Monitoring

Developers and administrators can monitor application metrics, system metrics, and infrastructure metrics for Grainite. Grainite exposes a Prometheus endpoint on port 5064 that can be used for gathering and processing monitoring data. The counters and gauges APIs discussed in the API section of the documentation allow for the creation of these metrics which are then exposed to Prometheus by Grainite and can be visualized in a tool like Grafana. More details here.

Developer-defined application metrics

Developers can define application metrics using the counters and gauges APIs. The Userflows example included in our Samples demonstrates the following developer-defined application metrics.

DescriptionMetric UsedMetric Type

Completed flows

gxsapp_completed_flows_total

Counter

Abandoned flows

gxsapp_abandoned_flows_total

Counter

Current flows

gxsapp_current_flow_counts_current

Gauge

Current flows by type

gxsapp_current_flow_counts_current

Gauge

Grainite-defined metrics

Grainite provides built-in metrics that allow developers and administrators to monitor

  • Application runtime metrics

  • Event processing metrics

  • Database metrics

Application Runtime Metrics

DescriptionMetric UsedMetric Type

Rate of action Invocation per minute

gxssys_action_count_total

Gauge

Count of actions errors

gxssys_action_errors_total

Counter

Average action execution Latency

gxssys_grain_execution_us_total

Counter

Paused Endpoints due to failures

gxssys_endpoint_paused_total

Gauge

Task execution errors

gxsapp_gxtask_execution_errors_total

Counter

Task instance execution errors

gxsapp_gxtask_instance_execution_errors_total

Counter

Task execution status

gxsapp_gxtask_execution_status_current

Gauge

Event processing metrics

DescriptionMetric UsedMetric Type

Message delivery latency for the last 30s window of data. This is published for the 50/95/99th percentiles

gxssys_message_delay_ms_total

Counter

Topic consumption latency for the last 30s window of data. This is published for the 50/95/99th percentiles

gxssys_subscription_delay_ms_total

Counter

Batch Size of fetched requests

gxdtopic_tot_fetch_batch_size_total gxdtopic_tot_batch_size_cnt_total

Counter

Total events fetched from Topic

gxdtopic_tot_fetched_events_total

Counter

Total events fetched and processed

gxdtopic_tot_consumed_events_total

Counter

Total Grain to Grain messages fetched

gxdg2g_tot_fetched_messages_total

Counter

Total Grain to Grain messages fetched and processed

gxdg2g_tot_consumed_messages_total

Counter

Indicates how many events have been pulled from a topic but has not been processed

gxdexec_cur_inflight_events_current

Gauge

Database Metrics

DescriptionMetric UsedMetric Type

Average latency to process requests to database

gxsctl_tot_work_process_latency_total gxsctl_tot_work_total

Counter

Cumulative count of writes to Grains

gxggrain_tot_update_total

Counter

Disk currently used by apps and system

gxsdat_cur_disk_used_size_current

Gauge

Number of Grain updates that have materialized

gxpmatr_tot_fetched_logs_total

Counter

Number of Grain updates pending

gxggrain_cur_update_current

Gauge

Cluster Health

DescriptionMetric UsedMetric Used

Total high load and total stalled metrics indicate the health of compute capability of the Grainite cluster

gxssrvr2_tot_highload_hz gxssrvr2_tot_stalled_hz

Counter

WAL disk used metric provides the current utilization of Grainite Write Ahead Log (WAL)

gxwwal_cur_disk_rlused_size_current

Gauge

The current allowed rate and current target rate help to determine if there is a continuous event execution overload on the cluster

gxdfctrl_cur_allowed_rate_current gxdfctrl_cur_target_rate_current

Gauge

Infrastructure Metrics

Cloud providers' monitoring solutions can be used to gather infrastructure-level metrics. We recommend monitoring the following metrics:

  • CPU usage: CPU usage by each Kubernetes node is measured in the number of CPU cores

  • CPU utilization: CPU utilization by each node measured as a percent of available CPU resources

  • Bytes transmitted: Throughput of network traffic being sent out of each node measured in bytes

  • Bytes received: Throughput of network traffic being received by each node, measured in bytes

  • Memory usage: Memory usage by each node measured in GiB

  • Disk read: Throughput of disk IOPS being read by each node to its persistent disk

  • Disk write: Throughput of disk IOPS being written by each node to its persistent disk

Additional metrics can be added as desired for your deployments within the cloud provider's monitoring console.

Last updated