Supported monitoring metrics

This page lists Cloud Monitoring metrics available for Memorystore for Redis Cluster, and describes what each metric measures.

Cloud Monitoring metrics

Cluster-level metrics

These metrics provide a high-level overview of the overall health and performance of the cluster. They are helpful for understanding the overall capacity and utilization of the cluster, as well as identifying potential bottlenecks or areas for improvement.

Metric name	Description
`redis.googleapis.com/cluster/clients/average_connected_clients`	Mean current number of client connections across the cluster.
`redis.googleapis.com/cluster/clients/maximum_connected_clients`	Maximum current number of client connections across the cluster.
`redis.googleapis.com/cluster/clients/total_connected_clients`	Current number of client connections to the cluster.
`redis.googleapis.com/cluster/stats/total_connections_received_count`	Count of cluster-level total client connections created in the last one minute.
`redis.googleapis.com/cluster/stats/cluster/stats/total_rejected_connections_count`	Number of connections rejected because of maxclients limit.
`redis.googleapis.com/cluster/commandstats/total_usec_count`	The total time consumed per command.
`redis.googleapis.com/cluster/commandstats/total_calls_count`	Total number of calls for this command in one minute.
`redis.googleapis.com/cluster/cpu/average_utilization`	Mean CPU utilization for the cluster from 0.0 to 1.0.
`redis.googleapis.com/cluster/cpu/maximum_utilization`	Maximum CPU utilization for the cluster from 0.0 to 1.0. Make sure that CPU utilization doesn't exceed 0.8 seconds for the primary node and 0.5 seconds for each replica that's designated as a read replica. For more information, see CPU usage best practices.
`redis.googleapis.com/cluster/stats/average_expired_keys`	Mean number of key expiration events for the primaries.
`redis.googleapis.com/cluster/stats/maximum_expired_keys`	Maximum number of key expiration events for the primaries.
`redis.googleapis.com/cluster/stats/total_expired_keys_count`	Total number of key expiration events for the primaries.
`redis.googleapis.com/cluster/stats/average_evicted_keys`	Mean number of evicted keys due to memory capacity for the primaries.
`redis.googleapis.com/cluster/stats/maximum_evicted_keys`	Maximum number of evicted keys due to memory capacity on primaries
`redis.googleapis.com/cluster/stats/total_evicted_keys_count`	Number of evicted keys due to memory capacity on primaries.
`redis.googleapis.com/cluster/keyspace/total_keys`	Number of keys stored in the cluster.
`redis.googleapis.com/cluster/stats/average_keyspace_hits`	Mean number of successful lookup of keys across the cluster.
`redis.googleapis.com/cluster/stats/maximum_keyspace_hits`	Maximum number of successful lookup of keys across the cluster.
`redis.googleapis.com/cluster/stats/total_keyspace_hits_count`	Number of successful lookup of keys across the cluster.
`redis.googleapis.com/cluster/stats/average_keyspace_misses`	Mean number of failed lookup of keys across the cluster.
`redis.googleapis.com/cluster/stats/maximum_keyspace_misses`	Maximum number of failed lookup of keys across the cluster.
`redis.googleapis.com/cluster/stats/total_keyspace_misses_count`	Total number of failed lookup of keys across the cluster.
`redis.googleapis.com/cluster/memory/average_utilization`	Mean memory utilization across the cluster from 0.0 to 1.0.
`redis.googleapis.com/cluster/memory/maximum_utilization`	Maximum memory utilization across the cluster from 0.0 to 1.0.
`redis.googleapis.com/cluster/memory/total_used_memory`	Total memory usage of the cluster.
`redis.googleapis.com/cluster/memory/size`	Memory size of the cluster.
`redis.googleapis.com/cluster/replication/average_ack_lag`	Mean acknowledgement lag (in seconds) of replicas across the cluster. Acknowledgment lag is a bottleneck on the primary node in a cluster. This bottleneck is caused by its replicas that can't keep up with the information that the primary node sends to them. When this happens, the primary node must wait for the acknowledgment that the replicas received the information. This might slow down transaction commits and cause a performance hit on the primary node.
`redis.googleapis.com/cluster/replication/maximum_ack_lag`	Maximum acknowledgement lag (in seconds) of replicas across the cluster.
`redis.googleapis.com/cluster/replication/average_offset_diff`	Mean replication acknowledge offset diff (in bytes) across the cluster. Replication acknowledge offset diff means the number of bytes that have not been replicated between replicas and their primaries.
`redis.googleapis.com/cluster/replication/maximum_offset_diff`	Maximum replication offset diff (in bytes) across the cluster. Replication offset diff means the number of bytes that have not been replicated between a replicas and their primaries.
`redis.googleapis.com/cluster/stats/total_net_input_bytes_count`	Count of incoming network bytes received by the cluster endpoints.
`redis.googleapis.com/cluster/stats/total_net_output_bytes_count`	Count of outgoing network bytes sent from the cluster endpoints.

Node-level metrics

These metrics offer detailed insights into the health and performance of individual nodes within the cluster. They are useful for troubleshooting issues with specific nodes and optimizing their performance.

Metric name	Description
`redis.googleapis.com/cluster/node/clients/connected_clients`	Number of client connected to the cluster node.
`redis.googleapis.com/cluster/node/clients/blocked_clients`	Number of client connections blocked by the cluster node.
`redis.googleapis.com/cluster/node/server/uptime`	Measures the uptime of the cluster node.
`redis.googleapis.com/cluster/node/stats/connections_received_count`	Count of total client connections created in the last one minute on the cluster node.
`redis.googleapis.com/cluster/node/stats/rejected_connections_count`	Number of connections rejected because of maxclients limit by the cluster node.
`redis.googleapis.com/cluster/node/commandstats/usec_count`	The total time consumed per command in the cluster node.
`redis.googleapis.com/cluster/node/commandstats/calls_count`	Total number of calls for this command on the cluster node in one minute.
`redis.googleapis.com/cluster/node/cpu/utilization`	CPU utilization for the cluster node from 0.0 to 1.0.
`redis.googleapis.com/cluster/node/stats/expired_keys_count`	Total number of expiration events in the cluster node.
`redis.googleapis.com/cluster/node/stats/evicted_keys_count`	Total number of evicted keys by the cluster node.
`redis.googleapis.com/cluster/node/keyspace/total_keys`	Number of keys stored in the cluster node.
`redis.googleapis.com/cluster/node/stats/keyspace_hits_count`	Number of successful lookup of keys in the cluster node.
`redis.googleapis.com/cluster/node/stats/keyspace_misses_count`	Number of failed lookup of keys in the cluster node.
`redis.googleapis.com/cluster/node/memory/utilization`	Memory utilization within the cluster node from 0.0 to 1.0.
`redis.googleapis.com/cluster/node/memory/usage`	Total memory usage of the cluster node.
`redis.googleapis.com/cluster/node/stats/net_input_bytes_count`	Count of incoming network bytes received by the cluster node.
`redis.googleapis.com/cluster/node/stats/net_output_bytes_count`	Count of outgoing network bytes sent from the cluster node.
`redis.googleapis.com/cluster/node/replication/offset`	Measures the replication offset bytes of the cluster node.
`redis.googleapis.com/cluster/node/server/healthy`	Determines whether a cluster node is available and functioning correctly. This metric is in Preview.

Cross-region replication metrics

This section lists metrics used for Cross-region replication.

Metric name	Description
`redis.googleapis.com/cluster/cross_cluster_replication/secondary_replication_links`	This metric shows the number of shard links between the primary and secondary clusters. Within a cross-region replication (CRR) group, a primary cluster reports the number of CRR replication links that it has with the secondary clusters in the group. For each secondary cluster, this number is expected to be equal to the number of shards. If, unexpectedly, the number drops below the number of shards, this identifies the number of shards where replication between the replicator and follower has ceased. In an ideal state, this metric should have the same number as the primary cluster shard count.
`redis.googleapis.com/cluster/cross_cluster_replication/secondary_maximum_replication_offset_diff`	Maximum replication offset difference between primary shards and secondary shards.
`redis.googleapis.com/cluster/cross_cluster_replication/secondary_average_replication_offset_diff`	Average replication offset difference between primary shards and secondary shards.

Backup metrics

This section lists backup and import metrics.

Cluster-level metrics

Metric name	Description
`redis.googleapis.com/cluster/backup/last_backup_start_time`	The start time of the last backup operation.
`redis.googleapis.com/cluster/backup/last_backup_status`	The status of the last backup operation. Statuses are `1` (success) and `0` (failure).
`redis.googleapis.com/cluster/backup/last_backup_duration`	The duration of the last backup operation (in milliseconds).
`redis.googleapis.com/cluster/backup/last_backup_size`	The size of the last backup (in bytes).
`redis.googleapis.com/cluster/import/last_import_start_time`	The start time of the last import operation.
`redis.googleapis.com/cluster/import/last_import_duration`	The duration of the last import operation(in milliseconds).

Persistence metrics

This sections lists persistence metrics and provides sample use cases for persistence metrics.

RDB persistence metrics

Cluster-level metrics

Metric name	Description
`redis.googleapis.com/cluster/persistence/rdb_saves_count`	This metric shows the cumulative number of times your cluster has taken an RDB snapshot (also known as save). This metric has a `status_code` field. To check if a snapshot has failed, you can filter the `status_code` field for the following error: 3 - INTERNAL_ERROR
`redis.googleapis.com/cluster/persistence/rdb_save_ages`	This metric shows a distribution snapshot age for all nodes across the cluster. Ideally you want to see the distribution have values that have less lag time (or the same time) than your snapshot frequency.

Node-level metrics

Metric name	Description
`redis.googleapis.com/cluster/node/persistence/rdb_bgsave_in_progress`	This metric shows if a RDB BGSAVE is currently in progress on the cluster node. TRUE means in progress.
`redis.googleapis.com/cluster/node/persistence/rdb_last_bgsave_status`	This metric shows the success of the last BGSAVE on the cluster node. TRUE means success, if no bgrewrite has occurred the value may default to TRUE.
`redis.googleapis.com/cluster/node/persistence/rdb_saves_count`	This metric shows the cumulative number of RDB saves executed on the cluster node.
`redis.googleapis.com/cluster/node/persistence/rdb_last_save_age`	This metric shows the time in seconds, since the last successful snapshot.
`redis.googleapis.com/cluster/node/persistence/rdb_next_save_time_until`	This metric shows the time in seconds, remaining until the next snapshot.
`redis.googleapis.com/cluster/node/persistence/current_save_keys_total`	This metric shows the number of keys in the current RDB save executing on the cluster node.

AOF persistence metrics

Cluster-level metrics

Metric name	Description
`redis.googleapis.com/cluster/persistence/aof_fsync_lags`	This metric shows a distribution of the lag (from data write to durable storage sync) for all nodes in the cluster. It is only emitted for clusters with appendfsync=everysec. Ideally you want to see the distribution have values that have less lag time (or the same time) than your AOF sync frequency.
`redis.googleapis.com/cluster/persistence/aof_rewrite_count`	This metric shows the cumulative number of times for your cluster that a node has triggered an AOF rewrite. This metric has a `status_code` field. To check if AOF rewrites are failing, you can filter the `status_code` field for the following error: 3 - INTERNAL_ERROR

Node-level metrics

Metric name	Description
`redis.googleapis.com/cluster/node/persistence/aof_last_write_status`	This metric shows the success of the most recent AOF write on the cluster node. TRUE means success, if no write has occurred the value may default to TRUE.
`redis.googleapis.com/cluster/node/persistence/aof_last_bgrewrite_status`	This metric shows the success of the last AOF bgrewrite operation on the cluster node. TRUE means success, if no bgrewrite has occurred the value may default to TRUE.
`redis.googleapis.com/cluster/node/persistence/aof_fsync_lag`	This metric shows the AOF lag between memory and persistent store in the cluster node. It is only applicable for AOF enabled clusters where appendfsync=EVERYSEC
`redis.googleapis.com/cluster/node/persistence/aof_rewrites_count`	This metric shows the count of AOF rewrites in the cluster node. To check if AOF rewrites are failing, you can filter the `status_code` field for the following error: 3 - INTERNAL_ERROR
`redis.googleapis.com/cluster/node/persistence/aof_fsync_errors_count`	This metric shows the count of AOF fsync() call errors and is only applicable for AOF enabled clusters where appendfsync=EVERYSEC\|ALWAYS.

Common Persistence Metrics

Metrics that are applicable to both AOF and RDB persistence mechanisms.

Node-level metrics

Metric name	Description
`redis.googleapis.com/cluster/node/persistence/auto_restore_count`	This metric shows the count of restores from the dumpfile (AOF or RDB).

Sample use cases for persistence metrics

Checking if AOF write operations cause latency and memory pressure

Suppose that you detect increased latency or memory usage on your cluster or the node within the cluster. In this case you may want to check if the extra usage is related to AOF persistence.

Since you know AOF rewrite operations can trigger transient load spikes, you can inspect the aof_rewrites_count metric which gives you the cumulative count of AOF rewrites over the lifetime of the cluster or the node within the cluster. Suppose this metric shows you that increments in the rewrites count correspond to latency increases. In this circumstance you could address the issue by reducing the write rate or increasing the shard count to reduce the frequency of rewrites.

Checking if RDB save operations cause latency and memory pressure

Suppose that you detect increased latency or memory usage on your cluster or the node within the cluster. In this case you may want to check if the extra usage is related to RDB persistence.

Since you know RDB save operations can trigger transient load spikes, you can inspect the rdb_saves_count metric which gives the cumulative count of RDB saves over the lifetime of the cluster or the node within the cluster. Suppose this metric shows you that increments in the RDB saves count correspond to latency increases. In this circumstance you could reduce the RDB snapshot interval to lower the frequency of rewrites. You could also scale out the cluster to reduce the baseline load levels.

Interpreting metrics for Memorystore for Redis Cluster

As seen in the list above, many of the metrics share three categories: average, maximum, and total.

For Memorystore for Redis Cluster, we provide average and maximum variations of the same metric so you can use them both to identify hotspotting for that metric family.

The total value for the metric is independent, and provides separate insight unrelated to the hotspotting purpose of average and maximum.

Understanding average and maximum metrics

Suppose you compare the average_keyspace_hits and maximum_keyspace_hits values for your cluster. As the difference between the two metrics grows, a greater difference indicates more hot spotting of hits in your instance. Ideally you would have a close value between average_keyspace_hits and maximum_keyspace_hits, because this means that hits are more evenly distributed across your instance.

This principle applies to all metrics that have the average and maximum variations of the same metric.

Hot spotting example

If you compare average_keyspace_hits and maximum_keyspace_hits for all of the shards in your cluster, comparing these values indicates where hot spotting occurs. For example, suppose shards in a 6-shard cluster have the following number of hits:

Shard 1 – 2 hits
Shard 2 – 2 hits
Shard 3 – 2 hits
Shard 4 – 2 hits
Shard 5 – 2 hits
Shard 6 – 8 hits

In this example the average_keyspace_hits returns a value of 3, and the maximum_keyspace_hits returns 8, indicating that shard 6 is hot.

We provide node-level metrics that you can use to identify hotspots in the cluster.