Monitor a Google Cloud Managed Service for Apache Kafka cluster

You can use the Google Cloud console or the Cloud Monitoring API to monitor Managed Service for Apache Kafka.

This section provides an overview of the monitoring metrics available to monitor Managed Service for Apache Kafka. This document also shows you how to monitor your Managed Service for Apache Kafka usage in the Google Cloud console using Monitoring.

If you want to view metrics from other Google Cloud resources in addition to the complete set of Managed Service for Apache Kafka metrics, use Monitoring.
Otherwise, you can use the monitoring dashboards with a selection of metrics provided within Managed Service for Apache Kafka. For more information, see the following topics:

Overview of the Managed Service for Apache Kafka metrics

Managed Service for Apache Kafka exports several metrics available in the open-source Kafka distribution, as well as service-specific metrics like consumer group offset lag. For monitoring, the Managed Service for Apache Kafka service is identified by the service URL managedkafka.googleapis.com.

Managed Service for Apache Kafka metrics are organized into four resource categories:

Cluster: These metrics are intended for maintaining overall cluster health.
Topic: These metrics include publisher and consumer rates and errors. These metrics monitor the overall health of Kafka applications and issues specific to a broker.
Topic Partition: These metrics are intended for monitoring and debugging performance problems specific to individual partitions. An example is uneven key distribution.
Topic Partition Consumer Group: These metrics monitor consumer application health, primarily consumer lag. Open source Kafka error metrics for consumer groups are not available by partition but only at the topic level.

Some metrics can be grouped by broker. While the Managed Service for Apache Kafka service itself does not expose brokers as a resource, monitoring them is essential to detect failure scenarios like latency due to overloaded brokers.

The metrics are named following the convention that includes the service API URL, monitored resource and the metric. For example the topic message_in_count metric identifier is managedkafka.googleapis.com/Topic/message_in_count.

To access these metrics, see View a single Managed Service for Apache Kafka metric.

Before you begin

Before you use Monitoring, ensure that you've prepared an Managed Service for Apache Kafka project with billing enabled. One way to do this is to complete the Quickstart for Managed Service for Apache Kafka.

Required roles and permissions

To get the permissions that you need to view monitoring charts, ask your administrator to grant you the Managed Kafka Viewer (roles/managedkafka.Viewer) IAM role on your project. For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

For more information about this role, see Managed Service for Apache Kafka predefined roles.

View a single Managed Service for Apache Kafka metric

To view a single Managed Service for Apache Kafka metric by using the Google Cloud console, perform the following steps:

In the Google Cloud console, go to the Monitoring page.

Go to Monitoring
In the navigation pane, select Metrics explorer.
In the Configuration section, click Select a metric.
In the filter, enter Apache Kafka.
In Active resources, select one of the following:
- Apache Kafka Cluster
- Apache Kafka Topic
- Apache Kafka Topic Partition
- Apache Kafka Topic Partition Consumer Group
Select a metric and click Apply.

The page for a specific metric opens.

You can learn more about the monitoring dashboard by reading the Cloud Monitoring documentation.

Cluster metrics

Metric	Description	Equivalent MBean Name
cpu/core_usage_time	Cumulative CPU usage of the cluster in vCPU. This can be useful for understanding the overall cost of operation for the cluster.	N/A
cpu/limit	Current CPU count configured for the cluster. Can be used to monitor CPU utilization as a ratio with the `cpu/usage` metric.	N/A
memory/usage	Current RAM usage on the cluster. Can be used to monitor RAM utilization as a ratio with the `memory/limit` metric.	N/A
memory/limit	Current configured RAM size of the cluster. Can be used to monitor RAM utilization as a ratio with the `memory/usage` metric.	N/A
cluster_byte_in_count	The total number of bytes from clients sent to all topics.	`kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec`
cluster_byte_out_count	The total number of bytes sent to clients from all topics.	`kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec`
cluster_message_in_count	The total number of messages that have been published to all topics.	`kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec`
request_count	The total number of requests made to the broker	`kafka.network:type=RequestMetrics,name=RequestsPerSec,request= {Produce\|FetchConsumer\|FetchFollower},version=([0-9]+)`
request_byte_count	The total size, in bytes, of requests made to the Cluster.	`kafka.network:type=RequestMetrics,name=RequestBytes,request= ([-.\w]+)`
partitions	The current number of partitions handled by this cluster, broken down by broker.	`kafka.server:type=ReplicaManager,name=PartitionCount`
request_latencies	The number of milliseconds taken for each request, at various percentiles	`kafka.network:type=RequestMetrics,name=TotalTimeMs,request= {Produce\|FetchConsumer\|FetchFollower}`
consumer_groups	The current number of Consumer Groups consuming from the broker	`kafka.server:type=GroupMetadataManager,name=NumGroups`
offline_partitions	The number of offline topic partitions as observed by the controller.	`kafka.controller:type=KafkaController,name=OfflinePartitionCount`

Topic metrics

Metric	Description	Equivalent MBean name
message_in_count	The total number of messages published to the topic.	`kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec, topic=([-.\w]+)`
byte_in_count	The total number of bytes from clients sent to the topic.	`kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=([-.\w]+)`
topic_request_count	The total number of produce and fetch requests made to the topic.	`kafka.server:type=BrokerTopicMetrics,name=TotalProduceRequestsPerSec,topic=([-.\w]+)` `kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec,topic=([-.\w]+)`
topic_error_count	The total number of failed produce and failed fetch requests made to the topic.	`kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec,topic=([-.\w]+)` `kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec,topic=([-.\w]+)`
byte_out_count	The total number of bytes sent to clients.	`kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec, topic=([-.\w]+)`

Partition metrics

Metric	Description	Equivalent MBean name
consumer_lag	Replication lag in messages between leader and each follower replica.	`kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)`
log_segments	The current number of log segments. This is useful to make sure storage tiering remains healthy.	`kafka.log:type=Log,name=NumLogSegments,topic=([-.\w]+),partition=([0-9]+)`
first_offset	The first offset for each partition in the topic. In combination with the `last_offset`, it can be used to monitor an upper bound on the total number of messages stored as well as to find the actual offset of the oldest message.	`kafka.log:type=Log,name=LogStartOffset,topic=([-.\w]+),partition=([0-9]+)`
last_offset	The last offset in the partition. This can be used to find the latest offset for each partition over time. This can be useful in identifying the specific offset needed to reprocess data starting from a particular time in the past.	`kafka.log:type=Log,name=LogEndOffset,topic=([-.\w]+),partition=([0-9]+)`
byte_size	The size of the partition on disk in bytes.	-

Consumer group metrics

Metric	Description	Equivalent MBean name
Offset_lag	The number of messages that the consumer group has not yet committed on the partition.	N/A