Configure metrics collection

Autopilot Standard

This document describes how to configure Google Kubernetes Engine (GKE) to send metrics to Cloud Monitoring. Metrics in Cloud Monitoring can populate custom dashboards, generate alerts, create service-level objectives, or be fetched by third-party monitoring services using the Cloud Monitoring API.

GKE provides several sources of metrics:

System metrics: metrics from essential system components, describing low-level resources such as CPU, memory and storage.
Google Cloud Managed Service for Prometheus: lets you monitor and alert on your workloads, using Prometheus, without having to manually manage and operate Prometheus at scale.
Packages of observability metrics:
- Control plane metrics: metrics exported from certain control plane components such as the API server and scheduler.
- Kube state metrics: a curated set of metrics exported from the kube state service, used to monitor the state of Kubernetes objects like Pods, Deployments, and more. For the set of included metrics, see Use kube state metrics.
  
  The kube state package is a managed solution. If you need greater flexibility—for example, if you need to collect additional metrics, or need to manage scrape intervals or to scrape other resources—you can disable the package, if it is enabled, and deploy your own instance of the open source kube state metrics service. For more information, see the Google Cloud Managed Service for Prometheus exporter documentation for Kube state metrics.
- cAdvisor/Kubelet: a curated set of cAdvisor and Kubelet metrics. For the set of included metrics, see Use cAdvisor/Kubelet metrics.
  
  The cAdvisor/Kubelet package is a managed solution. If you need greater flexibility—for example, if you need to collect additional metrics or to manage scrape intervals or to scrape other resources—you can disable the package, if it is enabled, and deploy your own instance of the open source cAdvisor/Kubelet metrics services.
- NVIDIA Data Center GPU Manager (DCGM) metrics: metrics from DCGM that provide a comprehensive view of GPU health, performance, and utilization.

You can also configure automatic application monitoring for certain workloads.

System metrics

When a cluster is created, GKE by default collects certain metrics emitted by system components.

You have a choice whether or not to send metrics from your GKE cluster to Cloud Monitoring. If you choose to send metrics to Cloud Monitoring, you must send system metrics.

All GKE system metrics are ingested into Cloud Monitoring with the prefix kubernetes.io.

Pricing

Cloud Monitoring does not charge for the ingestion of GKE system metrics. For more information, see Cloud Monitoring pricing.

Configuring collection of system metrics

To enable system metric collection, pass the SYSTEM value to the --monitoring flag of the gcloud container clusters create or gcloud container clusters update commands.

To disable system metric collection, use the NONE value for the --monitoring flag. If system metric collection is disabled, basic information like CPU usage, memory usage, and disk usage are not available for a cluster when viewing observability metrics.

For GKE Autopilot clusters, you cannot disable the collection of system metrics.

See Observability for GKE for more details about Cloud Monitoring integration with GKE.

To configure the collection of system metrics by using Terraform, see the monitoring_config block in the Terraform registry for google_container_cluster. For general information about using Google Cloud with Terraform, see Terraform with Google Cloud.

List of system metrics

System metrics include metrics from essential system components important for Kubernetes. For a list of these metrics, see GKE system metrics.

If you enable Cloud Monitoring for your cluster, then you can't disable system monitoring (--monitoring=SYSTEM).

Metrics enabled by default in GKE Enterprise

In the following tables, a checkmark () indicates which metrics are enabled by default when you create and register a new cluster in a project with GKE Enterprise enabled:

Metric name	Autopilot	Standard
System
API server
Scheduler
Controller Manager
Persistent volume (Storage)
Pods
Deployment
StatefulState
DaemonSet
HorizonalPodAutoscaler
cAdvisor
Kubelet
NVIDIA Data Center GPU Manager (DCGM) metrics

All registered clusters in a project that has GKE Enterprise enabled can use the packages for control plane metrics, kube state metrics, and cAdvisor/kubelet metrics without any additional charges. Otherwise these metrics incur Cloud Monitoring charges.

Troubleshooting system metrics

If system metrics are not available in Cloud Monitoring as expected, see Troubleshoot system metrics.

Package: Control plane metrics

You can configure a GKE cluster to send certain metrics emitted by the Kubernetes API server, Scheduler, and Controller Manager to Cloud Monitoring.

For more information, see Collect and view control plane metrics.

Package: Kube state metrics

You can configure a GKE cluster to send a curated set of kube state metrics in Prometheus format to Cloud Monitoring. This package of kube state metrics includes metrics for Pods, Deployments, StatefulSets, DaemonSets, HorizontalPodAutoscaler resources, Persistent Volumes, and Persistent Volume Claims.

For more information, see Collect and view Kube state metrics.

Package: cAdvisor/Kubelet metrics

You can configure a GKE cluster to send a curated set of cAdvisor/Kubelet metrics in Prometheus format to Cloud Monitoring. The curated set of metrics is a subset of the large set of cAdvisor/Kubelet metrics built into every Kubernetes deployment by default. The curated cAdvisor/Kubelet is designed to provide the most useful metrics, reducing ingestion volume and associated costs.

For more information, see Collect and view cAdvisor/Kubelet metrics.

Package: NVIDIA Data Center GPU Manager (DCGM) metrics

You can monitor GPU utilization, performance, and health by configuring GKE to send NVIDIA Data Center GPU Manager (DCGM) metrics to Cloud Monitoring.

For more information, see Collect and view NVIDIA Data Center GPU Manager (DCGM) metrics.

Disable metric packages

You can disable the use of metric packages in the cluster. You might want to disable certain packages to reduce costs or if you are using an alternate mechanism for collecting the metrics, like Google Cloud Managed Service for Prometheus and an exporter.

Console

To disable the collection of metrics from the Details tab for the cluster, do the following:

In the Google Cloud console, go to the Kubernetes clusters page:
Go to Kubernetes clusters

If you use the search bar to find this page, then select the result whose subheading is Kubernetes Engine.
Click your cluster's name.
In the Features row labelled Cloud Monitoring, click the Edit icon.
In the Components drop-down menu, clear the metric components that you want to disable.
Click OK.
Click Save Changes.

gcloud

Open a terminal window with Google Cloud SDK and the Google Cloud CLI installed. One way to do this is to use Cloud Shell.
In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Call the gcloud container clusters update command and pass an updated set of values to the --monitoring flag. The set of values supplied to the --monitoring flag overrides any previous setting.

For example, to turn off the collection of all metrics except system metrics, run the following command:
```
gcloud container clusters update CLUSTER_NAME \
    --location=COMPUTE_LOCATION \
    --enable-managed-prometheus \
    --monitoring=SYSTEM
```
This command disables the collection of any previously configured metric packages.

Terraform

To configure the collection of metrics by using Terraform, see the monitoring_config block in the Terraform registry for google_container_cluster. For general information about using Google Cloud with Terraform, see Terraform with Google Cloud.

Understanding your Monitoring bill

You can use Cloud Monitoring to identify the control plane or kube state metrics that are writing the largest numbers of samples. These metrics are contributing the most to your costs. After you identify the most expensive metrics, you can modify your scrape configs to filter these metrics appropriately.

The Cloud Monitoring Metrics Management page provides information that can help you control the amount you spend on billable metrics without affecting observability. The Metrics Management page reports the following information:

Ingestion volumes for both byte- and sample-based billing, across metric domains and for individual metrics.
Data about labels and cardinality of metrics.
Number of reads for each metric.
Use of metrics in alerting policies and custom dashboards.
Rate of metric-write errors.

You can also use the Metrics Management to exclude unneeded metrics, eliminating the cost of ingesting them.

To view the Metrics Management page, do the following:

In the Google Cloud console, go to the Metrics management page:
Go to Metrics management

If you use the search bar to find this page, then select the result whose subheading is Monitoring.
In the toolbar, select your time window. By default, the Metrics Management page displays information about the metrics collected in the previous one day.

For more information about the Metrics Management page, see View and manage metric usage.

To identify which control plane or kube state metrics have the largest number of samples being ingested, do the following:

In the Google Cloud console, go to the Metrics management page:
Go to Metrics management

If you use the search bar to find this page, then select the result whose subheading is Monitoring.
On the Billable samples ingested scorecard, click View charts.
Locate the Namespace Volume Ingestion chart, and then click More chart options.
In the Metric field, verify that the following resource and and metric are selected:
Metric Ingestion Attribution and Samples written by attribution id.
In the Filters page, do the following:
1. In the Label field, verify that the value is attribution_dimension.
2. In the Comparison field, verify that the value is = (equals).
3. In the Value field, select cluster.
Clear the Group by setting.
Optionally, filter for only certain metrics. For example, control plane API server metrics all include "apiserver" as part of the metric name, and kube state Pod metrics all include "kube_pod" as part of the metric name, so you can filter for metrics containing those strings:
- Click Add Filter.
- In the Label field, select metric_type.
- In the Comparison field, select =~ (equals regex).
- In the Value field, enter .*apiserver.* or .*kube_pod.*.
Optionally, group the number of samples ingested by GKE region or project:
- Click Group by.
- Ensure metric_type is selected.
- To group by GKE region, select location.
- To group by project, select project_id.
- Click OK.
Optionally, group the number of samples ingested by GKE cluster name:
- Click Group by.
- To group by GKE cluster name, ensure both attribution_dimension and attribution_id are selected.
- Click OK.
To see the ingestion volume for each of the metrics, in the toggle labeled Chart Table Both, select Both. The table shows the ingested volume for each metric in the Value column.

Click the Value column header twice to sort the metrics by descending ingestion volume.

These steps show the metrics with the highest rate of samples ingested into Cloud Monitoring. Because the metrics in the observability packages are charged by the number of samples ingested, pay attention to metrics with the greatest rate of samples being ingested.

Other metrics

In addition to the system metrics and metric packages described in this document, Istio metrics are also available for GKE clusters. For pricing information, see Cloud Monitoring pricing.

Available metrics

The following table indicates supported values for the --monitoring flag for the create and update commands.

Source	`--monitoring` value	Metrics Collected
None	`NONE`	No metrics sent to Cloud Monitoring; no metric collection agent installed in the cluster. This value isn't supported for Autopilot clusters.
System	`SYSTEM`	Metrics from essential system components required for Kubernetes. For a complete list of the metrics, see Kubernetes metrics.
API server	`API_SERVER`	Metrics from `kube-apiserver`. For a complete list of the metrics, see API server metrics.
Scheduler	`SCHEDULER`	Metrics from `kube-scheduler`. For a complete list of the metrics, see Scheduler metrics.
Controller Manager	`CONTROLLER_MANAGER`	Metrics from `kube-controller-manager`. For a complete list of the metrics, see Controller Manager metrics.
Persistent volume (Storage)	`STORAGE`	Storage metrics from `kube-state-metrics`. Includes metrics for Persistent Volume and Persistent Volume Claims. For a complete list of the metrics, see Storage metrics.
Pod	`POD`	Pod metrics from `kube-state-metrics`. For a complete list of the metrics, see Pod metrics.
Deployment	`DEPLOYMENT`	Deployment metrics from `kube-state-metrics`. For a complete list of the metrics, see Deployment metrics.
StatefulSet	`STATEFULSET`	StatefulSet metrics from `kube-state-metrics`. For a complete list of the metrics, see StatefulSet metrics.
DaemonSet	`DAEMONSET`	DaemonSet metrics from `kube-state-metrics`. For a complete list of the metrics, see DaemonSet metrics.
HorizonalPodAutoscaler	`HPA`	HPA metrics from `kube-state-metrics`. See a complete list of HorizonalPodAutoscaler metrics.
cAdvisor	`CADVISOR`	cAdvisor metrics from the cAdvisor/Kubelet metrics package. For a complete list of the metrics, see cAdvisor metrics.
Kubelet	`KUBELET`	Kubelet metrics from the cAdvisor/Kubelet For a complete list of the metrics, see Kubelet metrics.
NVIDIA Data Center GPU Manager (DCGM) metrics	`DCGM`	Metrics from NVIDIA Data Center GPU Manager (DCGM).

You can also collect Prometheus-style metrics exposed by any GKE workload by using Google Cloud Managed Service for Prometheus, which lets you monitor and alert on your workloads, using Prometheus, without having to manually manage and operate Prometheus at scale.

What's next

Learn how to troubleshoot system metrics.