This document describes how to configure Google Kubernetes Engine (GKE) to send metrics to Cloud Monitoring. Metrics in Cloud Monitoring can populate custom dashboards, generate alerts, create service-level objectives, or be fetched by third-party monitoring services using the Cloud Monitoring API.
GKE provides several sources of metrics:
- System metrics: metrics from essential system components, describing low-level resources such as CPU, memory and storage.
- Google Cloud Managed Service for Prometheus: lets you monitor and alert on your workloads, using Prometheus, without having to manually manage and operate Prometheus at scale.
Packages of observability metrics:
- Control plane metrics: metrics exported from certain control plane components such as the API server and scheduler.
Kube state metrics: a curated set of metrics exported from the kube state service, used to monitor the state of Kubernetes objects like Pods, Deployments, and more. For the set of included metrics, see Use kube state metrics.
The kube state package is a managed solution. If you need greater flexibility—for example, if you need to collect additional metrics, or need to manage scrape intervals or to scrape other resources—you can disable the package, if it is enabled, and deploy your own instance of the open source kube state metrics service. For more information, see the Google Cloud Managed Service for Prometheus exporter documentation for Kube state metrics.
cAdvisor/Kubelet: a curated set of cAdvisor and Kubelet metrics. For the set of included metrics, see Use cAdvisor/Kubelet metrics.
The cAdvisor/Kubelet package is a managed solution. If you need greater flexibility—for example, if you need to collect additional metrics or to manage scrape intervals or to scrape other resources—you can disable the package, if it is enabled, and deploy your own instance of the open source cAdvisor/Kubelet metrics services.
NVIDIA Data Center GPU Manager (DCGM) metrics: metrics from DCGM that provide a comprehensive view of GPU health, performance, and utilization.
You can also configure automatic application monitoring for certain workloads.
System metrics
When a cluster is created, GKE by default collects certain metrics emitted by system components.
You have a choice whether or not to send metrics from your GKE cluster to Cloud Monitoring. If you choose to send metrics to Cloud Monitoring, you must send system metrics.
All GKE system metrics are ingested into Cloud Monitoring with
the prefix kubernetes.io
.
Pricing
Cloud Monitoring does not charge for the ingestion of GKE system metrics. For more information, see Cloud Monitoring pricing.
Configuring collection of system metrics
To enable system metric collection, pass the SYSTEM
value to the
--monitoring
flag of the
gcloud container clusters create
or
gcloud container clusters update
commands.
To disable system metric collection, use the NONE
value for the
--monitoring
flag. If system metric collection is disabled, basic information like CPU usage,
memory usage, and disk usage are not available for a cluster when viewing
observability metrics.
For GKE Autopilot clusters, you cannot disable the collection of system metrics.
See Observability for GKE for more details about Cloud Monitoring integration with GKE.
To configure the collection of system metrics by using Terraform,
see the monitoring_config
block in the
Terraform registry for google_container_cluster
.
For general information about using Google Cloud with Terraform, see
Terraform with Google Cloud.
List of system metrics
System metrics include metrics from essential system components important for Kubernetes. For a list of these metrics, see GKE system metrics.
If you enable Cloud Monitoring for your cluster, then you can't disable
system monitoring (--monitoring=SYSTEM
).
Metrics enabled by default in GKE Enterprise
In the following tables, a checkmark () indicates which metrics are enabled by default when you create and register a new cluster in a project with GKE Enterprise enabled:
Metric name | Autopilot | Standard |
---|---|---|
System | ||
API server | ||
Scheduler | ||
Controller Manager | ||
Persistent volume (Storage) | ||
Pods | ||
Deployment | ||
StatefulState | ||
DaemonSet | ||
HorizonalPodAutoscaler | ||
cAdvisor | ||
Kubelet | ||
NVIDIA Data Center GPU Manager (DCGM) metrics |
All registered clusters in a project that has GKE Enterprise enabled can use the packages for control plane metrics, kube state metrics, and cAdvisor/kubelet metrics without any additional charges. Otherwise these metrics incur Cloud Monitoring charges.
Troubleshooting system metrics
If system metrics are not available in Cloud Monitoring as expected, see Troubleshoot system metrics.
Package: Control plane metrics
You can configure a GKE cluster to send certain metrics emitted by the Kubernetes API server, Scheduler, and Controller Manager to Cloud Monitoring.
For more information, see Collect and view control plane metrics.
Package: Kube state metrics
You can configure a GKE cluster to send a curated set of kube state metrics in Prometheus format to Cloud Monitoring. This package of kube state metrics includes metrics for Pods, Deployments, StatefulSets, DaemonSets, HorizontalPodAutoscaler resources, Persistent Volumes, and Persistent Volume Claims.
For more information, see Collect and view Kube state metrics.
Package: cAdvisor/Kubelet metrics
You can configure a GKE cluster to send a curated set of cAdvisor/Kubelet metrics in Prometheus format to Cloud Monitoring. The curated set of metrics is a subset of the large set of cAdvisor/Kubelet metrics built into every Kubernetes deployment by default. The curated cAdvisor/Kubelet is designed to provide the most useful metrics, reducing ingestion volume and associated costs.
For more information, see Collect and view cAdvisor/Kubelet metrics.
Package: NVIDIA Data Center GPU Manager (DCGM) metrics
You can monitor GPU utilization, performance, and health by configuring GKE to send NVIDIA Data Center GPU Manager (DCGM) metrics to Cloud Monitoring.
For more information, see Collect and view NVIDIA Data Center GPU Manager (DCGM) metrics.
Disable metric packages
You can disable the use of metric packages in the cluster. You might want to disable certain packages to reduce costs or if you are using an alternate mechanism for collecting the metrics, like Google Cloud Managed Service for Prometheus and an exporter.
Console
To disable the collection of metrics from the Details tab for the cluster, do the following:
-
In the Google Cloud console, go to the Kubernetes clusters page:
If you use the search bar to find this page, then select the result whose subheading is Kubernetes Engine.
Click your cluster's name.
In the Features row labelled Cloud Monitoring, click the Edit icon.
In the Components drop-down menu, clear the metric components that you want to disable.
Click OK.
Click Save Changes.
gcloud
Open a terminal window with Google Cloud SDK and the Google Cloud CLI installed. One way to do this is to use Cloud Shell.
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Call the
gcloud container clusters update
command and pass an updated set of values to the--monitoring
flag. The set of values supplied to the--monitoring
flag overrides any previous setting.For example, to turn off the collection of all metrics except system metrics, run the following command:
gcloud container clusters update CLUSTER_NAME \ --location=COMPUTE_LOCATION \
--enable-managed-prometheus
\ --monitoring=SYSTEMThis command disables the collection of any previously configured metric packages.
Terraform
To configure the collection of metrics by using Terraform,
see the monitoring_config
block in the
Terraform registry for google_container_cluster
.
For general information about using Google Cloud with Terraform, see
Terraform with Google Cloud.
Understanding your Monitoring bill
You can use Cloud Monitoring to identify the control plane or kube state metrics that are writing the largest numbers of samples. These metrics are contributing the most to your costs. After you identify the most expensive metrics, you can modify your scrape configs to filter these metrics appropriately.
The Cloud Monitoring Metrics Management page provides information that can help you control the amount you spend on billable metrics without affecting observability. The Metrics Management page reports the following information:
- Ingestion volumes for both byte- and sample-based billing, across metric domains and for individual metrics.
- Data about labels and cardinality of metrics.
- Number of reads for each metric.
- Use of metrics in alerting policies and custom dashboards.
- Rate of metric-write errors.
You can also use the Metrics Management to exclude unneeded metrics, eliminating the cost of ingesting them.
To view the Metrics Management page, do the following:
-
In the Google Cloud console, go to the
Metrics management page:If you use the search bar to find this page, then select the result whose subheading is Monitoring.
- In the toolbar, select your time window. By default, the Metrics Management page displays information about the metrics collected in the previous one day.
For more information about the Metrics Management page, see View and manage metric usage.
To identify which control plane or kube state metrics have the largest number of samples being ingested, do the following:
-
In the Google Cloud console, go to the
Metrics management page:If you use the search bar to find this page, then select the result whose subheading is Monitoring.
On the Billable samples ingested scorecard, click View charts.
Locate the Namespace Volume Ingestion chart, and then click more_vert More chart options.
In the Metric field, verify that the following resource and and metric are selected:
Metric Ingestion Attribution
andSamples written by attribution id
.In the Filters page, do the following:
In the Label field, verify that the value is
attribution_dimension
.In the Comparison field, verify that the value is
= (equals)
.In the Value field, select
cluster
.
Clear the Group by setting.
Optionally, filter for only certain metrics. For example, control plane API server metrics all include "apiserver" as part of the metric name, and kube state Pod metrics all include "kube_pod" as part of the metric name, so you can filter for metrics containing those strings:
Click Add Filter.
In the Label field, select
metric_type
.In the Comparison field, select
=~ (equals regex)
.In the Value field, enter
.*apiserver.*
or.*kube_pod.*
.
Optionally, group the number of samples ingested by GKE region or project:
Click Group by.
Ensure metric_type is selected.
To group by GKE region, select location.
To group by project, select project_id.
Click OK.
Optionally, group the number of samples ingested by GKE cluster name:
Click Group by.
To group by GKE cluster name, ensure both attribution_dimension and attribution_id are selected.
Click OK.
To see the ingestion volume for each of the metrics, in the toggle labeled Chart Table Both, select Both. The table shows the ingested volume for each metric in the Value column.
Click the Value column header twice to sort the metrics by descending ingestion volume.
These steps show the metrics with the highest rate of samples ingested into Cloud Monitoring. Because the metrics in the observability packages are charged by the number of samples ingested, pay attention to metrics with the greatest rate of samples being ingested.
Other metrics
In addition to the system metrics and metric packages described in this document, Istio metrics are also available for GKE clusters. For pricing information, see Cloud Monitoring pricing.
Available metrics
The following table indicates supported values for the --monitoring
flag for
the create and
update commands.
Source | --monitoring value |
Metrics Collected |
---|---|---|
None | NONE |
No metrics sent to Cloud Monitoring; no metric collection agent installed in the cluster. This value isn't supported for Autopilot clusters. |
System | SYSTEM |
Metrics from essential system components required for Kubernetes. For a complete list of the metrics, see Kubernetes metrics. |
API server | API_SERVER |
Metrics from kube-apiserver .
For a complete list of the metrics, see
API server metrics. |
Scheduler | SCHEDULER |
Metrics from kube-scheduler .
For a complete list of the metrics, see
Scheduler metrics.
|
Controller Manager | CONTROLLER_MANAGER |
Metrics from kube-controller-manager .
For a complete list of the metrics, see
Controller Manager metrics. |
Persistent volume (Storage) | STORAGE |
Storage metrics from kube-state-metrics .
Includes metrics for Persistent Volume and Persistent Volume Claims.
For a complete list of the metrics, see
Storage metrics.
|
Pod | POD |
Pod metrics from kube-state-metrics .
For a complete list of the metrics, see
Pod metrics.
|
Deployment | DEPLOYMENT |
Deployment metrics from kube-state-metrics .
For a complete list of the metrics, see
Deployment metrics.
|
StatefulSet | STATEFULSET |
StatefulSet metrics from kube-state-metrics .
For a complete list of the metrics, see
StatefulSet metrics. |
DaemonSet | DAEMONSET |
DaemonSet metrics from kube-state-metrics .
For a complete list of the metrics, see
DaemonSet metrics.
|
HorizonalPodAutoscaler | HPA |
HPA metrics from kube-state-metrics .
See a complete list of
HorizonalPodAutoscaler metrics.
|
cAdvisor | CADVISOR |
cAdvisor metrics from the cAdvisor/Kubelet metrics package. For a complete list of the metrics, see cAdvisor metrics. |
Kubelet | KUBELET |
Kubelet metrics from the cAdvisor/Kubelet For a complete list of the metrics, see Kubelet metrics. |
NVIDIA Data Center GPU Manager (DCGM) metrics | DCGM |
Metrics from NVIDIA Data Center GPU Manager (DCGM). |
You can also collect Prometheus-style metrics exposed by any GKE workload by using Google Cloud Managed Service for Prometheus, which lets you monitor and alert on your workloads, using Prometheus, without having to manually manage and operate Prometheus at scale.
What's next
- Learn how to troubleshoot system metrics.