Configure metrics collection


Google Kubernetes Engine (GKE) makes it easy to send metrics to Cloud Monitoring. Once in Cloud Monitoring, metrics can populate custom dashboards, generate alerts, create service-level objectives, or be fetched by third-party monitoring services using the Cloud Monitoring API.

GKE provides several sources of metrics:

  • System metrics: metrics from essential system components, describing low-level resources such as CPU, memory and storage.
  • Google Cloud Managed Service for Prometheus: lets you monitor and alert on your workloads, using Prometheus, without having to manually manage and operate Prometheus at scale.
  • Packages of observability metrics:

    • Control plane metrics: metrics exported from certain control plane components such as the API server and scheduler.
    • Kube state metrics: a curated set of metrics exported from the kube state service, used to monitor the state of Kubernetes objects like Pods, Deployments, and more; for the complete set, see Use kube state metrics.

      The kube state package is a managed solution. If you need greater flexibility—for example, if you need to manage scrape intervals or to scrape other resources—you can deploy your own instance of the open source kube state metrics service. For more information, see the Google Cloud Managed Service for Prometheus exporter documentation for Kube state metrics.

System metrics

When a cluster is created, GKE by default collects certain metrics emitted by system components.

You have a choice whether or not to send metrics from your GKE cluster to Cloud Monitoring. If you choose to send metrics to Cloud Monitoring, you must send system metrics.

All GKE system metrics are ingested into Cloud Monitoring with the prefix kubernetes.io.

Pricing

Cloud Monitoring does not charge for the ingestion of GKE system metrics. For more information, see Cloud Monitoring pricing.

Configuring collection of system metrics

To enable system metric collection, pass the SYSTEM value to the --monitoring flag of the gcloud container clusters create or gcloud container clusters update commands.

To disable system metric collection, use the NONE value for the --monitoring flag. If system metric collection is disabled, basic information like CPU usage, memory usage, and disk usage are not available for a cluster in the Observability tab or the GKE section of the Google Cloud console.

For GKE Autopilot clusters, you cannot disable the collection of system metrics.

See Observability for GKE for more details about Cloud Monitoring integration with GKE.

To configure the collection of system metrics by using Terraform, see the monitoring_config block in the Terraform registry for google_container_cluster. For general information about using Google Cloud with Terraform, see Terraform with Google Cloud.

List of system metrics

System metrics include metrics from essential system components important for Kubernetes. For a list of these metrics, see GKE system metrics.

Troubleshooting system metrics

If system metrics are not available in Cloud Monitoring as expected, here are some steps you can take to troubleshoot the issue.

Confirm that the metrics agent has sufficient memory

In most cases, the default allocation of resources to the GKE metrics agent is sufficient. However, if the DaemonSet crashes repeatedly, you can check the termination reason with the following instructions:

  1. Get the names of the GKE metrics agent Pods:

    kubectl get pods -n kube-system -l component=gke-metrics-agent
    

    Find the Pod with the status CrashLoopBackOff.

    The output is similar to the following:

    NAME                    READY STATUS           RESTARTS AGE
    gke-metrics-agent-5857x 0/1   CrashLoopBackOff 6        12m
    
  2. Describe the Pod that has the status CrashLoopBackOff:

    kubectl describe pod POD_NAME -n kube-system
    

    Replace POD_NAME with the name of the Pod from the previous step.

    If the termination reason of the Pod is OOMKilled, the agent needs additional memory.

    The output is similar to the following:

      containerStatuses:
      ...
      lastState:
        terminated:
          ...
          exitCode: 1
          finishedAt: "2021-11-22T23:36:32Z"
          reason: OOMKilled
          startedAt: "2021-11-22T23:35:54Z"
    
  3. Add a node label to the node with the failing metrics agent. You can use either a persistent or temporary node label. We recommend you try adding an additional 20 MB. If the agent keeps crashing, you can run this command again, replacing the node label with one requesting a higher amount of additional memory.

    To update a node pool with a persistent label, run the following command:

    gcloud container node-pools update NODEPOOL_NAME \
        --cluster=CLUSTER_NAME \
        --node-labels=ADDITIONAL_MEMORY_NODE_LABEL \
        --location=COMPUTE_LOCATION
    

    Replace the following:

    • NODEPOOL_NAME: the name of the node pool.
    • CLUSTER_NAME: the name of the existing cluster.
    • ADDITIONAL_MEMORY_NODE_LABEL: one of the additional memory node labels; use one one of the following:
      • To add 10 MB: cloud.google.com/gke-metrics-agent-scaling-level=10
      • To add 20 MB: cloud.google.com/gke-metrics-agent-scaling-level=20
      • To add 50 MB: cloud.google.com/gke-metrics-agent-scaling-level=50
      • To add 100 MB: cloud.google.com/gke-metrics-agent-scaling-level=100
      • To add 200 MB: cloud.google.com/gke-metrics-agent-scaling-level=200
      • To add 500 MB: cloud.google.com/gke-metrics-agent-scaling-level=500
    • COMPUTE_LOCATION: the Compute Engine location of the cluster.

    Alternatively, you can add add a temporary node label that won't persist after an upgrade by using the following command:

    kubectl label node/NODE_NAME \
    ADDITIONAL_MEMORY_NODE_LABEL --overwrite
    

    Replace the following:

    • NODE_NAME: the name of the node of the affected metrics agent.
    • ADDITIONAL_MEMORY_NODE_LABEL: one of the additional memory node labels; use one one of the values from the preceding example.

Package: Control plane metrics

You can configure a GKE cluster to send certain metrics emitted by the Kubernetes API server, Scheduler, and Controller Manager to Cloud Monitoring.

Requirements

Sending metrics emitted by Kubernetes control plane components to Cloud Monitoring requires GKE control plane version 1.22.13 or later and requires that the collection of system metrics be enabled.

Configuring collection of control plane metrics

To enable Kubernetes control plane metrics in an existing GKE cluster, follow these steps:

Console

You can enable control plane metrics for a cluster either from the Observability tab for the cluster or from Details tab for the cluster. When you use the Observability tab, you can preview the available charts and metrics before you enable the metric package.

To enable control plane metrics from the Observability tab for the cluster, do the following:

  1. In the navigation panel of the Google Cloud console, select Kubernetes Engine, and then select Clusters:

    Go to Kubernetes Clusters

  2. Click your cluster's name and then select the Observability tab.

  3. Select Control Plane from the list of features.

  4. Click Enable package.

    If the control plane metrics are already enabled, then you see a set of charts for control plane metrics instead.

To enable control plane metrics from the Details tab for the cluster, do the following:

  1. In the navigation panel of the Google Cloud console, select Kubernetes Engine, and then select Clusters:

    Go to Kubernetes Clusters

  2. Click your cluster's name.

  3. In the Features row labelled Cloud Monitoring, click the Edit icon.

  4. In the Edit Cloud Monitoring dialog that appears, confirm that Enable Cloud Monitoring is selected.

  5. In the Components drop-down menu, select the control plane components from which you would like to collect metrics: API Server, Scheduler, or Controller Manager.

  6. Click OK.

  7. Click Save Changes.

gcloud

  1. Open a terminal window with Google Cloud SDK and the Google Cloud CLI installed. One way to do this is to use Cloud Shell.

  2. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  3. Pass one or more of the values API_SERVER, SCHEDULER, or CONTROLLER_MANAGER to the --monitoring flag of the gcloud container clusters create or gcloud container clusters update commands.

    For example, to collect metrics from the API server, scheduler, and controller manager, run this command:

    gcloud container clusters update CLUSTER_NAME \
        --location=COMPUTE_LOCATION \
        --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER
    

Terraform

To configure the collection of Kubernetes control plane metrics by using Terraform, see the monitoring_config block in the Terraform registry for google_container_cluster. For general information about using Google Cloud with Terraform, see Terraform with Google Cloud.

Using control plane metrics

See Use control plane metrics for the following:

Dashboards for visualizing control plane metrics available on the the GKE Observability tab in the Google Cloud console. For information about these dashboards, see View observability metrics.

Package: Kube state metrics

You can configure a GKE cluster to send a curated set of kube state metrics in Prometheus format to Cloud Monitoring. This package of kube state metrics includes metrics for Pods, Deployments, StatefulSets, DaemonSets, HorizontalPodAutoscaler resources, Persistent Volumes, and Persistent Volume Claims.

For GKE Autopilot clusters starting with version 1.27.4-gke.900, the Kube state metrics package is enabled by default.

Requirements

To collect kube state metrics, your GKE cluster must meet the following requirements:

You can enable system metrics and Google Cloud Managed Service for Prometheus at the same time that you enable the package of kube state metrics. Google Cloud Managed Service for Prometheus managed collection is enabled by default on new clusters.

Configuring collection of kube state metrics

To enable kube state metrics in an existing GKE cluster, follow these steps:

Console

You can enable kube state metrics from the Observability tab for either a cluster or a Deployment within a cluster. You can also preview the available charts and metrics before you enable the metric package.

On the Observability tab for a cluster, the set of charts for kube state metrics is divided across two items in the filter menu:

  • Workloads State: includes the metrics for Pods, Deployments, StatefulSets, DaemonSets, and HorizontalPodAutoscaler resources.
  • Storage > Persistent: includes the metrics for Persistent Volumes and Persistent Volume Claims.

You can enable either or both sets of metrics.

To enable kube state metrics from the Observability tab for a cluster, do the following:

  1. In the navigation panel of the Google Cloud console, select Kubernetes Engine, and then select Clusters:

    Go to Kubernetes Clusters

  2. Click your cluster's name and then select the Observability tab.

  3. Select either Workloads State or Storage > Persistent from the list of features.

  4. Click Enable package.

    If the kube state metrics package is already enabled, then you see a set of charts for kube state metrics instead.

To enable kube state metrics from the Observability tab for a Deployment, do the following:

  1. In the navigation panel of the Google Cloud console, select Kubernetes Engine, and then select Workloads:

    Go to Kubernetes Workloads

  2. Click the name of your Deployment and then select the Observability tab.

  3. Select Kube State from the list of features.

  4. Click Enable package. The package is enabled for the entire cluster.

    If the kube state metrics package is already enabled, then you see a set of charts for metrics from Pods, Deployments, and Horizontal Pod Autoscalers.

To configure kube state metrics from the Details tab for the cluster, do the following:

  1. In the navigation panel of the Google Cloud console, select Kubernetes Engine, and then select Clusters:

    Go to Kubernetes Clusters

  2. Click your cluster's name.

  3. In the Features row labelled Cloud Monitoring, click the Edit icon.

  4. In the Edit Cloud Monitoring dialog that appears, confirm that Enable Cloud Monitoring is selected.

  5. In the Components drop-down menu, select the kube state components from which you would like to collect metrics:

    • Persistent Volume (Storage)
    • Pods
    • Deployment
    • StatefulSet
    • DaemonSet
    • Horizontal Pod Autoscaler
  6. Click OK.

  7. Click Save Changes.

gcloud

  1. Open a terminal window with Google Cloud SDK and the Google Cloud CLI installed. One way to do this is to use Cloud Shell.

  2. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  3. Pass one or more of the following values to the --monitoring flag of the gcloud container clusters create or gcloud container clusters update commands:

    • DAEMONSET
    • DEPLOYMENT
    • HPA
    • POD
    • STATEFULSET
    • STORAGE — this option includes metrics for Persistent Volume and Persistent Volume Claims

    For example, to collect metrics for Deployments and Pods in an existing cluster, run the following command:

    gcloud container clusters update CLUSTER_NAME \
        --location=COMPUTE_LOCATION \
        --enable-managed-prometheus
        --monitoring=SYSTEM,DEPLOYMENT,POD
    

    The set of values supplied to the --monitoring flag overrides any previous setting. In the preceding example, if the cluster had been previously configured to collect DAEMONSET metrics, the example command turns off collection of those metrics.

Terraform

To configure the collection of kube state metrics by using Terraform, see the monitoring_config block in the Terraform registry for google_container_cluster. For general information about using Google Cloud with Terraform, see Terraform with Google Cloud.

Using kube state metrics

See Use kube state metrics for the following:

  • Information about querying your kube state metrics.
  • Tables of kube state metrics.

Dashboards for visualizing kube state metrics are available on the the GKE Observability tab in the Google Cloud console. For information about these dashboards, see View observability metrics.

Pricing and quotas for observability packages

The information in this section applies to the following observability packages:

GKE control plane metrics and kube state metrics use Google Cloud Managed Service for Prometheus to load metrics into Cloud Monitoring. Cloud Monitoring charges for the ingestion of these metrics are based on the number of samples ingested. However, these metrics are free-of-charge for the registered clusters that belong to a project that has GKE Enterprise edition enabled.

For more information, see Cloud Monitoring pricing.

Understanding your Monitoring bill

You can use Cloud Monitoring to identify the control plane or kube state metrics that are writing the largest numbers of samples. These metrics are contributing the most to your costs. After you identify the most expensive metrics, you can modify your scrape configs to filter these metrics appropriately.

The Cloud Monitoring Metrics Management page provides information that can help you control the amount you spend on chargeable metrics without affecting observability. The Metrics Management page reports the following information:

  • Ingestion volumes for both byte- and sample-based billing, across metric domains and for individual metrics.
  • Data about labels and cardinality of metrics.
  • Use of metrics in alerting policies and custom dashboards.
  • Rate of metric-write errors.

To view the Metrics Management page, do the following:

  1. In the navigation panel of the Google Cloud console, select Monitoring, and then select  Metrics management:

    Go to Metrics management

  2. In the toolbar, select your time window. By default, the Metrics Management page displays information about the metrics collected in the previous one day.

For more information about the Metrics Management page, see View and manage metric usage.

To identify which control plane or kube state metrics have the largest number of samples being ingested, do the following:

  1. In the navigation panel of the Google Cloud console, select Monitoring, and then select  Metrics management:

    Go to Metrics management

  2. On the Billable samples ingested scorecard, click View charts.

  3. Locate the Namespace Volume Ingestion chart, and then click  More chart options.

  4. In the Metric field, verify that the following resource and and metric are selected:
    Metric Ingestion Attribution and Samples written by attribution id.

  5. In the Filters page, do the following:

    1. In the Label field, verify that the value is attribution_dimension.

    2. In the Comparison field, verify that the value is = (equals).

    3. In the Value field, select cluster.

  6. Clear the Group by setting.

  7. Optionally, filter for only certain metrics. For example, control plane API server metrics all include "apiserver" as part of the metric name, and kube state Pod metrics all include "kube_pod" as part of the metric name, so you can filter for metrics containing those strings:

    • Click Add Filter.

    • In the Label field, select metric_type.

    • In the Comparison field, select =~ (equals regex).

    • In the Value field, enter .*apiserver.* or .*kube_pod.*.

  8. Optionally, group the number of samples ingested by GKE region or project:

    • Click Group by.

    • Ensure metric_type is selected.

    • To group by GKE region, select location.

    • To group by project, select project_id.

    • Click OK.

  9. Optionally, group the number of samples ingested by GKE cluster name:

    • Click Group by.

    • To group by GKE cluster name, ensure both attribution_dimension and attribution_id are selected.

    • Click OK.

  10. To see the ingestion volume for each of the metrics, in the toggle labeled Chart Table Both, select Both. The table shows the ingested volume for each metric in the Value column.

    Click the Value column header twice to sort the metrics by descending ingestion volume.

These steps show the metrics with the highest rate of samples ingested into Cloud Monitoring. Because the metrics in the observability packages are charged by the number of samples ingested, pay attention to metrics with the greatest rate of samples being ingested.

Quota

Control plane metrics and kube state metrics consume the "Time series ingestion requests per minute" quota of the Cloud Monitoring API. Before enabling the metrics packages, check your recent peak usage of that quota. If you have many clusters in the same project or are already approaching that quota's limit, you can request a quota-limit increase before enabling either observability package.

Other metrics

In addition to the system metrics and metric packages described in this document, Istio metrics are also available for GKE clusters. For pricing information, see Cloud Monitoring pricing.