Manage GKE metrics

Stay organized with collections Save and categorize content based on your preferences.

Google Kubernetes Engine (GKE) makes it easy to send metrics to Cloud Monitoring. Once in Cloud Monitoring, metrics can populate custom dashboards, generate alerts, create service-level objectives, or be fetched by third-party monitoring services using the Cloud Monitoring API.

G​K​E provides several sources of metrics:

  • System metrics: metrics from essential system components, describing low-level resources such as CPU, memory and storage.
  • Managed Service for Prometheus: lets you monitor and alert on your workloads, using Prometheus, without having to manually manage and operate Prometheus at scale.
  • Control plane metrics: metrics exported from certain control plane components such as the API server and scheduler.
  • Workload metrics (Deprecated): metrics exposed by any G​K​E workload (such as a CronJob or a Deployment for an application).

System metrics

When a cluster is created, G​K​E by default collects certain metrics emitted by system components.

You have a choice whether or not to send metrics from your G​K​E cluster to Cloud Monitoring. If you choose to send metrics to Cloud Monitoring, you must send system metrics.

All G​K​E system metrics are ingested into Cloud Monitoring with the prefix kubernetes.io.

Pricing

Cloud Monitoring does not charge for the ingestion of G​K​E system metrics. For more information, see Cloud Monitoring pricing.

Configuring collection of system metrics

To enable system metric collection, pass the SYSTEM value to the --monitoring flag of the gcloud container clusters create or gcloud container clusters update commands.

To disable system metric collection, use the NONE value for the --monitoring flag. If system metric collection is disabled, basic information like CPU usage, memory usage, and disk usage are not available for a cluster in the Observability tab or the G​K​E section of the Google Cloud console. Additionally, the Cloud Monitoring G​K​E Dashboard does not contain information about the cluster.

See Configuring Cloud Operations for GKE for more details about Cloud Monitoring integration with G​K​E.

To configure the collection of system metrics by using Terraform, see the monitoring_config block in the Terraform registry for google_container_cluster. For general information about using Google Cloud with Terraform, see Terraform with Google Cloud.

List of system metrics

System metrics include metrics from essential system components important for core Kubernetes functionality. For a list of these metrics, see GKE system metrics.

Troubleshooting system metrics

If system metrics are not available in Cloud Monitoring as expected, here are some steps you can take to troubleshoot the issue.

Confirm that the metrics agent has sufficient memory

In most cases, the default allocation of resources to the GKE metrics agent is sufficient. However, if the DaemonSet crashes repeatedly, you can check the termination reason with the following instructions:

  1. Get the names of the GKE metrics agent Pods:

    kubectl get pods -n kube-system -l component=gke-metrics-agent
    

    Find the Pod with the status CrashLoopBackOff.

    The output is similar to the following:

    NAME                    READY STATUS           RESTARTS AGE
    gke-metrics-agent-5857x 0/1   CrashLoopBackOff 6        12m
    
  2. Describe the Pod that has the status CrashLoopBackOff:

    kubectl describe pod POD_NAME -n kube-system
    

    Replace POD_NAME with the name of the Pod from the previous step.

    If the termination reason of the Pod is OOMKilled, the agent needs additional memory.

    The output is similar to the following:

      containerStatuses:
      ...
      lastState:
        terminated:
          ...
          exitCode: 1
          finishedAt: "2021-11-22T23:36:32Z"
          reason: OOMKilled
          startedAt: "2021-11-22T23:35:54Z"
    
  3. Add a node label to the node with the failing metrics agent. You can use either a persistent or temporary node label.

    To update a node pool with a persistent label, run the following command:

    gcloud container node-pools update NODEPOOL_NAME \
     --cluster=CLUSTER_NAME \
     --node-labels=ADDITIONAL_MEMORY_NODE_LABEL \
     --zone ZONE
    

    Replace the following:

    • NODEPOOL_NAME: the name of the node pool.
    • CLUSTER_NAME: the name of the existing cluster.
    • ADDITIONAL_MEMORY_NODE_LABEL: one of the additional memory node labels; use one one of the following:
      • To add an additional 10 MB: cloud.google.com/gke-metrics-agent-scaling-level=10
      • To add an additional 20 MB: cloud.google.com/gke-metrics-agent-scaling-level=20
    • ZONE: the zone in which the cluster is running.

    Alternatively, you can add add a temporary node label that will not persist after an upgrade by using the following command:

    kubectl label node/NODE_NAME \
    ADDITIONAL_MEMORY_NODE_LABEL --overwrite
    

    Replace the following:

    • NODE_NAME with the name of the node of the affected metrics agent.
    • ADDITIONAL_MEMORY_NODE_LABEL: one of the additional memory node labels; use one one of the values from the preceding example.

Control plane metrics

You can configure a G​K​E cluster to send certain metrics emitted by the Kubernetes API server, Scheduler, and Controller Manager to Cloud Monitoring.

Requirements

Sending metrics emitted by Kubernetes control plane components to Cloud Monitoring requires G​K​E control plane version 1.22.13 or later and requires that the collection of system metrics be enabled.

Configuring collection of control plane metrics

To enable Kubernetes control plane metrics in an existing G​K​E cluster, follow these steps:

Console

  1. In the Google Cloud console, go to the list of G​K​E clusters:

    Go to Kubernetes clusters

  2. Click your cluster's name.

  3. In the row labelled Cloud Monitoring, click the Edit icon.

  4. In the Edit Cloud Monitoring dialog box that appears, confirm that Enable Cloud Monitoring is selected.

  5. In the Components dropdown menu, select the control plane components from which you would like to collect metrics: API Server, Scheduler, or Controller Manager.

  6. Click OK.

  7. Click Save Changes.

gcloud

  1. Open a terminal window with Google Cloud SDK and the Google Cloud CLI installed. One way to do this is to use Cloud Shell.

  2. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  3. Pass one or more of the values API_SERVER, SCHEDULER, or CONTROLLER_MANAGER to the --monitoring flag of the gcloud container clusters create or gcloud container clusters update commands.

    For example, to collect metrics from the API server, scheduler, and controller manager, run this command:

    gcloud container clusters update [CLUSTER_ID] \
      --zone=[ZONE] \
      --project=[PROJECT_ID] \
      --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER
    

Terraform

Using control plane metrics

See Use control plane metrics for the following:

Dashboards for visualizing control plane metrics available on the the GKE Observability tab in the Google Cloud console. For information about these dashboards, see View observability metrics.

Pricing

G​K​E control plane metrics uses Google Cloud Managed Service for Prometheus to ingest metrics into Cloud Monitoring. Cloud Monitoring charges for the ingestion of G​K​E control plane metrics based on the number of samples ingested. For more information, see Cloud Monitoring pricing.

Understanding your Monitoring bill

To identify which control plane metrics have the largest number of samples being ingested, use the monitoring.googleapis.com/collection/attribution/write_sample_count metric:

  1. In the Google Cloud console, select Monitoring:

    Go to Monitoring

  2. In the Monitoring navigation pane, click Metrics Explorer.

  3. In the Metric field, select monitoring.googleapis.com/collection/attribution/write_sample_count.

  4. Click Add Filter.

  5. In the Label field, select attribution_dimension.

  6. In the Comparison field, select = (equals).

  7. In the Value field, enter cluster.

  8. Click Done.

  9. Optionally, filter for only certain metrics. In particular, since API server metrics all include "apiserver" as part of the metric name and since Scheduler metrics all include "scheduler" as part of the metric name, you can restrict to metrics containing those strings:

    • Click Add Filter.

    • In the Label field, select metric_type.

    • In the Comparison field, select =~ (equals regex).

    • In the Value field, enter .*apiserver.* or .*scheduler.*.

    • Click Done.

  10. Optionally, group the number of samples ingested by G​K​E region or project:

    • Click Group by.

    • Ensure metric_type is selected.

    • To group by G​K​E region, select location.

    • To group by project, select project_id.

    • Click OK.

  11. Optionally, group the number of samples ingested by G​K​E cluster name:

    • Click Group by.

    • To group by G​K​E cluster name, ensure both attribution_dimension and attribution_id are selected.

    • Click OK.

  12. Sort the list of metrics in descending order by clicking the column header Value above the list of metrics.

These steps show the metrics with the highest rate of samples ingested into Cloud Monitoring. Since G​K​E control plane metrics are charged by the number of samples ingested, pay attention to metrics with the greatest rate of samples being ingested.

Quota

Control plane metrics consume the "Time series ingestion requests per minute" quota of the Cloud Monitoring API. Before enabling control plane metrics, check your recent peak usage of that quota. If you have many clusters in the same project or are already approaching that quota's limit, you can request a quota-limit increase before enabling control plane metrics.

Other metrics

In addition to the system metrics and control plane metrics in this document, Istio metrics are also available for G​K​E clusters.