Troubleshooting system metrics


This page shows you how to resolve system metrics-related issues on your Google Kubernetes Engine (GKE) clusters.

If you need additional assistance, reach out to Cloud Customer Care.

Metrics from your cluster not appearing in Cloud Monitoring

Ensure that you've enabled the Monitoring API and the Logging API on your project. You should also confirm that you're able to view your project in the Cloud Monitoring overview in the Google Cloud console.

If the issue persists, check the following potential causes:

  • Have you enabled monitoring on your cluster?

    Monitoring is enabled by default for clusters created from the Google Cloud console and from the Google Cloud CLI, but you can verify by clicking into the cluster's details in the Google Cloud console by running the following command:

    gcloud container clusters describe CLUSTER_NAME
    

    The output from this command should include SYSTEM_COMPONENTS in the list of enableComponents in the monitoringConfig section, similar to the following example:

    monitoringConfig:
      componentConfig:
        enableComponents:
        - SYSTEM_COMPONENTS
    

    If monitoring isn't enabled, run the following command to enable it:

    gcloud container clusters update CLUSTER_NAME --monitoring=SYSTEM
    
  • How long has it been since your cluster was created or had monitoring enabled?

    It can take up to one hour for a new cluster's metrics to start appearing in Cloud Monitoring.

  • Is a heapster or gke-metrics-agent (the OpenTelemetry Collector) running in your cluster in the kube-system namespace?

    This Pod might be failing to schedule workloads because your cluster is running low on resources. Check whether Heapster or OpenTelemetry is running by running kubectl get pods --namespace=kube-system and checking for Pods with heapster or gke-metrics-agent in the name.

  • Is your cluster's control plane able to communicate with the nodes?

    Cloud Monitoring relies on that communication. You can check whether the control plane is communicating with the nodes by running the following command:

    kubectl logs POD_NAME
    

    If this command returns an error, then the SSH tunnels might be causing the issue. For troubleshooting steps, see Troubleshoot SSH issues.

  • Can the service account for the nodes write metrics?

    Each GKE node has a service account that it uses to authenticate to Google Cloud APIs and services, including Cloud Monitoring. By default, GKE uses the Compute Engine default service account for nodes, unless you specify a custom service account. Depending on your organization's policies, the Compute Engine default service account might have insufficient permissions to write to Cloud Monitoring.

    1. If you don't know which service account your nodes use, find the service account. For Autopilot clusters, the service account is listed in the cluster's overall details. For Standard clusters, the service account is specified per node pool.

      Console

      1. With the relevant project selected, go to the GKE Clusters page in the Google Cloud console:

        Go to Clusters

      2. Select the cluster to view its details page.

      3. Depending on the cluster mode, do one of the following:

        • For Autopilot clusters, check the value of Service account in the Security section.
        • For Standard clusters, go to the Nodes tab and click the name of the relevant node pool. Then on the Node pool details page, check the value of Service account in the Security section.

      gcloud

      For Autopilot clusters, run the following command:

      gcloud container clusters describe CLUSTER_NAME \
          --location= LOCATION\
          --flatten=nodeConfig \
          --format='value(serviceAccount)'
      

      For Standard clusters, run the following command. This returns all node pools in the cluster with their service accounts:

      gcloud container clusters describe CLUSTER_NAME \
          --location=LOCATION \
          --flatten=nodePools \
          --format='table(nodePools.name,nodePools.config.serviceAccount)'
      

      Replace the following:

      • CLUSTER_NAME: the name of your cluster
      • LOCATION: the location of the cluster
      • If the service account is listed as default, the nodes use the Compute Engine default service account. This has the email address PROJECT_NUMBER-compute@developer.gserviceaccount.com.

      • If you see a different email address, the nodes are using a custom service account.

    2. To check if the service account has write permissions for Cloud Monitoring (or any permissions at all), do the following:

      Console

      1. With the relevant project selected, go to the IAM page in the Google Cloud console:

        Go to IAM

      2. In the Principal column, look for the email address of the node service account. If the service account email isn't in the list, the node service account has no permissions. If the account is in the list, check what role is listed in the Roles column and if that role includes write permissions for Cloud Monitoring.

      gcloud

      Run the following command to list all the roles granted to the service account that you identified:

      gcloud projects get-iam-policy PROJECT_ID \
          --flatten=bindings \
          --filter=bindings.members:serviceAccount:SERVICE_ACCOUNT_EMAIL\
          --format='value(bindings.role)'
      

      Replace SERVICE_ACCOUNT_EMAIL with the email address of the node service account.

      If the output is empty, the node service account has no permissions. If the output includes any roles, check if any of these roles include write permissions for Cloud Monitoring.

    If you don't see the necessary permissions after you follow these steps, you can grant relevant roles to the service account, or migrate your workloads to a new cluster or node pool that uses a different service account. The minimum role required to write system metrics is roles/monitoring.metricWriter. To learn more about the minimum roles required for GKE, see Harden your clusters.

Confirm that the metrics agent has sufficient memory

If you've tried the preceding troubleshooting steps and the metrics still aren't appearing, the metrics agent might have insufficient memory.

In most cases, the default allocation of resources to the GKE metrics agent is sufficient. However, if the DaemonSet crashes repeatedly, you can check the termination reason with the following instructions:

  1. Get the names of the GKE metrics agent Pods:

    kubectl get pods -n kube-system -l component=gke-metrics-agent
    
  2. Find the Pod with the status CrashLoopBackOff.

    The output is similar to the following:

    NAME                    READY STATUS           RESTARTS AGE
    gke-metrics-agent-5857x 0/1   CrashLoopBackOff 6        12m
    
  3. Describe the Pod that has the status CrashLoopBackOff:

    kubectl describe pod POD_NAME -n kube-system
    

    Replace POD_NAME with the name of the Pod from the previous step.

    If the termination reason of the Pod is OOMKilled, the agent needs additional memory.

    The output is similar to the following:

      containerStatuses:
      ...
      lastState:
        terminated:
          ...
          exitCode: 1
          finishedAt: "2021-11-22T23:36:32Z"
          reason: OOMKilled
          startedAt: "2021-11-22T23:35:54Z"
    
  4. Add a node label to the node with the failing metrics agent. You can use either a persistent or temporary node label. We recommend that you try adding an additional 20 MB. If the agent keeps crashing, you can run this command again, replacing the node label with one requesting a higher amount of additional memory.

    To update a node pool with a persistent label, run the following command:

    gcloud container node-pools update NODEPOOL_NAME \
        --cluster=CLUSTER_NAME \
        --node-labels=ADDITIONAL_MEMORY_NODE_LABEL \
        --location=COMPUTE_LOCATION
    

    Replace the following:

    • NODEPOOL_NAME: the name of the node pool.
    • CLUSTER_NAME: the name of the existing cluster.
    • ADDITIONAL_MEMORY_NODE_LABEL: one of the additional memory node labels; use one of the following values:
      • To add 10 MB: cloud.google.com/gke-metrics-agent-scaling-level=10
      • To add 20 MB: cloud.google.com/gke-metrics-agent-scaling-level=20
      • To add 50 MB: cloud.google.com/gke-metrics-agent-scaling-level=50
      • To add 100 MB: cloud.google.com/gke-metrics-agent-scaling-level=100
      • To add 200 MB: cloud.google.com/gke-metrics-agent-scaling-level=200
      • To add 500 MB: cloud.google.com/gke-metrics-agent-scaling-level=500
    • COMPUTE_LOCATION: the Compute Engine location of the cluster.

    Alternatively, you can add add a temporary node label that won't persist after an upgrade by using the following command:

    kubectl label node/NODE_NAME \
    ADDITIONAL_MEMORY_NODE_LABEL --overwrite
    

    Replace the following:

    • NODE_NAME: the name of the node of the affected metrics agent.
    • ADDITIONAL_MEMORY_NODE_LABEL: one of the additional memory node labels; use one one of the values from the preceding example.

What's next