Troubleshooting Autopilot clusters


This page shows you how to resolve issues with Google Kubernetes Engine (GKE) Autopilot clusters.

Cluster issues

Cannot create a cluster: 0 nodes registered

The following issue occurs when you try to create an Autopilot cluster with an IAM service account that's disabled or doesn't have the required permissions. Cluster creation fails with the following error message:

All cluster resources were brought up, but: only 0 nodes out of 2 have registered.

To resolve the issue, do the following:

  1. Check whether the default Compute Engine service account or the custom IAM service account that you want to use is disabled:

    gcloud iam service-accounts describe SERVICE_ACCOUNT
    

    Replace SERVICE_ACCOUNT with service account email address, such as my-iam-account@my-first-project.iam.gserviceaccount.com.

    If the service account is disabled, the output is similar to the following:

    disabled: true
    displayName: my-service-account
    email: my-service-account@my-project.iam.gserviceaccount.com
    ...
    
  2. If the service account is disabled, enable it:

    gcloud iam service-accounts enable SERVICE_ACCOUNT
    

If the service account is enabled and the error persists, grant the service account the minimum permissions required for GKE:

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member "serviceAccount:SERVICE_ACCOUNT" \
    --role roles/container.nodeServiceAccount

Scaling issues

Node scale up failed: Pod is at risk of not being scheduled

The following issue occurs when serial port logging is disabled in your Google Cloud project. GKE Autopilot clusters require serial port logging to effectively debug node issues. If serial port logging is disabled, Autopilot can't provision nodes to run your workloads.

The error message in your Kubernetes event log is similar to the following:

LAST SEEN   TYPE      REASON          OBJECT                          MESSAGE
12s         Warning   FailedScaleUp   pod/pod-test-5b97f7c978-h9lvl   Node scale up in zones associated with this pod failed: Internal error. Pod is at risk of not being scheduled

Serial port logging might be disabled at the organization level through an organization policy that enforces the compute.disableSerialPortLogging constraint. Serial port logging could also be disabled at the project or virtual machine (VM) instance level.

To resolve this issue, do the following:

  1. Ask your Google Cloud organization policy administrator to remove the compute.disableSerialPortLogging constraint in the project with your Autopilot cluster.
  2. If you don't have an organization policy that enforces this constraint, try to enable serial port logging in your project metadata. This action requires the compute.projects.setCommonInstanceMetadata IAM permission.

Nodes fail to scale up: Pod zonal resources exceeded

The following issue occurs when Autopilot doesn't provision new nodes for a Pod in a specific zone because a new node would violate resource limits.

The error message in your logs is similar to the following:

    "napFailureReasons": [
            {
              "messageId": "no.scale.up.nap.pod.zonal.resources.exceeded",
              ...

This error refers to a noScaleUp event, where node auto-provisioning did not provision any node group for the Pod in the zone.

If you encounter this error, confirm the following:

Workload issues

Pods stuck in Pending state

A Pod might get stuck in the Pending status if you select a specific node for your Pod to use, but the sum of resource requests in the Pod and in DaemonSets that must run on the node exceeds the maximum allocatable capacity of the node. This might cause your Pod to get a Pending status and remain unscheduled.

To avoid this issue, evaluate the sizes of your deployed workloads to ensure that they're within the supported maximum resource requests for Autopilot.

You can also try scheduling your DaemonSets before you schedule your regular workload Pods.

Consistently unreliable workload performance on a specific node

In GKE version 1.24 and later, if your workloads on a specific node consistently experience disruptions, crashes, or similar unreliable behavior, you can tell GKE about the problematic node by cordoning it using the following command:

kubectl drain NODE_NAME --ignore-daemonsets

Replace NODE_NAME with the name of the problematic node. You can find the node name by running kubectl get nodes.

GKE does the following:

  • Evicts existing workloads from the node and stops scheduling workloads on that node.
  • Automatically recreates any evicted workloads that are managed by a controller, such as a Deployment or a StatefulSet, on other nodes.
  • Terminates any workloads that remain on the node and repairs or recreates the node over time.
  • If you use Autopilot, GKE shuts down and replaces the node immediately and ignores any configured PodDisruptionBudgets.