Troubleshooting Autopilot clusters

Stay organized with collections Save and categorize content based on your preferences.

This page shows you how to resolve issues with Google Kubernetes Engine (GKE) Autopilot clusters.

Cannot create a cluster: 0 nodes registered

The following issue occurs when you try to create an Autopilot cluster with an IAM service account that's disabled. Cluster creation fails with the following error message:

All cluster resources were brought up, but: only 0 nodes out of 2 have registered.

To resolve the issue, do the following:

  1. Check whether the default Compute Engine service account or the custom IAM service account that you want to use is disabled:

    gcloud iam service-accounts describe SERVICE_ACCOUNT
    

    Replace SERVICE_ACCOUNT with service account email address, such as my-iam-account@my-first-project.iam.gserviceaccount.com.

    If the service account is disabled, the output is similar to the following:

    disabled: true
    displayName: my-service-account
    email: my-service-account@my-project.iam.gserviceaccount.com
    ...
    
  2. Enable the service account:

    gcloud iam service-accounts enable SERVICE_ACCOUNT
    

Node scale up failed: Pod is at risk of not being scheduled

The following issue occurs when serial port logging is disabled in your Google Cloud project. GKE Autopilot clusters require serial port logging to effectively debug node issues. If serial port logging is disabled, Autopilot can't provision nodes to run your workloads.

The error message in your Kubernetes event log is similar to the following:

LAST SEEN   TYPE      REASON          OBJECT                          MESSAGE
12s         Warning   FailedScaleUp   pod/pod-test-5b97f7c978-h9lvl   Node scale up in zones associated with this pod failed: Internal error. Pod is at risk of not being scheduled

Serial port logging might be disabled at the organization level through an organization policy that enforces the compute.disableSerialPortLogging constraint. Serial port logging could also be disabled at the project or virtual machine (VM) instance level.

To resolve this issue, do the following:

  1. Ask your Google Cloud organization policy administrator to remove the compute.disableSerialPortLogging constraint in the project with your Autopilot cluster.
  2. If you don't have an organization policy that enforces this constraint, try to enable serial port logging in your project metadata. This action requires the compute.projects.setCommonInstanceMetadata IAM permission.

Nodes fail to scale up: Pod zonal resources exceeded

The following issue occurs when Autopilot doesn't provision new nodes for a Pod in a specific zone because a new node would violate resource limits.

The error message in your logs is similar to the following:

    "napFailureReasons": [
            {
              "messageId": "no.scale.up.nap.pod.zonal.resources.exceeded",
              ...

This error refers to a noScaleUp event, where node auto-provisioning did not provision any node group for the Pod in the zone.

If you encounter this error, confirm the following:

Pods stuck in Pending state

A Pod might get stuck in the Pending status if you select a specific node for your Pod to use, but the sum of resource requests in the Pod and in DaemonSets that must run on the node exceeds the maximum allocatable capacity of the node. This might cause your Pod to get a Pending status and remain unscheduled.

To avoid this issue, evaluate the sizes of your deployed workloads to ensure that they're within the supported maximum resource requests for Autopilot.

You can also try scheduling your DaemonSets before you schedule your regular workload Pods.