This page shows you how to resolve issues with Google Kubernetes Engine (GKE) Autopilot clusters.
Cluster issues
Cannot create a cluster: 0 nodes registered
The following issue occurs when you try to create an Autopilot cluster with an IAM service account that's disabled or doesn't have the required permissions. Cluster creation fails with the following error message:
All cluster resources were brought up, but: only 0 nodes out of 2 have registered.
To resolve the issue, do the following:
Check whether the default Compute Engine service account or the custom IAM service account that you want to use is disabled:
gcloud iam service-accounts describe SERVICE_ACCOUNT
Replace
SERVICE_ACCOUNT
with service account email address, such asmy-iam-account@my-first-project.iam.gserviceaccount.com
.If the service account is disabled, the output is similar to the following:
disabled: true displayName: my-service-account email: my-service-account@my-project.iam.gserviceaccount.com ...
If the service account is disabled, enable it:
gcloud iam service-accounts enable SERVICE_ACCOUNT
If the service account is enabled and the error persists, grant the service account the minimum permissions required for GKE:
gcloud projects add-iam-policy-binding PROJECT_ID \
--member "serviceAccount:SERVICE_ACCOUNT" \
--role roles/container.nodeServiceAccount
Scaling issues
Node scale up failed: Pod is at risk of not being scheduled
The following issue occurs when serial port logging is disabled in your Google Cloud project. GKE Autopilot clusters require serial port logging to effectively debug node issues. If serial port logging is disabled, Autopilot can't provision nodes to run your workloads.
The error message in your Kubernetes event log is similar to the following:
LAST SEEN TYPE REASON OBJECT MESSAGE
12s Warning FailedScaleUp pod/pod-test-5b97f7c978-h9lvl Node scale up in zones associated with this pod failed: Internal error. Pod is at risk of not being scheduled
Serial port logging might be disabled at the organization level through an
organization policy that enforces the compute.disableSerialPortLogging
constraint. Serial port logging could also be disabled at the project or virtual
machine (VM) instance level.
To resolve this issue, do the following:
- Ask your Google Cloud organization policy administrator to
remove the
compute.disableSerialPortLogging
constraint in the project with your Autopilot cluster. - If you don't have an organization policy that enforces this constraint, try
to
enable serial port logging in your project metadata.
This action requires the
compute.projects.setCommonInstanceMetadata
IAM permission.
Nodes fail to scale up: Pod zonal resources exceeded
The following issue occurs when Autopilot doesn't provision new nodes for a Pod in a specific zone because a new node would violate resource limits.
The error message in your logs is similar to the following:
"napFailureReasons": [
{
"messageId": "no.scale.up.nap.pod.zonal.resources.exceeded",
...
This error refers to a noScaleUp
event, where node auto-provisioning did not provision any node group for the Pod in the zone.
If you encounter this error, confirm the following:
- Your Pods have sufficient memory and CPU.
- The Pod IP address CIDR range is large enough to support your anticipated maximum cluster size.
Workload issues
Pods stuck in Pending state
A Pod might get stuck in the Pending
status if you select a specific node
for your Pod to use, but the sum of resource requests in the Pod and in
DaemonSets that must run on the node exceeds the maximum allocatable capacity of
the node. This might cause your Pod to get a Pending
status and remain
unscheduled.
To avoid this issue, evaluate the sizes of your deployed workloads to ensure that they're within the supported maximum resource requests for Autopilot.
You can also try scheduling your DaemonSets before you schedule your regular workload Pods.
Consistently unreliable workload performance on a specific node
In GKE version 1.24 and later, if your workloads on a specific node consistently experience disruptions, crashes, or similar unreliable behavior, you can tell GKE about the problematic node by cordoning it using the following command:
kubectl drain NODE_NAME --ignore-daemonsets
Replace NODE_NAME
with the name of the problematic node.
You can find the node name by running kubectl get nodes
.
GKE does the following:
- Evicts existing workloads from the node and stops scheduling workloads on that node.
- Automatically recreates any evicted workloads that are managed by a controller, such as a Deployment or a StatefulSet, on other nodes.
- Terminates any workloads that remain on the node and repairs or recreates the node over time.
- If you use Autopilot, GKE shuts down and replaces the node immediately and ignores any configured PodDisruptionBudgets.