This page shows you how to resolve errors with your deployed workloads in Google Kubernetes Engine (GKE).
For more general advice about troubleshooting your applications, see Troubleshooting Applications in the Kubernetes documentation.
All errors: Check Pod status
If there are issues with a workload's Pods, Kubernetes updates the Pod status
with an error message. View these errors by checking the status of a Pod using
the Google Cloud console or the kubectl
command-line tool.
Console
Perform the following steps:
In the Google Cloud console, go to the Workloads page.
Select the workload that you want to investigate. The Overview tab displays the status of the workload.
From the Managed Pods section, click any error status message.
kubectl
To see all Pods running in your cluster, run the following command:
kubectl get pods
The output is similar to the following:
NAME READY STATUS RESTARTS AGE
POD_NAME 0/1 CrashLoopBackOff 23 8d
Potential errors are listed in the Status
column.
To get more details information about a specific Pod, run the following command:
kubectl describe pod POD_NAME
Replace POD_NAME
with the name of the Pod that you
want to investigate.
In the output, the Events
field shows more information about errors.
If you'd like more information, view the container logs:
kubectl logs POD_NAME
These logs can help you identify if a command or code in the container caused the Pod to crash.
After you identify the error, use the following sections to try and resolve the issue.
Error: CrashLoopBackOff
A status of CrashLoopBackOff
doesn't mean there's a specific error, instead it
indicates that a container is repeatedly crashing after restarting.
When a container crashes or exits shortly after starting
(CrashLoop
), Kubernetes attempts to restart the container. With each failed
restart, the delay (BackOff
) before the next attempt increases exponentially
(10s, 20s, 40s, etc.), up to a maximum of five minutes.
The following sections help you identify why your container might be crashing.
Use the Crashlooping Pods interactive playbook
Begin troubleshooting what's causing a CrashLoopBackOff
status by using the
interactive playbook in the Google Cloud console:
Go to the Crashlooping Pods interactive playbook:
In the Cluster drop-down list, select the cluster that you want to troubleshoot. If you can't find your cluster, enter the name of the cluster in the
Filter field.In the Namespace drop-down list, select the namespace that you want to troubleshoot. If you can't find your namespace, enter the namespace in the
Filter field.Work through each of the sections to help you identify the cause:
- Identify Application Errors
- Investigate Out Of Memory Issues
- Investigate Node Disruptions
- Investigate Liveness Probe Failures
- Correlate Change Events
Optional: To get notifications about future
CrashLoopBackOff
errors, in the Future Mitigation Tips section, select Create an Alert.
Inspect logs
A container might crash for many reasons, and checking a Pod's logs can aid you in troubleshooting the root cause.
You can check the logs with the Google Cloud console or the kubectl
command-line tool.
Console
Perform the following steps:
Go to the Workloads page in the Google Cloud console.
Select the workload that you want to investigate. The Overview tab displays the status of the workload.
From the Managed Pods section, click the problematic Pod.
From the Pod's menu, click the Logs tab.
kubectl
View all Pods running in your cluster:
kubectl get pods
In the output of the preceding command, look for a Pod with the
CrashLoopBackOff
error in theStatus
column.Get the Pod's logs:
kubectl logs POD_NAME
Replace
POD_NAME
with the name of the problematic Pod.You can also pass in the
-p
flag to get the logs for the previous instance of a Pod's container, if it exists.
Check the exit code of the crashed container
To better understand why your container crashed, find the exit code:
Describe the Pod:
kubectl describe pod POD_NAME
Replace
POD_NAME
with the name of the problematic Pod.Review the value in the
containers: CONTAINER_NAME: last state: exit code
field:- If the exit code is 1, the container crashed because the application crashed.
- If the exit code is 0, check how long your app was running. Containers
exit when your application's main process exits. If your app finishes
execution very quickly, the container might continue to restart. If
you experience this error, one solution is to set the
restartPolicy
field toOnFailure
. After you make this change, the app only restarts when the exit code isn't 0.
Connect to a running container
To run bash commands from the container so that you can test the network or check if you have access to files or databases used by your application, open a shell to the Pod:
kubectl exec -it POD_NAME -- /bin/bash
If there's more than one container in your Pod, add
-c CONTAINER_NAME
.
Errors: ImagePullBackOff and ErrImagePull
A status of ImagePullBackOff
or ErrImagePull
indicates that the image used
by a container cannot be loaded from the image registry.
You can verify this issue using the Google Cloud console or the kubectl
command-line tool.
Console
Perform the following steps:
In the Google Cloud console, go to the Workloads page.
Select the workload that you want to investigate. The Overview tab displays the status of the workload.
From the Managed Pods section, click the problematic Pod.
From the Pod's menu, click the Events tab.
kubectl
To get more information about a Pod's container image, run the following command:
kubectl describe pod POD_NAME
Issue: The image isn't found
If your image is not found, complete the following steps:
- Verify that the name of the image is correct.
- Verify that the tag for the image is correct. (Try
:latest
or no tag to pull the latest image). - If the image has a full registry path, verify that it exists in the Docker registry that you are using. If you provide only the image name, check the Docker Hub registry.
In GKE Standard clusters, try to pull the Docker image manually:
Use SSH to connect to the node:
For example, to use SSH to connect to a VM, run the following command:
gcloud compute ssh VM_NAME --zone=ZONE_NAME
Replace the following:
VM_NAME
: the name of the VM.ZONE_NAME
: a Compute Engine zone.
Generate a config file at
/home/[USER]/.docker/config.json
:docker-credential-gcr configure-docker
Ensure that the config file at
/home/[USER]/.docker/config.json
includes the registry of the image in thecredHelpers
field. For example, the following file includes authentication information for images hosted atasia.gcr.io
,eu.gcr.io
,gcr.io
,marketplace.gcr.io
, andus.gcr.io
:{ "auths": {}, "credHelpers": { "asia.gcr.io": "gcr", "eu.gcr.io": "gcr", "gcr.io": "gcr", "marketplace.gcr.io": "gcr", "us.gcr.io": "gcr" } }
Try to pull the image:
docker pull IMAGE_NAME
If pulling the image manually works, you probably need to specify
ImagePullSecrets
on a Pod. Pods can only reference image pull Secrets in their own namespace, so this process needs to be done one time per namespace.
Error: Permission denied
If you encounter a "permission denied" or "no pull access" error, verify that you are logged in and have access to the image. Try one of the following methods depending on the registry in which you host your images.
Artifact Registry
If your image is in Artifact Registry, your node pool's service account needs read access to the repository that contains the image.
Grant the
artifactregistry.reader
role
to the service account:
gcloud artifacts repositories add-iam-policy-binding REPOSITORY_NAME \
--location=REPOSITORY_LOCATION \
--member=serviceAccount:SERVICE_ACCOUNT_EMAIL \
--role="roles/artifactregistry.reader"
Replace the following:
REPOSITORY_NAME
: the name of your Artifact Registry repository.REPOSITORY_LOCATION
: the region of your Artifact Registry repository.SERVICE_ACCOUNT_EMAIL
: the email address of the IAM service account associated with your node pool.
Container Registry
If your image is in Container Registry, your node pool's service account needs read access to the Cloud Storage bucket that contains the image.
Grant the roles/storage.objectViewer
role
to the service account so that it can read from the bucket:
gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
--member=serviceAccount:SERVICE_ACCOUNT_EMAIL \
--role=roles/storage.objectViewer
Replace the following:
SERVICE_ACCOUNT_EMAIL
: the email of the service account associated with your node pool. You can list all the service accounts in your project usinggcloud iam service-accounts list
.BUCKET_NAME
: the name of the Cloud Storage bucket that contains your images. You can list all the buckets in your project usinggcloud storage ls
.
If your registry administrator set up
gcr.io repositories in Artifact Registry
to store images for the gcr.io
domain instead of Container Registry, you must
grant read access to Artifact Registry instead of Container Registry.
Private registry
If your image is in a private registry, you might require keys to access the images. For more information, see Using private registries in the Kubernetes documentation.
Error 401 Unauthorized: Cannot pull images from private container registry repository
An error similar to the following might occur when you pull an image from a private Container Registry repository:
gcr.io/PROJECT_ID/IMAGE:TAG: rpc error: code = Unknown desc = failed to pull and
unpack image gcr.io/PROJECT_ID/IMAGE:TAG: failed to resolve reference
gcr.io/PROJECT_ID/IMAGE]:TAG: unexpected status code [manifests 1.0]: 401 Unauthorized
Warning Failed 3m39s (x4 over 5m12s) kubelet Error: ErrImagePull
Warning Failed 3m9s (x6 over 5m12s) kubelet Error: ImagePullBackOff
Normal BackOff 2s (x18 over 5m12s) kubelet Back-off pulling image
To resolve the error, complete the following steps:
Identify the node running the Pod:
kubectl describe pod POD_NAME | grep "Node:"
Verify that the node you identified in the previous step has the storage scope:
gcloud compute instances describe NODE_NAME \ --zone=COMPUTE_ZONE --format="flattened(serviceAccounts[].scopes)"
The node's access scope should contain at least one of the following scopes:
serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/devstorage.read_only serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/cloud-platform
If the node doesn't contain one of these scopes, recreate the node pool.
Recreate the node pool that the node belongs to with sufficient scope. You cannot modify existing nodes, you must recreate the node with the correct scope.
Recommended: Create a new node pool with the
gke-default
scope:gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --zone=COMPUTE_ZONE \ --scopes="gke-default"
Create a new node pool with only storage scope:
gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --zone=COMPUTE_ZONE \ --scopes="https://www.googleapis.com/auth/devstorage.read_only"
Error: Pod unschedulable
A status of PodUnschedulable
indicates that your Pod cannot be scheduled
because of insufficient resources or some configuration error.
If you have configured control plane metrics, you can find more information about these errors in scheduler metrics and API server metrics.
Use the unschedulable Pods interactive playbook
You can troubleshoot PodUnschedulable
errors using the interactive playbook
in the Google Cloud console:
Go to the unschedulable Pods interactive playbook:
In the Cluster drop-down list, select the cluster that you want to troubleshoot. If you can't find your cluster, enter the name of the cluster in the
Filter field.In the Namespace drop-down list, select the namespace that you want to troubleshoot. If you can't find your namespace, enter the namespace in the
Filter field.To help you identify the cause, work through each of the sections in the playbook:
- Investigate CPU and Memory
- Investigate Max Pods per Node
- Investigate Autoscaler Behavior
- Investigate Other Failure Modes
- Correlate Change Events
Optional: To get notifications about future
PodUnschedulable
errors, in the Future Mitigation Tips section, select Create an Alert .
Error: Insufficient resources
You might encounter an error indicating a lack of CPU, memory, or another
resource. For example: No nodes are available that match all of the predicates:
Insufficient cpu (2)
which indicates that, on two nodes, there isn't enough CPU
available to fulfill a Pod's requests.
If your Pod resource requests exceed that of a single node from any eligible node pools, GKE does not schedule the Pod and also does not trigger scale up to add a new node. For GKE to schedule the Pod, you must either request fewer resources for the Pod, or create a new node pool with sufficient resources.
You can also enable node auto-provisioning so that GKE can automatically create node pools with nodes where the unscheduled Pods can run.
The default CPU request is 100m or 10% of a CPU (or
one core).
If you want to request more or fewer resources, specify the value in the Pod
specification under spec: containers: resources: requests
.
Error: MatchNodeSelector
MatchNodeSelector
indicates that there are no nodes that match the Pod's
label selector.
To verify this, check the labels specified in the Pod specification's
nodeSelector
field, under spec: nodeSelector
.
To see how nodes in your cluster are labeled, run the following command:
kubectl get nodes --show-labels
To attach a label to a node, run the following command:
kubectl label nodes NODE_NAME LABEL_KEY=LABEL_VALUE
Replace the following:
NODE_NAME
: the node that you want to add a label to.LABEL_KEY
: the label's key.LABEL_VALUE
: the label's value.
For more information, refer to Assigning Pods to Nodes in the Kubernetes documentation.
Error: PodToleratesNodeTaints
PodToleratesNodeTaints
indicates that the Pod can't be scheduled to any node
because the Pod doesn't have tolerations that correspond to existing
node taints.
To verify that this is the case, run the following command:
kubectl describe nodes NODE_NAME
In the output, check the Taints
field, which lists key-value pairs and
scheduling effects.
If the effect listed is NoSchedule
, then no Pod can be scheduled on that node
unless it has a matching toleration.
One way to resolve this issue is to remove the taint. For example, to remove a NoSchedule taint, run the following command:
kubectl taint nodes NODE_NAME key:NoSchedule-
Error: PodFitsHostPorts
PodFitsHostPorts
indicates that a port that a node is attempting to use is
already in use.
To resolve this issue, check the Pod specification's hostPort
value under
spec: containers: ports: hostPort
. You might need to change this value to
another port.
Error: Does not have minimum availability
If a node has adequate resources but you still see the Does not have minimum availability
message, check the Pod's status. If the status is SchedulingDisabled
or
Cordoned
status, the node cannot schedule new Pods. You can check the status of a
node using the Google Cloud console or the kubectl
command-line tool.
Console
Perform the following steps:
Go to the Google Kubernetes Engine page in the Google Cloud console.
Select the cluster that you want to investigate. The Nodes tab displays the Nodes and their status.
To enable scheduling on the node, perform the following steps:
From the list, click the node that you want to investigate.
From the Node Details section, click Uncordon.
kubectl
To get statuses of your nodes, run the following command:
kubectl get nodes
To enable scheduling on the node, run:
kubectl uncordon NODE_NAME
Error: Maximum Pods per node limit reached
If the Maximum Pods per node
limit is reached by all nodes in the cluster, the Pods will be stuck in
Unschedulable state. Under the Pod Events tab, you see a message
including the phrase Too many pods
.
To resolve this error, complete the following steps:
Check the
Maximum pods per node
configuration from the Nodes tab in GKE cluster details in the Google Cloud console.Get a list of nodes:
kubectl get nodes
For each node, verify the number of Pods running on the node:
kubectl get pods -o wide | grep NODE_NAME | wc -l
If the limit is reached, add a new node pool or add additional nodes to the existing node pool.
Issue: Maximum node pool size reached with cluster autoscaler enabled
If the node pool has reached its maximum size according to its cluster autoscaler configuration, GKE does not trigger scale up for the Pod that would otherwise be scheduled with this node pool. If you want the Pod to be scheduled with this node pool, change the cluster autoscaler configuration.
Issue: Maximum node pool size reached with cluster autoscaler disabled
If the node pool has reached its maximum number of nodes, and cluster autoscaler is disabled, GKE cannot schedule the Pod with the node pool. Increase the size of your node pool or enable cluster autoscaler for GKE to resize your cluster automatically.
Error: Unbound PersistentVolumeClaims
Unbound PersistentVolumeClaims
indicates that the Pod references a
PersistentVolumeClaim that is not bound. This error might happen if your
PersistentVolume failed to provision. You can verify that provisioning failed by
getting the events for your PersistentVolumeClaim and examining them for
failures.
To get events, run the following command:
kubectl describe pvc STATEFULSET_NAME-PVC_NAME-0
Replace the following:
STATEFULSET_NAME
: the name of the StatefulSet object.PVC_NAME
: the name of the PersistentVolumeClaim object.
This can also happen if there was a configuration error during your manual pre-provisioning of a PersistentVolume and its binding to a PersistentVolumeClaim.
To resolve this error, try to pre-provision the volume again.
Error: Insufficient quota
Verify that your project has sufficient Compute Engine quota for
GKE to scale up your cluster. If GKE attempts to
add a node to your cluster to schedule the Pod, and scaling up would exceed your
project's available quota, you receive the scale.up.error.quota.exceeded
error
message.
To learn more, see ScaleUp errors.
Issue: Deprecated APIs
Ensure that you are not using deprecated APIs that are removed with your cluster's minor version. To learn more, see GKE deprecations.
What's next
If you need additional assistance, reach out to
Cloud Customer Care.