Troubleshooting

Learn about troubleshooting steps that you might find helpful if you run into problems using Google Kubernetes Engine.

Debugging Kubernetes resources

If you are experiencing an issue related to your cluster, refer to Troubleshooting Clusters in the Kubernetes documentation.

If you are having an issue with your application, its Pods, or its controller object, refer to Troubleshooting Applications.

The kubectl command isn't found

First, install the kubectl binary by running the following command:

sudo gcloud components update kubectl

Answer "yes" when the installer prompts you to modify your $PATH environment variable. Modifying this variables enables you to use kubectl commands without typing their full file path.

Alternatively, add the following line to ~/.bashrc (or ~/.bash_profile in macOS, or wherever your shell stores environment variables):

export PATH=$PATH:/usr/local/share/google/google-cloud-sdk/bin/

Finally, run the following command to load your updated .bashrc (or .bash_profile) file:

source ~/.bashrc

kubectl commands return "connection refused" error

Set the cluster context with the following command:

gcloud container clusters get-credentials CLUSTER_NAME

If you are unsure of what to enter for CLUSTER_NAME, use the following command to list your clusters:

gcloud container clusters list

kubectl commands return "failed to negotiate an api version" error

Ensure kubectl has authentication credentials:

gcloud auth application-default login

The kubectl logs, attach, exec, and port-forward commands hang

These commands rely on the cluster's master being able to talk to the nodes in the cluster. However, because the master isn't in the same Compute Engine network as your cluster's nodes, we rely on SSH tunnels to enable secure communication.

GKE saves an SSH public key file in your Compute Engine project metadata. All Compute Engine VMs using Google-provided images regularly check their project's common metadata and their instance's metadata for SSH keys to add to the VM's list of authorized users. GKE also adds a firewall rule to your Compute Engine network allowing SSH access from the master's IP address to each node in the cluster.

If any of the above kubectl commands don't run, it's likely that the master is unable to open SSH tunnels with the nodes. Check for these potential causes:

  1. The cluster doesn't have any nodes.

    If you've scaled down the number of nodes in your cluster to zero, SSH tunnels won't work.

    To fix it, resize your cluster to have at least one node.

  2. Pods in the cluster have gotten stuck in a terminating state and have prevented nodes that no longer exist from being removed from the cluster.

    This is an issue that should only affect Kubernetes version 1.1, but could be caused by repeated resizing of the cluster.

    To fix it, delete the Pods that have been in a terminating state for more than a few minutes. The old nodes are then removed from the master's API and replaced by the new nodes.

  3. Your network's firewall rules don't allow for SSH access to the master.

    All Compute Engine networks are created with a firewall rule called "default-allow-ssh" that allows SSH access from all IP addresses (requiring a valid private key, of course). GKE also inserts an SSH rule for each cluster of the form gke-<cluster_name>-<random-characters>-ssh that allows SSH access specifically from the cluster's master IP to the cluster's nodes. If neither of these rules exists, then the master will be unable to open SSH tunnels.

    To fix it, re-add a firewall rule allowing access to VMs with the tag that's on all the cluster's nodes from the master's IP address.

  4. Your project's common metadata entry for "ssh-keys" is full.

    If the project's metadata entry named "ssh-keys" is close to the 32KiB size limit, then GKE isn't able to add its own SSH key to enable it to open SSH tunnels. You can see your project's metadata by running gcloud compute project-info describe [--project=PROJECT], then check the length of the list of ssh-keys.

    To fix it, delete some of the SSH keys that are no longer needed.

  5. You have set a metadata field with the key "ssh-keys" on the VMs in the cluster.

    The node agent on VMs prefers per-instance ssh-keys to project-wide SSH keys, so if you've set any SSH keys specifically on the cluster's nodes, then the master's SSH key in the project metadata won't be respected by the nodes. To check, run gcloud compute instances describe <VM-name> and look for an "ssh-keys" field in the metadata.

    To fix it, delete the per-instance SSH keys from the instance metadata.

It's worth noting that these features are not required for the correct functioning of the cluster. If you prefer to keep your cluster's network locked down from all outside access, be aware that features like these won't work.

Metrics from your cluster aren't showing up in Stackdriver

Ensure that you have activated the Stackdriver Monitoring API and the Stackdriver Logging API on your project, and that you are able to view your project in Stackdriver.

If the issue persists, check the following potential causes:

  1. Ensure that you have enabled monitoring on your cluster.

    Monitoring is enabled by default for clusters created from the Developers Console and the gcloud command-line tool, but you can verify by running the following command or clicking into the cluster's details in the Developers Console:

    gcloud container clusters describe cluster-name
    

    The output from the gcloud command-line tool should state that the "monitoringService" is "monitoring.googleapis.com", and Cloud Monitoring should be enabled in the Developers console.

    If monitoring is not enabled, run the following command to enable it:

    gcloud container clusters update cluster-name --monitoring-service=monitoring.googleapis.com
    
  2. How long has it been since your cluster was created or had monitoring enabled?

    It can take up to an hour for a new cluster's metrics to start appearing in Stackdriver Monitoring.

  3. Is a heapster running in your cluster in the "kube-system" namespace?

    It's possible that this pod is failing to schedule due to your cluster being too full. Check whether it's running by calling kubectl get pods --namespace=kube-system.

  4. Is your cluster's master able to communicate with the nodes?

    Stackdriver Monitoring relies on that. You can check whether this is the case by running kubectl logs [POD-NAME] If this command returns an error, then the SSH tunnels may be causing the issue. See this section.

If you are having an issue related to the Stackdriver Logging agent, see its troubleshooting documentation.

For more information, refer to the Stackdriver documentation.

Error 404: Resource "not found" when calling gcloud container commands

Re-authenticate to the gcloud command-line tool:

gcloud auth login

Error 400/403: Missing edit permissions on account

Your Compute Engine and/or Kubernetes Engine service account has been deleted or edited.

When you enable the Compute Engine or Kubernetes Engine API, a service account is created and given edit permissions on your project. If at any point you edit the permissions, remove the account entirely, or disable the API, cluster creation and all management functionality will fail.

To resolve the issue, you must re-enable the Kubernetes Engine API - this will correctly restore your service accounts and permissions.

  1. Visit the APIs & Services page.
  2. Select your project.
  3. Click __Enable APIs and Services__.
  4. Search for GKE, then select the API from the search results.
  5. Click __Enable__. If you have previously enabled the API, you must first disable it and then enable it again. It can take several minutes for the API and related services to be enabled.

Alternatively, use the gcloud command-line tool:

gcloud services enable container.googleapis.com

Replicating 1.8.x (and earlier) automatic firewall rules on 1.9.x and later

If your cluster is running Kubernetes version 1.9.x, the automatic firewall rules have changed to disallow workloads in a GKE cluster to initiate communication with other Compute Engine VMs that are outside the cluster but on the same network.

You can replicate the automatic firewall rules behavior of a 1.8.x (and earlier) cluster by performing the following steps:

First, find your cluster's network:

gcloud container clusters describe [CLUSTER_NAME] --format=get"(network)"

Then get the cluster's IPv4 CIDR used for the containers:

gcloud container clusters describe [CLUSTER_NAME] --format=get"(clusterIpv4Cidr)"

Finally create a firewall rule for the network, with the CIDR as the source range, and allow all protocols:

gcloud compute firewall-rules create "[CLUSTER_NAME]-to-all-vms-on-network" --network="[NETWORK]" --source-ranges="[CLUSTER_IPV4_CIDR]" --allow=tcp,udp,icmp,esp,ah,sctp

Restore default service account to your GCP project

GKE's default service account, container-engine-robot, can accidentally become unbound from a project. GKE Service Agent is an IAM role that grants the service account the permissions to manage cluster resources. If you remove this role binding from the service account, the default service account becomes unbound from the project, which can prevent you from deploying applications and performing other cluster operations.

You can check to see if the service account has been removed from your project by running gcloud projects get-iam-policy [PROJECT_ID] or by visiting the IAM & admin menu in Google Cloud Platform Console. If the command or the dashboard do not display container-engine-robot among your service accounts, the service account has become unbound.

If you removed the GKE Service Agent role binding, run the following commands to restore the role binding:

PROJECT_ID=$(gcloud config get-value project)
PROJECT_NUMBER=$(gcloud projects describe "${PROJECT_ID}" --format "value(projectNumber)")
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member "serviceAccount:service-${PROJECT_NUMBER}@container-engine-robot.iam.gserviceaccount.com" \
  --role roles/container.serviceAgent

To confirm that the role binding was granted:

gcloud projects get-iam-policy $PROJECT_ID

If you see the service account name along with the container.serviceAgent role, the role binding has been granted. For example:

- members:
  - serviceAccount:service-1234567890@container-engine-robot.iam.gserviceaccount.com
  role: roles/container.serviceAgent

Troubleshooting issues with deployed workloads

GKE returns an error if there are issues with a workload's Pods. You can check the status of a Pod using the kubectl command-line tool or Google Cloud Platform Console.

kubectl

To see all Pods running in your cluster, run the following command:

kubectl get pods

Output:

NAME            READY   STATUS              RESTARTS    AGE
[POD_NAME]      0/1     CrashLoopBackOff    23          8d

To get more details information about a specific Pod:

kubectl describe pod [POD_NAME]

Console

Perform the following steps:

  1. Visit the GKE Workloads dashboard in GCP Console.

    Visit the GKE Workloads dashboard

  2. Select the desired workload. The Overview tab displays the status of the workload.

  3. From the Managed Pods section, click on the error status message.

The following sections explain some common errors returned by workloads and how to resolve them.

CrashLoopBackOff

CrashLoopBackOff indicates that a container is repeatedly crashing after restarting. A container might crash for many reasons, and checking a Pod's logs might aid in troubleshooting the root cause.

By default, crashed containers restart with an exponential delay limited to five minutes. You can change this behavior setting the restartPolicy field Deployment's Pod specification under spec: restartPolicy. The field's default value is Always.

You can find out why your Pod's container is crashing using the kubectl command-line tool or GCP Console.

kubectl

To see all Pods running in your cluster, run the following command:

kubectl get pods

Look for the Pod with the CrashLoopBackOff error.

To get the Pod's logs, run:

kubectl logs [POD_NAME]

where [POD_NAME] is the name of the problematic Pod.

You can also pass in the -p flag to get the logs for the previous instance of a Pod's container, if it exists.

Console

Perform the following steps:

  1. Visit the GKE Workloads dashboard in GCP Console.

    Visit the GKE Workloads dashboard

  2. Select the desired workload. The Overview tab displays the status of the workload.

  3. From the Managed Pods section, click the problemetic Pod.
  4. From the Pod's menu, click the Logs tab.

Check “Exit Code” of the crashed container

You can find it in the output of kubectl describe pod [POD_NAME] in the containers: [CONTAINER_NAME]: last state: exit code field.

  • If the exit code is 1, the container crashed because the application crashed.
  • If the exit code is 0, verify for how long your app was running. Containers exit when your application's main process exits. If your app finishes execution very quickly, container might continue to restart.

Connect to a running container

Open a shell to the Pod:

kubectl exec -it [POD_NAME] -- /bin/bash

If there is more than one container in your Pod, add -c [CONTAINER_NAME].

Now, you can run bash commands from the container: you can test the network or check if you have access to files or databases used by your application.

ImagePullBackOff and ErrImagePull

ImagePullBackOff and ErrImagePull indicate that the image used by a container cannot be loaded from the image registry.

You can verify this issue using GCP Console or the kubectl command-line tool.

kubectl

To get more information about a Pod's container image, run the following command:

kubectl describe pod [POD_NAME]

Console

Perform the following steps:

  1. Visit the GKE Workloads dashboard in GCP Console.

    Visit the GKE Workloads dashboard

  2. Select the desired workload. The Overview tab displays the status of the workload.

  3. From the Managed Pods section, click the problemetic Pod.
  4. From the Pod's menu, click the Events tab.

If the image is not found

If your image is not found:

  1. Verify that the image's name is correct.
  2. Verify that the image's tag is correct. (Try :latest or no tag to pull the latest image).
  3. If the image has full registry path, verify that it exists in the Docker registry you are using. If you provide only the image name, check the Docker Hub registry.
  4. Try to pull the docker image manually:

    • SSH into the node:
      For example, to SSH into example-instance in the us-central1-a zone:

      gcloud compute ssh example-instance --zone us-central1-a
      
    • Run docker pull [IMAGE_NAME].
      If this option works, you probably need to specify ImagePullSecrets on a Pod. Pods can only reference image pull secrets in their own namespace, so this process needs to be done one time per namespace.

If you encounter a "permission denied" or "no pull access" error, verify that you are logged in and/or have access to the image.

If you are using a private registry, it may require keys to read images.

PodUnschedulable

PodUnschedulable indicates that your Pod cannot be scheduled because of insufficient resources or some configuration error.

Insufficient resources

You might encounter an error indicating a lack of CPU, memory, or other resource. For example: "No nodes are available that match all of the predicates: Insufficient cpu (2)" which indicates that on two nodes there isn't enough CPU available to fulfill a Pod's requests.

The default CPU request is 100m or 10% of a CPU (or one core). If you want to request more or fewer resources, specify the value in the Pod specification under spec: containers: resources: requests

MatchNodeSelector

MatchNodeSelector indicates that there are no nodes that match the Pod's label selector,.

To verify this, check the labels specified in the Pod specification's nodeSelector field, under spec: nodeSelector.

To see how nodes in your cluster are labelled, run the following command:

kubectl get nodes --show-labels

To attach label to a node:

kubectl label nodes [NODE_NAME] [LABEL_KEY]=[LABEL_VALUE]

For more information, refer to Assigning Pods to Nodes.

PodToleratesNodeTaints

PodToleratesNodeTaints indicates that the Pod can't be scheduled to any node because no node currently tolerates its node taint.

To verify that this is the case, run the following command:

kubectl describe nodes [NODE_NAME]

In the output, check the Taints field, which lists key-value pairs and scheduling effects.

If the effect listed is NoSchedule, then no Pod can be scheduled on that node unless it has a matching toleration.

One way to resolve this issue is to remove the taint. For example, to remove a NoSchedule taint:

kubectl taint nodes [NODE_NAME] key:NoSchedule-

PodFitsHostPorts

PodFitsHostPorts indicates that a port that a node is attempting to use is already in use.

To resolve this issue, check the Pod specification's hostPort value under spec: containers: ports: hostPort. You might need to change this value to another port.

Does not have minimum availability

If your Nodes have enough resources but you still have Does not have minimum availability message, check if the Nodes have SchedulingDisabled or Cordoned status: in this case they don't accept new pods.

kubectl

To get statuses of your Nodes, run the following command:

kubectl get nodes

To enable scheduling on the Node, run:

kubectl uncordon [NODE_NAME]

Console

Perform the following steps:

  1. Visit the GKE Workloads dashboard in GCP Console.

    Visit the GKE Clusters dashboard

  2. Select the desired cluster. The Nodes tab displays the Nodes and their status.

To enable scheduling on the Node, perform the following steps:

  1. From the list, click the desired Node.

  2. From the Node Details, click Uncordon button.

Was this page helpful? Let us know how we did:

Send feedback about...

Kubernetes Engine