Troubleshooting

Learn about troubleshooting steps that you might find helpful if you run into problems using Google Kubernetes Engine (GKE).

Debugging Kubernetes resources

If you are experiencing an issue related to your cluster, refer to Troubleshooting Clusters in the Kubernetes documentation.

If you are having an issue with your application, its Pods, or its controller object, refer to Troubleshooting Applications.

If you are having an issue related to connectivity between Compute Engine VMs that are in the same Virtual Private Cloud (VPC) network or two VPC networks connected with VPC Network Peering, refer to Troubleshooting connectivity between virtual machine (VM) instances with internal IP addresses.

If you are experiencing packet loss when sending traffic from a cluster to an external IP addresss using Cloud NAT, VPC-native clusters, or IP masquerade agent), see Troubleshooting Cloud NAT packet loss from a GKE cluster.

Troubleshooting issues with kubectl command

The kubectl command isn't found

  1. Install the kubectl binary by running the following command:

    gcloud components update kubectl
    
  2. Answer "yes" when the installer prompts you to modify your $PATH environment variable. Modifying this variable enables you to use kubectl commands without typing their full file path.

    Alternatively, add the following line to ~/.bashrc (or ~/.bash_profile in macOS, or wherever your shell stores environment variables):

    export PATH=$PATH:/usr/local/share/google/google-cloud-sdk/bin/
    
  3. Run the following command to load your updated .bashrc (or .bash_profile) file:

    source ~/.bashrc
    

kubectl commands return "connection refused" error

Set the cluster context with the following command:

gcloud container clusters get-credentials CLUSTER_NAME

If you are unsure of what to enter for CLUSTER_NAME, use the following command to list your clusters:

gcloud container clusters list

kubectl command times out

After creating a cluster, attempting to run the kubectl command against the cluster returns an error, such as Unable to connect to the server: dial tcp IP_ADDRESS: connect: connection timed out or Unable to connect to the server: dial tcp IP_ADDRESS: i/o timeout.

This can occur when kubectl is unable to talk to the cluster control plane.

To resolve this issue, set the cluster context using the following command:

gcloud container clusters get-credentials CLUSTER_NAME [--region=REGION | --zone=ZONE]

kubectl commands return "failed to negotiate an api version" error

Ensure kubectl has authentication credentials:

gcloud auth application-default login

The kubectl logs, attach, exec, and port-forward commands stops responding

These commands rely on the cluster's control plane (master) being able to talk to the nodes in the cluster. However, because the control plane isn't in the same Compute Engine network as your cluster's nodes, we rely on SSH tunnels to enable secure communication.

GKE saves an SSH public key file in your Compute Engine project metadata. All Compute Engine VMs using Google-provided images regularly check their project's common metadata and their instance's metadata for SSH keys to add to the VM's list of authorized users. GKE also adds a firewall rule to your Compute Engine network allowing SSH access from the control plane's IP address to each node in the cluster.

If any of the above kubectl commands don't run, it's likely that the API server is unable to open SSH tunnels with the nodes. Check for these potential causes:

  1. The cluster doesn't have any nodes.

    If you've scaled down the number of nodes in your cluster to zero, SSH tunnels won't work.

    To fix it, resize your cluster to have at least one node.

  2. Pods in the cluster have gotten stuck in a terminating state and have prevented nodes that no longer exist from being removed from the cluster.

    This is an issue that should only affect Kubernetes version 1.1, but could be caused by repeated resizing of the cluster.

    To fix it, delete the Pods that have been in a terminating state for more than a few minutes. The old nodes are then removed from the control plane and replaced by the new nodes.

  3. Your network's firewall rules don't allow for SSH access to the control plane.

    All Compute Engine networks are created with a firewall rule called default-allow-ssh that allows SSH access from all IP addresses (requiring a valid private key, of course). GKE also inserts an SSH rule for each public cluster of the form gke-CLUSTER_NAME-RANDOM_CHARACTERS-ssh that allows SSH access specifically from the cluster's control plane to the cluster's nodes. If neither of these rules exists, then the control plane can't open SSH tunnels.

    To fix it, re-add a firewall rule allowing access to VMs with the tag that's on all the cluster's nodes from the control plane's IP address.

  4. Your project's common metadata entry for "ssh-keys" is full.

    If the project's metadata entry named "ssh-keys" is close to maximum size limit, then GKE isn't able to add its own SSH key to enable it to open SSH tunnels. You can see your project's metadata by running the following command:

    gcloud compute project-info describe [--project=PROJECT_ID]
    

    And then check the length of the list of ssh-keys.

    To fix it, delete some of the SSH keys that are no longer needed.

  5. You have set a metadata field with the key "ssh-keys" on the VMs in the cluster.

    The node agent on VMs prefers per-instance ssh-keys to project-wide SSH keys, so if you've set any SSH keys specifically on the cluster's nodes, then the control plane's SSH key in the project metadata won't be respected by the nodes. To check, run gcloud compute instances describe VM_NAME and look for an ssh-keys field in the metadata.

    To fix it, delete the per-instance SSH keys from the instance metadata.

It's worth noting that these features are not required for the correct functioning of the cluster. If you prefer to keep your cluster's network locked down from all outside access, be aware that features like these won't work.

Troubleshooting error 400 issues

Error 404: Resource "not found" when calling gcloud container commands

Re-authenticate to the gcloud command-line tool:

gcloud auth login

Error 400/403: Missing edit permissions on account

Your Compute Engine default service account, the Google APIs Service Agent, or the service account associated with GKE has been deleted or edited manually.

When you enable the Compute Engine or Kubernetes Engine API, the Compute Engine default service account and the Google APIs Service Agent are created and assigned edit permissions on your project, and the Google Kubernetes Engine service account is created and assigned the Kubernetes Engine Service Agent role on your project. If at any point you edit those permissions, remove the role bindings on the project, remove the service account entirely, or disable the API, cluster creation and all management functionality will fail.

The name of your Google Kubernetes Engine service account is as follows, where PROJECT_NUMBER is your project number:

service-PROJECT_NUMBER@container-engine-robot.iam.gserviceaccount.com

The following command can be used to verify that the Google Kubernetes Engine service account has the Kubernetes Engine Service Agent role assigned on the project:

gcloud projects get-iam-policy PROJECT_ID

Replace PROJECT_ID with your project ID.

To resolve the issue, if you have removed the Kubernetes Engine Service Agent role from your Google Kubernetes Engine service account, add it back. Otherwise, you must re-enable the Kubernetes Engine API, which will correctly restore your service accounts and permissions. You can do this in the gcloud tool or the Cloud Console.

Console

  1. Go to the APIs & Services page in Cloud Console.

    Go to APIs & Services

  2. Select your project.

  3. Click Enable APIs and Services.

  4. Search for Kubernetes, then select the API from the search results.

  5. Click Enable. If you have previously enabled the API, you must first disable it and then enable it again. It can take several minutes for the API and related services to be enabled.

gcloud

Run the following command in the gcloud tool:

gcloud services enable container.googleapis.com

Error 400: Cannot attach RePD to an optimized VM

Regional persistent disks are restricted from being used with memory-optimized machines or compute-optimized machines.

Consider using a non-regional persistent disk storage class if using a regional persistent disk is not a hard requirement. If using a regional persistent disk is a hard requirement, consider scheduling strategies such as taints and tolerations to ensure that the Pods that need regional PD are scheduled on a node pool that are not optimized machines.

Troubleshooting issues with GKE cluster creation

Error CONDITION_NOT_MET: Constraint constraints/compute.vmExternalIpAccess violated

You have the organization policy constraint constraints/compute.vmExternalIpAccess configured to Deny All or to restrict external IPs to specific VM instances at the organization, folder, or project level in which you are trying to create a public GKE cluster.

When you create public GKE clusters, the underlying Compute Engine VMs, which make up the worker nodes of this cluster, have external IP addresses assigned. If you configure the organization policy constraint constraints/compute.vmExternalIpAccess to Deny All or to restrict external IPs to specific VM instances, then the policy prevents the GKE worker nodes from obtaining external IP addresses, which results in cluster creation failure.

To find the logs of the cluster creation operation, you can review the GKE Cluster Operations Audit Logs using Logs Explorer with a search query similar to the following:

resource.type="gke_cluster"
logName="projects/test-last-gke-sa/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName="google.container.v1beta1.ClusterManager.CreateCluster"
resource.labels.cluster_name="CLUSTER_NAME"
resource.labels.project_id="PROJECT_ID"

To resolve this issue, ensure that the effective policy for the constraint constraints/compute.vmExternalIpAccess is Allow All on the project where you are trying to create a GKE public cluster. See Restricting external IP addresses to specific VM instances for information on working with this constraint. After setting the constraint to Allow All, delete the failed cluster and create a new cluster. This is required because repairing the failed cluster is not possible.

Troubleshooting issues with deployed workloads

GKE returns an error if there are issues with a workload's Pods. You can check the status of a Pod using the kubectl command-line tool or Cloud Console.

kubectl

To see all Pods running in your cluster, run the following command:

kubectl get pods

Output:

NAME       READY  STATUS             RESTARTS  AGE
POD_NAME   0/1    CrashLoopBackOff   23        8d

To get more details information about a specific Pod, run the following command:

kubectl describe pod POD_NAME

Replace POD_NAME with the name of the desired Pod.

Console

Perform the following steps:

  1. Go to the Workloads page in Cloud Console.

    Go to Workloads

  2. Select the desired workload. The Overview tab displays the status of the workload.

  3. From the Managed Pods section, click the error status message.

The following sections explain some common errors returned by workloads and how to resolve them.

CrashLoopBackOff

CrashLoopBackOff indicates that a container is repeatedly crashing after restarting. A container might crash for many reasons, and checking a Pod's logs might aid in troubleshooting the root cause.

By default, crashed containers restart with an exponential delay limited to five minutes. You can change this behavior setting the restartPolicy field Deployment's Pod specification under spec: restartPolicy. The field's default value is Always.

You can find out why your Pod's container is crashing using the kubectl command-line tool or Cloud Console.

kubectl

To see all Pods running in your cluster, run the following command:

kubectl get pods

Look for the Pod with the CrashLoopBackOff error.

To get the Pod's logs, run the following command:

kubectl logs POD_NAME

Replace POD_NAME with the name of the problematic Pod.

You can also pass in the -p flag to get the logs for the previous instance of a Pod's container, if it exists.

Console

Perform the following steps:

  1. Go to the Workloads page in Cloud Console.

    Go to Workloads

  2. Select the desired workload. The Overview tab displays the status of the workload.

  3. From the Managed Pods section, click the problematic Pod.

  4. From the Pod's menu, click the Logs tab.

Check "Exit Code" of the crashed container

You can find the exit code by performing the following tasks:

  1. Run the following command:

    kubectl describe pod POD_NAME
    

    Replace POD_NAME with the name of the Pod.

  2. Review the value in the containers: CONTAINER_NAME: last state: exit code field:

    • If the exit code is 1, the container crashed because the application crashed.
    • If the exit code is 0, verify for how long your app was running.

    Containers exit when your application's main process exits. If your app finishes execution very quickly, container might continue to restart.

Connect to a running container

Open a shell to the Pod:

kubectl exec -it POD_NAME -- /bin/bash

If there is more than one container in your Pod, add -c CONTAINER_NAME.

Now, you can run bash commands from the container: you can test the network or check if you have access to files or databases used by your application.

ImagePullBackOff and ErrImagePull

ImagePullBackOff and ErrImagePull indicate that the image used by a container cannot be loaded from the image registry.

You can verify this issue using Cloud Console or the kubectl command-line tool.

kubectl

To get more information about a Pod's container image, run the following command:

kubectl describe pod POD_NAME

Console

Perform the following steps:

  1. Go to the Workloads page in Cloud Console.

    Go to Workloads

  2. Select the desired workload. The Overview tab displays the status of the workload.

  3. From the Managed Pods section, click the problematic Pod.

  4. From the Pod's menu, click the Events tab.

If the image is not found

If your image is not found:

  1. Verify that the image's name is correct.
  2. Verify that the image's tag is correct. (Try :latest or no tag to pull the latest image).
  3. If the image has full registry path, verify that it exists in the Docker registry you are using. If you provide only the image name, check the Docker Hub registry.
  4. Try to pull the docker image manually:

    • SSH into the node:

      For example, to SSH into example-instance in the us-central1-a zone:

      gcloud compute ssh example-instance --zone us-central1-a
      
    • Run docker pull IMAGE_NAME.

    If this option works, you probably need to specify ImagePullSecrets on a Pod. Pods can only reference image pull secrets in their own namespace, so this process needs to be done one time per namespace.

Permission denied error

If you encounter a "permission denied" or "no pull access" error, verify that you are logged in and have access to the image. Try one of the following methods depending on the registry in which you host your images.

Artifact Registry

If your image is in Artifact Registry, your node pool's service account needs read access to the repository that contains the image.

Grant the artifactregistry.reader role to the service account:

gcloud artifacts repositories add-iam-policy-binding REPOSITORY_NAME \
    --location=REPOSITORY_LOCATION \
    --member=SERVICE_ACCOUNT \
    --role="roles/artifactregistry.reader"

Replace the following:

  • REPOSITORY_NAME: the name of your Artifact Registry repository.
  • REPOSITORY_LOCATION: the region or multi-region of your Artifact Registry repository.
  • SERVICE_ACCOUNT: the name of the service account associated with your node pool.

Container Registry

If your image is in Container Registry, your node pool's service account needs read access to the Cloud Storage bucket that contains the image.

Grant the roles/storage.objectViewer role to the service account so that it can read from the bucket:

gsutil iam set gs://BUCKET_NAME \
    serviceAccount:SERVICE_ACCOUNT:roles/storage.objectViewer

Replace the following:

  • BUCKET_NAME: the name of the Cloud Storage bucket that contains your images. You can list all the buckets in your project using gsutil ls.
  • SERVICE_ACCOUNT: the name of the service account associated with your node pool.

Private registry

If your image is in a private registry, you might require keys to access the images. See Using private registries for more information.

Pod unschedulable

PodUnschedulable indicates that your Pod cannot be scheduled because of insufficient resources or some configuration error.

Insufficient resources

You might encounter an error indicating a lack of CPU, memory, or another resource. For example: "No nodes are available that match all of the predicates: Insufficient cpu (2)" which indicates that on two nodes there isn't enough CPU available to fulfill a Pod's requests.

The default CPU request is 100m or 10% of a CPU (or one core). If you want to request more or fewer resources, specify the value in the Pod specification under spec: containers: resources: requests

MatchNodeSelector

MatchNodeSelector indicates that there are no nodes that match the Pod's label selector.

To verify this, check the labels specified in the Pod specification's nodeSelector field, under spec: nodeSelector.

To see how nodes in your cluster are labelled, run the following command:

kubectl get nodes --show-labels

To attach a label to a node, run the following command:

kubectl label nodes NODE_NAME LABEL_KEY=LABEL_VALUE

Replace the following:

  • NODE_NAME: the desired node.
  • LABEL_KEY: the label's key.
  • LABEL_VALUE: the label's value.

For more information, refer to Assigning Pods to Nodes.

PodToleratesNodeTaints

PodToleratesNodeTaints indicates that the Pod can't be scheduled to any node because no node currently tolerates its node taint.

To verify that this is the case, run the following command:

kubectl describe nodes NODE_NAME

In the output, check the Taints field, which lists key-value pairs and scheduling effects.

If the effect listed is NoSchedule, then no Pod can be scheduled on that node unless it has a matching toleration.

One way to resolve this issue is to remove the taint. For example, to remove a NoSchedule taint, run the following command:

kubectl taint nodes NODE_NAME key:NoSchedule-

PodFitsHostPorts

PodFitsHostPorts indicates that a port that a node is attempting to use is already in use.

To resolve this issue, check the Pod specification's hostPort value under spec: containers: ports: hostPort. You might need to change this value to another port.

Does not have minimum availability

If a node has adequate resources but you still see the Does not have minimum availability message, check the Pod's status. If the status is SchedulingDisabled or Cordoned status, the node cannot schedule new Pods. You can check the status of a node using Cloud Console or the kubectl command-line tool.

kubectl

To get statuses of your nodes, run the following command:

kubectl get nodes

To enable scheduling on the node, run:

kubectl uncordon NODE_NAME

Console

Perform the following steps:

  1. Go to the Google Kubernetes Engine page in Cloud Console.

    Go to Google Kubernetes Engine

  2. Select the desired cluster. The Nodes tab displays the Nodes and their status.

To enable scheduling on the Node, perform the following steps:

  1. From the list, click the desired Node.

  2. From the Node Details, click Uncordon button.

Unbound PersistentVolumeClaims

Unbound PersistentVolumeClaims indicates that the Pod references a PersistentVolumeClaim that is not bound. This error might happen if your PersistentVolume failed to provision. You can verify that provisioning failed by getting the events for your PersistentVolumeClaim and examining them for failures.

To get events, run the following command:

kubectl describe pvc STATEFULSET_NAME-PVC_NAME-0

Replace the following:

  • STATEFULSET_NAME: the name of the StatefulSet object.
  • PVC_NAME: the name of the PersistentVolumeClaim object.

This may also happen if there was a configuration error during your manual pre-provisioning of a PersistentVolume and its binding to a PersistentVolumeClaim. You can try to pre-provision the volume again.

Connectivity issues

As mentioned in the Network Overview discussion, it is important to understand how Pods are wired from their network namespaces to the root namespace on the node in order to troubleshoot effectively. For the following discussion, unless otherwise stated, assume that the cluster uses GKE's native CNI rather than Calico's. That is, no network policy has been applied.

Pods on select nodes have no availability

If Pods on select nodes have no network connectivity, ensure that the Linux bridge is up:

ip address show cbr0

If the Linux bridge is down, raise it:

sudo ip link set cbr0 up

Ensure that the node is learning Pod MAC addresses attached to cbr0:

arp -an

Pods on select nodes have minimal connectivity

If Pods on select nodes have minimal connectivity, you should first confirm whether there are any lost packets by running tcpdump in the toolbox container:

sudo toolbox bash

Install tcpdump in the toolbox if you have not done so already:

apt install -y tcpdump

Run tcpdump against cbr0:

tcpdump -ni cbr0 host HOSTNAME and port PORT_NUMBER and [TCP|UDP|ICMP]

Should it appear that large packets are being dropped downstream from the bridge (for example, the TCP handshake completes, but no SSL hellos are received), ensure that the Linux bridge MTU is correctly set to the MTU of the cluster's VPC network.

ip address show cbr0

When overlays are used (for example, Weave or Flannel), this MTU must be further reduced to accommodate encapsulation overhead on the overlay.

Intermittent failed connections

Connections to and from the Pods are forwarded by iptables. Flows are tracked as entries in the conntrack table and, where there are many workloads per node, conntrack table exhaustion may manifest as a failure. These can be logged in the serial console of the node, for example:

nf_conntrack: table full, dropping packet

If you are able to determine that intermittent issues are driven by conntrack exhaustion, you may increase the size of the cluster (thus reducing the number of workloads and flows per node), or increase nf_conntrack_max:

new_ct_max=$(awk '$1 == "MemTotal:" { printf "%d\n", $2/32; exit; }' /proc/meminfo)
sysctl -w net.netfilter.nf_conntrack_max="${new_ct_max:?}" \
  && echo "net.netfilter.nf_conntrack_max=${new_ct_max:?}" >> /etc/sysctl.conf

"bind: Address already in use" reported for a container

A container in a Pod is unable to start because according to the container logs, the port where the application is trying to bind to is already reserved. The container is crash looping. For example, in Cloud Logging:

resource.type="container"
textPayload:"bind: Address already in use"
resource.labels.container_name="redis"

2018-10-16 07:06:47.000 CEST 16 Oct 05:06:47.533 # Creating Server TCP listening socket *:60250: bind: Address already in use
2018-10-16 07:07:35.000 CEST 16 Oct 05:07:35.753 # Creating Server TCP listening socket *:60250: bind: Address already in use

When Docker crashes, sometimes a running container gets left behind and is stale. The process is still running in the network namespace allocated for the Pod, and listening on its port. Because Docker and the kubelet don't know about the stale container they try to start a new container with a new process, which is unable to bind on the port as it gets added to the network namespace already associated with the Pod.

To diagnose this problem:

  1. You need the UUID of the Pod in the .metadata.uuid field:

    kubectl get pod -o custom-columns="name:.metadata.name,UUID:.metadata.uid" ubuntu-6948dd5657-4gsgg
    
    name                      UUID
    ubuntu-6948dd5657-4gsgg   db9ed086-edba-11e8-bdd6-42010a800164
    
  2. Get the output of the following commands from the node:

    docker ps -a
    ps -eo pid,ppid,stat,wchan:20,netns,comm,args:50,cgroup --cumulative -H | grep [Pod UUID]
    
  3. Check running processes from this Pod. Because the UUID of the cgroup namespaces contain the UUID of the Pod, you can grep for the Pod UUID in ps output. Grep also the line before, so you will have the docker-containerd-shim processes having the container id in the argument as well. Cut the rest of the cgroup column to get a simpler output:

    # ps -eo pid,ppid,stat,wchan:20,netns,comm,args:50,cgroup --cumulative -H | grep -B 1 db9ed086-edba-11e8-bdd6-42010a800164 | sed s/'blkio:.*'/''/
    1283089     959 Sl   futex_wait_queue_me  4026531993       docker-co       docker-containerd-shim 276e173b0846e24b704d4 12:
    1283107 1283089 Ss   sys_pause            4026532393         pause           /pause                                     12:
    1283150     959 Sl   futex_wait_queue_me  4026531993       docker-co       docker-containerd-shim ab4c7762f5abf40951770 12:
    1283169 1283150 Ss   do_wait              4026532393         sh              /bin/sh -c echo hello && sleep 6000000     12:
    1283185 1283169 S    hrtimer_nanosleep    4026532393           sleep           sleep 6000000                            12:
    1283244     959 Sl   futex_wait_queue_me  4026531993       docker-co       docker-containerd-shim 44e76e50e5ef4156fd5d3 12:
    1283263 1283244 Ss   sigsuspend           4026532393         nginx           nginx: master process nginx -g daemon off; 12:
    1283282 1283263 S    ep_poll              4026532393           nginx           nginx: worker process
    
  4. From this list, you can see the container ids, which should be visible in docker ps as well.

    In this case:

    • docker-containerd-shim 276e173b0846e24b704d4 for pause
    • docker-containerd-shim ab4c7762f5abf40951770 for sh with sleep (sleep-ctr)
    • docker-containerd-shim 44e76e50e5ef4156fd5d3 for nginx (echoserver-ctr)
  5. Check those in the docker ps output:

    # docker ps --no-trunc | egrep '276e173b0846e24b704d4|ab4c7762f5abf40951770|44e76e50e5ef4156fd5d3'
    44e76e50e5ef4156fd5d383744fa6a5f14460582d0b16855177cbed89a3cbd1f   gcr.io/google_containers/echoserver@sha256:3e7b182372b398d97b747bbe6cb7595e5ffaaae9a62506c725656966d36643cc                   "nginx -g 'daemon off;'"                                                                                                                                                                                                                                                                                                                                                                     14 hours ago        Up 14 hours                             k8s_echoserver-cnt_ubuntu-6948dd5657-4gsgg_default_db9ed086-edba-11e8-bdd6-42010a800164_0
    ab4c7762f5abf40951770d3e247fa2559a2d1f8c8834e5412bdcec7df37f8475   ubuntu@sha256:acd85db6e4b18aafa7fcde5480872909bd8e6d5fbd4e5e790ecc09acc06a8b78                                                "/bin/sh -c 'echo hello && sleep 6000000'"                                                                                                                                                                                                                                                                                                                                                   14 hours ago        Up 14 hours                             k8s_sleep-cnt_ubuntu-6948dd5657-4gsgg_default_db9ed086-edba-11e8-bdd6-42010a800164_0
    276e173b0846e24b704d41cf4fbb950bfa5d0f59c304827349f4cf5091be3327   k8s.gcr.io/pause-amd64:3.1
    

    In normal cases, you see all container ids from ps showing up in docker ps. If there is one you don't see, it's a stale container, and probably you will see a child process of the docker-containerd-shim process listening on the TCP port that is reporting as already in use.

    To verify this, execute netstat in the container's network namespace. Get the pid of any container process (so NOT docker-containerd-shim) for the Pod.

    From the above example:

    • 1283107 - pause
    • 1283169 - sh
    • 1283185 - sleep
    • 1283263 - nginx master
    • 1283282 - nginx worker
    # nsenter -t 1283107 --net netstat -anp
    Active Internet connections (servers and established)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
    tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      1283263/nginx: mast
    Active UNIX domain sockets (servers and established)
    Proto RefCnt Flags       Type       State         I-Node   PID/Program name     Path
    unix  3      [ ]         STREAM     CONNECTED     3097406  1283263/nginx: mast
    unix  3      [ ]         STREAM     CONNECTED     3097405  1283263/nginx: mast
    
    gke-zonal-110-default-pool-fe00befa-n2hx ~ # nsenter -t 1283169 --net netstat -anp
    Active Internet connections (servers and established)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
    tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      1283263/nginx: mast
    Active UNIX domain sockets (servers and established)
    Proto RefCnt Flags       Type       State         I-Node   PID/Program name     Path
    unix  3      [ ]         STREAM     CONNECTED     3097406  1283263/nginx: mast
    unix  3      [ ]         STREAM     CONNECTED     3097405  1283263/nginx: mast
    

    You can also execute netstat using ip netns, but you need to link the network namespace of the process manually, as Docker is not doing the link:

    # ln -s /proc/1283169/ns/net /var/run/netns/1283169
    gke-zonal-110-default-pool-fe00befa-n2hx ~ # ip netns list
    1283169 (id: 2)
    gke-zonal-110-default-pool-fe00befa-n2hx ~ # ip netns exec 1283169 netstat -anp
    Active Internet connections (servers and established)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
    tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      1283263/nginx: mast
    Active UNIX domain sockets (servers and established)
    Proto RefCnt Flags       Type       State         I-Node   PID/Program name     Path
    unix  3      [ ]         STREAM     CONNECTED     3097406  1283263/nginx: mast
    unix  3      [ ]         STREAM     CONNECTED     3097405  1283263/nginx: mast
    gke-zonal-110-default-pool-fe00befa-n2hx ~ # rm /var/run/netns/1283169
    

Mitigation:

The short term mitigation is to identify stale processes by the method outlined above, and end the processes using the kill [PID] command.

Long term mitigation involves identifying why Docker is crashing and fixing that. Possible reasons include:

  • Zombie processes piling up, so running out of PID namespaces
  • Bug in docker
  • Resource pressure / OOM

Error: "failed to allocate for range 0: no IP addresses in range set"

GKE version 1.18.17 and later fixed an issue where out-of-memory (OOM) events would result in incorrect Pod eviction if the Pod was deleted before its containers were started. This incorrect eviction could result in orphaned pods that continued to have reserved IP addresses from the allocated node range. Over time, GKE ran out of IP addresses to allocate to new pods because of the build-up of orphaned pods. This led to the error message failed to allocate for range 0: no IP addresses in range set, because the allocated node range didn't have available IPs to assign to new pods.

To resolve this issue, upgrade your cluster and node pools to GKE version 1.18.17 or later.

To prevent this issue and resolve it on clusters with GKE versions prior to 1.18.17, increase your resource limits to avoid OOM events in the future, and then reclaim the IP addresses by removing the orphaned pods.

Remove the orphaned pods from affected nodes

You can remove the orphaned pods by draining the node, upgrading the node pool, or moving the affected directories.

Draining the node (recommended)

  1. Cordon the node to prevent new pods from scheduling on it:

     kubectl cordon NODE
    

    Replace NODE with the name of the node you want to drain.

  2. Drain the node. GKE automatically reschedules pods managed by deployments onto other nodes. Use the --force flag to drain orphaned pods that don't have a managing resource.

     kubectl drain NODE --force
    
  3. Uncordon the node to allow GKE to schedule new pods on it:

     kubectl uncordon NODE
    

Moving affected directories

You can identify orphaned Pod directories in /var/lib/kubelet/pods and move them out of the main directory to allow GKE to terminate the pods.

Troubleshooting issues with terminating resources

Namespace stuck in Terminating state

Namespaces use Kubernetes finalizers to prevent deletion when one or more resources within a namespace still exist. When you delete a namespace using the kubectl delete command, the namespace enters the Terminating state until Kubernetes deletes its dependent resources and clears all finalizers. The namespace lifecycle controller first lists all resources in the namespace that GKE needs to delete. If GKE can't delete a dependent resource, or if the namespace lifecycle controller can't verify that the namespace is empty, the namespace remains in the Terminating state until you resolve the issue.

To resolve a namespace stuck in the Terminating state, you need to identify and remove the unhealthy component(s) blocking the deletion. Try one of the following solutions.

Find and remove unavailable API services

  1. List unavailable API services:

    kubectl get apiservice | grep False
    
  2. Troubleshoot any unresponsive services:

    kubectl describe apiservice API_SERVICE
    

    Replace API_SERVICE with the name of the unresponsive service.

  3. Check if the namespace is still terminating:

    kubectl get ns | grep Terminating
    

Find and remove remaining resources

  1. List all the resources remaining in the terminating namespace:

    kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get -n NAMESPACE
    

    Replace NAMESPACE with the name of the namespace you want to delete.

  2. Remove any resources displayed in the output.

  3. Check if the namespace is still terminating:

    kubectl get ns | grep Terminating
    

Force delete the namespace

You can remove the finalizers blocking namespace deletion to force the namespace to terminate.

  1. Save the namespace manifest as a YAML file:

    kubectl get ns NAMESPACE -o yaml > ns-terminating.yml
    
  2. Open the manifest in a text editor and remove all values in the spec.finalizers field:

    vi ns-terminating.yml
    
  3. Verify that the finalizers field is empty:

    cat ns-terminating.yml
    

    The output should look like the following:

    apiVersion: v1
    kind: Namespace
    metadata:
      annotations:
      name: NAMESPACE
    spec:
      finalizers:
    status:
      phase: Terminating
    
  4. Start an HTTP proxy to access the Kubernetes API:

    kubectl proxy
    
  5. Replace the namespace manifest using curl:

    curl -H "Content-Type: application/yaml" -X PUT --data-binary @ns-terminating.yml http://127.0.0.1:8001/api/v1/namespaces/NAMESPACE/finalize
    
  6. Check if the namespace is still terminating:

    kubectl get ns | grep Terminating
    

Troubleshooting issues with disk performance

The performance of the boot disk is important because the boot disk for GKE nodes is not only used for the operating system but also for the following:

  • docker images
  • the container filesystem for what is not mounted as a volume (that is, the overlay filesystem), and this often includes directories like /tmp.
  • disk-backed emptyDir volumes, unless the node uses local SSD.

Disk performance is shared for all disks of the same disk type on a node. For example, if you have a 100 GB pd-standard boot disk and a 100 GB pd-standard PersistentVolume with lots of activity, the performance of the boot disk will be that of a 200 GB disk. Also, if there is a lot of activity on the PersistentVolume, this will impact the performance of the boot disk as well.

If you encounter messages similar to the following on your nodes, these could be symptoms of low disk performance:

INFO: task dockerd:2314 blocked for more than 300 seconds.
fs: disk usage and inodes count on following dirs took 13.572074343s
PLEG is not healthy: pleg was last seen active 6m46.842473987s ago; threshold is 3m0s

To help resolve such issues, review the following:

  • Ensure you have consulted the Storage disk type comparisons and chosen a persistent disk type to suit your needs.
  • This issue often occurs for nodes that use standard persistent disks with a size of less than 200 GB. Consider increasing the size of your disks or switching to SSDs, especially for clusters used in production.
  • Consider enabling local SSD for ephemeral storage on your node pools. This is particularly effective if you have containers that frequently use emptyDir volumes.

Troubleshooting Cloud NAT packet loss from a GKE cluster

Because node VMs in GKE private clusters don't have external IP addresses, they can't connect to the internet by themselves. You can use Cloud NAT to allocate the external IP addresses and ports that allow private clusters to make public connections.

If a node VM runs out of its allocation of external ports and IP addresses from Cloud NAT, packets will drop. To avoid this, you can reduce the outbound packet rate or increase the allocation of available Cloud NAT source IP addresses and ports. The following sections describe how to diagnose and troubleshoot packet loss from Cloud NAT in the context of GKE private clusters.

Diagnosing packet loss

This section explains how to log dropped packets using Cloud Logging, and diagnose the cause of dropped packets using Cloud Monitoring.

Logging dropped packets

You can log dropped packets with the following query in Cloud Logging:

resource.type="nat_gateway"
resource.labels.region=REGION
resource.labels.gateway_name=GATEWAY_NAME
jsonPayload.allocation_status="DROPPED"
  • REGION: the name of the region that the cluster is in.
  • GATEWAY_NAME: the name of the Cloud NAT gateway.

This command returns a list of all packets dropped by a Cloud NAT gateway, but does not identify the cause.

Monitoring causes for packet loss

To identify causes for dropped packets, query the Metrics observer in Cloud Monitoring. Packets drop for one of three reasons:

To identify packets dropped due to OUT_OF_RESOURCES or ENDPOINT_ALLOCATION_FAILED error codes, use the following query:

fetch nat_gateway
  metric 'router.googleapis.com/nat/dropped_sent_packets_count'
  filter (resource.gateway_name == NAT_NAME)
  align rate(1m)
  every 1m
  group_by [metric.reason],
    [value_dropped_sent_packets_count_aggregate:
       aggregate(value.dropped_sent_packets_count)]

To identify packets dropped due to the NAT_ALLOCATION_FAILED error code, use the following query:

fetch nat_gateway
  metric 'router.googleapis.com/nat/nat_allocation_failed'
  group_by 1m,
    [value_nat_allocation_failed_count_true:
       count_true(value.nat_allocation_failed)]
  every 1m

Troubleshooting Cloud NAT with GKE IP masquerading

If the previous queries return empty results, and GKE Pods are unable to communicate to external IP addresses, troubleshoot your configuration:

Configuration Troubleshooting
Cloud NAT configured to apply only to the subnet's primary IP address range. When Cloud NAT is configured only for the subnet's primary IP address range, packets sent from the cluster to external IP addresses must have a source node IP address. In this Cloud NAT configuration:
  • Pods can send packets to external IP addresses if those external IP address destinations are subject to IP masquerading. When deploying the ip-masq-agent, verify that the nonMasqueradeCIDRs list doesn't contain the destination IP address and port. Packets sent to those destinations are first converted to source node IP addresses before being processed by Cloud NAT.
  • To allow the Pods to connect to all external IP addresses with this Cloud NAT configuration, ensure the ip-masq-agent is deployed and that the nonMasqueradeCIDRs list contains only the node and Pod IP address ranges of the cluster. Packets sent to destinations outside of the cluster are first converted to source node IP addresses before being processed by Cloud NAT.
  • To prevent Pods from sending packets to some external IP addresses, you need to explicitly block those addresses so they are not masqueraded. With the ip-masq-agent is deployed, add the external IP addresses you wish to block to the nonMasqueradeCIDRs list. Packets sent to those destinations leave the node with their original Pod IP address sources. The Pod IP addresses come from a secondary IP address range of the cluster's subnet. In this configuration, Cloud NAT won't operate on that secondary range.
Cloud NAT configured to apply only to the subnet's secondary IP address range used for Pod IPs.

When Cloud NAT is configured only for the subnet's secondary IP address range used by the cluster's Pod IPs, packets sent from the cluster to external IP addresses must have a source Pod IP address. In this Cloud NAT configuration:

  • Using an IP masqeurade agent causes packets to lose their source Pod IP address when processed by Cloud NAT. To keep the source Pod IP address, specify destination IP address ranges in a nonMasqueradeCIDRs list. With the ip-masq-agent deployed, any packets sent to destinations on the nonMasqueradeCIDRslist retain their source Pod IP addresses before being processed by Cloud NAT.
  • To allow the Pods to connect to all external IP addresses with this Cloud NAT configuration, ensure the ip-masq-agent is deployed and that the nonMasqueradeCIDRs list is as large as possible (0.0.0.0/0 specifies all IP address destinations). Packets sent to all destinations retain source Pod IP addresses before being processed by Cloud NAT.

Optimizations to avoid packet loss

You can stop packet loss by:

Optimizing your application

When an application makes multiple outbound connections to the same destination IP address and port, it can quickly consume all connections Cloud NAT can make to that destination using the number of allocated NAT source addresses and source port tuples. In this scenario, reducing the application's outbound packet rate helps to reduce packet loss.

For details about the how Cloud NAT uses NAT source addresses and source ports to make connections, including limits on the number of simultaneous connections to a destination, refer to Ports and connections.

Reducing the rate of outbound connections from the application can help to mitigate packet loss. You can accomplish this by reusing open connections. Common methods of reusing connections include connection pooling, multiplexing connections using protocols such as HTTP/2, or establishing persistent connections reused for multiple requests. For more information, see Ports and Connections.

Node version not compatible with control plane version

Check what version of Kubernetes your cluster's control plane is running, and then check what version of Kubernetes your cluster's node pools are running. If any of the cluster's node pools are more than two minor versions older than the control plane, this might be causing issues with your cluster.

Periodically, the GKE team performs upgrades of the cluster control plane on your behalf. Control planes are upgraded to newer stable versions of Kubernetes. By default, a cluster's nodes have auto-upgrade enabled, and it is recommended that you do not disable it.

If auto-upgrade is disabled for a cluster's nodes, and you do not manually upgrade your node pool version to a version that is compatible with the control plane, your control plane will eventually become incompatible with your nodes as the control plane is automatically upgraded over time. Incompatibility between your cluster's control plane and the nodes can cause unexpected issues.

The Kubernetes version and version skew support policy guarantees that control planes are compatible with nodes up to two minor versions older than the control plane. For example, Kubernetes 1.19 control planes are compatible with Kubernetes 1.19, 1.18, and 1.17 nodes. To resolve this issue, manually upgrade the node pool version to a version that is compatible with the control plane.

If you are concerned about the upgrade process causing disruption to workloads running on the affected nodes, follow the steps in the Migrating the workloads section of the Migrating workloads to different machine types tutorial. These steps let you migrate gracefully by creating a new node pool and then cordoning and draining the old node pool.

Metrics from your cluster aren't showing up in Cloud Monitoring

Ensure that you have activated the Cloud Monitoring API and the Cloud Logging API on your project, and that you are able to view your project in Cloud Monitoring.

If the issue persists, check the following potential causes:

  1. Ensure that you have enabled monitoring on your cluster.

    Monitoring is enabled by default for clusters created from the Google Cloud Console and from the gcloud command-line tool, but you can verify by running the following command or clicking into the cluster's details in the Cloud Console:

    gcloud container clusters describe CLUSTER_NAME
    

    The output from this command should state that the "monitoringService" is "monitoring.googleapis.com", and Cloud Monitoring should be enabled in the Cloud Console.

    If monitoring is not enabled, run the following command to enable it:

    gcloud container clusters update CLUSTER_NAME --monitoring-service=monitoring.googleapis.com
    
  2. How long has it been since your cluster was created or had monitoring enabled?

    It can take up to an hour for a new cluster's metrics to start appearing in Cloud Monitoring.

  3. Is a heapster or gke-metrics-agent (the OpenTelemetry Collector) running in your cluster in the "kube-system" namespace?

    This pod might be failing to schedule workloads because your cluster is running low on resources. Check whether Heapster or OpenTelemetry is running by calling kubectl get pods --namespace=kube-system and checking for pods with heapster or gke-metrics-agent in the name.

  4. Is your cluster's control plane able to communicate with the nodes?

    Cloud Monitoring relies on that. You can check whether this is the case by running the following command:

    kubectl logs POD_NAME
    

    If this command returns an error, then the SSH tunnels may be causing the issue. See this section for further information.

If you are having an issue related to the Cloud Logging agent, see its troubleshooting documentation.

For more information, refer to the Logging documentation.

Missing permissions on account for Shared VPC clusters

For Shared VPC clusters, ensure that the service project's GKE service account has a binding for the Host Service Agent User role on the host project. You can do this using the gcloud tool.

To check if the role binding exists, run the following command in your host project:

gcloud projects get-iam-policy PROJECT_ID \
  --flatten="bindings[].members" \
  --format='table(bindings.role)' \
  --filter="bindings.members:SERVICE_ACCOUNT_NAME

Replace the following:

  • PROJECT_ID: your host project ID.
  • SERVICE_ACCOUNT_NAME: the GKE service account name.

In the output, look for the roles/container.hostServiceAgentUser role:

ROLE
...
roles/container.hostServiceAgentUser
...

If the hostServiceAgentUser role isn't in the list, follow the instructions in Granting the Host Service Agent User role to add the binding to the service account.

Restore default service account to your GCP project

GKE's default service account, container-engine-robot, can accidentally become unbound from a project. GKE Service Agent is an Identity and Access Management (IAM) role that grants the service account the permissions to manage cluster resources. If you remove this role binding from the service account, the default service account becomes unbound from the project, which can prevent you from deploying applications and performing other cluster operations.

You can check to see if the service account has been removed from your project using gcloud tool or the Cloud Console.

gcloud

Run the following command:

gcloud projects get-iam-policy PROJECT_ID

Replace PROJECT_ID with your project ID.

Console

Visit the IAM & Admin page in Cloud Console.

If the command or the dashboard do not display container-engine-robot among your service accounts, the service account has become unbound.

If you removed the GKE Service Agent role binding, run the following commands to restore the role binding:

PROJECT_ID=$(gcloud config get-value project)
PROJECT_NUMBER=$(gcloud projects describe "${PROJECT_ID}" --format "value(projectNumber)")
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member "serviceAccount:service-${PROJECT_NUMBER}@container-engine-robot.iam.gserviceaccount.com" \
  --role roles/container.serviceAgent

To confirm that the role binding was granted:

gcloud projects get-iam-policy $PROJECT_ID

If you see the service account name along with the container.serviceAgent role, the role binding has been granted. For example:

- members:
  - serviceAccount:service-1234567890@container-engine-robot.iam.gserviceaccount.com
  role: roles/container.serviceAgent

Pods stuck in pending state after enabling Node Allocatable

If you are experiencing an issue with Pods stuck in pending state after enabling Node Allocatable, please note the following:

Starting with version 1.7.6, GKE reserves CPU and memory for Kubernetes overhead, including Docker and the operating system. See Cluster architecture for information on how much of each machine type can be scheduled by Pods.

If Pods are pending after an upgrade, we suggest the following:

  1. Ensure CPU and Memory requests for your Pods do not exceed their peak usage. With GKE reserving CPU and memory for overhead, Pods cannot request these resources. Pods that request more CPU or memory than they use prevent other Pods from requesting these resources, and might leave the cluster underutilized. For more information, see How Pods with resource requests are scheduled.

  2. Consider resizing your cluster. For instructions, see Resizing a cluster.

  3. Revert this change by downgrading your cluster. For instructions, see Manually upgrading a cluster or node pool.

Cluster's root Certificate Authority is expiring soon

Your cluster's root Certificate Authority is expiring soon. To prevent normal cluster operations from being interrupted, you must perform a credential rotation.

Seeing error "Instance 'Foo' does not contain 'instance-template' metadata"

You may see an error "Instance 'Foo' does not contain 'instance-template' metadata" as a status of a node pool that fails to upgrade, scale, or perform automatic node repair.

This message indicates that the metadata of VM instances, allocated by GKE, was corrupted. This typically happens when custom-authored automation or scripts attempt to add new instance metadata (like block-project-ssh-keys), and instead of just adding or updating values, it also deletes existing metadata. You can read about VM instance metadata in Setting custom metadata.

In case any of the critical metadata values (among others: instance-template, kube-labels, kubelet-config, kubeconfig, cluster-name, configure-sh, cluster-uid) were deleted, the node or entire node pool might render itself into an unstable state as these values are crucial for GKE operations.

If the instance metadata was corrupted, the best way to recover the metadata is to re-create the node pool that contains the corrupted VM instances. You will need to add a node pool to your cluster and increase the node count on the new node pool, while cordoning and removing nodes on another. This is similar to the process explained in Migrating workloads to different machine types.

To find who and when instance metadata was edited, you can review Compute Engine audit logging information or find logs using Logs Explorer with the search query similar to this:

resource.type="gce_instance_group_manager"
protoPayload.methodName="v1.compute.instanceGroupManagers.setInstanceTemplate"

In the logs you may find the request originator IP address and user agent:

requestMetadata: {
  callerIp: "REDACTED"
  callerSuppliedUserAgent: "google-api-go-client/0.5 GoogleContainerEngine/v1"
}

Mounting a volume stops responding due to the fsGroup setting

One issue that can cause PersistentVolume mounting to fail is a Pod that is configured with the fsGroup setting. Normally, mounts automatically retry and the mount failure resolves itself. However, if the PersistentVolume is large, setting ownership is slow and can cause the mount to fail.

To confirm if a failed mount error is due to the fsGroup setting, you can check the logs for the Pod. If the issue is related to the fsGroup setting, you will see the following log entry:

Setting volume ownership for /var/lib/kubelet/pods/POD_UUID and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699

If the PersistentVolume does not mount within a few minutes, try the following to resolve this issue: