Troubleshooting

Learn about troubleshooting steps that you might find helpful if you run into problems using Kubernetes Engine.

Debugging Kubernetes resources

If you are experiencing an issue related to your cluster, refer to Troubleshooting Clusters in the Kubernetes documentation.

If you are having an issue with your application, its Pods, or its controller object, refer to Troubleshooting Applications.

The kubectl command isn't found

First, install the kubectl binary by running the following command:

sudo gcloud components update kubectl

Answer "yes" when the installer prompts you to modify your $PATH environment variable. Modifying this variables enables you to use kubectl commands without typing their full file path.

Alternatively, add the following line to ~/.bashrc (or ~/.bash_profile in macOS, or wherever your shell stores environment variables):

export PATH=$PATH:/usr/local/share/google/google-cloud-sdk/bin/

Finally, run the following command to load your updated .bashrc (or .bash_profile) file:

source ~/.bashrc

kubectl commands return "connection refused" error

Set the cluster context with the following command:

gcloud container clusters get-credentials CLUSTER_NAME

If you are unsure of what to enter for CLUSTER_NAME, use the following command to list your clusters:

gcloud container clusters list

kubectl commands return "failed to negotiate an api version" error

Ensure kubectl has authentication credentials:

gcloud auth application-default login

The kubectl logs, attach, exec, and port-forward commands hang

These commands rely on the cluster's master being able to talk to the nodes in the cluster. However, because the master isn't in the same Compute Engine network as your cluster's nodes, we rely on SSH tunnels to enable secure communication.

Kubernetes Engine saves an SSH public key file in your Compute Engine project metadata. All Compute Engine VMs using Google-provided images regularly check their project's common metadata and their instance's metadata for SSH keys to add to the VM's list of authorized users. Kubernetes Engine also adds a firewall rule to your Compute Engine network allowing SSH access from the master's IP address to each node in the cluster.

If any of the above kubectl commands don't run, it's likely that the master is unable to open SSH tunnels with the nodes. Check for these potential causes:

  1. The cluster doesn't have any nodes.

    If you've scaled down the number of nodes in your cluster to zero, SSH tunnels won't work.

    To fix it, resize your cluster to have at least one node.

  2. Pods in the cluster have gotten stuck in a terminating state and have prevented nodes that no longer exist from being removed from the cluster.

    This is an issue that should only affect Kubernetes version 1.1, but could be caused by repeated resizing of the cluster.

    To fix it, delete the Pods that have been in a terminating state for more than a few minutes. The old nodes are then removed from the master's API and replaced by the new nodes.

  3. Your network's firewall rules don't allow for SSH access to the master.

    All Compute Engine networks are created with a firewall rule called "default-allow-ssh" that allows SSH access from all IP addresses (requiring a valid private key, of course). Kubernetes Engine also inserts an SSH rule for each cluster of the form gke-<cluster_name>-<random-characters>-ssh that allows SSH access specifically from the cluster's master IP to the cluster's nodes. If neither of these rules exists, then the master will be unable to open SSH tunnels.

    To fix it, re-add a firewall rule allowing access to VMs with the tag that's on all the cluster's nodes from the master's IP address.

  4. Your project's common metadata entry for "sshKeys" is full.

    If the project's metadata entry named "sshKeys" is close to the 32KiB size limit, then Kubernetes Engine isn't able to add its own SSH key to enable it to open SSH tunnels. You can see your project's metadata by running gcloud compute project-info describe [--project=PROJECT], then check the length of the list of sshKeys.

    To fix it, delete some of the SSH keys that are no longer needed.

  5. You have set a metadata field with the key "sshKeys" on the VMs in the cluster.

    The node agent on VMs prefers per-instance sshKeys to project-wide SSH keys, so if you've set any SSH keys specifically on the cluster's nodes, then the master's SSH key in the project metadata won't be respected by the nodes. To check, run gcloud compute instances describe <VM-name> and look for an "sshKeys" field in the metadata.

    To fix it, delete the per-instance SSH keys from the instance metadata.

It's worth noting that these features are not required for the correct functioning of the cluster. If you prefer to keep your cluster's network locked down from all outside access, be aware that features like these won't work.

Metrics from your cluster aren't showing up in Stackdriver

Ensure that you have activated the Stackdriver Monitoring API and the Stackdriver Logging API on your project, and that you are able to view your project in Stackdriver.

If the issue persists, check the following potential causes:

  1. Ensure that you have enabled monitoring on your cluster.

    Monitoring is enabled by default for clusters created from the Developers Console and the gcloud command-line tool, but you can verify by running the following command or clicking into the cluster's details in the Developers Console:

    gcloud container clusters describe cluster-name
    

    The output from the gcloud command-line tool should state that the "monitoringService" is "monitoring.googleapis.com", and Cloud Monitoring should be enabled in the Developers console.

    If monitoring is not enabled, run the following command to enable it:

    gcloud container clusters update cluster-name --monitoring-service=monitoring.googleapis.com
    
  2. How long has it been since your cluster was created or had monitoring enabled?

    It can take up to an hour for a new cluster's metrics to start appearing in Stackdriver Monitoring.

  3. Is a heapster running in your cluster in the "kube-system" namespace?

    It's possible that this pod is failing to schedule due to your cluster being too full. Check whether it's running by calling kubectl get pods --namespace=kube-system.

  4. Is your cluster's master able to communicate with the nodes?

    Stackdriver Monitoring relies on that. You can check whether this is the case by running kubectl logs [POD-NAME] If this command returns an error, then the SSH tunnels may be causing the issue. See this section.

If you are having an issue related to the Stackdriver Logging agent, see its troubleshooting documentation.

For more information, refer to the Stackdriver documentation.

Error 404: Resource "not found" when calling gcloud container commands

Re-authenticate to the gcloud command-line tool:

gcloud auth login

Error 400/403: Missing edit permissions on account

Your Compute Engine and/or Kubernetes Engine service account has been deleted or edited.

When you enable the Compute Engine or Kubernetes Engine API, a service account is created and given edit permissions on your project. If at any point you edit the permissions, remove the account entirely, or disable the API, cluster creation and all management functionality will fail.

To resolve the issue, you must re-enable the Kubernetes Engine API - this will correctly restore your service accounts and permissions.

  1. Visit the APIs & Services page.
  2. Select your project.
  3. Enable the Kubernetes Engine API - if you have previously enabled the API, you must first Disable it and then Enable it again. Wait for the API and related services to be enabled. This can take several minutes.

Alternatively, use the gcloud command-line tool:

gcloud services enable container.googleapis.com

Replicating 1.8.x (and earlier) automatic firewall rules on 1.9.x and later

If your cluster is running Kubernetes version 1.9.x, the automatic firewall rules have changed to disallow workloads in a Kubernetes Engine cluster to initiate communication with other Compute Engine VMs that are outside the cluster but on the same network.

You can replicate the automatic firewall rules behavior of a 1.8.x (and earlier) cluster by performing the following steps:

First, find your cluster's network:

gcloud container clusters describe [CLUSTER_NAME] --format=get"(network)"

Then get the cluster's IPv4 CIDR used for the containers:

gcloud container clusters describe [CLUSTER_NAME] --format=get"(clusterIpv4Cidr)"

Finally create a firewall rule for the network, with the CIDR as the source range, and allow all protocols:

gcloud compute firewall-rules create "[CLUSTER_NAME]-to-all-vms-on-network" --network="[NETWORK]" --source-ranges="[CLUSTER_IPV4_CIDR]" --allow=tcp,udp,icmp,esp,ah,sctp

Restore default service account to your GCP project

Kubernetes Engine's default service account, container-engine-robot, can accidentally become unbound from a project. Kubernetes Engine Service Agent is an IAM role that grants the service account the permissions to manage cluster resources. If you remove this role binding from the service account, the default service account becomes unbound from the project, which can prevent you from deploying applications and performing other cluster operations.

You can check to see if the service account has been removed from your project by running gcloud projects get-iam-policy [PROJECT_ID] or by visiting the IAM & admin menu in Google Cloud Platform Console. If the command or the dashboard do not display container-engine-robot among your service accounts, the service account has become unbound.

If you removed the Kubernetes Engine Service Agent role binding, run the following commands to restore the role binding:

PROJECT_ID=$(gcloud config get-value project)
PROJECT_NUMBER=$(gcloud projects describe "${PROJECT_ID}" --format "value(projectNumber)")
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member "serviceAccount:service-${PROJECT_NUMBER}@container-engine-robot.iam.gserviceaccount.com" \
  --role roles/container.serviceAgent

To confirm that the role binding was granted:

gcloud projects get-iam-policy $PROJECT_ID

If you see the service account name along with the container.serviceAgent role, the role binding has been granted. For example:

- members:
  - serviceAccount:service-1234567890@container-engine-robot.iam.gserviceaccount.com
  role: roles/container.serviceAgent
Was this page helpful? Let us know how we did:

Send feedback about...

Kubernetes Engine