Troubleshooting pages


This page provides a list of the troubleshooting pages in Google Kubernetes Engine (GKE), organized by the tasks you'll typically complete when building your GKE environment. For example, you might begin by setting up a cluster, then move on to configuring networking, provisioning storage, and establishing cluster security. From there, you might deploy your workloads and begin managing and monitoring your cluster.

This page also provides access to more general troubleshooting topics: known issues and 4xx errors.

Cluster setup

Topic Description
Cluster creation Resolve issues with creating clusters.
Autopilot clusters Diagnose and troubleshoot GKE Autopilot clusters, including cluster creation, namespace deletion, scaling, and workload issues.
Kubectl command-line tool Troubleshoot the kubectl command-line tool in GKE, including issues with authentication, authorization. This page also includes advice on how to troubleshoot the Konnectivity proxy to check if it's causing the kubectl logs, attach, exec, or port-forward commands to stop responding.
Standard node pools Troubleshoot GKE Standard node pools, including issues with node pool creation, best-effort provisioning, corrupted instance metadata, and migrating workloads to new node pools.
Node registration Troubleshoot issues that occur when adding nodes to your GKE Standard cluster, such as node registration failures and missing prerequisites for successful node registration.
Container runtime Troubleshoot container runtimes in GKE, including issues with containerd and dockershim, and private registries.

Networking

Topic Description
Cluster connectivity Troubleshoot network connectivity, including issues with Pod network connectivity.
IP address management in VPC clusters Troubleshoot managing IP addresses in VPC-native clusters, including issues with subnet exhaustion and default SNAT.
DNS Troubleshoot issues that occur with the Cloud DNS service in GKE, including issues with Cloud DNS quotas and response policies.
Cluster network isolation Troubleshoot cluster network isolation, including issues with cluster creation, control plane access, VPC Network Peering, and connectivity to public resources.
Load balancing Troubleshoot load balancing, including issues with BackendConfig, Ingress security policies, 500 series errors with NEGs, and internal Ingress.
Multi Cluster Ingress Troubleshoot MultiClusterIngress and MultiClusterService resources, including issues with VIPs, 502 responses, and config cluster migration.
Cloud NAT packet loss from a cluster Troubleshoot packet loss from Cloud NAT in clusters with private nodes, including how to use Cloud Logging and Cloud Monitoring to identify the cause of packet loss.

Storage

Topic Description
Storage Troubleshoot storage, including issues with regional persistent disks, disk performance, and volume expansion.

Cluster security

Topic Description
Authentication Troubleshoot authentication in GKE, including issues with RBAC, Workload Identity Federation for GKE, and the GKE metadata server.
Service accounts Troubleshoot service accounts, including restoring the default service account and enabling the Compute Engine default service account.
Application-layer secrets Troubleshoot issues that can occur when configuring application-layer secrets encryption, including failed updates and errors with Cloud Key Management Service keys.

Cluster's root Certificate Authority expiring soon

Topic Description
Root Certificate Authority (CA) expiring If your cluster's root Certificate Authority (CA) is expiring soon, learn how to perform a credential rotation to prevent normal cluster operations from being interrupted.

Workloads

Topic Description
Deployed workloads Troubleshoot errors for workloads running in a GKE cluster, including CrashLoopBackOff, ImagePullBackOff, and PodUnschedulable.
Arm workloads Troubleshoot issues with Arm workloads, including Pods on Arm nodes crashing.
TPUs Troubleshoot TPUs, including issues with quota, node auto-provisioning, workload configuration, and scheduling.
GPUs Troubleshoot GPUs, including issues with GPU driver installation, device plugin errors, and container images.

Cluster management

Topic Description
Upgrades Troubleshoot issues with GKE cluster upgrades, such as a kube-apiserver that's unhealthy after a control plane upgrade or workloads that are evicted after an upgrade.
Webhooks Understand how to troubleshoot and ensure the stability of your cluster control plane when using admission webhooks.
Namespace stuck in the Terminating state Troubleshoot issues with namespaces stuck in the Terminating state by identifying and removing the unhealthy components that are blocking deletion.

Monitoring

Topic Description
System metrics Troubleshoot system metrics not appearing in Cloud Monitoring.
Monitoring dashboards Troubleshoot monitoring dashboards, including issues with enabling monitoring, missing Kubernetes resources, and permissions.
Logging Troubleshoot logging, including issues with enabling logging, missing logs, and quotas.

4xx errors

Topic Description
4xx errors Troubleshoot some of the 400, 401, 403, and 404 errors that you might encounter when using GKE. This page also includes information on how to troubleshoot missing edit permissions on account errors.

Known issues

Topic Description
Known issues Identify and resolve known issues that might affect your use of GKE.