This page lists troubleshooting pages for common issues you might
encounter when using Google Kubernetes Engine (GKE). This page is for
Admins and architects, Security specialists, Networking specialists, or
Storage specialists who troubleshoot GKE configurations. To learn more about GKE roles, see Common GKE Enterprise user roles and tasks.
Use this page to diagnose
and resolve issues you encounter across various stages of working with your
GKE infrastructure:
This page also provides access to more general troubleshooting topics:
Cluster setup
Topic |
Description |
Cluster creation |
Resolve issues with creating clusters. |
Autopilot clusters |
Diagnose and troubleshoot GKE Autopilot clusters, including cluster creation, namespace deletion, scaling, and workload issues. |
Kubectl command-line tool |
Troubleshoot the kubectl command-line tool in
GKE, including issues with authentication, authorization.
This page also includes advice on how to
troubleshoot the Konnectivity proxy
to check if it's causing the kubectl logs , attach ,
exec , or port-forward commands to stop
responding. |
Standard node pools |
Troubleshoot GKE Standard node pools,
including issues with node pool creation, best-effort provisioning,
corrupted instance metadata, and migrating workloads to new node pools. |
Node registration |
Troubleshoot issues that occur when adding nodes to your
GKE Standard cluster, such as node registration
failures and missing prerequisites for successful node registration. |
Container runtime |
Troubleshoot container runtimes in GKE, including
issues with containerd and dockershim , and
private registries. |
Networking
Topic |
Description |
Cluster connectivity |
Troubleshoot network connectivity, including issues with
Pod network connectivity. |
IP address management in VPC clusters |
Troubleshoot managing IP addresses in VPC-native
clusters, including issues with subnet exhaustion and default SNAT. |
Kube-dns in GKE |
Learn to identify the source of kube-dns issues by investigating things
like the /etc/resolv.conf file and network policies. Also
learn how to resolve common issues like intermittent DNS timeouts. |
Cloud DNS in GKE |
Learn to identify the source of Cloud DNS issues in
GKE by doing things like verifying basic settings and
investigating logs. Also learn how to resolve errors such as API rate
limits or insufficient quota. |
Cluster network isolation |
Troubleshoot cluster network isolation, including issues with cluster creation,
control plane access, VPC Network Peering, and connectivity to public
resources. |
Load balancing |
Troubleshoot load balancing, including issues with BackendConfig,
Ingress security policies, 500 series errors with NEGs, and internal
Ingress. |
Multi Cluster Ingress |
Troubleshoot MultiClusterIngress and
MultiClusterService resources, including issues with VIPs,
502 responses, and config cluster migration. |
Cloud NAT packet loss from a cluster |
Troubleshoot packet loss from Cloud NAT in clusters with private
nodes, including how to use Cloud Logging and Cloud Monitoring to
identify the cause of packet loss. |
Storage
Topic |
Description |
Storage |
Troubleshoot storage, including issues with regional persistent disks,
disk performance, and volume expansion. |
Cluster security
Topic |
Description |
Authentication |
Troubleshoot authentication in GKE, including issues
with RBAC, Workload Identity Federation for GKE, and the GKE
metadata server. |
Service accounts |
Troubleshoot service accounts, including restoring the default service
account and enabling the Compute Engine default service account. |
Application-layer secrets |
Troubleshoot issues that can occur when configuring application-layer
secrets encryption, including failed updates and errors
where you're unable to use a Cloud KMS key. |
Cluster's root Certificate Authority expiring soon
Workloads
Topic |
Description |
Deployed workloads |
Troubleshoot errors for workloads running in a GKE
cluster, including
CrashLoopBackOff,
ImagePullBackOff, and
PodUnschedulable.
Read the PodUnschedulable section for advice on errors like
MatchNodeSelector and
Does not have minimum availability.
|
Arm workloads |
Troubleshoot issues with Arm workloads, including Pods on Arm nodes
crashing. |
TPUs |
Troubleshoot TPUs, including issues with quota, node
auto-provisioning, workload configuration, and scheduling. |
GPUs |
Troubleshoot GPUs, including issues with GPU driver installation,
device plugin errors, and container images. |
Cluster management
Topic |
Description |
Upgrades |
Troubleshoot issues with GKE cluster upgrades, such as
a kube-apiserver that's unhealthy after a control plane
upgrade or workloads that are evicted after an upgrade. |
Webhooks |
Understand how to troubleshoot and ensure the stability of your
cluster control plane when using admission webhooks. |
Namespace stuck in the Terminating state |
Troubleshoot issues with namespaces stuck in the
Terminating state by identifying and removing the unhealthy
components that are blocking deletion. |
Monitoring
Topic |
Description |
System metrics |
Troubleshoot system metrics not appearing in Cloud Monitoring. |
Monitoring dashboards |
Troubleshoot monitoring dashboards, including issues with enabling
monitoring, missing Kubernetes resources, and permissions. |
Logging |
Troubleshoot logging, including issues with enabling logging, missing
logs, and quotas. |
4xx errors
Known issues
Topic |
Description |
Known issues |
Identify and resolve known issues that might
affect your use of GKE. |