Troubleshoot GKE

Autopilot Standard

This page lists troubleshooting pages for common issues you might encounter when using Google Kubernetes Engine (GKE). This page is for Admins and architects, Security specialists, Networking specialists, or Storage specialists who troubleshoot GKE configurations. To learn more about GKE roles, see Common GKE Enterprise user roles and tasks.

If you are new to troubleshooting in GKE or for a high-level overview of fundamental tools and techniques, start with Introduction to troubleshooting.

To diagnose and resolve issues you encounter across various stages of working with your GKE infrastructure, see the following sections:

Cluster setup
Storage
Cluster security
Workloads
Cluster management
Monitoring

This page also provides access to more general troubleshooting topics:

4xx errors
Known issues

To troubleshoot GKE networking, see Troubleshoot GKE networking in the GKE networking documentation.

Introduction to troubleshooting

Topic	Description
Introduction to GKE troubleshooting	Get started troubleshooting GKE by learning how to use core tools to diagnose and resolve your issues.

Cluster setup

Topic	Description
Cluster creation	Resolve issues with creating clusters.
Autopilot clusters	Diagnose and troubleshoot GKE Autopilot clusters, including cluster creation, namespace deletion, scaling, and workload issues.
Kubectl command-line tool	Troubleshoot the `kubectl` command-line tool in GKE, including issues with authentication, authorization. This page also includes advice on how to troubleshoot the Konnectivity proxy to check if it's causing the `kubectl logs`, `attach`, `exec`, or `port-forward` commands to stop responding.
Standard node pools	Troubleshoot GKE Standard node pools, including issues with node pool creation, best-effort provisioning, corrupted instance metadata, and migrating workloads to new node pools.
Node registration	Troubleshoot issues that occur when adding nodes to your GKE Standard cluster, such as node registration failures and missing prerequisites for successful node registration.
Container runtime	Troubleshoot container runtimes in GKE, including issues with `containerd` and `dockershim`, and private registries.

Storage

Topic	Description
Storage	Troubleshoot storage, including issues with regional persistent disks, disk performance, and volume expansion.

Cluster security

Topic	Description
Authentication	Troubleshoot authentication in GKE, including issues with RBAC, Workload Identity Federation for GKE, and the GKE metadata server.
Service accounts	Troubleshoot service accounts, including restoring the default service account and enabling the Compute Engine default service account.
Application-layer secrets	Troubleshoot issues that can occur when configuring application-layer secrets encryption, including failed updates and errors where you're unable to use a Cloud KMS key or where the Cloud KMS key version was destroyed.

Cluster's root Certificate Authority expiring soon

Topic	Description
Root Certificate Authority (CA) expiring	If your cluster's root Certificate Authority (CA) is expiring soon, learn how to perform a credential rotation to prevent normal cluster operations from being interrupted.

Workloads

Topic	Description
Deployed workloads	Troubleshoot errors for workloads running in a GKE cluster, including `CrashLoopBackOff` and `PodUnschedulable`. Read the PodUnschedulable section for advice on errors like `MatchNodeSelector` and `Does not have minimum availability`.
Image pulls	Troubleshoot image pulls. Learn what causes statuses like `ImagePullBackOff` and `ErrImagePull` and how to resolve these statuses by fixing common issues like authentication and network connectivity.
OOM events	Troubleshoot Kubernetes Out of Memory (OOM) events. Identify causes, distinguish event types, and apply effective solutions for both container- and node-level OOM kills.
Arm workloads	Troubleshoot issues with Arm workloads, including Pods on Arm nodes crashing.
TPUs	Troubleshoot TPUs, including issues with quota, node auto-provisioning, workload configuration, and scheduling.
GPUs	Troubleshoot GPUs, including issues with GPU driver installation, device plugin errors, and container images.

Cluster management

Topic	Description
Upgrades	Troubleshoot issues with GKE cluster upgrades, such as a `kube-apiserver` that's unhealthy after a control plane upgrade or workloads that are evicted after an upgrade.
Webhooks	Understand how to troubleshoot and ensure the stability of your cluster control plane when using admission webhooks.
Namespace stuck in the `Terminating` state	Troubleshoot issues with namespaces stuck in the `Terminating` state by identifying and removing the unhealthy components that are blocking deletion.

Monitoring

Topic	Description
System metrics	Troubleshoot system metrics not appearing in Cloud Monitoring.
Monitoring dashboards	Troubleshoot monitoring dashboards, including issues with enabling monitoring, missing Kubernetes resources, and permissions.
Logging	Troubleshoot logging, including issues with enabling logging, missing logs, and quotas.

4xx errors

Topic	Description
4xx errors	Troubleshoot some of the 400, 401, 403, and 404 errors that you might encounter when using GKE. This page also includes information on how to troubleshoot missing edit permissions on account errors.

Known issues

Topic	Description
Known issues	Identify and resolve known issues that might affect your use of GKE.

What's next

If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics:
- Opening a support case by contacting Cloud Customer Care.
- Getting support from the community by asking questions on StackOverflow and using the google-kubernetes-engine tag to search for similar issues. You can also join the #kubernetes-engine Slack channel for more community support.
- Opening bugs or feature requests by using the public issue tracker.

Troubleshoot GKE Stay organized with collections Save and categorize content based on your preferences.