GKE known issues

Autopilot Standard

This page lists known issues for GKE. This page is for Admins and architects who manage the lifecycle of the underlying technology infrastructure, and respond to alerts and pages when service level objectives (SLOs) aren't met or applications fail.

This page lists known issues for all supported versions and for the two minor versions that immediately precede the earliest version in extended support. After a version reaches the end of extended support, all N-2 issues are removed. For example, when version 1.30 reaches the end of extended support, known issues specific to versions 1.28 and earlier are removed.

If you're part of the Google Developer Program, save this page to receive notifications when a release note related to this page is published. To learn more, see Saved Pages.

To filter the known issues by a product version or category, select your filters from the following drop-down menus.

Select your GKE version:

Select your problem category:

Or, search for your issue:

Category	Identified version(s)	Fixed version(s)	Issue and workaround
Operation	1.33 versions before 1.33.4-gke.1036000	1.33.4-gke.1036000 and later	Incorrect performance tier for dynamically provisioned Lustre instances When dynamically provisioning a Lustre instance, the instance creation fails with an `InvalidArgument` error for PerUnitStorageThroughput, regardless of the perUnitStorageThroughput value specified in the API request. Workaround: Upgrade the GKE cluster to version 1.33.4-gke.1036000 or later. If using the Stable channel, a newer version might not be available yet. In this case, you can manually select a version from the Regular or Rapid channels that includes the fix.
Operation	1.32.3-gke.1099000 and later 1.33	1.33.3-gke.1266000 and later	Input/output error when renaming or moving files using Cloud Storage FUSE CSI driver When using an affected version of the Cloud Storage FUSE CSI driver, renaming or moving files in Cloud Storage buckets might fail with an I/O error. Workaround: Temporarily add a specific sidecar image definition to your Pod manifest. In the `spec.containers` section of your Pod manifest, add the following container definition with the image: `gcr.io/gke-release/gcs-fuse-csi-driver-sidecar-mounter:v1.8.9-gke.2`. # Add the following block to use the fixed sidecar image - name: gke-gcsfuse-sidecar image: gcr.io/gke-release/gcs-fuse-csi-driver-sidecar-mounter:v1.8.9-gke.2 For more information, see the Pod manifest in Configure a private image for the sidecar container. After you upgrade your cluster to a fixed GKE version or later, you must remove the entire `gke-gcsfuse-sidecar` block from your manifest. After removing this block GKE will resume to automatically inject and manage the correct sidecar image for your upgraded cluster version.
Logging and monitoring	All versions	TBD	Race condition in `gke-metrics-agent` DaemonSets When you add the `cloud.google.com/gke-metrics-agent-scaling-level` label to a node pool to manually increase the memory allocation of the `gke-metrics-agent` DaemonSet, a race condition occurs in the DaemonSet during new node creation. This race condition results in intermittent restarts and might result in lost metrics. This issue occurs because there's a delay before the label is added to a new node in the node pool. During this delay, the DaemonSet creates a Pod with the original memory allocation on that node. After the label is applied, the DaemonSet creates a Pod that has the new memory allocation. The existing Pod isn't completely deleted when the updated Pod starts. Both of these Pods try to bind to the same port number. Workaround: After you add the `cloud.google.com/gke-metrics-agent-scaling-level` label to a node pool, restart the `gke-metrics-agent` DaemonSet: kubectl -n kube-system rollout restart daemonset gke-metrics-agent
Upgrades	1.33	1.33.2-gke.1043000	Lowered open files limit with containerd 2.0 For node pools running GKE version 1.33, which uses containerd 2.0, the default soft limit for open files (`ulimit -n`) for containers is lowered to 1024. This is a change in containerd itself (see containerd PR #8924) where `LimitNOFILE` was removed from `containerd.service` as a best practice, making the default system soft limit apply. Workloads expecting a higher default soft limit (for example, those implicitly relying on the previous higher default) might experience failures, such as `Too many open files` errors. Solution: Upgrade from an earlier 1.33 patch version to 1.33.2-gke.1043000 or later. Workaround: Increase the open files limit for your workloads using one of the following methods: Application-Level adjustment: Modify the application code or configuration to explicitly set the `ulimit -n` or `prlimit --nofile=524288`. Containerd NRI Ulimit Adjuster Plugin Use the `containerd/nri ulimit-adjuster` plugin to adjust the ulimit. Downgrade node pool (Standard only): For Standard GKE clusters, you can temporarily downgrade your node pool to version 1.32 to avoid this issue until a permanent resolution is available. For more information about the migration to containerd 2, see Migrate nodes to containerd 2.
Upgrades	1.31.5-gke.1169000, 1.32.1-gke.1376000	1.31.7-gke.1164000, 1.32.3-gke.1512000	Invalid CRD status.storedVersions for managed CRDs Some GKE-managed CRDs might have an invalid `status.storedVersions` field, which creates a risk for breaking access to CRD objects after an upgrade. This issue impacts clusters that are both of the following: Clusters that used an affected GKE version at some point in time. Clusters that had unsupported (`served=false`) versions of CRD objects stored in etcd. Workaround: The recommended workaround is to delay cluster upgrades until the issue is resolved. Alternatively, if you know your cluster contains unsupported versions of CRD objects, you can add these versions to the `status.storedVersions` field by using the `kubectl patch` command.
Operation, logging and monitoring	1.32.2-gke.1652000, 1.31.6-gke.1221000, 1.30.10-gke.1227000	1.32.2-gke.1652003 or later 1.31.6-gke.1221001 or later 1.30.10-gke.1227001 or later	Missing metrics or workload autoscaler not scaling You might observe gaps in metrics data on the affected versions after the cluster size increases by more than five nodes. This issue might also impact autoscaling operations. This issue impacts only clusters upgraded to the impacted versions. Newly created clusters should work as expected. Workaround: If you are impacted, you can downgrade one patch version or upgrade to the newer fixed versions.
Operation			Google Cloud Hyperdisk size and attachment limits Typically, a pod that cannot be successfully scheduled due to a node's volume attachment limits triggers the auto-provisioning of a new node. When running workloads that use a Hyperdisk product are scheduled to a node that runs a C3 VM, node auto-provisioning does not occur and the pod is scheduled onto the already full node. The workload is scheduled onto the node despite the lack of an available disk attachment. The workload also fails to start, due to an error like the following: AttachVolume.Attach failed for volume "[VOLUME NAME]" : rpc error: code = InvalidArgument desc = Failed to Attach: failed when waiting for zonal op: rpc error: code = InvalidArgument desc = operation operation-[OPERATION NUMBERS] failed (UNSUPPORTED_OPERATION): Maximum hyperdisk-balanced disks count should be less than or equal to [16], Requested : [17] The problem is present for all Hyperdisk products on C3 machines. Hyperdisk attachment limits vary by VM vCPU number and Hyperdisk product. For more information, see Hyperdisk performance limits. Workaround: Hyperdisk products trigger auto-provisioning on other VM shapes. We recommend a shape that supports only Hyperdisk.
Operation	1.32.3-gke.1927000, 1.32.3-gke.1785000, 1.32.3-gke.1717000, 1.32.3-gke.1440000, 1.32.3-gke.1170000, 1.32.3-gke.1250000, 1.32.3-gke.1671000, 1.32.3-gke.1596000, 1.32.3-gke.1298000		gke-metadata-server is OOMKilled on TPU/GPU nodes On GKE TPU nodes (for example `ct4p-hightpu-4t`) and GPU nodes (for example `a3-highgpu-8g`), the kernel might terminate the `gke-metadata-server` with an `OOMKilled` when the server's memory usage exceeds 100 MB. Workaround: If you observe `OOMKilled` events for `gke-metadata-server` on TPU or GPU nodes, contact the GKE Identity on-call team (Component ID: 395450) for mitigation options.
Operation	1.32.0-gke.1358000 to 1.32.4-gke.1106006, 1.32.4-gke.1236007, 1.32.4-gke.1353001, 1.32.4-gke.1415001, 1.32.4-gke.1533000 1.33.0-gke.0 to 1.33.0-gke.1552000	1.32.4-gke.1353003 and later 1.33.0-gke.1552000 and later	Volume resizes might be stuck due to dangling NodePendingResize status on PVCs. Clusters on version 1.32 that have nodes on versions 1.31 or earlier will fail to update PersistentVolumeClaim status during resizing. This incorrect status prevents subsequent resize operations from beginning, effectively preventing further resizes. A PVC in this state has a `status.allocatedResourceStatuses` field that contains `NodePendingResize` or both `status.allocatedResources` field and `status.conditions.type: FileSystemResizePending`. If a PVC was created while your cluster was on an affected version, you might see this issue persist after your cluster upgrades to a known fixed version. In this scenario, your PVC needs to be patched to remove the `status.allocatedResources` field with a one-time workaround. Workaround: PVCs stuck due to dangling status can be patched to remove that status. You can use a patch command like the following to remove the dangling status: kubectl patch pvc $PVC_NAME --subresource='status' --type='merge' -p '{"status":{"allocatedResourceStatuses":null}}' kubectl patch pvc $PVC_NAME --subresource='status' --type='merge' -p '{"status":{"allocatedResources":null}}'
Operation	1.32.4-gke.1236007, 1.32.4-gke.1353001	1.32.4-gke.1353003 and later	The PDCSI driver might log excessively GKE clusters on specific versions of 1.32 might emit excessive log messages from the PDCSI driver. This excess logging would consume Cloud Logging Write API quota. Workaround: You can reduce this excessive logging by adding an exclusion filter. To exclude the log messages from being ingested into Cloud Logging, use the following query: resource.type="k8s_container" resource.labels.container_name="gce-pd-driver" (sourceLocation.file="cache.go" OR "Cannot process volume group")
Operations	1.27.16-gke.2440000 and later 1.28.15-gke.1673000 and later 1.29.13-gke.1038000 and later 1.30.9 and later 1.31.7 and later 1.32.1-gke.1357001 and later	1.27.16-gke.2691000 and later 1.28.15-gke.2152000 and later 1.29.15-gke.1218000 and later 1.30.11-gke.1190000 and later 1.31.7-gke.1363000 and later 1.32.4-gke.1106000 and later 1.33.0-gke.1552000 and later	Pods that attempt to mount NFS persistent volumes on COS nodes which previously had a Read-Only (RO) mount will only be mounted in RO mode For GKE version 1.27 and later, NFS volumes using the Kubernetes in-tree CSI driver are only able to mount persistent volumes in RO mode after a previous RO mount on the same node. Workaround: Downgrade node pools to a version that's earlier than the impacted versions
Operations	1.32 versions from 1.32.1-gke.1357001 up to, but not including, 1.32.4-gke.1106000 All 1.33 versions earlier than 1.33.0-gke.1552000	1.32 release: 1.32.4-gke.1106000 and later 1.33 release: 1.33.0-gke.1552000 and later	Pods attempting to mount NFS persistent volumes on Ubuntu nodes will be unable to run. For GKE version 1.32 and later NFS volumes using the Kubernetes in-tree CSI driver won't be able to mount persistent volumes on Ubuntu nodes. When this happens, you might see the following error messages: "MountVolume.SetUp failed for volume 'nfs-storage' : mount failed: exit status 1" Output: Mount failed: mount failed: exit status 127 "Output: chroot: failed to run command 'mount': No such file or directory failed to run command mount on Ubuntu nodes" In addition to seeing these error messages, Pods using these volumes won't be able to come up. Workaround: Downgrade node pools to version 1.31.
Operation	>= 1.28.15-gke.1436000, < 1.28.15-gke.1668000, >= 1.29.12-gke.1040000, < 1.29.13-gke.1028000, >= 1.30.8-gke.1053000, < 1.30.8-gke.1287000, >= 1.31.4-gke.1057000, < 1.31.6-gke.1020000, >= 1.32.0-gke.1281000, < 1.32.1-gke.1302000	1.28.15-gke.1668000 or later 1.29.13-gke.1028000 or later 1.30.8-gke.1287000 or later 1.31.6-gke.1020000 or later 1.32.1-gke.1302000 or later	Pods using io_uring-related syscalls might be stuck in Terminating Pods that use io_uring-related syscalls might enter the D (disk sleep) state, which is also called TASK_UNINTERRUPTIBLE, due to a bug in the Linux kernel. Processes in D state cannot be woken by signals, including `SIGKILL`. When a Pod is affected by this known issue, its containers might fail to terminate normally. In the containerd logs, you might observe repeated messages similar to the following: `Kill container [container-id]` indicating that the system is repeatedly attempting to stop a container. Concurrently, the kubelet logs display error messages, such as the following: `"Kill container failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"` or `"Error syncing pod, skipping" err="[failed to \"KillContainer\" for \"container-name\" with KillContainerError: \"rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \\\"container-id\\\" to be killed: wait container \\\"container-id\\\": context deadline exceeded\", failed to \"KillPodSandbox\" for \"pod-uuid\" with KillPodSandboxError: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"]" pod="pod-name" podUID="pod-uuid"` These symptoms point to processes within the container that are stuck in an uninterruptible sleep (D-state), which prevents proper termination of the Pod. Workloads that use io_uring directly, or that use io_uring indirectly through a language runtime like NodeJS, might be affected by this known issue. Affected workloads have a process in the D (disk sleep) state in the `/proc/<pid>/state` file, and show the `io_uring` string as part of the contents of `/proc/<pid>/stack`. NodeJS workloads might be able to disable the use of io_uring through `UV_USE_IO_URING=0`. Workaround: Upgrade the cluster nodes to a fixed version or later.
Operation	1.28, 1.29, 1.30, 1.31	1.30.8-gke.1261000 and later 1.31.4-gke.1183000 and later 1.32.0-gke.1448000 and later	Workloads using Image streaming fail with authentication errors A bug in the Image streaming feature might cause workloads to fail when a set of specific conditions is met while the container reads files. Error messages related to authentication failures might be visible in the gcfsd log. To check if you are impacted, search the logs with this search query: `resource.type="k8s_node" resource.labels.project_id="[project_id]" resource.labels.cluster_name="[cluster_name]" logName="projects/[project_id]/logs/gcfsd" "backend.FileContent failed" "Request is missing required authentication credential."` The presence of these errors indicate that the nodes are impacted. If you are impacted by this issue, you can upgrade your node pools to a patched GKE version.
Operation	1.30.0 to 1.30.5-gke.1443001 1.31.0 to 1.31.1-gke.1678000	1.30.5-gke.1628000 and later 1.31.1-gke.1846000 and later	Increased Pod eviction rates on GKE versions 1.30 and 1.31 Some versions of GKE 1.30 and GKE 1.31 that use COS 113 and COS 117, respectively, have kernels that were built with the option `CONFIG_LRU_GEN_ENABLED=y`. This option enables the kernel feature Multi-Gen LRU, which causes the kubelet to miscalculate memory usage and might lead to the kubelet evicting Pods. The config option `CONFIG_LRU_GEN_ENABLED` is disabled in cos-113-18244-151-96 and cos-117-18613-0-76. You might not always see an unusual Pod eviction rate because this issue depends on the workload's memory usage pattern. There is a higher risk of the kubelet evicting Pods for workloads that haven't set a memory limit in the resources field. This is because the workloads might request more memory than what the kubelet reports as available. If you see higher memory usage of an application after upgrading to the mentioned GKE versions without any other changes, then you might be affected by the kernel option. To check if there are unusual Pod eviction rates, analyze the following metrics with Metrics Explorer: `kubernetes.io/container_memory_used_bytes` and `kubernetes.io/container_memory_request_bytes` You can use the following PromQL queries. Replace the values for `cluster_name`, `namespace_name`, `metadata_system_top_level_controller_type` and `metadata_system_top_level_controller_name` with the workload name and type that you want to analyze: `max by (pod_name)(max_over_time(kubernetes_io:container_memory_used_bytes{monitored_resource="k8s_container",memory_type="non-evictable",cluster_name="REPLACE_cluster_name",namespace_name="REPLACE_namespace",metadata_system_top_level_controller_type="REPLACE_controller_type",metadata_system_top_level_controller_name="REPLACE_controller_name"}[${__interval}]))` `sum by (pod_name)(avg_over_time(kubernetes_io:container_memory_request_bytes{monitored_resource="k8s_container",cluster_name="REPLACE_cluster_name",namespace_name="REPLACE_namespace",metadata_system_top_level_controller_type="REPLACE_controller_type",metadata_system_top_level_controller_name="REPLACE_controller_name"}[${__interval}]))` If you see unusual spikes in the memory usage that go above the requested memory, the workload might be getting evicted more often. Workaround If you can't upgrade to the fixed versions and if you're running in a GKE environment where you can deploy privileged Pods, you can disable the Multi-Gen LRU option by using a DaemonSet. Update the GKE node pools from where you want to run the DaemonSet with an annotation to disable the Multi-Gen LRU option. For example, `disable-mglru: "true"`. Update the `nodeSelector` parameter in the DaemonSet manifest with the same annotation you used in the preceding step. For example, see the `disable-mglru.yaml` file in the GoogleCloudPlatform/k8s-node-tools repository. Deploy the DaemonSet to your cluster. After the DaemonSet is running in all the selected node pools, the change is effective immediately and the kubelet memory usage calculation is back to normal.
Operation	1.28, 1.29, 1.30, 1.31	1.28.14-gke.1175000 and later 1.29.9-gke.1341000 and later 1.30.5-gke.1355000 and later 1.31.1-gke.1621000 and later	Pods stuck in Terminating status A bug in the container runtime (containerd) might cause Pods and containers to be stuck in Terminating status with errors similar to the following: `OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown` If you are impacted by this issue, you can upgrade your nodes to a GKE version with a fixed version of containerd.
Operation	1.28,1.29	1.28.9-gke.1103000 and later 1.29.4-gke.1202000 and later 1.30: All versions	Image streaming fails because of symbolic links A bug in the Image streaming feature might cause containers to fail to start. Containers running on a node with image streaming enabled on specific GKE versions might fail to be created with the following error: `"CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create containerd container: failed to mount [PATH]: too many levels of symbolic links"` If you are impacted by this issue, you can check for empty layers or duplicate layers. If you can't remove empty empty layers or duplicate layers, then disable Image streaming.
Operation	1.27,1.28,1.29	1.28.9-gke.1103000 and later 1.29.4-gke.1202000 and later 1.30: All versions	Image streaming fails because of missing files A bug in the Image streaming feature might cause containers to fail because of a missing file or files. Containers running on a node with Image streaming enabled on the following versions might fail to start or run with errors informing that certain files don't exist. The following are examples of such errors: `No such file or directory` `Executable file not found in $PATH` If you are impacted by this issue, you can disable Image streaming.
Networking,Upgrades and updates	1.28		Gateway TLS configuration error We've identified an issue with configuring TLS for Gateways in clusters running GKE version 1.28.4-gke.1083000. This affects TLS configurations using either an SSLCertificate or a CertificateMap. If you're upgrading a cluster with existing Gateways, updates made to the Gateway will fail. For new Gateways, the load balancers won't be provisioned. This issue will be fixed in an upcoming GKE 1.28 patch version.
Upgrades and updates	1.27	1.27.8 or later	GPU device plugin issue Clusters that are running GPUs and are upgraded from 1.26 to a 1.27 patch version earlier than 1.27.8 might experience issues with their nodes' GPU device plugins (`nvidia-gpu-device-plugin`). Do the following steps depending on the state of your cluster: If your cluster is running version 1.26 and has GPUs, don't manually upgrade your cluster until version 1.27.8 is available in your cluster's release channel. If your cluster is running an earlier 1.27 patch version and the nodes are affected, restart the nodes or manually delete the `nvidia-gpu-device-plugin` Pod on the nodes (the add-on manager will create a new working plugin). If your cluster is using auto-upgrades, this doesn't affect you as automatic upgrades will only move clusters to patch versions with the fix.
Operation	1.27,1.28	1.27.5-gke.1300 and later 1.28.1-gke.1400 and later	Autoscaling for all workloads stops HorizontalPodAutoscaler (HPA) and VerticalPodAutoscaler (VPA) might stop autoscaling all workloads in a cluster if it contains misconfigured `autoscaling/v2` HPA objects. The issue impacts clusters running earlier patch versions of GKE version 1.27 and 1.28 (for example, 1.27.3-gke.100). Workaround: Correct misconfigured `autoscaling/v2` HPA objects by making sure the fields in `spec.metrics.resource.target` match, for example: When `spec.metrics.resource.target.type` is `Utilization` then target should be `averageUtilization` When `spec.metrics.resource.target.type` is `AverageValue` then target should be `averageValue` For more details on how to configure `autoscaling/v2` HPA objects, see the HorizontalPodAutoscaler Kubernetes documentation.
Operation	1.28,1.29	1.28.7-gke.1026000 1.29.2-gke.1060000	Container Threat Detection fails to deploy Container Threat Detection might fail to deploy on Autopilot clusters running the following GKE versions: 1.28.6-gke.1095000 to 1.28.7-gke.1025000 1.29.1-gke.1016000 to 1.29.1-gke.1781000
Networking, Upgrades	1.27, 1.28, 1.29, 1.30	1.30.4-gke.1282000 or later 1.29.8-gke.1157000 or later 1.28.13-gke.1078000 or later 1.27.16-gke.1342000 or later	Connectivity issues for `hostPort` Pods after control plane upgrade Clusters with network policy enabled might experience connectivity issues with hostPort Pods. Additionally, newly created Pods might take an additional 30 to 60 seconds to be ready. The issue is triggered when the GKE control plane of a cluster is upgraded to one of the following GKE versions 1.30 to 1.30.4-gke.1281999 1.29.1-gke.1545000 to 1.29.8-gke.1156999 1.28.7-gke.1042000 to 1.28.13-gke.1077999 1.27.12-gke.1107000 to 1.27.16-gke.1341999 Workaround: Upgrade or recreate nodes immediately after the GKE control plane upgrade.
Networking	1.31, 1.32	1.32.1-gke.1729000 or later 1.31.6-gke.1020000 or later	Broken UDP traffic between Pods that run on the same node Clusters with intra-node visibility enabled might experience broken UDP traffic between Pods that run on the same node. The issue is triggered when the GKE cluster node is upgraded to or created with one of the following GKE versions: 1.32.1-gke.1729000 or later 1.31.6-gke.1020000 or later The impacted path is Pod-to-Pod UDP traffic on the same node through Hostport or Service. Resolution Upgrade the cluster to one of the following fixed versions: 1.32.3-gke.1927000 or later 1.31.7-gke.1390000 or later
Operation	1.29,1.30,1.31	1.29.10-gke.1071000 or later 1.30.5-gke.1723000 or later 1.31.2-gke.1115000 or later	Incompatible Ray Operator and Cloud KMS database encryption Some Ray Operator versions are incompatible with Cloud KMS database encryption. Workarounds: Upgrade the cluster control plane to a fixed version or later.
Upgrades and updates	1.30, 1.31	1.30.8-gke.1051000 or later 1.31.1-gke.2008000 and later	GPU Maintenance Handler Pod Stuck in CrashLoopBackOff State With this issue, gpu-maintenance-handler Pods are stuck in a `CrashLoopBackOff` state on their respective nodes. This state prevents the `upcoming maintenance` label from being applied to GKE nodes, which can impact both node-drain and pod-eviction processes for workloads. `"Node upcoming maintenance label not applied due to error: Node "gke-yyy-yyy" is invalid: metadata.labels: Invalid value: "-62135596800": a valid label must be an empty string or consist of alphanumeric characters, '-','' or '.', and must start and end with an alphanumeric character (e.g.'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9.]*)?[A-Za-z0-9])?')"` If you are impacted by this issue, you can resolve it by upgrading your control plane to a GKE version that includes the fix.
Operation	1.33.1-gke.1522000 and later	1.33.4-gke.1142000 and later	Pods fail to start on nodes with Image streaming enabled On nodes with Image streaming enabled, workloads might fail to start with the following error signature: `Failed to create pod sandbox ... context deadline exceeded` An affected node's serial-port logs also contain the following error signature: `task gcfsd ... blocked for more than X seconds` The presence of these two error signatures indicates a deadlock in one of the Image streaming components. This deadlock inhibits Pods from successfully starting. Mitigation: Restart the node for a quick mitigation. Note that the restarted node might still encounter the deadlock again. For a more robust mitigation, disable Image streaming on the node pool by running the following: gcloud container node-pools update NODE_POOL_NAME --cluster CLUSTER_NAME --no-enable-image-streaming Note: disabling Image streaming re-creates all of the nodes within a node pool.

What's next

If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics:
- Opening a support case by contacting Cloud Customer Care.
- Getting support from the community by asking questions on StackOverflow and using the google-kubernetes-engine tag to search for similar issues. You can also join the #kubernetes-engine Slack channel for more community support.
- Opening bugs or feature requests by using the public issue tracker.

GKE known issues

Incorrect performance tier for dynamically provisioned Lustre instances

Input/output error when renaming or moving files using Cloud Storage FUSE CSI driver

Race condition in `gke-metrics-agent` DaemonSets

Lowered open files limit with containerd 2.0

Invalid CRD status.storedVersions for managed CRDs

Missing metrics or workload autoscaler not scaling

Google Cloud Hyperdisk size and attachment limits

gke-metadata-server is OOMKilled on TPU/GPU nodes

Volume resizes might be stuck due to dangling NodePendingResize status on PVCs.

The PDCSI driver might log excessively

Pods that attempt to mount NFS persistent volumes on COS nodes which previously had a Read-Only (RO) mount will only be mounted in RO mode

Pods attempting to mount NFS persistent volumes on Ubuntu nodes will be unable to run.

Workloads using Image streaming fail with authentication errors

Increased Pod eviction rates on GKE versions 1.30 and 1.31

Workaround

Pods stuck in Terminating status

Image streaming fails because of symbolic links

Image streaming fails because of missing files

Gateway TLS configuration error

GPU device plugin issue

Autoscaling for all workloads stops

Container Threat Detection fails to deploy

Connectivity issues for `hostPort` Pods after control plane upgrade

Broken UDP traffic between Pods that run on the same node

Incompatible Ray Operator and Cloud KMS database encryption

GPU Maintenance Handler Pod Stuck in CrashLoopBackOff State

Pods fail to start on nodes with Image streaming enabled

What's next

GKE known issues

Incorrect performance tier for dynamically provisioned Lustre instances

Input/output error when renaming or moving files using Cloud Storage FUSE CSI driver

Race condition in gke-metrics-agent DaemonSets

Lowered open files limit with containerd 2.0

Invalid CRD status.storedVersions for managed CRDs

Missing metrics or workload autoscaler not scaling

Google Cloud Hyperdisk size and attachment limits

gke-metadata-server is OOMKilled on TPU/GPU nodes

Volume resizes might be stuck due to dangling NodePendingResize status on PVCs.

The PDCSI driver might log excessively

Pods that attempt to mount NFS persistent volumes on COS nodes which previously had a Read-Only (RO) mount will only be mounted in RO mode

Pods attempting to mount NFS persistent volumes on Ubuntu nodes will be unable to run.

Pods using io_uring-related syscalls might be stuck in Terminating

Workloads using Image streaming fail with authentication errors

Increased Pod eviction rates on GKE versions 1.30 and 1.31

Workaround

Pods stuck in Terminating status

Image streaming fails because of symbolic links

Image streaming fails because of missing files

Gateway TLS configuration error

GPU device plugin issue

Autoscaling for all workloads stops

Container Threat Detection fails to deploy

Connectivity issues for hostPort Pods after control plane upgrade

Broken UDP traffic between Pods that run on the same node

Incompatible Ray Operator and Cloud KMS database encryption

GPU Maintenance Handler Pod Stuck in CrashLoopBackOff State

Pods fail to start on nodes with Image streaming enabled

What's next

Race condition in `gke-metrics-agent` DaemonSets

Connectivity issues for `hostPort` Pods after control plane upgrade