This page shows you how to resolve storage-related issues on your Google Kubernetes Engine (GKE) clusters.
If you need additional assistance, reach out to Cloud Customer Care.Error 400: Cannot attach RePD to an optimized VM
Regional persistent disks are restricted from being used with memory-optimized machines or compute-optimized machines.
Consider using a non-regional persistent disk storage class if using a regional persistent disk is not a hard requirement. If using a regional persistent disk is a hard requirement, consider scheduling strategies such as taints and tolerations to ensure that the Pods that need regional persistent disks are scheduled on a node pool that are not optimized machines.
Troubleshooting issues with disk performance
The performance of the boot disk is important because the boot disk for GKE nodes is not only used for the operating system but also for the following:
- Docker images.
- The container filesystem for what is not mounted as a volume (that is, the
overlay filesystem), and this often includes directories like
/tmp
. - Disk-backed
emptyDir
volumes, unless the node uses local SSD.
Disk performance is shared for all disks of the same disk
type on a node. For example, if you have a
100 GB pd-standard
boot disk and a 100 GB pd-standard
PersistentVolume with lots of activity, the performance of the boot disk is that
of a 200 GB disk. Also, if there is a lot of activity on the
PersistentVolume, this impacts the performance of the boot disk as well.
If you encounter messages similar to the following on your nodes, these could be symptoms of low disk performance:
INFO: task dockerd:2314 blocked for more than 300 seconds.
fs: disk usage and inodes count on following dirs took 13.572074343s
PLEG is not healthy: pleg was last seen active 6m46.842473987s ago; threshold is 3m0s
To help resolve such issues, review the following:
- Ensure you have consulted the Storage disk type comparisons and chosen a persistent disk type to suit your needs.
- This issue often occurs for nodes that use standard persistent disks with a size of less than 200 GB. Consider increasing the size of your disks or switching to SSDs, especially for clusters used in production.
- Consider enabling local SSD for ephemeral storage on your node pools.
This is particularly effective if you have containers that frequently use
emptyDir
volumes.
Mounting a volume stops responding due to the fsGroup
setting
One issue that can cause PersistentVolume
mounting to fail is a Pod that is
configured with the fsGroup
setting. Normally, mounts automatically
retry and the mount failure resolves itself. However, if the PersistentVolume
has a large number of files, kubelet will attempt to change ownership on each
file on the filesystem, which can increase volume mount latency.
Unable to attach or mount volumes for pod; skipping pod ... timed out waiting for the condition
To confirm if a failed mount error is due to the fsGroup
setting, you can
check the logs for the Pod.
If the issue is related to the fsGroup
setting,
you see the following log entry:
Setting volume ownership for /var/lib/kubelet/pods/POD_UUID and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
If the PersistentVolume
does not mount within a few minutes, try the following
steps to resolve this issue:
- Reduce the number of files in the Volume.
- Stop using the
[fsGroup]
setting. - Change the application
fsGroupChangePolicy
toOnRootMismatch
.
Slow disk operations cause Pod creation failures
For more information, refer to containerd issue #4604.
Affected GKE node versions: 1.18, 1.19, 1.20.0 to 1.20.15-gke.2100, 1.21.0 to 1.21.9-gke.2000, 1.21.10 to 1.21.10-gke.100, 1.22.0 to 1.22.6-gke.2000, 1.22.7 to 1.22.7-gke.100, 1.23.0 to 1.23.3-gke.700, 1.23.4 to 1.23.4-gke.100
The following example errors might be displayed in the
k8s_node container-runtime
logs:
Error: failed to reserve container name "container-name-abcd-ef12345678-91011_default_12131415-1234-5678-1234-12345789012_0": name "container-name-abcd-ef12345678-91011_default_12131415-1234-5678-1234-12345789012_0" is reserved for "1234567812345678123456781234567812345678123456781234567812345678"
Mitigation
- If Pods are failing, consider using
restartPolicy:Always
orrestartPolicy:OnFailure
in your PodSpec. - Increase the boot disk IOPS (for example, upgrade the disk type or increase the disk size).
Fix
This issue is fixed in containerd 1.6.0+. GKE versions with this fix are 1.20.15-gke.2100+, 1.21.9-gke.2000+, 1.21.10-gke.100+, 1.22.6-gke.2000+, 1.22.7-gke.100+, 1.23.3-gke.1700+ and 1.23.4-gke.100+
Volume expansion changes not reflecting in the container file system
When performing volume expansion, always make sure to update the PersistentVolumeClaim. Changing a PersistentVolume directly can result in volume expansion not happening. This could lead to one of the following scenarios:
If a PersistentVolume object is modified directly, both the PersistentVolume and PersistentVolumeClaim values are updated to a new value, but the file system size is not reflected in the container and is still using the old volume size.
If a PersistentVolume object is modified directly, followed by updates to the PersistentVolumeClaim where the
status.capacity
field is updated to a new size, this can result in changes to the PersistentVolume but not the PersistentVolumeClaim or the container file system.
To resolve this issue, complete the following steps:
- Keep the modified PersistentVolume object as it was.
- Edit the PersistentVolumeClaim object and set
spec.resources.requests.storage
to a value that is higher than was used in the PersistentVolume. - Verify if the PersistentVolume is resized to the new value.
After these changes, PersistentVolume, PersistentVolumeClaim and container file system should be automatically resized by the kubelet.
Verify if the changes are reflected in the Pod.
kubectl exec POD_NAME -- /bin/bash -c "df -h"
Replace POD_NAME
with the Pod attached to PersistentVolumeClaim.
The selected machine type should have local SSD(s)
You might encounter the following error when creating a cluster or a node pool that uses Local SSD:
The selected machine type (c3-standard-22-lssd) has a fixed number of local SSD(s): 4. The EphemeralStorageLocalSsdConfig's count field should be left unset or set to 4, but was set to 1.
In the error message, you might see LocalNvmeSsdBlockConfig
instead of
EphemeralStorageLocalSsdConfig
depending on which you specified.
This error occurs when the number of Local SSD disks specified does not match the number of Local SSD disks included with the machine type.
To resolve this issue, specify a number of Local SSD disks that
matches the machine type that you want.
For third generation machine series, you must omit the Local SSD count
flag
and the correct value will be configured automatically.
Hyperdisk Storage Pools: Cluster or node pool creation fails
You might encounter the ZONE_RESOURCE_POOL_EXHAUSTED
error or similar
Compute Engine resource errors when trying to provision Hyperdisk Balanced disks as your node's boot or
attached disks in a Hyperdisk Storage Pool.
This happens when you're trying to create a GKE cluster or node pool in a zone that's running low on resources, for example:
- The zone might not have enough of the Hyperdisk Balanced disks available.
- The zone might not have enough capacity to create the nodes of the machine type
you specified, like
c3-standard-4
.
To resolve this issue:
- Select a new zone within the same region with enough capacity for your chosen machine type and where Hyperdisk Balanced Storage Pools are available.
- Delete the existing storage pool and recreate it in the new zone. This is because storage pools are zonal resources.
- Create your cluster or node pool in the new zone.