This document gives troubleshooting guidance for storage issues.
Volume fails to attach
This issue can occur if a virtual disk is attached to the wrong virtual machine, it may be due to Issue #32727 in Kubernetes 1.12.
The output of gkectl diagnose cluster
looks like this:
Checking cluster object...PASS Checking machine objects...PASS Checking control plane pods...PASS Checking gke-connect pods...PASS Checking kube-system pods...PASS Checking gke-system pods...PASS Checking storage...FAIL PersistentVolume pvc-776459c3-d350-11e9-9db8-e297f465bc84: virtual disk "[datastore_nfs] kubevols/kubernetes-dynamic-pvc-776459c3-d350-11e9-9db8-e297f465bc84.vmdk" IS attached to machine "gsl-test-user-9b46dbf9b-9wdj7" but IS NOT listed in the Node.Status 1 storage errors
One or more Pods are stuck in the ContainerCreating
state with warnings like
this:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedAttachVolume 6s (x6 over 31s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-776459c3-d350-11e9-9db8-e297f465bc84" : Failed to add disk 'scsi0:6'.
To resolve this issue:
If a virtual disk is attached to the wrong virtual machine, you might need to manually detach it:
Drain the node. See Safely draining a node {:.external"}. You might want to include the
--ignore-daemonsets
and--delete-local-data
flags in your kubectl drain {:.external"> command.Edit the VM's hardware config in vCenter to remove the volume.
Volume is lost
This issue can occur if a virtual disk was permanently deleted. This can happen if an operator manually deletes a virtual disk or the virtual machine it is attached to. If you see a "not found" error related to your VMDK file, it is likely that the virtual disk was permanently deleted.
The output of gkectl diagnose cluster
looks like this:
Checking cluster object...PASS Checking machine objects...PASS Checking control plane pods...PASS Checking gke-connect pods...PASS Checking kube-system pods...PASS Checking gke-system pods...PASS Checking storage...FAIL PersistentVolume pvc-52161704-d350-11e9-9db8-e297f465bc84: virtual disk "[datastore_nfs] kubevols/kubernetes-dynamic-pvc-52161704-d350-11e9-9db8-e297f465bc84.vmdk" IS NOT found 1 storage errors
One or more Pods are stuck in the ContainerCreating
state:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedAttachVolume 71s (x28 over 42m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-52161704-d350-11e9-9db8-e297f465bc84" : File []/vmfs/volumes/43416d29-03095e58/kubevols/ kubernetes-dynamic-pvc-52161704-d350-11e9-9db8-e297f465bc84.vmdk was not found
To prevent this issue from occurring, manage your virtual machines as described in Resizing a user cluster and Upgrading clusters.
To resolve this issue, you might need to manually clean up related Kubernetes resources:
Delete the PVC that referenced the PV by running
kubectl delete pvc [PVC_NAME]
.Delete the Pod that referenced the PVC by running
kubectl delete pod [POD_NAME]
.Repeat step 2. Yes, really. See Kubernetes issue 74374 {:.external}.
vSphere CSI Volume fails to detach
This issue occurs if the CNS > Searchable
privilege has not been granted to
the vSphere user.
If you find pods stuck in the ContainerCreating
phase with
FailedAttachVolume
warnings, it could be due to a failed detach on
a different node.
To check for CSI detach errors:
kubectl get volumeattachments -o=custom-columns=NAME:metadata.name,DETACH_ERROR:status.detachError.message
The output is similar to the following:
NAME DETACH_ERROR csi-0e80d9be14dc09a49e1997cc17fc69dd8ce58254bd48d0d8e26a554d930a91e5 rpc error: code = Internal desc = QueryVolume failed for volumeID: "57549b5d-0ad3-48a9-aeca-42e64a773469". ServerFaultCode: NoPermission csi-164d56e3286e954befdf0f5a82d59031dbfd50709c927a0e6ccf21d1fa60192dcsi-8d9c3d0439f413fa9e176c63f5cc92bd67a33a1b76919d42c20347d52c57435c csi-e40d65005bc64c45735e91d7f7e54b2481a2bd41f5df7cc219a2c03608e8e7a8
To resolve this issue, add the CNS > Searchable
privilege to your
vcenter user account.
The detach operation automatically retries until it succeeds.
CSI volume creation fails with NotSupported
error
This issue occurs when an ESXi host in the vSphere cluster is running a version lower than ESXi 6.7U3.
The output of kubectl describe pvc
includes this error:
Failed to provision volume with StorageClass: rpc error: code = Internal desc = Failed to create volume. Error: CnsFault error: CNS: Failed to create disk.:Fault cause: vmodl.fault.NotSupported
To resolve this issue, upgrade your ESXi hosts to version 6.7U3 or later.
vSphere CSI volume fails to attach
This known issue {:.external} in the open-source vSphere CSI driver occurs when a node is shut down, deleted, or fails.
The output of kubectl describe pod
looks like this:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedAttachVolume 2m30s attachdetach-controller Multi-Attach error for volume "pvc-xxxxx" Volume is already exclusively attached to one node and can't be attached to another
To resolve this issue:
Note the name of the PersistentVolumeClaim (PVC) in the preceding output.
Find the VolumeAttachments that are associated with that PVC. For example:
kubectl get volumeattachments | grep pvc-xxxxx
The output shows the names of the VolumeAttachments. For example:
csi-yyyyy csi.vsphere.vmware.com pvc-xxxxx node-zzzzz ...
Describe the VolumeAttachments. For example:
kubectl describe volumeattachments csi-yyy | grep "Deletion Timestamp"
Make a note of the deletion timestamp in the output. For example:
Deletion Timestamp: 2021-03-10T22:14:58Z
Wait until the time specified by the deletion timestamp, and then force delete the VolumeAttachment. To do this, edit the VolumeAttachment object and delete the finalizer. For example:
kubectl edit volumeattachment csi-yyyyy Finalizers: external-attacher/csi-vsphere-vmware-com