Troubleshooting

This page includes troubleshooting steps for some common issues and errors.

FAILED instance

The FAILED status means that the instance data has been lost and the instance must be deleted.

Parallelstore instances in a FAILED state continue to be billed until they're deleted.

To retrieve an instance's state, follow the instructions at Manage instances: Retrieve an instance.

To delete an instance, read Manage instances: Delete an instance.

Timeouts during dfuse mount or network tests

If, when mounting your Parallelstore instance, the dfuse -m command times out; or if network test commands such as self_test or daos health net-test time out, this may be due to a network connectivity issue.

To verify connectivity to the Parallelstore servers, run

self_test --use-daos-agent-env -r 1

If the test reports a connection issue, two possible reasons are:

The DAOS agent may have selected the wrong network interface during setup

You may need to exclude network interfaces that are not able to reach the IPs in the access_points list.

Run ifconfig to list the available network interfaces. An example output may show several network interfaces such as eth0, docker0, ens8, lo, etc.
Stop the daos_agent.
Edit /etc/daos/daos_agent.yml to exclude the unwanted network interfaces. Uncomment the exclude_fabric_ifaces line and update the values. The entries you include are specific to your situation. For example:
```
exclude_fabric_ifaces: ["docker0", "ens8", "lo"]
```
Restart the daos_agent.

The instance or client IP address conflicts with internal IP addresses

Parallelstore instances and clients cannot use an IP address from the 172.17.0.0/16 subnet range. See Known issues for more information.

`ENOSPC` when there is unused capacity in the instance

If your instance uses minimum or (the default of) balanced striping, you might run into ENOSPC errors even if the existing files are not using all of the capacity of the instance. This is likely to happen when writing large files that are generally greater than 8 GiB, or when importing such files from Cloud Storage.

Use maximum file striping to reduce the likelihood of these errors.

Google Kubernetes Engine troubleshooting

The following section lists some common issues and steps to resolve them.

`Transport endpoint is not connected` in workload Pods

This error is due to dfuse termination. In most cases, dfuse was terminated because of out-of-memory. Use the Pod annotations gke-parallelstore/[cpu-limit|memory-limit] to allocate more resources to the Parallelstore sidecar container. You can set gke-parallelstore/memory-limit: "0" to remove the sidecar memory limitation if you don't know how much memory you want to allocate to it. Note that this only works with Standard clusters; with Autopilot clusters, you cannot use value 0 to unset the sidecar container resource limits and requests. You have to explicitly set a larger resource limit for the sidecar container.

Once you've modified the annotations, you must restart your workload Pod. Adding annotations to a running workload doesn't dynamically modify the resource allocation.

Pod event warnings

If your workload Pods cannot start up, check the Pod events:

kubectl describe pod POD_NAME -n NAMESPACE

The following solutions are for common errors.

CSI driver enablement issues

Common CSI driver enablement errors are as follows:

MountVolume.MountDevice failed for volume "volume" : kubernetes.io/csi:
attacher.MountDevice failed to create newCsiDriverClient:
driver name parallelstore.csi.storage.gke.io not found in the list of registered CSI drivers

MountVolume.SetUp failed for volume "volume" : kubernetes.io/csi:
mounter.SetUpAt failed to get CSI client:
driver name parallelstore.csi.storage.gke.io not found in the list of registered CSI drivers

These warnings indicate that the CSI driver is not enabled, or not running.

If your cluster was just scaled, updated, or upgraded, this warning is normal and should be transient. It takes a few minutes for the CSI driver Pods to be functional after cluster operations.

Otherwise, confirm that the CSI driver is enabled on your cluster. See Enable the CSI driver for details. If the CSI is enabled, each node shows a Pod named parallelstore-csi-node-id up and running.

AttachVolume.Attach failures

After the Pod is scheduled to a node, the volume will be attached to the node and the mounter Pod will be created if using node mount.

This happens on the controller and involves the AttachVolume step in from attachdetach-controller.

Error code	Pod event warning	Solution
InvalidArgument	`AttachVolume.Attach failed for volume "volume" : rpc error: code = InvalidArgument desc = an error occurred while preparing mount options: invalid mount options`	Invalid mount flags are passed to PersistentVolume or StorageClass. Check supported dfuse mount options for more details.
NotFound	`AttachVolume.Attach failed for volume "volume" : rpc error: code = NotFound desc = failed to get instance "instance"`	The Parallelstore instance does not exist. Verify that the PersistentVolume's volumeHandle has the correct format.

MountVolume.MountDevice failures

After the volume is attached to a node, the volume will be staged to the node.

This happens on the node and involves the MountVolume.MountDevice step in from kubelet.

Error code	Pod event warning	Solution
FailedPrecondition	`MountVolume.MountDevice failed for volume "volume" : rpc error: code = FailedPrecondition desc = mounter pod "pod" expected to exist but was not found`	This error is usually caused by the mounter pod being manually deleted. Delete all of the workloads consuming the PVC and re-deploy them. This will create a new mounter Pod.
DeadlineExceeded	`MountVolume.MountDevice failed for volume "volume": rpc error: code = DeadlineExceeded desc = context deadline exceeded`	There's trouble connecting to the Parallelstore instance. Verify that your VPC network and your access points are configured correctly.

MountVolume.SetUp failures

After the volume is staged to the node, the volume will be mounted and provided to the container on the Pod. This happens on the node and involves the MountVolume.SetUp step in kubelet.

Pod mount

Error code	Pod event warning	Solution
ResourceExhausted	`MountVolume.SetUp failed for volume "volume" : rpc error: code = ResourceExhausted desc = the sidecar container failed with error: signal: killed` `MountVolume.SetUp failed for volume "volume" : rpc error: code = ResourceExhausted desc = the sidecar container terminated due to OOMKilled, exit code: 137`	The dfuse process ended, which is usually caused by an out-of-memory (OOM) condition. Consider increasing the sidecar container memory limit by using the `gke-parallelstore/memory-limit` annotation. If you're unsure about the amount of memory you want to allocate to the parallelstore-sidecar, we recommend setting `gke-parallelstore/memory-limit: "0"` to eliminate the memory restriction imposed by Parallelstore.
Aborted	`MountVolume.SetUp failed for volume "volume" : rpc error: code = Aborted desc = NodePublishVolume request is aborted due to rate limit` `MountVolume.SetUp failed for volume "volume" : rpc error: code = Aborted desc = An operation with the given volume key key already exists`	The volume mount operation was aborted due to rate limiting or existing operations. This warning is normal and should be transient.
InvalidArgument	`MountVolume.SetUp failed for volume "volume" : rpc error: code = InvalidArgument desc =`	If you supplied invalid arguments in the StorageClass or PersistentVolume, the error log indicates the fields with the invalid arguments. For dynamic provisioning, check the Storage Class. For static provisioning, check the Persistent Volume.
FailedPrecondition	`MountVolume.SetUp failed for volume "volume" : rpc error: code = FailedPrecondition desc = can not find the sidecar container in Pod spec`	The Parallelstore sidecar container was not injected. Check that the `gke-parallelstore/volumes: "true"` Pod annotation is set correctly.

Node mount

Error code	Pod event warning	Solution
Aborted	`MountVolume.SetUp failed for volume "volume" : rpc error: code = Aborted desc = NodePublishVolume request is aborted due to rate limit` `MountVolume.SetUp failed for volume "volume" : rpc error: code = Aborted desc = An operation with the given volume key key already exists`	The volume mount operation was aborted due to rate limit or existing operations. This warning is normal and should be transient.
InvalidArgument	`MountVolume.SetUp failed for volume "volume" : rpc error: code = InvalidArgument desc =`	If you have supplied invalid arguments in the StorageClass or Persistent Volume, the error log will indicate the fields with the invalid arguments. For dynamic provisioning, check the Storage Class. For static provisioning, check the Persistent Volume.
FailedPrecondition	`MountVolume.SetUp failed for volume "volume" : rpc error: code = FailedPrecondition desc = mounter pod expected to exist but was not found`	The Parallelstore mounter pod does not exist. If the mounter Pod was accidentally deleted, re-create all workloads to prompt re-creation.
DeadlineExceeded	`MountVolume.SetUp failed for volume "volume" : rpc error: code = DeadlineExceeded desc = timeout waiting for mounter pod gRPC server to become available`	The mounter Pod's gRPC server did not start. Check the mounter pod's logs for any errors.

Troubleshooting VPC networks

Permission denied to add peering for service `servicenetworking.googleapis.com`

ERROR: (gcloud.services.vpc-peerings.connect) User [$(USER)] does not have 
permission to access services instance [servicenetworking.googleapis.com]
(or it may not exist): Permission denied to add peering for service
'servicenetworking.googleapis.com'.

This error means that you don't have servicenetworking.services.addPeering IAM permission on your user account.

See Access control with IAM for instructions on adding one of the following roles to your account:

roles/compute.networkAdmin or
roles/servicenetworking.networksAdmin

Cannot modify allocated ranges in CreateConnection

ERROR: (gcloud.services.vpc-peerings.connect) The operation
"operations/[operation_id]" resulted in a failure "Cannot modify allocated
ranges in CreateConnection. Please use UpdateConnection.

This error is returned when you have already created a vpc-peering on this network with different IP ranges. There are two possible solutions:

Replace the existing IP ranges:

gcloud services vpc-peerings update \
  --network=NETWORK_NAME \
  --ranges=IP_RANGE_NAME \
  --service=servicenetworking.googleapis.com \
  --force

Or, add the new IP range to the existing connection:

Retrieve the list of existing IP ranges for the peering:

EXISTING_RANGES=$(
  gcloud services vpc-peerings list \
    --network=NETWORK_NAME \
    --service=servicenetworking.googleapis.com \
    --format="value(reservedPeeringRanges.list())"
)

Then, add the new range to the peering:

gcloud services vpc-peerings update \
  --network=NETWORK_NAME \
  --ranges=$EXISTING_RANGES,IP_RANGE_NAME \
  --service=servicenetworking.googleapis.com

IP address range exhausted

Instance creation might fail with the following range exhausted error:

ERROR: (gcloud.alpha.Parallelstore.instances.create) FAILED_PRECONDITION: Invalid
resource state for "NETWORK_RANGES_NOT_AVAILABLE": IP address range exhausted

If you see this error message, follow the VPC guide to either recreate the IP range or extend the existing IP range.

If you're recreating a Parallelstore instance, you must recreate the IP range instead of extending it.

Maintenance blocked due to restrictive Pod Disruption Budget

The Google Cloud console might display the following error message indicating that maintenance can't proceed because a Pod Disruption Budget (PDB) is configured to allow zero Pod evictions:

GKE can't perform maintenance because the Pod Disruption Budget allows for 0 Pods evictions.

If you see this error message, identify the problematic Pod by completing the following steps:

Click the error message to open the error insight panel.
Check the Unpermissive Pod Disruption Budgets section for the Pod's name.
If the Pod is parallelstorecsi-mount, you can disregard this error as it won't prevent maintenance. For any other Pod, examine your PDB.

Troubleshooting

FAILED instance

Timeouts during dfuse mount or network tests

The DAOS agent may have selected the wrong network interface during setup

The instance or client IP address conflicts with internal IP addresses

ENOSPC when there is unused capacity in the instance

Google Kubernetes Engine troubleshooting

Transport endpoint is not connected in workload Pods

Pod event warnings

CSI driver enablement issues

AttachVolume.Attach failures

MountVolume.MountDevice failures

MountVolume.SetUp failures

Pod mount

Node mount

Troubleshooting VPC networks

Permission denied to add peering for service servicenetworking.googleapis.com

Cannot modify allocated ranges in CreateConnection

IP address range exhausted

Maintenance blocked due to restrictive Pod Disruption Budget

`ENOSPC` when there is unused capacity in the instance

`Transport endpoint is not connected` in workload Pods

Permission denied to add peering for service `servicenetworking.googleapis.com`