Troubleshooting

This page includes troubleshooting steps for some common issues and errors.

FAILED instance

The FAILED status means that the instance data has been lost and the instance must be deleted.

Parallelstore instances in a FAILED state continue to be billed until they're deleted.

To retrieve an instance's state, follow the instructions at Manage instances: Retrieve an instance.

To delete an instance, read Manage instances: Delete an instance.

Timeouts during dfuse mount or network tests

If, when mounting your Parallelstore instance, the dfuse -m command times out; or if network test commands such as self_test or daos health net-test time out; the DAOS agent may have selected the wrong network interface during setup.

To verify the DAOS agent, run

self_test --use-daos-agent-env -r 1

If the test reports a connection issue, you may need to exclude network interfaces that are not able to reach the IPs in the access_points list.

Run ifconfig to list the available network interfaces. An example output may show several network interfaces such as eth0, docker0, ens8, lo, etc.
Stop the daos_agent.
Edit /etc/daos/daos_agent.yml to exclude the unwanted network interfaces. Uncomment the exclude_fabric_ifaces line and update the values. The entries you include are specific to your situation. For example:
```
exclude_fabric_ifaces: ["docker0", "ens8", "lo"]
```
Restart the daos_agent.

`ENOSPC` when there is unused capacity in the instance

If your instance uses minimum or (the default of) balanced striping, you might run into ENOSPC errors even if the existing files are not using all of the capacity of the instance. This is likely to happen when writing large files that are generally greater than 8 GiB, or when importing such files from Cloud Storage.

Use maximum file striping to reduce the likelihood of these errors.

Google Kubernetes Engine troubleshooting

The following section lists some common issues and steps to resolve them.

`Transport endpoint is not connected` in workload Pods

This error is due to dfuse termination. In most cases, dfuse was terminated because of out-of-memory. Use the Pod annotations gke-parallelstore/[cpu-limit|memory-limit] to allocate more resources to the Parallelstore sidecar container. You can set gke-parallelstore/memory-limit: "0" to remove the sidecar memory limitation if you don't know how much memory you want to allocate to it. Note that this only works with Standard clusters; with Autopilot clusters, you cannot use value 0 to unset the sidecar container resource limits and requests. You have to explicitly set a larger resource limit for the sidecar container.

Once you've modified the annotations, you must restart your workload Pod. Adding annotations to a running workload doesn't dynamically modify the resource allocation.

Pod event warnings

If your workload Pods cannot start up, check the Pod events:

kubectl describe pod POD_NAME -n NAMESPACE

The following solutions are for common errors.

CSI driver enablement issues

Common CSI driver enablement errors are as follows:

MountVolume.MountDevice failed for volume "volume" : kubernetes.io/csi:
attacher.MountDevice failed to create newCsiDriverClient:
driver name parallelstore.csi.storage.gke.io not found in the list of registered CSI drivers

MountVolume.SetUp failed for volume "volume" : kubernetes.io/csi:
mounter.SetUpAt failed to get CSI client:
driver name parallelstore.csi.storage.gke.io not found in the list of registered CSI drivers

These warnings indicate that the CSI driver is not enabled, or not running.

If your cluster was just scaled, updated, or upgraded, this warning is normal and should be transient. It takes a few minutes for the CSI driver Pods to be functional after cluster operations.

Otherwise, confirm that the CSI driver is enabled on your cluster. See Enable the CSI driver for details. If the CSI is enabled, each node shows a Pod named parallelstore-csi-node-id up and running.

MountVolume.SetUp failures

Error code	Pod event warning	Solution
ResourceExhausted	`MountVolume.SetUp failed for volume "volume" : rpc error: code = ResourceExhausted desc = the sidecar container failed with error: signal: killed` `MountVolume.SetUp failed for volume "volume" : rpc error: code = ResourceExhausted desc = the sidecar container terminated due to OOMKilled, exit code: 137`	The dfuse process was killed, which is usually caused by OOM. Consider increasing the sidecar container memory limit by using the annotation `gke-parallelstore/memory-limit`. If you're unsure about the amount of memory you want to allocate to the parallelstore-sidecar, we recommend setting `gke-parallelstore/memory-limit: "0"` to eliminate the memory restriction imposed by Parallelstore.
Aborted	`MountVolume.SetUp failed for volume "volume" : rpc error: code = Aborted desc = NodePublishVolume request is aborted due to rate limit` `MountVolume.SetUp failed for volume "volume" : rpc error: code = Aborted desc = An operation with the given volume key key already exists`	The volume mount operation was aborted due to rate limit or existing operations. This warning is normal and should be transient.
InvalidArgument	`MountVolume.SetUp failed for volume "volume" : rpc error: code = InvalidArgument desc =`	If you have supplied invalid arguments in the StorageClass or Persistent Volume, the error log will indicate the fields with the invalid arguments. For dynamic provisioning, check the Storage Class. For static provisioning, check the Persistent Volume.
FailedPrecondition	`MountVolume.SetUp failed for volume "volume" : rpc error: code = FailedPrecondition desc = can not find the sidecar container in Pod spec`	The Parallelstore sidecar container was not injected. Check that the Pod annotation `gke-parallelstore/volumes: "true"` is set correctly.

Troubleshooting VPC networks

Permission denied to add peering for service `servicenetworking.googleapis.com`

ERROR: (gcloud.services.vpc-peerings.connect) User [$(USER)] does not have 
permission to access services instance [servicenetworking.googleapis.com]
(or it may not exist): Permission denied to add peering for service
'servicenetworking.googleapis.com'.

This error means that you don't have servicenetworking.services.addPeering IAM permission on your user account.

See Access control with IAM for instructions on adding one of the following roles to your account:

roles/compute.networkAdmin or
roles/servicenetworking.networksAdmin

Cannot modify allocated ranges in CreateConnection

ERROR: (gcloud.services.vpc-peerings.connect) The operation
"operations/[operation_id]" resulted in a failure "Cannot modify allocated
ranges in CreateConnection. Please use UpdateConnection.

This error is returned when you have already created a vpc-peering on this network with different IP ranges. There are two possible solutions:

Replace the existing IP ranges:

gcloud services vpc-peerings update \
  --network=NETWORK_NAME \
  --ranges=IP_RANGE_NAME \
  --service=servicenetworking.googleapis.com \
  --force

Or, add the new IP range to the existing connection:

Retrieve the list of existing IP ranges for the peering:

EXISTING_RANGES=$(
  gcloud services vpc-peerings list \
    --network=NETWORK_NAME \
    --service=servicenetworking.googleapis.com \
    --format="value(reservedPeeringRanges.list())"
)

Then, add the new range to the peering:

gcloud services vpc-peerings update \
  --network=NETWORK_NAME \
  --ranges=$EXISTING_RANGES,IP_RANGE_NAME \
  --service=servicenetworking.googleapis.com

IP address range exhausted

Problem: Instance creation fails with range exhausted error:

ERROR: (gcloud.alpha.Parallelstore.instances.create) FAILED_PRECONDITION: Invalid
resource state for "NETWORK_RANGES_NOT_AVAILABLE": IP address range exhausted

Solution: Follow the VPC guide to either recreate the IP range or extend the existing IP range.

If you're recreating a Parallelstore instance, you must recreate the IP range instead of extending it.