This page includes troubleshooting steps for some common issues and errors.
FAILED instance
The FAILED
status means that the instance data has been lost and the instance
must be deleted.
Parallelstore instances in a FAILED
state continue to be billed until they're
deleted.
To retrieve an instance's state, follow the instructions at Manage instances: Retrieve an instance.
To delete an instance, read Manage instances: Delete an instance.
Timeouts during dfuse mount or network tests
If, when mounting your Parallelstore instance, the dfuse -m
command times out;
or if network test commands such as self_test
or daos health net-test
time
out; the DAOS agent may have selected the wrong network interface during setup.
To verify the DAOS agent, run
self_test --use-daos-agent-env -r 1
If the test reports a connection issue, you may need to exclude network
interfaces that are not able to reach the IPs in the access_points
list.
Run
ifconfig
to list the available network interfaces. An example output may show several network interfaces such aseth0
,docker0
,ens8
,lo
, etc.Stop the daos_agent.
Edit
/etc/daos/daos_agent.yml
to exclude the unwanted network interfaces. Uncomment theexclude_fabric_ifaces
line and update the values. The entries you include are specific to your situation. For example:exclude_fabric_ifaces: ["docker0", "ens8", "lo"]
Restart the daos_agent.
Google Kubernetes Engine troubleshooting
The following section lists some common issues and steps to resolve them.
Transport endpoint is not connected
in workload Pods
This error is due to dfuse termination. In most cases, dfuse was terminated
because of out-of-memory. Use the Pod annotations
gke-parallelstore/[cpu-limit|memory-limit]
to allocate more resources to
the Parallelstore sidecar container. You can set
gke-parallelstore/memory-limit: "0"
to remove the sidecar memory limitation
if you don't know how much memory you want to allocate to it. Note that this
only works with Standard clusters; with Autopilot clusters, you cannot
use value 0
to unset the sidecar container resource limits and requests. You
have to explicitly set a larger resource limit for the sidecar container.
Once you've modified the annotations, you must restart your workload Pod. Adding annotations to a running workload doesn't dynamically modify the resource allocation.
Pod event warnings
If your workload Pods cannot start up, check the Pod events:
kubectl describe pod POD_NAME -n NAMESPACE
The following solutions are for common errors.
CSI driver enablement issues
Common CSI driver enablement errors are as follows:
MountVolume.MountDevice failed for volume "volume" : kubernetes.io/csi:
attacher.MountDevice failed to create newCsiDriverClient:
driver name parallelstore.csi.storage.gke.io not found in the list of registered CSI drivers
MountVolume.SetUp failed for volume "volume" : kubernetes.io/csi:
mounter.SetUpAt failed to get CSI client:
driver name parallelstore.csi.storage.gke.io not found in the list of registered CSI drivers
These warnings indicate that the CSI driver is not enabled, or not running.
If your cluster was just scaled, updated, or upgraded, this warning is normal and should be transient. It takes a few minutes for the CSI driver Pods to be functional after cluster operations.
Otherwise, confirm that the CSI driver is enabled on your cluster. See
Enable the CSI driver for details. If the CSI is enabled,
each node shows a Pod named parallelstore-csi-node-id
up and
running.
MountVolume.SetUp failures
Error code | Pod event warning | Solution |
ResourceExhausted |
|
The dfuse process was killed, which is usually caused by OOM. Consider
increasing the sidecar container memory limit by using the annotation
gke-parallelstore/memory-limit .
If you're unsure about the amount of memory you want to allocate to
the parallelstore-sidecar, we recommend setting
|
Aborted |
|
The volume mount operation was aborted due to rate limit or existing operations. This warning is normal and should be transient. |
InvalidArgument |
MountVolume.SetUp failed for volume "volume" : rpc
error: code = InvalidArgument desc =
|
If you have supplied invalid arguments in the StorageClass or Persistent Volume, the error log will indicate the fields with the invalid arguments. For dynamic provisioning, check the Storage Class. For static provisioning, check the Persistent Volume. |
FailedPrecondition |
MountVolume.SetUp failed for volume "volume" : rpc
error: code = FailedPrecondition desc = can not find the sidecar
container in Pod spec
|
The Parallelstore sidecar container was not injected. Check that the Pod
annotation gke-parallelstore/volumes: "true" is set correctly.
|
Troubleshooting VPC networks
Permission denied to add peering for service servicenetworking.googleapis.com
ERROR: (gcloud.services.vpc-peerings.connect) User [$(USER)] does not have
permission to access services instance [servicenetworking.googleapis.com]
(or it may not exist): Permission denied to add peering for service
'servicenetworking.googleapis.com'.
This error means that you don't have servicenetworking.services.addPeering
IAM permission on your user account.
See Access control with IAM for instructions on adding one of the following roles to your account:
roles/compute.networkAdmin
orroles/servicenetworking.networksAdmin
Cannot modify allocated ranges in CreateConnection
ERROR: (gcloud.services.vpc-peerings.connect) The operation
"operations/[operation_id]" resulted in a failure "Cannot modify allocated
ranges in CreateConnection. Please use UpdateConnection.
This error is returned when you have already created a vpc-peering on this network with different IP ranges. There are two possible solutions:
Replace the existing IP ranges:
gcloud services vpc-peerings update \
--network=NETWORK_NAME \
--ranges=IP_RANGE_NAME \
--service=servicenetworking.googleapis.com \
--force
Or, add the new IP range to the existing connection:
Retrieve the list of existing IP ranges for the peering:
EXISTING_RANGES=$( gcloud services vpc-peerings list \ --network=NETWORK_NAME \ --service=servicenetworking.googleapis.com \ --format="value(reservedPeeringRanges.list())" )
Then, add the new range to the peering:
gcloud services vpc-peerings update \ --network=NETWORK_NAME \ --ranges=$EXISTING_RANGES,IP_RANGE_NAME \ --service=servicenetworking.googleapis.com
IP address range exhausted
Problem: Instance creation fails with range exhausted error:
ERROR: (gcloud.alpha.Parallelstore.instances.create) FAILED_PRECONDITION: Invalid
resource state for "NETWORK_RANGES_NOT_AVAILABLE": IP address range exhausted
Solution: Follow the VPC guide to either recreate the IP range or extend the existing IP range.
If you're recreating a Parallelstore instance, you must recreate the IP range instead of extending it.