You're viewing Apigee and Apigee hybrid documentation.
There is no equivalent
Apigee Edge documentation for this topic.
Symptom
During the Cassandra restoration in Apigee hybrid, you may encounter errors in the restore logs.
Error message
You see one of the following in the logs:
java.net.ConnectException: Connection timed out (Connection timed out)
/tmp/tmp/schema.cql:662:OperationTimedOut: errors={'10.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.0.0.1
/tmp/tmp/schema.cql:6409:AlreadyExists: Table 'kvm_myorg.kvm_map_keys_descriptor' already exists
Possible causes
Cause | Description | Troubleshooting instructions applicable for |
---|---|---|
Connection timed out |
This error is a connectivity error between
apigee-cassandra-restore pods and
apigee-cassandra-default-* pods.
|
Apigee hybrid |
Operation timed out | This error occurs if restoration times out after more than 15 minutes. | Apigee hybrid |
Already exists | This error message is not related to the cause of the issue and is a result of a retry operation of a restoration job. | Apigee hybrid |
Cause: Connection timed out
The following error is a connectivity error between
apigee-cassandra-restore
pods and
apigee-cassandra-default-*
pods:
java.net.ConnectException: Connection timed out (Connection timed out)
Diagnosis
-
If your host network is not reachable from the pod network, ensure that
hostNetwork
is set tofalse
undercassandra
inoverrides.yaml
as shown in Restoring a region from a backup. -
To test the connectivity, log-in to the
apigee-mart
orapigee-runtime
pod, which is in the same network as theapigee-cassandra-restore
job. You can also use any other pod in the pod network.-
Obtain the name of the
apigee-mart
pod:kubectl -n apigee get po --selector=app=apigee-mart --no-headers -o custom-columns=":metadata.name"
-
Execute a bash session inside the mart pod:
kubectl exec -it MART_POD_NAME -n apigee -- bash
Replace MART_POD_NAME with the name of the MART pod. For example,
apigee-mart-myorg--9a8228a-191-t0xh1-qz5fl
. -
Run connectivity tests against
Cassandra ports:
curl -v -m 5 telnet://apigee-cassandra-default-0.apigee-cassandra-default.apigee.svc.cluster.local:9042
curl -v -m 5 telnet://apigee-cassandra-default-0.apigee-cassandra-default.apigee.svc.cluster.local:7001
If you get a
Connection timed out
error in the output, that means you have connectivity issues. But if you see aConnected to
message, that indicates that the connection is successful, and you need to press Ctrl + C to close the connection and proceed. -
Obtain the name of the
Resolution
Ensure that the HostNetwork
setting is set to
false
in the overrides.yaml
file used for restore,
and
repeat the restoration process. If the setting is already set to
false
, but you see connectivity errors, ensure that Cassandra
pods are up and running with the following command:
kubectl get pods -n apigee -l app=apigee-cassandra
Your output should look something like the following example:
NAME READY STATUS RESTARTS AGE apigee-cassandra-default-0 1/1 Running 0 14m apigee-cassandra-default-1 1/1 Running 0 13m apigee-cassandra-default-2 1/1 Running 0 11m exampleuser@example hybrid-files %
Cause: Operation timed out
The following error occurs if restoration times out after more than 15 minutes. The error indicates I/O issues such as storage and network not being able to transmit the uncompressed contents of the backup on time.
/tmp/tmp/schema.cql:662:OperationTimedOut: errors={'10.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.0.0.1
Diagnosis
-
Check the
apigee-cassandra-default-0
logs to note the timestamp of the beginning of the restoration:kubectl logs apigee-cassandra-default-0 -n apigee | grep 'MigrationManager.java' | head -n 1
-
Compare the timestamp with the latest log of table creation:
kubectl logs apigee-cassandra-default-0 -n apigee | grep 'Create new table' | tail -n 1
The results from this comparison should show that the Cassandra pod was still in the process of creating tables after the timeout had been exceeded.
-
Test the storage bandwidth with the following commands:
kubectl -n apigee exec -it apigee-cassandra-default-0 -- bash -c 'dd if=/dev/zero of=/opt/apigee/data/test.img bs=200M count=1 ; rm /opt/apigee/data/test.img'
kubectl -n apigee exec -it apigee-cassandra-default-1 -- bash -c 'dd if=/dev/zero of=/opt/apigee/data/test.img bs=200M count=1 ; rm /opt/apigee/data/test.img'
kubectl -n apigee exec -it apigee-cassandra-default-2 -- bash -c 'dd if=/dev/zero of=/opt/apigee/data/test.img bs=200M count=1 ; rm /opt/apigee/data/test.img'
If the write speed is less than 100 M/s, this could indicate a lack of an appropriate StorageClass (SSD) used.
-
Test the networking bandwidth:
-
Run
netcat
on the Cassandra pod to listen on the port:kubectl -n apigee exec -it apigee-cassandra-default-0 -- bash -c 'nc -l -p 3456 > /dev/null'
-
In a separate shell session, obtain the name of the
apigee-mart
pod:kubectl -n apigee get po --selector=app=apigee-mart --no-headers -o custom-columns=":metadata.name"
-
Execute a bash session inside the
apigee-mart
pod. You can also use any other pod in the pod network:kubectl exec -it MART_POD_NAME -n apigee -- bash
Replace MART_POD_NAME with the name of the MART pod. For example,
apigee-mart-myorg--9a8228a-191-t0xh1-qz5fl
. -
Run a network bandwidth test against the Cassandra pod that is still running the
netcat
:dd if=/dev/zero bs=50M count=1 | nc apigee-cassandra-default-0.apigee-cassandra-default.apigee.svc.cluster.local 3456
You can repeat the process for other Cassandra pods. If the resulting speed is less than 10M/s, then the network bandwidth most likely is the cause of the issue.
-
Resolution
Once slow I/O speed is confirmed with the steps above, ensure that your cluster adheres to minimum network and storage requirements. Test the bandwidth again afterwards.
Cause: Already exists
Diagnosis
You see an error similar to the following:
/tmp/tmp/schema.cql:6409:AlreadyExists: Table 'kvm_myorg.kvm_map_keys_descriptor' already exists
Resolution
This error message is not related to the cause of the issue and is a result of a retry operation of a restoration job. The actual error message should be displayed in the logs of the first pod that failed.
Obtain the logs from the initial failure in order to diagnose the issue.
If the problem still persists, go to Must gather diagnostic information.
Must gather diagnostic information
If the problem persists even after following the above instructions, gather the following diagnostic information and then contact Google Cloud Customer Care:
-
In addition to the usual data you might be asked to provide, collect the diagnostic data from all the Cassandra pods with the following command:
for p in $(kubectl -n apigee get pods -l app=apigee-cassandra --no-headers -o custom-columns=":metadata.name") ; do \ for com in info describecluster failuredetector version status ring info gossipinfo compactionstats tpstats netstats cfstats proxyhistograms gcstats ; do kubectl \ -n apigee exec ${p} -- bash -c 'nodetool -u $APIGEE_JMX_USER -pw $APIGEE_JMX_PASSWORD '"$com"' 2>&1 '\ | tee /tmp/k_cassandra_nodetool_${com}_${p}_$(date +%Y.%m.%d_%H.%M.%S).txt | head -n 40 ; echo '...' ; done; done
-
Compress it, and provide it in the support case:
tar -cvzf /tmp/cassandra_data_$(date +%Y.%m.%d_%H.%M.%S).tar.gz /tmp/k_cassandra_nodetool*
- Collect and provide logs from the restore pod. Note that logs are short-lived, so they should be collected right after the failure.
- If you followed the diagnosis steps above, collect all the console outputs, copy them into a file, and attach the file to the support case.