Troubleshooting Cassandra restore

You're viewing Apigee and Apigee hybrid documentation.
There is no equivalent Apigee Edge documentation for this topic.

Symptom

During the Cassandra restoration in Apigee hybrid, you may encounter errors in the restore logs.

Error message

You see one of the following in the logs:

java.net.ConnectException: Connection timed out (Connection timed out)
/tmp/tmp/schema.cql:662:OperationTimedOut: errors={'10.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.0.0.1
/tmp/tmp/schema.cql:6409:AlreadyExists: Table 'kvm_myorg.kvm_map_keys_descriptor' already exists

Possible causes

Cause Description Troubleshooting instructions applicable for
Connection timed out This error is a connectivity error between apigee-cassandra-restore pods and apigee-cassandra-default-* pods. Apigee hybrid
Operation timed out This error occurs if restoration times out after more than 15 minutes. Apigee hybrid
Already exists This error message is not related to the cause of the issue and is a result of a retry operation of a restoration job. Apigee hybrid

Cause: Connection timed out

The following error is a connectivity error between apigee-cassandra-restore pods and apigee-cassandra-default-* pods:

java.net.ConnectException: Connection timed out (Connection timed out)

Diagnosis

  1. If your host network is not reachable from the pod network, ensure that hostNetwork is set to false under cassandra in overrides.yaml as shown in Restoring a region from a backup.
  2. To test the connectivity, log-in to the apigee-mart or apigee-runtime pod, which is in the same network as the apigee-cassandra-restore job. You can also use any other pod in the pod network.
    1. Obtain the name of the apigee-mart pod:
      kubectl -n apigee get po --selector=app=apigee-mart --no-headers -o custom-columns=":metadata.name"
    2. Execute a bash session inside the mart pod:
      kubectl exec -it MART_POD_NAME -n apigee -- bash

      Replace MART_POD_NAME with the name of the MART pod. For example, apigee-mart-myorg--9a8228a-191-t0xh1-qz5fl.

    3. Run connectivity tests against Cassandra ports:
      curl -v -m 5 telnet://apigee-cassandra-default-0.apigee-cassandra-default.apigee.svc.cluster.local:9042
      curl -v -m 5 telnet://apigee-cassandra-default-0.apigee-cassandra-default.apigee.svc.cluster.local:7001

    If you get a Connection timed out error in the output, that means you have connectivity issues. But if you see a Connected to message, that indicates that the connection is successful, and you need to press Ctrl + C to close the connection and proceed.

Resolution

Ensure that the HostNetwork setting is set to false in the overrides.yaml file used for restore, and repeat the restoration process. If the setting is already set to false, but you see connectivity errors, ensure that Cassandra pods are up and running with the following command:

kubectl get pods -n apigee -l app=apigee-cassandra

Your output should look something like the following example:

NAME                         READY   STATUS    RESTARTS   AGE
apigee-cassandra-default-0   1/1     Running   0          14m
apigee-cassandra-default-1   1/1     Running   0          13m
apigee-cassandra-default-2   1/1     Running   0          11m
exampleuser@example hybrid-files %

Cause: Operation timed out

The following error occurs if restoration times out after more than 15 minutes. The error indicates I/O issues such as storage and network not being able to transmit the uncompressed contents of the backup on time.

/tmp/tmp/schema.cql:662:OperationTimedOut: errors={'10.0.0.1': 'Client
request timeout. See Session.execute[_async](timeout)'}, last_host=10.0.0.1

Diagnosis

  1. Check the apigee-cassandra-default-0 logs to note the timestamp of the beginning of the restoration:

    kubectl logs apigee-cassandra-default-0 -n apigee | grep 'MigrationManager.java' | head -n 1
  2. Compare the timestamp with the latest log of table creation:

    kubectl logs apigee-cassandra-default-0 -n apigee | grep 'Create new table' | tail -n 1

    The results from this comparison should show that the Cassandra pod was still in the process of creating tables after the timeout had been exceeded.

  3. Test the storage bandwidth with the following commands:

    kubectl -n apigee exec -it apigee-cassandra-default-0 -- bash -c 'dd if=/dev/zero of=/opt/apigee/data/test.img bs=200M count=1 ; rm /opt/apigee/data/test.img'
    kubectl -n apigee exec -it apigee-cassandra-default-1 -- bash -c 'dd if=/dev/zero of=/opt/apigee/data/test.img bs=200M count=1 ; rm /opt/apigee/data/test.img'
    kubectl -n apigee exec -it apigee-cassandra-default-2 -- bash -c 'dd if=/dev/zero of=/opt/apigee/data/test.img bs=200M count=1 ; rm /opt/apigee/data/test.img'

    If the write speed is less than 100 M/s, this could indicate a lack of an appropriate StorageClass (SSD) used.

  4. Test the networking bandwidth:

    1. Run netcat on the Cassandra pod to listen on the port:

      kubectl -n apigee exec -it apigee-cassandra-default-0 -- bash -c 'nc -l -p 3456 > /dev/null'
    2. In a separate shell session, obtain the name of the apigee-mart pod:

      kubectl -n apigee get po --selector=app=apigee-mart --no-headers -o custom-columns=":metadata.name"
    3. Execute a bash session inside the apigee-mart pod. You can also use any other pod in the pod network:

      kubectl exec -it MART_POD_NAME -n apigee -- bash

      Replace MART_POD_NAME with the name of the MART pod. For example, apigee-mart-myorg--9a8228a-191-t0xh1-qz5fl.

    4. Run a network bandwidth test against the Cassandra pod that is still running the netcat:

      dd if=/dev/zero bs=50M count=1 | nc apigee-cassandra-default-0.apigee-cassandra-default.apigee.svc.cluster.local 3456

    You can repeat the process for other Cassandra pods. If the resulting speed is less than 10M/s, then the network bandwidth most likely is the cause of the issue.

Resolution

Once slow I/O speed is confirmed with the steps above, ensure that your cluster adheres to minimum network and storage requirements. Test the bandwidth again afterwards.

Cause: Already exists

Diagnosis

You see an error similar to the following:

/tmp/tmp/schema.cql:6409:AlreadyExists: Table 'kvm_myorg.kvm_map_keys_descriptor' already exists

Resolution

This error message is not related to the cause of the issue and is a result of a retry operation of a restoration job. The actual error message should be displayed in the logs of the first pod that failed.

Obtain the logs from the initial failure in order to diagnose the issue.

If the problem still persists, go to Must gather diagnostic information.

Must gather diagnostic information

If the problem persists even after following the above instructions, gather the following diagnostic information and then contact Apigee Support:

  1. In addition to the usual data you might be asked to provide, collect the diagnostic data from all the Cassandra pods with the following command:

    for p in $(kubectl -n apigee get pods -l app=apigee-cassandra --no-headers -o custom-columns=":metadata.name") ; do \
            for com in info describecluster failuredetector version status ring info gossipinfo compactionstats tpstats netstats cfstats proxyhistograms gcstats ; do kubectl \
            -n apigee exec ${p} -- bash -c 'nodetool -u $APIGEE_JMX_USER -pw $APIGEE_JMX_PASSWORD '"$com"' 2>&1 '\
            | tee /tmp/k_cassandra_nodetool_${com}_${p}_$(date +%Y.%m.%d_%H.%M.%S).txt | head -n 40 ; echo '...' ; done; done
          
  2. Compress it, and provide it in the support case:

    tar -cvzf /tmp/cassandra_data_$(date +%Y.%m.%d_%H.%M.%S).tar.gz /tmp/k_cassandra_nodetool*
  3. Collect and provide logs from the restore pod. Note that logs are short-lived, so they should be collected right after the failure.
  4. If you followed the diagnosis steps above, collect all the console outputs, copy them into a file, and attach the file to the support case.