Cassandra troubleshooting guide

This topic discusses steps you can take to troubleshoot and fix problems with the Cassandra datastore. Cassandra is a persistent datastore that runs in the cassandra component of the hybrid runtime architecture. See also Runtime service configuration overview.

Cassandra pods are stuck in the Pending state

Symptom

When starting up, the Cassandra pods remain in the Pending state.

Error message

When you use kubectl to view the pod states, you see that one or more Cassandra pods are stuck in the Pending state. The Pending state indicates that Kubernetes is unable to schedule the pod on a node: the pod cannot be created. For example:

kubectl get pods -n namespace

NAME                                     READY   STATUS      RESTARTS   AGE
adah-resources-install-4762w             0/4     Completed   0          10m
apigee-cassandra-default-0               0/1     Pending     0          10m
...

Possible causes

A pod stuck in the Pending state can have multiple causes. For example:

Cause	Description
Insufficient resources	There is not enough CPU or memory available to create the pod.
Volume not created	The pod is waiting for the persistent volume to be created.

Diagnosis

Use kubectl to describe the pod to determine the source of the error. For example:

kubectl -n namespace describe pods pod_name

For example:

kubectl describe pods apigee-cassandra-default-0 -n apigee

The output may show one of these possible problems:

If the problem is insufficient resources, you will see a Warning message that indicates insufficient CPU or memory.
If the error message indicates that the pod has unbound immediate PersistentVolumeClaims (PVC), it means the pod is not able to create its Persistent volume.

Resolution

Insufficient resources

Modify the Cassandra node pool so that it has sufficient CPU and memory resources. See Resizing a node pool for details.

Persistent volume not created

If you determine a persistent volume issue, describe the PersistentVolumeClaim (PVC) to determine why it is not being created:

List the PVCs in the cluster:

kubectl -n namespace get pvc

NAME                                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cassandra-data-apigee-cassandra-default-0   Bound    pvc-b247faae-0a2b-11ea-867b-42010a80006e   10Gi       RWO            standard       15m
...

Describe the PVC for the pod that is failing. For example, the following command describes the PVC bound to the pod apigee-cassandra-default-0:

kubectl apigee describe pvc cassandra-data-apigee-cassandra-default-0

Events:
  Type     Reason              Age                From                         Message
  ----     ------              ----               ----                         -------
  Warning  ProvisioningFailed  3m (x143 over 5h)  persistentvolume-controller  storageclass.storage.k8s.io "apigee-sc" not found

Note that in this example, the StorageClass named apigee-sc does not exist. To resolve this problem, create the missing StorageClass in the cluster, as explained in Change the default StorageClass.

Cassandra pods are stuck in the CrashLoopBackoff state

Symptom

When starting up, the Cassandra pods remain in the CrashLoopBackoff state.

Error message

When you use kubectl to view the pod states, you see that one or more Cassandra pods are in the CrashLoopBackoff state. This state indicates that Kubernetes is unable to create the pod. For example:

kubectl get pods -n namespace

NAME                                     READY   STATUS            RESTARTS   AGE
adah-resources-install-4762w             0/4     Completed         0          10m
apigee-cassandra-default-0               0/1     CrashLoopBackoff  0          10m
...

Possible causes

A pod stuck in the CrashLoopBackoff state can have multiple causes. For example:

Cause	Description
Data center differs from previous data center	This error indicates that the Cassandra pod has a persistent volume that has data from a previous cluster, and the new pods are not able to join the old cluster. This usually happens when stale persistent volumes persist from the previous Cassandra cluster on the same Kubernetes node. This problem can occur if you delete and recreate Cassandra in the cluster.
Truststore directory not found	This error indicates that the Cassandra pod is not able to create a TLS connection. This usually happens when the provided keys and certificates are invalid, missing, or have other issues.

Diagnosis

Check the Cassandra error log to determine the cause of the problem.

List the pods to get the ID of the Cassandra pod that is failing:
```
kubectl get pods -n namespace
```
Check the failing pod's log:
```
kubectl logs pod_id -n namespace
```

Resolution

Look for the following clues in the pod's log:

Data center differs from previous data center

If you see this log message:

Cannot start node if snitch's data center (us-east1) differs from previous data center

Check if there are any stale or old PVC in the cluster and delete them.

If this is a fresh install, delete all the PVCs and re-try the setup. For example:

kubectl -n namespace get pvc
kubectl -n namespace delete pvc cassandra-data-apigee-cassandra-default-0

Truststore directory not found

If you see this log message:

Caused by: java.io.FileNotFoundException: /apigee/cassandra/ssl/truststore.p12
(No such file or directory)

Verify the key and certificates if provided in your overrides file are correct and valid. For example:

cassandra:
  sslRootCAPath: path_to_root_ca-file
  sslCertPath: path-to-tls-cert-file
  sslKeyPath: path-to-tls-key-file

Node failure

Symptom

When starting up, the Cassandra pods remain in the Pending state. This problem can indicate an underlying node failure.

Diagnosis

Determine which Cassandra pods are not running:

$ kubectl get pods -n your_namespace
    NAME                  READY   STATUS    RESTARTS   AGE
    cassandra-default-0   0/1     Pending   0          13s
    cassandra-default-1   1/1     Running   0          8d
    cassandra-default-2   1/1     Running   0          8d

Check the worker nodes. If one is in the NotReady state, then that is the node that has failed:

kubectl get nodes -n your_namespace
NAME                                              STATUS   ROLES    AGE   VERSION            INTERNAL-IP
gke-hybrid-cluster-apigee-data-178811f1-lv5j      Ready    <none>   34d   v1.21.5-gke.1302   10.138.15.198
gke-hybrid-cluster-apigee-data-d63b8b8d-n41g      NotReady <none>   34d   v1.21.5-gke.1302   10.138.15.200
gke-hybrid-cluster-apigee-data-ec752c0b-b1cr      Ready    <none>   34d   v1.21.5-gke.1302   10.138.15.199
gke-hybrid-cluster-apigee-runtime-ba502ff4-57mq   Ready    <none>   34d   v1.21.5-gke.1302   10.138.15.204
gke-hybrid-cluster-apigee-runtime-ba502ff4-hwkb   Ready    <none>   34d   v1.21.5-gke.1302   10.138.15.203
gke-hybrid-cluster-apigee-runtime-bfa558e0-08vw   Ready    <none>   34d   v1.21.5-gke.1302   10.138.15.201
gke-hybrid-cluster-apigee-runtime-bfa558e0-xvsc   Ready    <none>   34d   v1.21.5-gke.1302   10.138.15.202
gke-hybrid-cluster-apigee-runtime-d12de7df-693w   Ready    <none>   34d   v1.21.5-gke.1302   10.138.15.241
gke-hybrid-cluster-apigee-runtime-d12de7df-fn0w   Ready    <none>   34d   v1.21.5-gke.1302   10.138.15.206

Resolution

Remove the dead Cassandra pod from the cluster.

$ kubectl exec -it apigee-cassandra-default-0 -- nodetool status
$ kubectl exec -it apigee-cassandra-default-0 -- nodetool removenode deadnode_hostID

Remove the VolumeClaim from the dead node to prevent the Cassandra pod from attempting to come up on the dead node because of the affinity:
```
kubectl get pvc -n your_namespace
kubectl delete pvc volumeClaim_name -n your_namespace
```

Update the volume template and create PersistentVolume for the newly added node. The following is an example volume template:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: cassandra-data-3
spec:
  capacity:
    storage: 100Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /apigee/data
  nodeAffinity:
    "required":
      "nodeSelectorTerms":
      - "matchExpressions":
        - "key": "kubernetes.io/hostname"
          "operator": "In"
          "values": ["gke-hybrid-cluster-apigee-data-d63b8b8d-n41g"]

Replace the values with the new hostname/IP and apply the template:
```
kubectl apply -f volume-template.yaml
```

Create a client container for debugging

This section explains how to create a client container from which you can access Cassandra debugging utilities such as cqlsh. These utilities allow you to query Cassandra tables and can be useful for debugging purposes.

Create the client container

To create the client container, follow these steps:

The container uses the TLS certificate from the apigee-cassandra-user-setup pod. The first step is to fetch this certificate name:
```
kubectl get secrets -n apigee --field-selector type=kubernetes.io/tls | grep apigee-cassandra-user-setup | awk '{print $1}'
```
This command returns the certificate name. For example: apigee-cassandra-user-setup-rg-hybrid-b7d3b9c-tls.

Open a new file and paste the following pod spec into it:

apiVersion: v1
kind: Pod
metadata:
  labels:
  name: cassandra-client-name   # For example: my-cassandra-client
  namespace: apigee
spec:
  containers:
  - name: cassandra-client-name
    image: "gcr.io/apigee-release/hybrid/apigee-hybrid-cassandra-client:1.7.6"
    imagePullPolicy: Always
    command:
    - sleep
    - "3600"
    env:
    - name: CASSANDRA_SEEDS
      value: apigee-cassandra-default.apigee.svc.cluster.local
    - name: APIGEE_DML_USER
      valueFrom:
        secretKeyRef:
          key: dml.user
          name: apigee-datastore-default-creds
    - name: APIGEE_DML_PASSWORD
      valueFrom:
        secretKeyRef:
          key: dml.password
          name: apigee-datastore-default-creds
    volumeMounts:
    - mountPath: /opt/apigee/ssl
      name: tls-volume
      readOnly: true
  volumes:
  - name: tls-volume
    secret:
      defaultMode: 420
      secretName: your-secret-name    # For example: apigee-cassandra-user-setup-rg-hybrid-b7d3b9c-tls
  restartPolicy: Never

Save the file with a .yaml extension. For example: my-spec.yaml.

Apply the spec to your cluster:

kubectl apply -f your-spec-file.yaml -n apigee

kubectl exec -n apigee cassandra-client -it -- bash

Connect to the Cassandra cqlsh interface with the following command. Enter the command exactly as shown:
```
cqlsh ${CASSANDRA_SEEDS} -u ${APIGEE_DML_USER} -p ${APIGEE_DML_PASSWORD} --ssl
```

Deleting the client pod

Use this command to delete the Cassandra client pod:

kubectl delete pods -n apigee cassandra-client

Additional resources

See Introduction to Apigee and Apigee hybrid playbooks.