Back up and restore clusters

This page describes how to back up and restore clusters created with Google Distributed Cloud. These instructions apply to all cluster types supported by Google Distributed Cloud.

Back up a cluster

The backup process has two parts. First, a snapshot is made from the etcd store. Then, the related PKI certificates are saved to a tar file. The etcd store is the Kubernetes backing store for all cluster data and contains all the Kubernetes objects and custom objects required to manage cluster state. The PKI certificates are used for authentication over TLS. This data is backed up from the cluster's control plane or from one of the control planes for a high-availability (HA)

We recommend you back up your clusters regularly to ensure your snapshot data is relatively current. The rate of backups depends upon the frequency in which significant changes occur for your clusters.

Make a snapshot of the etcd store

In Google Distributed Cloud, a pod named etcd-CONTROL_PLANE_NAME in the kube-system namespace runs the etcd for that control plane. To backup the cluster's etcd store, perform the following steps from your admin workstation:

  1. Use kubectl get po to identify the etcd Pod.

    kubectl --kubeconfig CLUSTER_KUBECONFIG get po -n kube-system \
        -l 'component=etcd,tier=control-plane'
    

    The response includes the etcd Pod name and its status.

  2. Use kubectl describe pod to see the containers running in the etcd pod, including the etcd container.

    kubectl --kubeconfig CLUSTER_KUBECONFIG describe pod ETCD_POD_NAME -n kube-system
    
  3. Run a Bash shell in the etcd container:

    kubectl --kubeconfig CLUSTER_KUBECONFIG exec -it \
        ETCD_POD_NAME --container etcd --namespace kube-system \
        -- bin/sh
    
  4. From the shell within the etcd container, use etcdctl (version 3 of the API) to save a snapshot, snapshot.db, of the etcd store.

    ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
        --cacert=/etc/kubernetes/pki/etcd/ca.crt \
        --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
        --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
        snapshot save snapshotDATESTAMP.db
    

    Replace DATESTAMP with the current date to prevent overwriting any subsequent snapshots.

  5. Exit from the shell in the container and run the following command to copy the snapshot file to the admin workstation.

    kubectl --kubeconfig CLUSTER_KUBECONFIG cp \
        kube-system/ETCD_POD_NAME:snapshot.db \
        --container etcd snapshot.db
    
  6. Store the snapshot file in a location that is outside of the cluster and is not dependent on the cluster's operation.

Archive the PKI certificates

The certificates to be backed up are located in the /etc/kubernetes/pki directory of the control plane. The PIK certificates together with the etcd store snapshot.db file are needed to to recover a cluster in the event the control plane goes down completely. The following steps create a tar file, containing the PKI certificates.

  1. Use ssh to connect to the cluster's control plane as root.

    ssh root@CONTROL_PLANE_NAME
    
  2. From the control plane, create a tar file, certs_backup.tar.gz with the contents of the /etc/kubernetes/pki directory.

    tar -czvf certs_backup.tar.gz -C /etc/kubernetes/pki .
    

    Creating the tar file from within the control plane preserves all the certificate file permissions.

  3. Exit the control plane and, from the workstation, copy tar file containing the certificates to a preferred location on the workstation.

    sudo scp root@CONTROL_PLANE_NAME:certs_backup.tar.gz BACKUP_PATH
    

Restore a cluster

Restoring a cluster from a backup is a last resort and should be used when a cluster has failed catastrophically and cannot be returned to service any other way. For example, the etcd data is corrupted or the etcd Pod is in a crash loop.

The cluster restore process has two parts. First, the PKI certificates are restored on the control plane. Then, the etcd store data is restored.

Restore PKI certificates

Assuming you have backed up PKI certificates as described in Archive the PKI certificates, the following steps describe how to restore the certificates from the tar file to a control plane.

  1. Copy the PKI certificates tar file, certs_backup.tar.gz, from workstation to the cluster control plane.

    sudo scp -r BACKUP_PATH/certs_backup.tar.gz root@CONTROL_PLANE_NAME:~/
    
  2. Use ssh to connect to the cluster's control plane as root.

    ssh root@CONTROL_PLANE_NAME
    
  3. From the control plane, extract the contents of the tar file to the /etc/kubernetes/pki directory.

    tar -xzvf certs_backup.tar.gz -C /etc/kubernetes/pki/
    
  4. Exit the control plane.

Restore the etcd store

When restoring the etcd store, the process depends upon whether or not the cluster is running in high availability (HA) mode and, if so, whether or not quorum has been preserved. Use the following guidance to restore the etcd store for a given cluster failure situation:

  • If the failed cluster is not running in HA mode, restore the etcd store on the control plane with the following steps.

  • If the cluster is running in HA mode and quorum is preserved, do nothing. As long a quorum is preserved, you don't need to restore failed clusters.

  • If the cluster is running in HA mode and quorum is lost, repeat the following steps to restore the etcd store for each failed member.

Follow these steps from the workstation to remove and restore the etcd store on a control plane for a failed cluster:

  1. Create a /backup directory in the root directory of the control plane.

    ssh root@CONTROL_PLANE_NAME "mkdir /backup"
    

    This step is not strictly required, but we recommend it. The following steps assume you have created a /backup directory.

  2. Copy the etcd snapshot file, snapshot.db from workstation to the backup directory on the cluster control plane.

    sudo scp snapshot.db root@CONTROL_PLANE_NAME:/backup
    
  3. Use SSH to connect to the control plane node:

    ssh root@CONTROL_PLANE_NAME
    
  4. Stop the etcd and kube-apiserver static pods by moving their manifest files out of the /etc/kubernetes/manifests directory and into the /backup directory.

    sudo mv /etc/kubernetes/manifests/etcd.yaml /backup/etcd.yaml
    sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /backup/kube-apiserver.yaml
    
  5. Remove the etcd data directory.

    rm -rf /var/lib/etcd/
    
  6. Run etcdctl snapshot restore using docker.

    sudo docker run --rm -t \
        -v /var/lib:/var/lib \
        -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \
        -v /backup:/backup \
        --env ETCDCTL_API=3 \
        k8s.gcr.io/etcd:3.2.24 etcdctl \
        --cacert=/etc/kubernetes/pki/etcd/ca.crt \
        --cert=/etc/kubernetes/pki/etcd/server.crt \
        --key=/etc/kubernetes/pki/etcd/server.key \
        --data-dir=/var/lib/etcd \
        --name=CONTROL_PLANE_NAME \
        --initial-advertise-peer-urls=https://CONTROL_PLANE_IP:2380 \
        --initial-cluster=CONTROL_PLANE_NAME=https://CONTROL_PLANE_IP:2380 \
        snapshot restore /backup/snapshot.db
    

    The entries for --name, --initial-advertise-peer-urls, and --initial-cluster can be found in the etcd.yaml manifest file that was moved to the /backup directory.

  7. Ensure that /var/lib/etcd was recreated and that a new member is created in /var/lib/etcd/member.

  8. Move the etcd and kube-apiserver manifests back to the /manifests directory so that the static pods can restart.

    sudo mv /backup/etcd.yaml /etc/kubernetes/manifests/etcd.yaml
    sudo mv /backup/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
    
  9. Run a Bash shell in the etcd container:

    kubectl --kubeconfig CLUSTER_KUBECONFIG exec -it \
        ETCD_POD_NAME --container etcd --namespace kube-system \
        -- bin/sh
    
  10. Use etcdctl to confirm the added member is working properly.

    ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt  \
        --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
        --cacert=/etc/kubernetes/pki/etcd/ca.crt \
        --endpoints=CONTROL_PLANE_IP:2379 \
        endpoint health
    

    If you are restoring multiple failed members, once all failed members have been restored, run the command with the control plane IP addresses from all restored members in the `--endpoints' field.

    For example:

    ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt  \
        --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
        --cacert=/etc/kubernetes/pki/etcd/ca.crt \
        --endpoints=10.200.0.3:2379,10.200.0.4:2379,10.200.0.5:2379 \
        endpoint health
    

    On success for each endpoint, your cluster should be working properly.