Back up and restore clusters

This page describes how to back up and restore clusters created with GKE on Bare Metal. These instructions apply to all cluster types supported by GKE on Bare Metal.

Back up a cluster

The backup process has two parts. First, a snapshot is made from the etcd store. Then, the related PKI certificates are saved to a tar file. The etcd store is the Kubernetes backing store for all cluster data and contains all the Kubernetes objects and custom objects required to manage cluster state. The PKI certificates are used for authentication over TLS. This data is backed up from the cluster's control plane or from one of the control planes for a high-availability (HA) deployment.

We recommend you back up your clusters regularly to ensure your snapshot data is relatively current. The rate of backups depends upon the frequency in which significant changes occur for your clusters.

Make a snapshot of the etcd store

In GKE on Bare Metal, a pod named etcd-CONTROL_PLANE_NAME in the kube-system namespace runs the etcd for that control plane. To backup the cluster's etcd store, perform the following steps from your admin workstation:

  1. Use kubectl get po to identify the etcd Pod.

    kubectl --kubeconfig CLUSTER_KUBECONFIG get po -n kube-system \
        -l 'component=etcd,tier=control-plane'

    The response includes the etcd Pod name and its status.

  2. Use kubectl describe pod to see the containers running in the etcd pod, including the etcd container.

    kubectl --kubeconfig CLUSTER_KUBECONFIG describe pod ETCD_POD_NAME -n kube-system
  3. Run a Bash shell in the etcd container:

    kubectl --kubeconfig CLUSTER_KUBECONFIG exec -it \
        ETCD_POD_NAME --container etcd --namespace kube-system \
        -- bin/sh
  4. From the shell within the etcd container, use etcdctl (version 3 of the API) to save a snapshot, snapshot.db, of the etcd store.

    ETCDCTL_API=3 etcdctl --endpoints= \
        --cacert=/etc/kubernetes/pki/etcd/ca.crt \
        --cert=/etc/kubernetes/pki/etcd/peer.crt \
        --key=/etc/kubernetes/pki/etcd/peer.key \
        snapshot save /tmp/snapshotDATESTAMP.db

    Replace DATESTAMP with the current date to prevent overwriting any subsequent snapshots.

  5. Exit from the shell in the container and run the following command to copy the snapshot file to the admin workstation.

    kubectl --kubeconfig CLUSTER_KUBECONFIG cp \
        kube-system/ETCD_POD_NAME:/tmp/snapshot.db \
        --container etcd snapshot.db
  6. Copy the etcdctl binary from the etcd pod so that the same can be used during the restore process.

    kubectl --kubeconfig CLUSTER_KUBECONFIG cp \
      kube-system/ETCD_POD_NAME:/usr/local/bin/etcdctl \
      --container etcd etcdctl
  7. Store the snapshot file and the etcdctl binary in a location that is outside of the cluster and is not dependent on the cluster's operation.

Archive the PKI certificates

The certificates to be backed up are located in the /etc/kubernetes/pki directory of the control plane. The PIK certificates together with the etcd store snapshot.db file are needed to to recover a cluster in the event the control plane goes down completely. The following steps create a tar file, containing the PKI certificates.

  1. Use ssh to connect to the cluster's control plane as root.

  2. From the control plane, create a tar file, certs_backup.tar.gz with the contents of the /etc/kubernetes/pki directory.

    tar -czvf certs_backup.tar.gz -C /etc/kubernetes/pki .

    Creating the tar file from within the control plane preserves all the certificate file permissions.

  3. Exit the control plane and, from the workstation, copy tar file containing the certificates to a preferred location on the workstation.

    sudo scp root@CONTROL_PLANE_NAME:certs_backup.tar.gz BACKUP_PATH

Restore a cluster

Restoring a cluster from a backup is a last resort and should be used when a cluster has failed catastrophically and cannot be returned to service any other way. For example, the etcd data is corrupted or the etcd Pod is in a crash loop.

The cluster restore process has two parts. First, the PKI certificates are restored on the control plane. Then, the etcd store data is restored.

Restore PKI certificates

Assuming you have backed up PKI certificates as described in Archive the PKI certificates, the following steps describe how to restore the certificates from the tar file to a control plane.

  1. Copy the PKI certificates tar file, certs_backup.tar.gz, from workstation to the cluster control plane.

    sudo scp -r BACKUP_PATH/certs_backup.tar.gz root@CONTROL_PLANE_NAME:~/
  2. Use ssh to connect to the cluster's control plane as root.

  3. From the control plane, extract the contents of the tar file to the /etc/kubernetes/pki directory.

    tar -xzvf certs_backup.tar.gz -C /etc/kubernetes/pki/
  4. Exit the control plane.

Restore the etcd store

When restoring the etcd store, the process depends upon whether or not the cluster is running in high availability (HA) mode and, if so, whether or not quorum has been preserved. Use the following guidance to restore the etcd store for a given cluster failure situation:

  • If the failed cluster is not running in HA mode, restore the etcd store on the control plane with the following steps.

  • If the cluster is running in HA mode and quorum is preserved, do nothing. As long a quorum is preserved, you don't need to restore failed clusters.

  • If the cluster is running in HA mode and quorum is lost, repeat the following steps to restore the etcd store for each failed member.

Follow these steps from the workstation to remove and restore the etcd store on a control plane for a failed cluster:

  1. Create a /backup directory in the root directory of the control plane.

    ssh root@CONTROL_PLANE_NAME "mkdir /backup"

    This step is not strictly required, but we recommend it. The following steps assume you have created a /backup directory.

  2. Copy the etcd snapshot file, snapshot.db and etcdctl binary from the workstation to the backup directory on the cluster control plane.

    sudo scp snapshot.db root@CONTROL_PLANE_NAME:/backup
    sudo scp etcdctl root@CONTROL_PLANE_NAME:/backup
  3. Use SSH to connect to the control plane node:

  4. Stop the etcd and kube-apiserver static pods by moving their manifest files out of the /etc/kubernetes/manifests directory and into the /backup directory.

    sudo mv /etc/kubernetes/manifests/etcd.yaml /backup/etcd.yaml
    sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /backup/kube-apiserver.yaml
  5. Remove the etcd data directory.

    rm -rf /var/lib/etcd/
  6. Run etcdctl snapshot restore using the saved binary.

    sudo chmod +x /backup/etcdctl
    sudo ETCDCTL_API=3 /backup/etcdctl \
        --cacert=/etc/kubernetes/pki/etcd/ca.crt \
        --cert=/etc/kubernetes/pki/etcd/server.crt \
        --key=/etc/kubernetes/pki/etcd/server.key \
        --data-dir=/var/lib/etcd \
        --name=CONTROL_PLANE_NAME \
        --initial-advertise-peer-urls=https://CONTROL_PLANE_IP:2380 \
        --initial-cluster=CONTROL_PLANE_NAME=https://CONTROL_PLANE_IP:2380 \
        snapshot restore /backup/snapshot.db

    The entries for --name, --initial-advertise-peer-urls, and --initial-cluster can be found in the etcd.yaml manifest file that was moved to the /backup directory.

  7. Ensure that /var/lib/etcd was recreated and that a new member is created in /var/lib/etcd/member.

  8. Change the owner of the /var/lib/etcd/member directory to 2003. Starting with GKE on Bare Metal release 1.10.0, the etcd container runs as non-root user with UID and GID of 2003.

    sudo chown -R 2003:2003 /var/lib/etcd
  9. Move the etcd and kube-apiserver manifests back to the /manifests directory so that the static pods can restart.

    sudo mv /backup/etcd.yaml /etc/kubernetes/manifests/etcd.yaml
    sudo mv /backup/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
  10. Run a Bash shell in the etcd container:

    kubectl --kubeconfig CLUSTER_KUBECONFIG exec -it \
        ETCD_POD_NAME --container etcd --namespace kube-system \
        -- bin/sh
    1. Use etcdctl to confirm the added member is working properly.
    ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt  \
        --key=/etc/kubernetes/pki/etcd/peer.key \
        --cacert=/etc/kubernetes/pki/etcd/ca.crt \
        --endpoints=CONTROL_PLANE_IP:2379 \
        endpoint health

    If you are restoring multiple failed members, once all failed members have been restored, run the command with the control plane IP addresses from all restored members in the `--endpoints' field.

    For example:

    ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt  \
        --key=/etc/kubernetes/pki/etcd/peer.key \
        --cacert=/etc/kubernetes/pki/etcd/ca.crt \
        --endpoints=,, \
        endpoint health

    On success for each endpoint, your cluster should be working properly.