Backing up and restoring clusters

This page describes how to manually create and restore backups of GKE On-Prem admin and user clusters. This page also provides a script that you can use to automatically back up your clusters.

You should create backups for recovery from foreseen disasters that might damage etcd data and Secrets. Be sure to store backups in a location that is outside of the cluster and that is not dependent on the cluster's operation. If you want to be safe, consider creating a copy of the backup, too.

While the etcd events Pod that runs in every cluster is not vital to the restoration of a user cluster, you can follow a similar process to back it up.

Restrictions

  • Backing up application-specific data is out of scope for this feature.
  • Secrets remain valid until you manually rotate them.
  • Workloads scheduled after you create a backup aren't restored with that backup.
  • Currently, you aren't able to restore from failed cluster upgrades.
  • This procedure is not intended to restore a deleted cluster.

Known issues

When you run sudo commands, you might encounter the following error:

sudo: unable to resolve host gke-admin-master-[CLUSTER_ID]

If you do, add the following line to the /etc/hosts file:

127.0.0.1 gke-admin-master-[CLUSTER_ID]

User cluster backups

A user cluster backup contains a snapshot of the user cluster's etcd. A cluster's etcd contains, among other things, all of the Kubernetes objects and any custom objects required to manage cluster state. This snapshot contains the data required to recreate the cluster's components and workloads.

Backing up a user cluster

A user cluster's etcd is stored in its control plane node, which you can access using the admin cluster's kubeconfig.

To create a snapshot of etcd, execute the following steps:

  1. Shell into the kube-etcd container:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] exec \
    -it -n [USER_CLUSTER_NAME] kube-etcd-0 -c \
    kube-etcd -- bin/sh

    where:

    • [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.
    • [USER_CLUSTER_NAME] is name of the user cluster. Specifically, you're passing in a namespace in the admin cluster that is named after the user cluster.
  2. From the shell, use etcdctl to a create backup named snapshot.db in the local directory:

    ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etcd.local.config/certificates/etcdCA.crt \
    --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key \
    snapshot save snapshot.db
  3. Exit the container:

    exit
  4. Copy the backup out of the kube-etcd container using kubectl cp:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] cp \
    [USER_CLUSTER_NAME]/kube-etcd-0:snapshot.db [DIRECTORY] -c kube-etcd

    where [RELATIVE_DIRECTORY] is a path where you want to store your backup.

Restoring a user cluster backup

  • Before you restore a backup, be sure to diagnose your cluster and resolve existing issues. Restoring a backup to a problematic cluster might recreate or exacerbate issues. Contact the GKE On-Prem support team for further assistance on restoring your clusters.

  • If you created a HA user cluster, you should run these steps once per etcd cluster member. You can use the same snapshot when restoring each etcd member. Don't take these steps unless all etcd Pods are crashlooping: this indicates that there is data corruption.

Crashlooping etcd Pod

The following instructions explain how to restore a backup in cases where a user cluster's etcd data has become damaged and its etcd Pod is crashlooping. You can recover by deploying a etcd Pod to the existing Pod's volumes and overwriting the damaged data with the backup, assuming that the user cluster's API server is running and can schedule new Pods.

  1. Copy the etcd Pod specification below to a file, restore-etcd.yaml, after populating the following placeholder values:

    • [MEMBER_NUMBER] is the numbered Pod that you are restoring.
    • [NODE_NAME] is the node on which the [MEMBER_NUMBER[ Pod is running.
    • [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.
    • [USER_CLUSTER_NAME] is the name of the user cluster.
    • [DEFAULT_TOKEN] is used for authentication. You can find this value by running the following command:

      kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \
      -n [USER_CLUSTER_NAME] get pods kube-etcd-0 \
      -o yaml | grep default-token

    restore-etcd.yaml

    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        Component: restore-etcd-[MEMBER_NUMBER]
      name: restore-etcd-0
      namespace: restore-etcd-[MEMBER_NUMBER]
    spec:
      restartPolicy: Never
      containers:
      - command: ["/bin/sh"]
        args: ["-ec", "while :; do echo '.'; sleep 5 ; done"]
        image: gcr.io/gke-on-prem-release/etcd:v3.2.24-1-gke.0
        imagePullPolicy: IfNotPresent
        name: restore-etcd
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/etcd
          name: data
        - mountPath: /etcd.local.config/certificates
          name: etcd-certs
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: [DEFAULT_TOKEN]
          readOnly: true
      dnsPolicy: ClusterFirst
      hostname: restore-etcd-0
      imagePullSecrets:
      - name: private-registry-creds
      nodeSelector:
        kubernetes.googleapis.com/cluster-name: [USER_CLUSTER_NAME]
        kubernetes.io.hostname: [NODE_NAME]
      priority: 0
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      serviceAccountName: default
      subdomain: restore-etcd
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 300
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 300
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: data-kube-etcd-[MEMBER_NUMBER]
      - name: etcd-certs
        secret:
          defaultMode: 420
          secretName: kube-etcd-certs
      - name: [DEFAULT_TOKEN]
        secret:
          defaultMode: 420
          secretName: [DEFAULT_TOKEN]
  2. Deploy the Pod:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \
    -n [USER_CLUSTER_NAME] create -f restore-etcd.yaml
  3. Copy etcd's backup file, snapshot.db, to the new Pod. snapshot.db lives at the relative directory where you created the backup:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \
    cp [RELATIVE_DIRECTORY]/snapshot.db \
    [USER_CLUSTER_NAME]/restore-etcd-0:snapshot.db
  4. Shell into the restore-etcd Pod:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \
    -it -n [USER_CLUSTER_NAME] exec restore-etcd-0 -- bin/sh
  5. Run the following command to create a new default.etcd folder containing the backup:

    ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etcd.local.config/certificates/etcdCA.crt \
    --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key \
    snapshot restore snapshot.db
  6. Overwrite the damaged etcd data with the backup:

    rm -r var/lib/etcd/*; cp -r default.etcd/* var/lib/etcd/
  7. Exit the container:

    exit
  8. Delete the crashing etcd Pod:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \
    -n [USER_CLUSTER_NAME] delete pod kube-etcd-0
  9. Verify that the etcd Pod is no longer crashing.

  10. Remove restore-etcd.yaml and delete the restore-etcd Pod:

    rm restore-etcd.yaml;
    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \
    -n [USER_CLUSTER_NAME] delete pod restore-etcd-0

Admin cluster backups

An admin cluster backup contains the following:

  • A snapshot of the admin cluster's etcd.
  • Admin control plane's Secrets, which are required for authenticating to the admin and user clusters.

Complete the following steps before you create an admin cluster backup:

  1. Find the admin cluster's external IP address, which is used to SSH in to the admin cluster control plane:

    kubectl --kubeconfig [ADMIN_KUBECONFIG] get nodes -n kube-system -o wide | grep gke-admin-master

    where [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.

  2. Create an SSH key called vsphere_tmp from the admin cluster's private key.

    You can find the private key from the admin clusters Secrets:

    kubectl --kubeconfig [ADMIN_KUBECONFIG] get secrets sshkeys -n kube-system -o yaml

    In the command output, you can find the private key in the vsphere_tmp field.

    Copy the private key to vsphere_tmp:

    echo "[PRIVATE_KEY]" | base64 -d > vsphere_tmp; chmod 600 vsphere_tmp
  3. Check that you can shell into the admin control plane using this private key:

    ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
  4. Exit the container:

    exit

Backing up an admin cluster

You can back up an admin cluster's etcd and its control plane's Secrets.

etcd

To back up the admin cluster's etcd:

  1. Get the etcd Pod's name:

    kubectl --kubeconfig [ADMIN_KUBECONFIG] get pods \
    -n kube-system | grep etcd-gke-admin-master
  2. Shell into Pod's kube-etcd container:

    kubectl --kubeconfig [ADMIN_KUBECONFIG]  exec -it \
    -n kube-system [ADMIN_ETCD_POD] -- bin/sh

    where [ADMIN_ETCD_POD] is the name of the etcd Pod.

  3. From the shell, use etcdctl to a create backup named snapshot.db in the local directory:

    ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save snapshot.db
  4. Exit the container:

    exit
  5. Copy the backup out of the kube-etcd container using kubectl cp:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] cp \
    kube-system/[ADMIN_ETCD_POD]:snapshot.db [RELATIVE_DIRECTORY]

    where [RELATIVE_DIRECTORY] is a path where you want to store your backup.

Secrets

To back up the admin control plane's Secrets:

  1. Shell into the admin control plane node:

    ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]

    where [EXTERNAL_IP] is the admin control plane's external IP address, which you gathered previously.

  2. Create a local backup directory. (This is optional, but highly recommended. You need to change the backup Secrets' permissions to copy them out of the node):

    mkdir backup
  3. Locally copy the Secrets to the local backup directory:

    sudo cp -r /etc/kubernetes/pki/* backup/
  4. Change permissions of the backup Secrets:

    sudo chmod -R +rw backup/
  5. Exit the container:

    exit
  6. Run scp to copy the backup folder out of the admin control plane node:

    sudo scp -r -i vsphere_tmp  ubuntu@[EXTERNAL_IP]:backup/ [RELATIVE_DIRECTORY]

    where [RELATIVE_DIRECTORY] is a path where you want to store your backup.

Restoring an admin cluster

The following procedure recreates a backed-up admin cluster and all of the user control planes it managed when its etcd snapshot was created.

  1. Run scp to copy snapshot.db to the admin control plane:

    sudo scp -i vsphere_tmp snapshot.db ubuntu@[EXTERNAL_IP]:

    where [EXTERNAL_IP] is the admin control plane's external IP address, which you gathered previously.

  2. Shell into the admin control plane:

    sudo ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
  3. Copy snapshot.db/ to /mnt:

    sudo cp snapshot.db /mnt/
  4. Make temporary directory, like backup:

    mkdir backup
  5. Exit the admin control plane:

    exit
  6. Copy the certificates to backup/:

    sudo scp -r -i vsphere_tmp [BACKUP_CERT_FILE] ubuntu@[EXTERNAL_IP]:backup/
  7. Shell into the admin control plane node:

    ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]

    where [EXTERNAL_IP] is the admin control plane's external IP address, which you gathered previously.

  8. Run kubeadm reset. This stops anything still running in the admin cluster, deletes all etcd data, and deletes Secrets in /etc/kubernetes/pki/:

    sudo kubeadm reset --ignore-preflight-errors=all
  9. Copy the backup Secrets to /etc/kubernetes/pki/:

    sudo cp -r backup/* /etc/kubernetes/pki/
  10. Run etcdctl restore with Docker:

    sudo docker run --rm \
    -v '/mnt:/backup' \
    -v '/var/lib/etcd:/var/lib/etcd' --env ETCDCTL_API=3 'k8s.gcr.io/etcd-amd64:3.1.12' /bin/sh -c "etcdctl snapshot restore '/backup/snapshot.db'; mv /default.etcd/member/ /var/lib/etcd/"
  11. Run kubeadm init. This reuses all of the backup Secrets and restarts etcd with the restored snapshot:

    sudo kubeadm init --config /etc/kubernetes/kubeadm_config.yaml --ignore-preflight-errors=DirAvailable--var-lib-etcd
  12. Exit the admin control plane:

    exit
  13. Copy the newly generated kubeconfig file out of the admin node:

    sudo scp -i vsphere_tmp ubuntu@[EXTERNAL_IP]:[HOME]/.kube/config kubeconfig

    where:

    • [EXTERNAL_IP] is the admin control plane's external IP address.
    • [HOME] is the home directory on the admin node.

    Now you can use this new kubeconfig file to access restored cluster.

Backup script

You can use the script given here to automatically back up your clusters. Before you run the script, fill in values for the five variables at the beginning of the script.

Set USER_CLUSTER_NAMESPACE to the name of your user cluster. The name of your user cluster is a namespace in the admin cluster.

Set EXTERNAL_IP to the VIP that you reserved for the admin control plane service.

Set SSH_PRIVATE_KEY to the path of the SSH key you created when you set up your admin workstation.

#!/usr/bin/env bash

# Automates manual steps for taking backups of user and admin clusters.
# Fill in the variables below before running the script.

BACKUP_DIR=""                       # path to store user and admin cluster backups
ADMIN_CLUSTER_KUBECONFIG=""         # path to admin cluster kubeconfig
USER_CLUSTER_NAMESPACE=""           # user cluster namespace
EXTERNAL_IP=""                      # admin control plane node external ip - follow steps in documentation
SSH_PRIVATE_KEY=""                  # path to vsphere ssh private key - follow steps in documentation


mkdir -p $BACKUP_DIR
mkdir $BACKUP_DIR/pki

# USER CLUSTER BACKUP

# Snapshot user cluster etcd
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG}  exec -it -n ${USER_CLUSTER_NAMESPACE} kube-etcd-0 -c kube-etcd -- /bin/sh -ec "export ETCDCTL_API=3; etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etcd.local.config/certificates/etcdCA.crt --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key snapshot save ${USER_CLUSTER_NAMESPACE}_snapshot.db"
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} cp ${USER_CLUSTER_NAMESPACE}/kube-etcd-0:user_snapshot.db $BACKUP_DIR/

# ADMIN CLUSTER BACKUP

# Copy admin certs
ssh -o StrictHostKeyChecking=no -i vsphere_tmp ubuntu@${EXTERNAL_IP} 'sudo chmod -R +rw /etc/kubernetes/pki/*'
scp -o StrictHostKeyChecking=no -r -i vsphere_tmp ubuntu@${EXTERNAL_IP}:/etc/kubernetes/pki/* ${BACKUP_DIR}/pki/

# Snapshot admin cluster etcd
admin_etcd=$(kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} get pods -n kube-system -o=name | grep etcd | cut -c 5-)
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG}  exec -it -n kube-system ${admin_etcd} -- /bin/sh -ec "export ETCDCTL_API=3; etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save admin_snapshot.db"

Troubleshooting

For more information, refer to Troubleshooting.

Diagnosing cluster issues using gkectl

Use gkectl diagnosecommands to identify cluster issues and share cluster information with Google. See Diagnosing cluster issues.

Default logging behavior

For gkectl and gkeadm it is sufficient to use the default logging settings:

  • By default, log entries are saved as follows:

    • For gkectl, the default log file is /home/ubuntu/.config/gke-on-prem/logs/gkectl-$(date).log, and the file is symlinked with the logs/gkectl-$(date).log file in the local directory where you run gkectl.
    • For gkeadm, the default log file is logs/gkeadm-$(date).log in the local directory where you run gkeadm.
  • All log entries are saved in the log file, even if they are not printed in the terminal (when --alsologtostderr is false).
  • The -v5 verbosity level (default) covers all the log entries needed by the support team.
  • The log file also contains the command executed and the failure message.

We recommend that you send the log file to the support team when you need help.

Specifying a non-default location for the log file

To specify a non-default location for the gkectl log file, use the --log_file flag. The log file that you specify will not be symlinked with the local directory.

To specify a non-default location for the gkeadm log file, use the --log_file flag.

Locating Cluster API logs in the admin cluster

If a VM fails to start after the admin control plane has started, you can try debugging this by inspecting the Cluster API controllers' logs in the admin cluster:

  1. Find the name of the Cluster API controllers Pod in the kube-system namespace, where [ADMIN_CLUSTER_KUBECONFIG] is the path to the admin cluster's kubeconfig file:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system get pods | grep clusterapi-controllers
  2. Open the Pod's logs, where [POD_NAME] is the name of the Pod. Optionally, use grep or a similar tool to search for errors:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system logs [POD_NAME] vsphere-controller-manager

What's next