Back up and restore an admin cluster

This document shows how to backup and restore the etcd store for an admin cluster created with Google Distributed Cloud (software only) for VMware. This document also provides a script that you can use to automatically back up a cluster's etcd store. You can also back up and restore an admin cluster using the gkectlcommand-line tool.

You can create a backup file for recovery from unexpected disasters that might damage your cluster's etcd data. Store the backup file in a location that is outside of the cluster and isn't dependent on the cluster's operation.

Limitations

The backup and restore procedure described in this document has the following limitations:

This procedure doesn't back up application-specific data.
This procedure doesn't back up your PersistentVolumes.
Workloads scheduled after you create a backup aren't restored with that backup.
You can't restore a cluster after a failed upgrade.
This procedure isn't intended to restore a deleted cluster.
Don't use this procedure for clusters with advanced cluster enabled. Instead, refer to Back up and restore advanced clusters with gkectl.

For more information about limitations, see Infrastructure incompatibility.

Backing up an admin cluster

An admin cluster backup contains the following:

A snapshot of the admin cluster's etcd.
Admin control plane's Secrets, which are required for authenticating to the admin and user clusters.

Complete the following steps before you create an admin cluster backup:

Find the admin cluster's external IP address, which is used to SSH in to the admin cluster control plane:
```
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get nodes -n kube-system -o wide | grep master
```
where [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.
Create an SSH key called vsphere_tmp from the admin cluster's private key.

You can find the private key from the admin clusters Secrets:
```
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get secrets sshkeys -n kube-system -o yaml
```
In the command output, you can find the private key in the vsphere_tmp field.

Copy the private key to vsphere_tmp:
```
echo "[PRIVATE_KEY]" | base64 -d > vsphere_tmp; chmod 600 vsphere_tmp
```
Check that you can shell into the admin control plane using this private key:
```
ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
```
Exit the container:
```
exit
```

Backing up an admin cluster's etcd store

To back up the admin cluster's etcd store:

Get the etcd Pod's name:

kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get pods \
    -n kube-system -l component=etcd,tier=control-plane -ojsonpath='{$.items[*].metadata.name}{"\n"}'

Shell into Pod's kube-etcd container:

kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG]  exec -it \
    -n kube-system [ADMIN_ETCD_POD] -- bin/sh

where [ADMIN_ETCD_POD] is the name of the etcd Pod.

From the shell, use etcdctl to create a backup named snapshot.db in the local directory:

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /tmp/snapshot.db

Exit the container:
```
exit
```
Copy the backup out of the kube-etcd container using kubectl cp:
```
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] cp \
kube-system/[ADMIN_ETCD_POD]:tmp/snapshot.db [RELATIVE_DIRECTORY]
```
where [RELATIVE_DIRECTORY] is a path where you want to store your backup.

Backing up an admin cluster's Secrets

To back up the admin control plane's Secrets:

Use SSH to connect to the admin control plane node:
```
ssh -i vsphere_tmp ubuntu@EXTERNAL_IP
```
Replace EXTERNAL_IP with the admin control plane's external IP address, which you noted previously.
Optional but highly recommended: Create a local backup directory.

You need to change the backup Secrets' permissions to copy them out of the node.
```
mkdir backup
```
Locally copy the Secrets to the local backup directory:
```
sudo cp -r /etc/kubernetes/pki/* backup/
```
Change the permissions of the backup Secrets:
```
sudo chmod -R a+rX backup/
```
Exit the admin control plane node:
```
exit
```
Run scp to copy the backup folder out of the admin control plane node:
```
sudo scp -r -i vsphere_tmp  ubuntu@EXTERNAL_IP:backup/ RELATIVE_DIRECTORY
```
Replace RELATIVE_DIRECTORY with a path where you want to store your backup.

Restoring an admin cluster

The following procedure recreates a backed-up admin cluster and all of the user control planes it managed when its etcd snapshot was created.

Run scp to copy snapshot.db to the admin control plane:
```
sudo scp -i vsphere_tmp snapshot.db ubuntu@[EXTERNAL_IP]:
```
where [EXTERNAL_IP] is the admin control plane's external IP address, which you gathered previously.

Shell into the admin control plane:

sudo ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]

Copy snapshot.db/ to /mnt:
```
sudo cp snapshot.db /mnt/
```
Make temporary directory, like backup:

Note: You don't need to do this if there is already a backup folder containing the Secrets.
```
mkdir backup
```
Exit the admin control plane:
```
exit
```

Copy the certificates to backup/:

sudo scp -r -i vsphere_tmp [BACKUP_CERT_FILE] ubuntu@[EXTERNAL_IP]:backup/

Shell into the admin control plane node:
```
ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
```
where [EXTERNAL_IP] is the admin control plane's external IP address, which you gathered previously.

Stop kube-etcd and kube-apiserver.

sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/etcd.yaml

sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml

Copy the backup Secrets to /etc/kubernetes/pki/:
```
sudo cp -r backup/* /etc/kubernetes/pki/
```

Run etcdctl restore with Docker:

sudo docker run --rm \
-v '/mnt:/backup' \
-v '/var/lib/etcd:/var/lib/etcd' --env ETCDCTL_API=3 'gcr.io/gke-on-prem-release/etcd-util:GKE_ON_PREM_VERSION' /bin/sh -c "etcdctl snapshot restore '/backup/snapshot.db'; rm -r /var/lib/etcd/*; mv /default.etcd/member/ /var/lib/etcd/"

Restart kube-etcd and kube-apiserver.

sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/etcd.yaml

sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml

Verify kube-etcd and kube-apiserver have started.
```
sudo crictl ps -a
```
Note: If kube-etcd or kube-apiserver are in an exited status Follow, troubleshoot admin restore.

Copy /etc/kubernetes/admin.conf to a .kube folder so it can be accessed from admin workstation:

mkdir -p [HOME]/.kube

sudo cp -i /etc/kubernetes/admin.conf [HOME]/.kube/config

sudo chown $(id -u):$(id -g) $HOME/.kube/config

Exit the admin control plane:
```
exit
```
Copy the newly generated kubeconfig file out of the admin node:
```
sudo scp -i vsphere_tmp ubuntu@[EXTERNAL_IP]:[HOME]/.kube/config kubeconfig
```
```
sudo chown $(id -u):$(id -g) kubeconfig
```
where:
- [EXTERNAL_IP] is the admin control plane's external IP address.
- [HOME] is the home directory on the admin node.
Now you can use this new kubeconfig file to access the restored cluster.

Troubleshooting an admin cluster restore

If you encounter an issue when restoring the admin cluster, you must contact Google Support to resolve the issue with the admin cluster.

In the meantime, you can check the following to further troubleshoot.

Find the etcd container id
```
sudo crictl ps -a | grep [ADMIN_ETCD_POD]
```
where [ADMIN_ETCD_POD] is the name of the etcd Pod.
Examine the logs from the etc container
```
sudo crictl logs [ETCD_CONTAINER_ID]
```
where [ETCD_CONTAINER_ID] is the id of the etcd container.
Look for the following permission denied log messages like

etcdserver: create snapshot directory error:mkdir /var/lib/etcd/member/snap: permission denied
If permission denied messages are found update the ownership of /opt/data/var/lib/etcd/
```
sudo chown -R 2001:2001 /opt/data/var/lib/etcd/
```
Verify kube-etcd and kube-apiserver have started.
```
sudo crictl ps
```

Automatic cluster backup

You can use the script given here as an example on how to automatically back up your clusters. Note that the following script is not supported, and should only be used as reference to write a better, more robust and complete script. Before you run the script, fill in values for the five variables at the beginning of the script:

Set BACKUP_DIR to the path where you want to store the admin and user cluster backups. This path should not exist.
Set ADMIN_CLUSTER_KUBECONFIG to the path of the admin cluster's kubeconfig file
Set USER_CLUSTER_NAMESPACE to the name of your user cluster. The name of your user cluster is a namespace in the admin cluster.
Set EXTERNAL_IP to the VIP that you reserved for the admin control plane service.
Set SSH_PRIVATE_KEY to the path of your SSH key.
If you are using a private network, set JUMP_IP to your network's jump server's IP address.

#!/usr/bin/env bash
 
# Automates manual steps for taking backups of user and admin clusters.
# Fill in the variables below before running the script.
 
BACKUP_DIR=""                       # path to store user and admin cluster backups
ADMIN_CLUSTER_KUBECONFIG=""         # path to admin cluster kubeconfig
USER_CLUSTER_NAMESPACE=""           # user cluster namespace
EXTERNAL_IP=""                      # admin control plane node external ip - follow steps in documentation
SSH_PRIVATE_KEY=""                  # path to vsphere_tmp ssh private key - follow steps in documentation
JUMP_IP=""                          # network jump server IP - leave empty string if not using private network.
 
mkdir -p $BACKUP_DIR
mkdir $BACKUP_DIR/pki
 
# USER CLUSTER BACKUP
 
# Snapshot user cluster etcd
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} exec -it -n ${USER_CLUSTER_NAMESPACE} kube-etcd-0 -c kube-etcd -- /bin/sh -ec "export ETCDCTL_API=3; etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etcd.local.config/certificates/etcdCA.crt --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key snapshot save /tmp/${USER_CLUSTER_NAMESPACE}_snapshot.db"
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} cp ${USER_CLUSTER_NAMESPACE}/kube-etcd-0:tmp/${USER_CLUSTER_NAMESPACE}_snapshot.db $BACKUP_DIR/user-cluster_${USER_CLUSTER_NAMESPACE}_snapshot.db 
 
# ADMIN CLUSTER BACKUP
 
# Set up ssh options
SSH_OPTS=(-oStrictHostKeyChecking=no -i ${SSH_PRIVATE_KEY})
if [ "${JUMP_IP}" != "" ]; then
    SSH_OPTS+=(-oProxyCommand="ssh -oStrictHostKeyChecking=no -i ${SSH_PRIVATE_KEY} -W %h:%p ubuntu@${JUMP_IP}")
fi
 
# Copy admin certs
ssh "${SSH_OPTS[@]}" ubuntu@${EXTERNAL_IP} 'sudo chmod -R a+rX /etc/kubernetes/pki/*'
scp -r "${SSH_OPTS[@]}" ubuntu@${EXTERNAL_IP}:/etc/kubernetes/pki/* ${BACKUP_DIR}/pki/
 
# Snapshot admin cluster etcd
admin_etcd=$(kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} get pods -n kube-system -l component=etcd,tier=control-plane -ojsonpath='{$.items[*].metadata.name}{"\n"}')
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} exec -it -n kube-system ${admin_etcd} -- /bin/sh -ec "export ETCDCTL_API=3; etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /tmp/admin_snapshot.db"
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} cp -n kube-system ${admin_etcd}:tmp/admin_snapshot.db $BACKUP_DIR/admin-cluster_snapshot.db

What's next

Back up and restore a user cluster
Diagnose cluster issues
Learn about augur, an open-source tool for restoring individual objects from etcd backups.