This document shows how to backup and restore the etcd store for an admin
cluster created with Google Distributed Cloud (software only) for VMware. This document
also provides a script that you can use to
automatically back up a cluster's etcd store. You can also
back up and restore
an admin cluster using the gkectl
command-line tool.
You can create a backup file for recovery from unexpected disasters that might damage your cluster's etcd data. Store the backup file in a location that is outside of the cluster and isn't dependent on the cluster's operation.
Limitations
The backup and restore procedure described in this document has the following limitations:
This procedure doesn't back up application-specific data.
This procedure doesn't back up your PersistentVolumes.
Workloads scheduled after you create a backup aren't restored with that backup.
You can't restore a cluster after a failed upgrade.
This procedure isn't intended to restore a deleted cluster.
For more information about limitations, see Infrastructure incompatibility.
Backing up an admin cluster
An admin cluster backup contains the following:
- A snapshot of the admin cluster's etcd.
- Admin control plane's Secrets, which are required for authenticating to the admin and user clusters.
Complete the following steps before you create an admin cluster backup:
Find the admin cluster's external IP address, which is used to SSH in to the admin cluster control plane:
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get nodes -n kube-system -o wide | grep master
where [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.
Create an SSH key called
vsphere_tmp
from the admin cluster's private key.You can find the private key from the admin clusters Secrets:
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get secrets sshkeys -n kube-system -o yaml
In the command output, you can find the private key in the
vsphere_tmp
field.Copy the private key to
vsphere_tmp
:echo "[PRIVATE_KEY]" | base64 -d > vsphere_tmp; chmod 600 vsphere_tmp
Check that you can shell into the admin control plane using this private key:
ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
Exit the container:
exit
Backing up an admin cluster's etcd store
To back up the admin cluster's etcd store:
Get the etcd Pod's name:
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get pods \ -n kube-system -l component=etcd,tier=control-plane -ojsonpath='{$.items[*].metadata.name}{"\n"}'
Shell into Pod's kube-etcd container:
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] exec -it \ -n kube-system [ADMIN_ETCD_POD] -- bin/sh
where [ADMIN_ETCD_POD] is the name of the etcd Pod.
From the shell, use
etcdctl
to create a backup namedsnapshot.db
in the local directory:ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /tmp/snapshot.db
Exit the container:
exit
Copy the backup out of the kube-etcd container using
kubectl cp
:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] cp \ kube-system/[ADMIN_ETCD_POD]:tmp/snapshot.db [RELATIVE_DIRECTORY]
where [RELATIVE_DIRECTORY] is a path where you want to store your backup.
Backing up an admin cluster's Secrets
To back up the admin control plane's Secrets:
Use SSH to connect to the admin control plane node:
ssh -i vsphere_tmp ubuntu@EXTERNAL_IP
Replace
EXTERNAL_IP
with the admin control plane's external IP address, which you noted previously.Optional but highly recommended: Create a local backup directory.
You need to change the backup Secrets' permissions to copy them out of the node.
mkdir backup
Locally copy the Secrets to the local backup directory:
sudo cp -r /etc/kubernetes/pki/* backup/
Change the permissions of the backup Secrets:
sudo chmod -R a+rX backup/
Exit the admin control plane node:
exit
Run
scp
to copy the backup folder out of the admin control plane node:sudo scp -r -i vsphere_tmp ubuntu@EXTERNAL_IP:backup/ RELATIVE_DIRECTORY
Replace
RELATIVE_DIRECTORY
with a path where you want to store your backup.
Restoring an admin cluster
The following procedure recreates a backed-up admin cluster and all of the user control planes it managed when its etcd snapshot was created.
Run
scp
to copysnapshot.db
to the admin control plane:sudo scp -i vsphere_tmp snapshot.db ubuntu@[EXTERNAL_IP]:
where [EXTERNAL_IP] is the admin control plane's external IP address, which you gathered previously.
Shell into the admin control plane:
sudo ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
Copy
snapshot.db/
to/mnt
:sudo cp snapshot.db /mnt/
Make temporary directory, like
backup
:mkdir backup
Exit the admin control plane:
exit
Copy the certificates to
backup/
:sudo scp -r -i vsphere_tmp [BACKUP_CERT_FILE] ubuntu@[EXTERNAL_IP]:backup/
Shell into the admin control plane node:
ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
where [EXTERNAL_IP] is the admin control plane's external IP address, which you gathered previously.
Stop
kube-etcd
andkube-apiserver
.sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/etcd.yaml
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml
Copy the backup Secrets to
/etc/kubernetes/pki/
:sudo cp -r backup/* /etc/kubernetes/pki/
Run
etcdctl restore
with Docker:sudo docker run --rm \ -v '/mnt:/backup' \ -v '/var/lib/etcd:/var/lib/etcd' --env ETCDCTL_API=3 'gcr.io/gke-on-prem-release/etcd-util:GKE_ON_PREM_VERSION' /bin/sh -c "etcdctl snapshot restore '/backup/snapshot.db'; rm -r /var/lib/etcd/*; mv /default.etcd/member/ /var/lib/etcd/"
Restart
kube-etcd
andkube-apiserver
.sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/etcd.yaml
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
Verify
kube-etcd
andkube-apiserver
have started.sudo crictl ps -a
Copy
/etc/kubernetes/admin.conf
to a.kube
folder so it can be accessed from admin workstation:mkdir -p [HOME]/.kube
sudo cp -i /etc/kubernetes/admin.conf [HOME]/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Exit the admin control plane:
exit
Copy the newly generated kubeconfig file out of the admin node:
sudo scp -i vsphere_tmp ubuntu@[EXTERNAL_IP]:[HOME]/.kube/config kubeconfig
sudo chown $(id -u):$(id -g) kubeconfig
where:
- [EXTERNAL_IP] is the admin control plane's external IP address.
- [HOME] is the home directory on the admin node.
Now you can use this new kubeconfig file to access the restored cluster.
Troubleshooting an admin cluster restore
If you encounter an issue when restoring the admin cluster, you must contact Google Support to resolve the issue with the admin cluster.
In the meantime, you can check the following to further troubleshoot.
Find the etcd container id
sudo crictl ps -a | grep [ADMIN_ETCD_POD]
where [ADMIN_ETCD_POD] is the name of the etcd Pod.
Examine the logs from the etc container
sudo crictl logs [ETCD_CONTAINER_ID]
where [ETCD_CONTAINER_ID] is the id of the etcd container.
Look for the following permission denied log messages like
etcdserver: create snapshot directory error:mkdir /var/lib/etcd/member/snap: permission denied
If permission denied messages are found update the ownership of /opt/data/var/lib/etcd/
sudo chown -R 2001:2001 /opt/data/var/lib/etcd/
Verify
kube-etcd
andkube-apiserver
have started.sudo crictl ps
Automatic cluster backup
You can use the script given here as an example on how to automatically back up your clusters. Note that the following script is not supported, and should only be used as reference to write a better, more robust and complete script. Before you run the script, fill in values for the five variables at the beginning of the script:
- Set
BACKUP_DIR
to the path where you want to store the admin and user cluster backups. This path should not exist. - Set
ADMIN_CLUSTER_KUBECONFIG
to the path of the admin cluster's kubeconfig file - Set
USER_CLUSTER_NAMESPACE
to the name of your user cluster. The name of your user cluster is a namespace in the admin cluster. - Set
EXTERNAL_IP
to the VIP that you reserved for the admin control plane service. - Set
SSH_PRIVATE_KEY
to the path of your SSH key. - If you are using a private network, set
JUMP_IP
to your network's jump server's IP address.
#!/usr/bin/env bash
# Automates manual steps for taking backups of user and admin clusters.
# Fill in the variables below before running the script.
BACKUP_DIR="" # path to store user and admin cluster backups
ADMIN_CLUSTER_KUBECONFIG="" # path to admin cluster kubeconfig
USER_CLUSTER_NAMESPACE="" # user cluster namespace
EXTERNAL_IP="" # admin control plane node external ip - follow steps in documentation
SSH_PRIVATE_KEY="" # path to vsphere_tmp ssh private key - follow steps in documentation
JUMP_IP="" # network jump server IP - leave empty string if not using private network.
mkdir -p $BACKUP_DIR
mkdir $BACKUP_DIR/pki
# USER CLUSTER BACKUP
# Snapshot user cluster etcd
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} exec -it -n ${USER_CLUSTER_NAMESPACE} kube-etcd-0 -c kube-etcd -- /bin/sh -ec "export ETCDCTL_API=3; etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etcd.local.config/certificates/etcdCA.crt --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key snapshot save /tmp/${USER_CLUSTER_NAMESPACE}_snapshot.db"
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} cp ${USER_CLUSTER_NAMESPACE}/kube-etcd-0:tmp/${USER_CLUSTER_NAMESPACE}_snapshot.db $BACKUP_DIR/user-cluster_${USER_CLUSTER_NAMESPACE}_snapshot.db
# ADMIN CLUSTER BACKUP
# Set up ssh options
SSH_OPTS=(-oStrictHostKeyChecking=no -i ${SSH_PRIVATE_KEY})
if [ "${JUMP_IP}" != "" ]; then
SSH_OPTS+=(-oProxyCommand="ssh -oStrictHostKeyChecking=no -i ${SSH_PRIVATE_KEY} -W %h:%p ubuntu@${JUMP_IP}")
fi
# Copy admin certs
ssh "${SSH_OPTS[@]}" ubuntu@${EXTERNAL_IP} 'sudo chmod -R a+rX /etc/kubernetes/pki/*'
scp -r "${SSH_OPTS[@]}" ubuntu@${EXTERNAL_IP}:/etc/kubernetes/pki/* ${BACKUP_DIR}/pki/
# Snapshot admin cluster etcd
admin_etcd=$(kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} get pods -n kube-system -l component=etcd,tier=control-plane -ojsonpath='{$.items[*].metadata.name}{"\n"}')
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} exec -it -n kube-system ${admin_etcd} -- /bin/sh -ec "export ETCDCTL_API=3; etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /tmp/admin_snapshot.db"
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} cp -n kube-system ${admin_etcd}:tmp/admin_snapshot.db $BACKUP_DIR/admin-cluster_snapshot.db
What's next
- Back up and restore a user cluster
- Learn how to diagnose cluster issues
- Learn about augur, an open-source tool for restoring individual objects from etcd backups.