This document describes how to replace a failed etcd replica in a high availability (HA) user cluster for Google Distributed Cloud.
The instructions given here apply to an HA user cluster that uses kubeception; that is, a user cluster that does not have Controlplane V2 enabled. If you need to replace an etcd replica in a user cluster that has Controlplane V2 enabled, contact Cloud Customer Care.
Before you begin
Make sure the admin cluster is working correctly.
Make sure the other two etcd members in the user cluster are working correctly. If more than one etcd member has failed, see Recovery from etcd data corruption or loss.
Replacing a failed etcd replica
Back up a copy of the etcd PodDisruptionBudget (PDB) so you can restore it later.
kubectl --kubeconfig
ADMIN_CLUSTER_KUBECONFIG -nUSER_CLUSTER_NAME get pdb kube-etcd-pdb -o yaml >PATH_TO_PDB_FILE Where:
ADMIN_CLUSTER_KUBECONFIG
is the path to the kubeconfig file for the admin cluster.USER_CLUSTER_NAME
is the name of the user cluster that contains the failed etcd replica.PATH_TO_PDB_FILE
is the path where you want to save the etcd PDB file, for instance/tmp/etcpdb.yaml
.
Delete the etcd PodDisruptionBudget (PDB).
kubectl --kubeconfig
ADMIN_CLUSTER_KUBECONFIG -nUSER_CLUSTER_NAME delete pdb kube-etcd-pdbRun the following command to open the kube-etcd StatefulSet in your text editor:
kubectl --kubeconfig
ADMIN_CLUSTER_KUBECONFIG -nUSER_CLUSTER_NAME edit statefulset kube-etcdChange the value of the
--initial-cluster-state
flag toexisting
.containers: - name: kube-etcd ... args: - --initial-cluster-state=existing ...
Drain the failed etcd replica node.
kubectl --kubeconfig
ADMIN_CLUSTER_KUBECONFIG drainNODE_NAME --ignore-daemonsets --delete-local-dataWhere
NODE_NAME
is the name of the failed etcd replica node.Create a new shell in the container of one of the working kube-etcd pods.
kubectl --kubeconfig
ADMIN_CLUSTER_KUBECONFIG exec -it \KUBE_ETCD_POD --container kube-etcd --namespaceUSER_CLUSTER_NAME \ -- bin/shWhere
KUBE_ETCD_POD
is the name of the working kube-etcd pod. For example,kube-etcd-0
.From this new shell, run the following commands:
Remove the failed etcd replica node from the etcd cluster.
First, list all the members of the etcd cluster:
etcdctl member list -w table
The output shows all the member IDs. Determine the member ID of the failed replica.
Next, remove the failed replica:
export ETCDCTL_CACERT=/etcd.local.config/certificates/etcdCA.crt export ETCDCTL_CERT=/etcd.local.config/certificates/etcd.crt export ETCDCTL_CERT=/etcd.local.config/certificates/etcd.crt export ETCDCTL_KEY=/etcd.local.config/certificates/etcd.key export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379 etcdctl member remove
MEMBER_ID Where
MEMBER_ID
is the hex member ID of the failed etcd replica pod.Add a new member with the same name and peer URL as the failed replica node.
etcdctl member add
MEMBER_NAME --peer-urls=https://MEMBER_NAME .kube-etcd:2380Where
MEMBER_NAME
is the identifier of the failed kube-etcd replica node. For example,kube-etcd-1
orkube-etcd2
.
Follow steps 1-3 of Deploying the utility Pods to create a utility Pod in the admin cluster. This Pod is used to access the PersistentVolume (PV) of the failed etcd member in the user cluster.
Clean up the etcd data directory from within the utility Pod.
kubectl --kubeconfig
ADMIN_CLUSTER_KUBECONFIG exec -it -nUSER_CLUSTER_NAME etcd-utility-MEMBER_NUMBER -- /bin/bash -c 'rm -rf /var/lib/etcd/*'Delete the utility Pod.
kubectl --kubeconfig
ADMIN_CLUSTER_KUBECONFIG delete pod -nUSER_CLUSTER_NAME etcd-utility-MEMBER_NUMBER Uncordon the failed node.
kubectl --kubeconfig
ADMIN_CLUSTER_KUBECONFIG uncordonNODE_NAME Open the kube-etcd StatefulSet in your text editor.
kubectl --kubeconfig
ADMIN_CLUSTER_KUBECONFIG -nUSER_CLUSTER_NAME edit statefulset kube-etcdChange the value of the
--initial-cluster-state
flag toexisting
.containers: - name: kube-etcd ... args: - --initial-cluster-state=existing ...
Restore the etcd PDB which was deleted in step 1.
kubectl --kubeconfig
ADMIN_CLUSTER_KUBECONFIG apply -f /path/to/etcdpdb.yaml