Migrate an admin cluster to HA

This document shows how to migrate to a high availability (HA) admin cluster from a non-HA admin cluster.

1.29: Preview
1.28: Not available
1.16: Not available

An HA admin cluster has three control-plane nodes and no add-on nodes. A non-HA admin cluster has one control-plane node and two add-on nodes.

Procedure overview

These are the primary steps involved in a migration:

Edit the admin cluster configuration file.
Run gkectl update admin. This command does the following:
- Bring up an external cluster (Kind) and ensure the current non-HA admin cluster is in a healthy state.
- Create a new admin cluster control plane using HA spec and a new control plane VIP.
- Turn off the existing admin cluster control plane.
- Take an etcd snapshot of the existing admin cluster.
- Restore the old admin cluster data in the new HA control plane.
- Reconcile the restored admin cluster to meet the end state of HA admin cluster.

Notes

During migration, there's no downtime for user cluster workload.
During migration, there is some downtime for the admin cluster control plane. (The downtime is <18 min based on our test, but the actual length depends on individual infra environments).
Requirements for HA admin clusters still hold for non-HA to HA migration. That is, If you are using the Seesaw load balancer for a non-HA admin cluster, you must first migrate to MetalLB, and then migrate to an HA admin cluster. This is because an HA admin cluster doesn't support Seesaw.
After the migration is successfully done, there will be left-over resources (e.g. the non-HA admin master VM) that we intentionally kept for the sake of failure recovery. You can manually clean them up if needed.

Before and after migration

These are the primary differences in the cluster before and after migration:

	Before migration	After migration
Control-plane node replicas	1	3
Add-on nodes	2	0
Control-plane Pod replicas (kube-apiserver, kube-etcd, etc.)	1	3
Data disk size	100GB * 1	25GB * 3
Data disks path	Set by vCenter.dataDisk in the admin cluster configuration file	Auto generated under the directory: `/anthos/[ADMIN_CLUSTER_NAME]/default/[MACHINE_NAME]-data.vmdk`
Load balancer for the control-plane VIP	Set by loadBalancer.kind in the admin cluster configuration file	`keepalived` + `haproxy`
Allocation of IP addresses for admin cluster control-plane nodes	DHCP or static, depending on network.ipMode.type	3 static IP addresses
Allocation of IP addresses for kubeception user cluster control-plane nodes	DHCP or static, depending on network.ipMode.type	DHCP or static, depending on network.ipMode.type
Checkpoint file	Enabled by default	Not used

Edit the admin cluster configuration file

You need to specify four additional IP addresses:

Three IP addresses for the control-plane nodes of the admin cluster
A new control-plane VIP for the admin cluster load balancer

You also need to change a few other fields in your admin cluster configuration file.

Specify IP addresses

In the admin cluster configuration file, fill in the network.controlPlaneIPBlock section. For example:

controlPlaneIPBlock:
 netmask: "255.255.255.0"
 gateway: "172.16.20.1"
 ips:
 – ip: "172.16.20.50"
   hostname: "admin-cp-node-1"
 – ip: "172.16.20.51"
   hostname: "admin-cp-node-2"
 – ip: "172.16.20.52"
   hostname: "admin-cp-node-3"

Fill in the hostconfig section. If your admin cluster uses static IP addresses, this section is already filled in. For example:
```
hostConfig:
 dnsServers:
 – 203.0.113.1
 – 198.51.100.1
 ntpServers:
 – 216.239.35.12
```
Replace the value of loadBalancer.vips.controlPlaneVIP with a new VIP. For example:
```
loadBalancer:
 vips:
   controlPlaneVIP: "172.16.20.59"
```

Update additional configuration fields

Set adminMaster.replicas to 3:

adminMaster:
 replicas: 3
 cpus: 4
 memoryMB: 8192

Remove the vCenter.dataDisk field. This is because for an HA admin cluster, the paths for the three data disks used by control-plane nodes are automatically generated under the root directory anthos in the datastore.
If loadBalancer.manualLB.controlPlaneNodePort has a non-zero value, set it to 0.

Adjust manual load balancer configuration

If your admin cluster uses manual load balancing, do the step in this section. Otherwise skip this section.

For each of the three new control-plane node IP addresses that you specified in the network.controlPlaneIPBlock section, configure this mapping in your load balancer:

(old controlPlaneVIP:443) -> (NEW_NODE_IP_ADDRESS:old controlPlaneNodePort)

This is so that the old control-plane VIP will work during the migration.

Update the admin cluster

Start the migration:
```
gkectl update admin --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config ADMIN_CLUSTER_CONFIG
```
Replace the following:
- ADMIN_CLUSTER_KUBECONFIG: the path of the admin cluster kubeconfig file.
- ADMIN_CLUSTER_CONFIG: the path of the admin cluster configuration file
The command displays the progress of the migration.

When prompted, enter Y to continue.
When the migration is done, the admin cluster kubeconfig file is automatically updated to use the new control-plane VIP. Meanwhile, the old control-plane VIP still functions, and can also be used to access the new HA admin cluster.

Manually cleanup left-over resources if needed

During the migration, gkectl does not delete the admin cluster control-plane VM. Instead, it just shuts down the VM instead of deleting it from vSphere. If you want to delete the old control-plane VM after a successful migration, you must do the deletion manually.

To manually delete the old control-plane VM and related resources:

Ensure the non-HA admin master VM gke-admin-master-xxx is already powered off.
Delete the non-HA admin master VM gke-admin-master-xxx from vSphere.
Delete the non-HA admin master VM template gke-admin-master-xxx-tmpl from vSphere.
Delete the non-HA admin data disk and the admin checkpoint file.
Clean up the temporary files saved in /home/ubuntu/admin-ha-migration/[ADMIN_CLUSTER_NAME]/.

Following are govc commands if using cli is preferred:

  // Configure govc credentials
  export GOVC_INSECURE=1
  export GOVC_URL=VCENTER_URL
  export GOVC_DATACENTER=DATACENTER
  export GOVC_DATASTORE=DATASTORE
  export GOVC_USERNAME=USERNAME
  export GOVC_PASSWORD=PASSWORD

  // Configure admin master VM name (you can find the master VM name from the "[HA Migration]" logs)
  export ADMIN_MASTER_VM=ADMIN_MASTER_VM_NAME

  // Configure datadisk path (remove ".vmdk" suffix)
  export DATA_DISK_PATH=DATADISK_PATH_WITHOUT_VMDK

  // Check whether admin master is in "poweredOff" state:
  govc vm.info $ADMIN_MASTER_VM | grep Power

  // Delete admin master VM
  govc vm.destroy $ADMIN_MASTER_VM

  // Delete admin master VM template
  govc vm.destroy "$ADMIN_MASTER_VM"-tmpl

  // Delete data disk
  govc datastore.ls $DATA_DISK_PATH
  govc datastore.rm $DATA_DISK_PATH

  // Delete admin checkpoint file
  govc datastore.ls "$DATA_DISK_PATH"-checkpoint.yaml
  govc datastore.rm "$DATA_DISK_PATH"-checkpoint.yaml