Configuring node auto repair

This page shows how to enable node auto repair for Google Distributed Cloud clusters.

The node auto repair feature continuously detects and repairs unhealthy nodes in a cluster. The feature is disabled by default. You can enable the feature during admin or user cluster creation. You can also enable the feature for an existing user cluster, but not for an existing admin cluster.

Unhealthy conditions

The following conditions are indications that a node is unhealthy:

  • The node condition NotReady is true for approximately 10 minutes.

  • The machine state is Unavailable for approximately 10 minutes after successful creation.

  • The machine state is not Available for approximately 30 minutes after VM creation.

  • There is no node object (nodeRef is nil) corresponding to a machine in the Available state for approximately 10 minutes.

  • The node condition DiskPressure is true for approximately 30 minutes.

Repair strategy

Google Distributed Cloud initiates a repair on a node if the node meets at least one of the conditions in the preceding list.

The repair drains the unhealthy node and creates a new VM. If the node draining is unsuccessful for one hour, the repair forces the drain and safely detaches the attached Kubernetes managed disks.

If there are multiple unhealthy nodes in the same MachineDeployment, the repair is performed on only one of those nodes at a time.

v1 cluster configuration files

To enable node auto repair, you must use a v1 configuration file, either the v1 admin cluster configuration file or the v1 user cluster configuration file. Node auto repair is not supported in the v0 configuration file.

Enabling node auto repair for a new cluster

In your admin or user cluster configuration file, set autoRepair.enabled to true:

autoRepair:
  enabled: true

Continue with the steps for creating your admin or user cluster.

Enabling node auto repair for an existing user cluster:

In your user cluster configuration file, set autoRepair.enabled to true:

autoRepair:
  enabled: true

Update the cluster:

gkectl update cluster --config USER_CLUSTER_CONFIG --kubeconfig ADMIN_KUBECONFIG

Replace the following:

  • USER_CLUSTER_CONFIG: the path of your user cluster configuration file

  • ADMIN_KUBECONFIG: the path of your admin cluster kubeconfig file

Disabling node auto repair for a user cluster

In your user cluster configuration file, set autoRepair.enabled to false:

autoRepair:
  enabled: false

Update the cluster:

gkectl update cluster --config USER_CLUSTER_CONFIG --kubeconfig ADMIN_KUBECONFIG

Disabling node auto repair for an admin cluster

To disable node auto repair for an admin cluster, delete the cluster-health-controller Deployment:

kubectl --kubeconfig ADMIN_KUBECONFIG] delete deployment cluster-health-controller --namespace kube-system

Debugging node auto repair

You can investigate issues with node auto repair by describing the Machine and Node objects in the admin cluster. Here's an example:

List the machine objects:

kubectl --kubeconfig kubeconfig get machines

Output:

default     gke-admin-master-wc
default     gke-admin-node-7458969ff8-5cg8d
default     gke-admin-node-7458969ff8-svqj7
default     xxxxxx-user-cluster-41-25j8d-567f9c848f-fwjqt

Describe one of the Machine objects:

kubectl --kubeconfig kubeconfig describe machine gke-admin-master-wcbrj

In the output, look for events from cluster-health-controller.

Similarly, you can list and describe node objects. For example:

kubectl --kubeconfig kubeconfig get nodes
...
kubectl --kubeconfig kubeconfig describe node gke-admin-master-wcbrj

Manual node repair

In case there are node problems that are not covered by the auto repair logic, or you have not enabled node auto repair, you can do a manual repair. This deletes and re-creates the node.

Get the name of the Machine object that corresponds to the node:

kubectl --kubeconfig KUBECONFIG get machines

Replace KUBECONFIG with the path of your admin or user cluster kubeconfig file.

Add the repair annotation to the Machine object:

kubectl annotate --kubeconfig KUBECONFIG machine MACHINE_NAME onprem.cluster.gke.io/repair-machine=true

Replace MACHINE_NAME with the name of the Machine object.

Delete the Machine object:

kubectl delete --kubeconfig KUBECONFIG machine MACHINE_NAME