This page shows how to enable node auto repair for Google Distributed Cloud clusters.
The node auto repair feature continuously detects and repairs unhealthy nodes in a cluster. The feature is disabled by default. You can enable the feature during admin or user cluster creation. You can also enable the feature for an existing user cluster, but not for an existing admin cluster.
Unhealthy conditions
The following conditions are indications that a node is unhealthy:
The node condition
NotReady
istrue
for approximately 10 minutes.The machine state is
Unavailable
for approximately 10 minutes after successful creation.The machine state is not
Available
for approximately 30 minutes after VM creation.There is no node object (nodeRef is
nil
) corresponding to a machine in theAvailable
state for approximately 10 minutes.The node condition
DiskPressure
istrue
for approximately 30 minutes.
Repair strategy
Google Distributed Cloud initiates a repair on a node if the node meets at least one of the conditions in the preceding list.
The repair drains the unhealthy node and creates a new VM. If the node draining is unsuccessful for one hour, the repair forces the drain and safely detaches the attached Kubernetes managed disks.
If there are multiple unhealthy nodes in the same MachineDeployment, the repair is performed on only one of those nodes at a time.
v1 cluster configuration files
To enable node auto repair, you must use a v1 configuration file, either the v1 admin cluster configuration file or the v1 user cluster configuration file. Node auto repair is not supported in the v0 configuration file.
Enabling node auto repair for a new cluster
In your
admin
or
user
cluster configuration file, set autoRepair.enabled
to true
:
autoRepair: enabled: true
Continue with the steps for creating your admin or user cluster.
Enabling node auto repair for an existing user cluster:
In your
user cluster configuration file,
set autoRepair.enabled
to true
:
autoRepair: enabled: true
Update the cluster:
gkectl update cluster --config USER_CLUSTER_CONFIG --kubeconfig ADMIN_KUBECONFIG
Replace the following:
USER_CLUSTER_CONFIG: the path of your user cluster configuration file
ADMIN_KUBECONFIG: the path of your admin cluster kubeconfig file
Disabling node auto repair for a user cluster
In your
user cluster configuration file,
set autoRepair.enabled
to false
:
autoRepair: enabled: false
Update the cluster:
gkectl update cluster --config USER_CLUSTER_CONFIG --kubeconfig ADMIN_KUBECONFIG
Disabling node auto repair for an admin cluster
To disable node auto repair for an admin cluster, delete the
cluster-health-controller
Deployment:
kubectl --kubeconfig ADMIN_KUBECONFIG] delete deployment cluster-health-controller --namespace kube-system
Debugging node auto repair
You can investigate issues with node auto repair by describing the Machine and Node objects in the admin cluster. Here's an example:
List the machine objects:
kubectl --kubeconfig kubeconfig get machines
Output:
default gke-admin-master-wc default gke-admin-node-7458969ff8-5cg8d default gke-admin-node-7458969ff8-svqj7 default xxxxxx-user-cluster-41-25j8d-567f9c848f-fwjqt
Describe one of the Machine objects:
kubectl --kubeconfig kubeconfig describe machine gke-admin-master-wcbrj
In the output, look for events from cluster-health-controller
.
Similarly, you can list and describe node objects. For example:
kubectl --kubeconfig kubeconfig get nodes ... kubectl --kubeconfig kubeconfig describe node gke-admin-master-wcbrj
Manual node repair
In case there are node problems that are not covered by the auto repair logic, or you have not enabled node auto repair, you can do a manual repair. This deletes and re-creates the node.
Get the name of the Machine object that corresponds to the node:
kubectl --kubeconfig KUBECONFIG get machines
Replace KUBECONFIG with the path of your admin or user cluster kubeconfig file.
Add the repair
annotation to the Machine object:
kubectl annotate --kubeconfig KUBECONFIG machine MACHINE_NAME onprem.cluster.gke.io/repair-machine=true
Replace MACHINE_NAME with the name of the Machine object.
Delete the Machine object:
kubectl delete --kubeconfig KUBECONFIG machine MACHINE_NAME