In Google Distributed Cloud, periodic health checking and automatic node repair are enabled by default.
The node auto repair feature continuously detects and repairs unhealthy nodes in a cluster.
Periodic health checks run every fifteen minutes. The checks are the same as the
ones performed by gkectl diagnose cluster
. The results are surfaced as logs
and events on Cluster objects in the admin cluster.
Make sure that your admin and user clusters each have an extra IP address available for automatic node repair.
If advanced cluster is enabled, the periodic health checks are not run as part of auto repair.
Unhealthy node conditions when advanced cluster isn't enabled
The following conditions are indications that a node is unhealthy
when
enableAdvanceCluster
is false
.
The node condition
NotReady
istrue
for approximately 10 minutes.The machine state is
Unavailable
for approximately 10 minutes after successful creation.The machine state is not
Available
for approximately 30 minutes after VM creation.There is no node object (nodeRef is
nil
) corresponding to a machine in theAvailable
state for approximately 10 minutes.The node condition
DiskPressure
istrue
for approximately 30 minutes.
Unhealthy node conditions when advanced cluster is enabled
The following conditions are indications that a node is unhealthy when
enableAdvanceCluster
is true
.
The node condition
NotReady
istrue
for approximately 10 minutes.The node condition
DiskPressure
istrue
for approximately 30 minutes.
Node repair strategy
Google Distributed Cloud initiates a repair on a node if the node meets at least one of the conditions in the preceding list.
The repair drains the unhealthy node and creates a new VM. If the node draining is unsuccessful for one hour, the repair forces the drain and safely detaches the attached Kubernetes managed disks.
If there are multiple unhealthy nodes in the same MachineDeployment, the repair is performed on only one of those nodes at a time.
The number of repairs per hour for a node pool is limited to the maximum of:
- Three
- Ten percent of the number of nodes in the node pool
Enabling node repair and health checking for a new cluster
In your admin or
user
cluster configuration file, set autoRepair.enabled
to true
:
autoRepair: enabled: true
Continue with the steps for creating your admin or user cluster.
Enabling node repair and health checking for an existing user cluster
In your user cluster configuration
file,
set autoRepair.enabled
to true
:
Update the cluster:
gkectl update cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG
Replace the following:
ADMIN_CLUSTER_KUBECONFIG: the path of your admin cluster kubeconfig file
USER_CLUSTER_CONFIG: the path of your user cluster configuration file
Enabling node repair and health checking for an existing admin cluster
In your admin cluster configuration
file,
set autoRepair.enabled
to true
:
Update the cluster:
gkectl update admin --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config ADMIN_CLUSTER_CONFIG
Replace ADMIN_CLUSTER_CONFIG with the path of your admin cluster configuration file.
Viewing logs from a health checker
List all of the health checker Pods in the admin cluster:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get pods --all-namespaces | grep cluster-health-controller
The output is similar to this:
kube-system cluster-health-controller-6c7df455cf-zlfh7 2/2 Running my-user-cluster cluster-health-controller-5d5545bb75-rtz7c 2/2 Running
To view the logs from a particular health checker, get the logs for the
cluster-health-controller
container in one of the Pods. For example, to get
the logs for my-user-cluster
shown in the preceding output:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace my-user-cluster logs \ cluster-health-controller-5d5545bb75-rtz7c cluster-health-controller
Viewing events from a health checker
List all of the Cluster objects in your admin cluster:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get clusters --all-namespaces
The output is similar to this:
default gke-admin-ldxh7 2d15h my-user-cluster my-user-cluster 2d12h
To view the events for a particular cluster, run kubectl describe cluster
with
the --show-events
flag. For example, to see the events for my-user-cluster
shown in the preceding output:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace my-user-cluster \ describe --show-events cluster my-user-cluster
Example output:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ValidationFailure 17s cluster-health-periodics-controller validator for Pod returned with status: FAILURE, reason: 1 pod error(s).
Disabling node repair and health checking for a user cluster
In your user cluster configuration
file,
set autoRepair.enabled
to false
:
Update the cluster:
gkectl update cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG
Disabling node repair and health checking for an admin cluster
In your admin cluster configuration
file,
set autoRepair.enabled
to false
:
Update the cluster:
gkectl update admin --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config ADMIN_CLUSTER_CONFIG
Debugging node auto repair when advanced cluster isn't enabled
You can investigate issues with node auto repair by describing the Machine and Node objects in the admin cluster. Here's an example:
List the machine objects:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get machines
Example output:
default gke-admin-master-wcbrj default gke-admin-node-7458969ff8-5cg8d default gke-admin-node-7458969ff8-svqj7 default xxxxxx-user-cluster-41-25j8d-567f9c848f-fwjqt
Describe one of the Machine objects:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG describe machine gke-admin-master-wcbrj
In the output, look for events from cluster-health-controller
.
Similarly, you can list and describe node objects. For example:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get nodes ... kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG describe node gke-admin-master-wcbrj
Debugging node auto repair when advanced cluster is enabled
You can investigate issues with node auto repair by describing the Machine and Node objects in the admin cluster and the corresponding cluster respectively. Here's an example:
List the machine objects:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get machines
Example output:
NAMESPACE NAME NODEPOOL ci-1f6861fe28cac8fb390bc798927c717b 10.251.172.47 ci-1f6861fe28cac8fb390bc798927c717b-np ci-1f6861fe28cac8fb390bc798927c717b 10.251.173.64 ci-1f6861fe28cac8fb390bc798927c717b-cp ci-1f6861fe28cac8fb390bc798927c717b 10.251.173.66 ci-1f6861fe28cac8fb390bc798927c717b-cp ci-1f6861fe28cac8fb390bc798927c717b 10.251.174.19 ci-1f6861fe28cac8fb390bc798927c717b-np ci-1f6861fe28cac8fb390bc798927c717b 10.251.175.15 ci-1f6861fe28cac8fb390bc798927c717b-np ci-1f6861fe28cac8fb390bc798927c717b 10.251.175.30 ci-1f6861fe28cac8fb390bc798927c717b-cp kube-system 10.251.172.239 gke-admin-bnbp9-cp kube-system 10.251.173.39 gke-admin-bnbp9-cp kube-system 10.251.173.6 gke-admin-bnbp9-cp
Describe the machine corresponding to the Machine object:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG describe machine -n ci-1f6861fe28cac8fb390bc798927c717b 10.251.172.47
In the output, look for events from auto-repair-controller
.
Similarly, you can list and describe node objects. For example:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get nodes ... kubectl --kubeconfig USER_CLUSTER_KUBECONFIG describe node ci-1f6861fe28cac8fb390bc798927c717b-np
Manual node repair when advanced cluster isn't enabled
Admin control plane node
The admin control plane node has a dedicated repair command, because the normal manual repair doesn't work for it.
Use gkectl repair
admin-master
to repair the
admin control plane node.
Controlplane V2 user cluster control plane node
The Controlplane V2 user cluster control plane nodes are managed differently from other nodes.
Similar to kubeception user clusters, the control plane Machine objects of Controlplane V2 user clusters are in the admin cluster. And the node auto repair is covered by the admin cluster node auto repair.
In case there are node problems that are not covered by the admin cluster node auto repair logic, or you have not enabled admin cluster node auto repair, you can do a manual repair. This deletes and re-creates the node.
Get the name of the Machine object that corresponds to the node:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME get machines
Replace the following:
ADMIN_CLUSTER_KUBECONFIG
: the path of your admin cluster kubeconfig file.USER_CLUSTER_NAME
: the target user cluster name.
Add the
repair
annotation to the Machine object:kubectl annotate --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME machine MACHINE_NAME onprem.cluster.gke.io/repair-machine=true
Replace
MACHINE_NAME
with the name of the Machine object.Delete the Machine object:
kubectl delete --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME machine MACHINE_NAME
Re-create the node one by one for a HA control plane, or else it might bring down the control plane unexpectedly.
Other nodes
In case there are node problems that are not covered by the auto repair logic, or you have not enabled node auto repair, you can do a manual repair. This deletes and re-creates the node.
Get the name of the Machine object that corresponds to the node:
kubectl --kubeconfig CLUSTER_KUBECONFIG get machines
Replace CLUSTER_KUBECONFIG with the path of your admin or user cluster kubeconfig file.
Add the repair
annotation to the Machine object:
kubectl annotate --kubeconfig CLUSTER_KUBECONFIG machine MACHINE_NAME onprem.cluster.gke.io/repair-machine=true
Replace MACHINE_NAME with the name of the Machine object.
Delete the Machine object:
kubectl delete --kubeconfig CLUSTER_KUBECONFIG machine MACHINE_NAME
Manual node repair when advanced cluster is enabled
Admin control plane node
Admin control plane node manual repair is not supported
User cluster control plane node / Worker Nodes
Get the name of the Inventory Machine object that corresponds to the node using the IP of the node to match the objects:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME get inventorymachines
Replace the following: ADMIN_CLUSTER_KUBECONFIG: the path of your admin kubeconfig file. USER_CLUSTER_NAME: the target user cluster name.
Add the force-remove
annotation to the Inventory Machine object:
kubectl annotate --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME inventorymachine MACHINE_NAME baremetal.cluster.gke.io/force-remove=true
Replace MACHINE_NAME with the name of the Machine object.
Delete the Inventory Machine object:
kubectl delete --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME inventorymachine MACHINE_NAME
Re-create the node one by one for a HA control plane, or else it might bring down the control plane unexpectedly.