This page explains how node auto-repair works and how to use the feature for Standard Google Kubernetes Engine (GKE) clusters.
Node auto-repair helps keep the nodes in your GKE cluster in a healthy, running state. When enabled, GKE makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period, GKE initiates a repair process for that node.
Settings for Autopilot and Standard
Autopilot clusters always automatically repair nodes. You can't disable this setting.
In Standard clusters, node auto-repair is enabled by default for new node pools. You can disable auto repair for an existing node pool, however we recommend keeping the default configuration.
Repair criteria
GKE uses the node's health status to determine if a node
needs to be repaired. A node reporting a Ready
status is considered healthy.
GKE triggers a repair action if a node reports consecutive
unhealthy status reports for a given time threshold.
An unhealthy status can mean:
- A node reports a
NotReady
status on consecutive checks over the given time threshold (approximately 10 minutes). - A node does not report any status at all over the given time threshold (approximately 10 minutes).
- A node's boot disk is out of disk space for an extended time period (approximately 30 minutes).
You can manually check your node's health signals at any time by using the
kubectl get nodes
command.
Node repair process
If GKE detects that a node requires repair, the node is drained and re-created. GKE waits one hour for the drain to complete. If the drain doesn't complete, the node is shut down and a new node is created.
If multiple nodes require repair, GKE might repair nodes in parallel. GKE balances the number of repairs depending on the size of the cluster and the number of broken nodes. GKE will repair more nodes in parallel on a larger cluster, but fewer nodes as the number of unhealthy nodes grows.
If you disable node auto-repair at any time during the repair process, in- progress repairs are not canceled and continue for any node under repair.
Node repair history
GKE generates a log entry for automated repair events. You can check the logs by running the following command:
gcloud container operations list
Node auto repair in TPU slice nodes
If a TPU slice node in a multi-host TPU slice node pool is unhealthy and requires auto repair, the entire node pool is recreated. To learn more about the TPU slice node conditions, see TPU slice node auto repair.
Enable auto-repair for an existing Standard node pool
You enable node auto-repair on a per-node pool basis.
If auto-repair is disabled on an existing node pool in a Standard cluster, use the following instructions to enable it:
gcloud
gcloud container node-pools update POOL_NAME \
--cluster CLUSTER_NAME \
--region=COMPUTE_REGION \
--enable-autorepair
Replace the following:
POOL_NAME
: the name of your node pool.CLUSTER_NAME
: the name of your Standard cluster.COMPUTE_REGION
: the Compute Engine region for the cluster. For zonal clusters, use the--zone COMPUTE_ZONE
option.
Console
Go to the Google Kubernetes Engine page in the Google Cloud console.
In the cluster list, click the name of the cluster you want to modify.
Click the Nodes tab.
Under Node Pools, click the name of the node pool you want to modify.
On the Node pool details page, click edit Edit.
Under Management, select the Enable auto-repair checkbox.
Click Save.
Verify node auto-repair is enabled for a Standard node pool
Node auto-repair is enabled on a per-node pool basis. You can verify that a node pool in your cluster has node auto-repair enabled with the Google Cloud CLI or the Google Cloud console.
gcloud
Describe the node pool:
gcloud container node-pools describe NODE_POOL_NAME \
--cluster=CLUSTER_NAME
If node auto-repair is enabled, the output of the command includes these lines:
management:
...
autoRepair: true
Console
Go to the Google Kubernetes Engine page in the Google Cloud console.
On the Google Kubernetes Engine page, click the name of the cluster of the node pool you want to inspect.
Click the Nodes tab.
Under Node Pools, click the name of the node pool you want to inspect.
Under Management, in the Auto-repair field, verify that auto-repair is enabled.
Disable node auto-repair
You can disable node auto-repair for an existing node pool in a Standard cluster by using the gcloud CLI or the Google Cloud console.
gcloud
gcloud container node-pools update POOL_NAME \
--cluster CLUSTER_NAME \
--region=COMPUTE_REGION \
--no-enable-autorepair
Replace the following:
POOL_NAME
: the name of your node pool.CLUSTER_NAME
: the name of your Standard cluster.COMPUTE_REGION
: the Compute Engine region for the cluster. For zonal clusters, use the--zone COMPUTE_ZONE
option.
Console
Go to the Google Kubernetes Engine page in the Google Cloud console.
In the cluster list, click the name of the cluster you want to modify.
Click the Nodes tab.
Under Node Pools, click the name of the node pool you want to modify.
On the Node pool details page, click edit Edit.
Under Management, clear the Enable auto-repair checkbox.
Click Save.