Auto-repairing nodes

This page shows you how to configure node auto-repair in Google Kubernetes Engine (GKE).

Overview

Node auto-repair helps keep the nodes in your GKE cluster in a healthy, running state. When enabled, GKE makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period, GKE initiates a repair process for that node.

Nodes for Autopilot clusters are managed by Google, and have node auto-repair already configured.

For GKE clusters subscribed to release channels, node auto-repair is enabled by default and cannot be overridden.

Node auto-repair isn't available on Alpha clusters.

Repair criteria

GKE uses the node's health status to determine if a node needs to be repaired. A node reporting a Ready status is considered healthy. GKE triggers a repair action if a node reports consecutive unhealthy status reports for a given time threshold. An unhealthy status can mean:

  • A node reports a NotReady status on consecutive checks over the given time threshold (approximately 10 minutes).
  • A node does not report any status at all over the given time threshold (approximately 10 minutes).
  • A node's boot disk is out of disk space for an extended time period (approximately 30 minutes).

You can manually check your node's health signals at any time by using the kubectl get nodes command in the gcloud command-line tool.

Node repair process

If GKE detects that a node requires repair, the node is drained and re-created. GKE waits one hour for the drain to complete. If the drain doesn't complete, the node is shut down and a new node is created.

If multiple nodes require repair, GKE might repair nodes in parallel. GKE balances the number of repairs depending on the size of the cluster and the number of broken nodes. GKE will repair more nodes in parallel on a larger cluster, but fewer nodes as the number of unhealthy nodes grows.

If you disable node auto-repair at any time during the repair process, in- progress repairs are not cancelled and continue for any node currently under repair.

Node repair history

GKE generates a log entry for automated repair events. You can check the logs by running the following command:

gcloud container operations list

Enabling node auto-repair

GKE clusters subscribed to release channels have node auto-repair enabled by default and cannot be overridden.

You enable node auto-repair on a per-node pool basis. When you create a cluster, you can enable or disable auto-repair for the cluster's default node pool. If you create additional node pools, you can enable or disable node auto-repair for those node pools, independent of the auto-repair setting for the default node pool.

You can create a cluster or node pool with node auto-repair enabled by using the gcloud tool or the Google Cloud Console.

Create a cluster with node auto-repair enabled

Use the following instructions to create a Standard cluster with node auto-repair enabled:

gcloud

gcloud container clusters create CLUSTER_NAME \
    --region=COMPUTE_REGION \
    --enable-autorepair

Replace the following:

  • CLUSTER_NAME: the name of your new Standard cluster.
  • COMPUTE_REGION: the Compute Engine region for the cluster. For zonal clusters, use the --zone COMPUTE_ZONE option.

Console

  1. Go to the Google Kubernetes Engine page in the Cloud Console.

    Go to Google Kubernetes Engine

  2. Click Create.

  3. In the Standard section, click Configure.

  4. Configure your cluster as desired.

  5. From the navigation pane, under Node Pools, click the name of the node pool you want to modify.

  6. Under Automation, select the Enable auto-repair checkbox.

  7. Click Create.

Create a node pool with auto-repair enabled

Use the following instructions to create a node pool in an existing Standard cluster with node auto-repair enabled:

gcloud

gcloud container node-pools create POOL_NAME \
    --cluster CLUSTER_NAME \
    --region=COMPUTE_REGION \
    --enable-autorepair

Replace the following:

  • POOL_NAME: the name of your new node pool.
  • CLUSTER_NAME: the name of your Standard cluster.
  • COMPUTE_REGION: the Compute Engine region for the cluster. For zonal clusters, use the --zone COMPUTE_ZONE option.

Console

  1. Go to the Google Kubernetes Engine page in Cloud Console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Click Add Node Pool.

  4. On the Add a node pool page, under Automation, select the Enable auto-repair checkbox.

  5. Click Create.

Enable auto-repair for an existing node pool

Use the following instructions to enable node auto-repair on an existing node pool in a Standard cluster:

gcloud

gcloud container node-pools update POOL_NAME \
    --cluster CLUSTER_NAME \
    --region=COMPUTE_REGION \
    --enable-autorepair

Replace the following:

  • POOL_NAME: the name of your node pool.
  • CLUSTER_NAME: the name of your Standard cluster.
  • COMPUTE_REGION: the Compute Engine region for the cluster. For zonal clusters, use the --zone COMPUTE_ZONE option.

Console

  1. Go to the Google Kubernetes Engine page in Cloud Console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Click the Nodes tab.

  4. Under Node Pools, click the name of the node pool you want to modify.

  5. On the Node pool details page, click Edit.

  6. Under Management, select the Enable auto-repair checkbox.

  7. Click Save.

Disable node auto-repair

You can disable node auto-repair for an existing node pool in a Standard cluster by using the gcloud tool or the Google Cloud Console.

gcloud

gcloud container node-pools update POOL_NAME \
    --cluster CLUSTER_NAME \
    --region=COMPUTE_REGION \
    --no-enable-autorepair

Replace the following:

  • POOL_NAME: the name of your node pool.
  • CLUSTER_NAME: the name of your Standard cluster.
  • COMPUTE_REGION: the Compute Engine region for the cluster. For zonal clusters, use the --zone COMPUTE_ZONE option.

Console

  1. Go to the Google Kubernetes Engine page in Cloud Console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Click the Nodes tab.

  4. Under Node Pools, click the name of the node pool you want to modify.

  5. On the Node pool details page, click Edit.

  6. Under Management, clear the Enable auto-repair checkbox.

  7. Click Save.

What's next