Auto-repairing nodes

This page shows you how to configure node auto-repair in Google Kubernetes Engine (GKE).

Overview

GKE's node auto-repair feature helps you keep the nodes in your cluster in a healthy, running state. When enabled, GKE makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period, GKE initiates a repair process for that node.

Repair criteria

GKE uses the node's health status to determine if a node needs to be repaired. A node reporting a Ready status is considered healthy. GKE triggers a repair action if a node reports consecutive unhealthy status reports for a given time threshold. An unhealthy status can mean:

  • A node reports a NotReady status on consecutive checks over the given time threshold (approximately 10 minutes).
  • A node does not report any status at all over the given time threshold (approximately 10 minutes).
  • A node's boot disk is out of disk space for an extended time period (approximately 30 minutes).

Node repair process

If GKE detects that a node requires repair, the node is drained and re-created. GKE waits one hour for the drain to complete. If the drain doesn't complete, the node is shut down and a new node is created.

If multiple nodes require repair, GKE might repair nodes in parallel. GKE balances the number of repairs depending on the size of the cluster and the number of broken nodes. GKE will repair more nodes in parallel on a larger cluster, but fewer nodes as the number of unhealthy nodes grows.

If you disable node auto-repair at any time during the repair process, in- progress repairs are not cancelled and continue for any node currently under repair.

Node repair history

GKE generates a log entry for automated repair events. You can check the logs by using the gcloud container operations list command.

Enabling node auto-repair

You enable node auto-repair on a per-node pool basis. When you create a cluster, you can enable or disable auto-repair for the cluster's default node pool. If you create additional node pools, you can enable or disable node auto-repair for those node pools, independent of the auto-repair setting for the default node pool.

gcloud

To create a cluster or node pool with node auto-repair enabled, specify the --enable-autorepair option when you create your cluster or node pool using the gcloud command-line tool.

To create a cluster with auto-repair enabled, run the following command:

gcloud container clusters create [CLUSTER_NAME] --zone [COMPUTE_ZONE] \
--enable-autorepair

To create a node pool with auto-repair enabled:

gcloud container node-pools create [POOL_NAME] --cluster [CLUSTER_NAME] \
--zone [COMPUTE_ZONE] --enable-autorepair

To enable auto-repair for an existing node pool:

gcloud container node-pools update [POOL_NAME] --cluster [CLUSTER_NAME] \
--zone [COMPUTE_ZONE] --enable-autorepair

Console

To create a cluster in which the default node pool has node auto-repair enabled, perform the following steps:

  1. Visit the Google Kubernetes Engine menu in Cloud Console.

    Visit the Google Kubernetes Engine menu

  2. Click Create cluster.

  3. Choose the Standard cluster template or choose an appropriate template for your workload.

  4. Configure your cluster as desired.

  5. Click More options. Select Enable auto-repair.

  6. Click Create.

To create a node pool with node auto-repair enabled:

  1. Visit the Google Kubernetes Engine menu in Cloud Console.

    Visit the Google Kubernetes Engine menu

  2. Click the cluster's Edit button, which looks like a pencil.

  3. From the Node pools menu, click Add node pool.

  4. Configure your node pool as desired. Then, click More options for the node pool.

  5. Select Enable auto-repair.

  6. Click Save to save the node pool configuration.

  7. Click Save again to modify the cluster.

To enable node auto-repair for an existing node pool:

  1. Visit the Google Kubernetes Engine menu in Cloud Console.

    Visit the Google Kubernetes Engine menu

  2. Click the cluster's Edit button, which looks like a pencil.

  3. From the Node pools menu, click More options for the node pool you want to modify.

  4. Select Enable auto-repair.

  5. Click Save to save the node pool configuration.

  6. Click Save again to modify the cluster.

Disabling node auto-repair

gcloud

To disable auto-repair for a given node pool, run the following command:

gcloud container node-pools update [POOL_NAME] --cluster [CLUSTER_NAME] \
--zone [COMPUTE_ZONE] --no-enable-autorepair

Console

To disable node auto-repair for an existing node pool, perform the following steps:

  1. Visit the Google Kubernetes Engine menu in Cloud Console.

    Visit the Google Kubernetes Engine menu

  2. Click the cluster's Edit button, which looks like a pencil.

  3. From the Node pools menu, click More options for the node pool you want to modify.

  4. Deselect Enable auto-repair.

  5. Click Save to save the node pool configuration.

  6. Click Save again to modify the cluster.

What's next