Node Auto-Repair on Kubernetes Engine

This page explains how to configure node auto-repair in Kubernetes Engine.

Overview

Kubernetes Engine's node auto-repair feature helps you keep the nodes in your cluster in a healthy, running state. When enabled, Kubernetes Engine makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period (approximately 10 minutes), Kubernetes Engine initiates a repair process for that node.

Repair criteria

Kubernetes Engine uses the node's health status to determine if a node needs to be repaired. A node reporting a Ready status is considered healthy. Kubernetes Engine triggers a repair action if a node reports consecutive unhealthy status reports for a given time threshold (approximately 10 minutes). An unhealthy status can mean:

  • A node reports a NotReady status on consecutive checks over the given time threshold.
  • A node does not report any status at all over the given time threshold.
  • A node's boot disk is out of disk space for an extended time period.

Node repair process

If Kubernetes Engine detects that a node requires repair, the node is drained and re-created. The drain might not succeed if the node is unresponsive or is too unhealthy to process the drain command.

If multiple nodes require repair, Kubernetes Engine repairs one node at a time, with each repair lasting approximately 5-10 minutes. If you disable node auto-repair at any time during the repair process, the in-progress repairs are not cancelled and still complete for any node currently under repair.

Kubernetes Engine generates an entry in its operation logs for any automated repair event. You can check the logs by using the gcloud container operations list command.

Enabling node auto-repair

You enable node auto-repair on a per-node pool basis. When you create a cluster, you can enable or disable auto-repair for the cluster's default node pool. If you create additional node pools, you can enable or disable node auto-repair for those node pools, independent of the auto-repair setting for the default node pool.

Console

To create a cluster in which the default node pool has node auto-repair enabled, perform the following steps:

  1. Visit the Kubernetes Engine menu in the Google Cloud Platform Console.

    Visit the Kubernetes Engine menu

  2. Click Create cluster.

  3. Configure your cluster as desired. Then, from the Automatic node repair drop-down menu, select Enabled.
  4. Click Create.

To create a node pool with node auto-repair enabled:

  1. Visit the Kubernetes Engine menu in the Google Cloud Platform Console.

    Visit the Kubernetes Engine menu

  2. Select the desired cluster.

  3. Click Edit.
  4. From the Node pools menu, click Add node pool.
  5. Configure your node pool as desired. Then, from the Automatic node repair drop-down menu, select Enabled.
  6. Click Save.

To enable node auto-repair for an existing node pool:

  1. Visit the Kubernetes Engine menu in the Google Cloud Platform Console.

    Visit the Kubernetes Engine menu

  2. Select the desired cluster.

  3. Click Edit.
  4. From the Node pools menu, click the Edit icon beside the desired node pool.
  5. From the Automatic node repair drop-down menu, select Enabled.
  6. Click Save.

gcloud

To create a cluster or node pool with node auto-repair enabled, specify the --enable-autorepair option when you create your cluster or node pool using the gcloud command-line tool.

To create a cluster with auto-repair enabled, run the following command:

gcloud beta container clusters create [CLUSTER-NAME] --zone [COMPUTE-ZONE] --enable-autorepair

To create a node pool with auto-repair enabled:

gcloud beta container node-pools create [POOL-NAME] --cluster [CLUSTER-NAME] --zone [COMPUTE-ZONE] --enable-autorepair

To enable auto-repair for an existing node pool:

gcloud beta container node-pools update [POOL-NAME] --cluster [CLUSTER-NAME] --zone [COMPUTE-ZONE] --enable-autorepair

Disabling node auto-repair

Console

To disable node auto-repair for an existing node pool, perform the following steps:

  1. Visit the Kubernetes Engine menu in the Google Cloud Platform Console.

    Visit the Kubernetes Engine menu

  2. Select the desired cluster.

  3. Click Edit.
  4. From the Node pools menu, click the Edit icon beside the desired node pool.
  5. From the Automatic node repair drop-down menu, select Disabled.
  6. Click Save.

gcloud

To disable auto-repair for a given node pool, run the following command:

gcloud beta container node-pools update [POOL-NAME] --cluster [CLUSTER-NAME] --zone [COMPUTE-ZONE] --no-enable-autorepair

What's next

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Kubernetes Engine