Node Auto-Repair on Container Engine

Container Engine's Node Auto-Repair feature helps you keep the nodes in your cluster in a healthy, running state. When enabled, Container Engine makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period (approximately 10 minutes), Container Engine initiates a repair process for that node.

Repair Criteria

Container Engine uses the node's health status to determine if a node needs to be repaired. A node reporting a Ready status is considered healthy. Container Engine triggers a repair action if a node reports consecutive unhealthy status reports for a given time threshold (approximately 10 minutes). An unhealthy status can mean:

  • A node reports a NotReady status on consecutive checks over the given time threshold.
  • A node does not report any status at all over the given time threshold.

Node Repair Process

If Container Engine detects that a node requires repair, that node will first be drained, and then Container Engine will re-create the node VM. The drain might not succeed if the node is unresponsive or is too unhealthy to process the drain command.

If multiple nodes require repair, Container Engine repairs one node at a time, with each repair lasting approximately 5-10 minutes. If you disable node auto-repair at any time during the repair process, the in-progress repairs are not cancelled and will still complete for any node currently under repair.

Container Engine will generate an entry in its operation logs for any automated repair event. You can check the logs by using the gcloud container operations list command.

Enabling Auto-Repair

You enable node auto-repair on a per-node pool basis. When you create a cluster, you can enable or disable auto-repair for the cluster's default node pool. If you create additional node pools, you can enable or disable node auto-repair for those node pools, independent of the auto-repair setting for the default node pool.

Creating a Cluster or Node Pool with Auto-Repair Enabled

To create a cluster or node pool with node auto-repair enabled, specify the --enable-autorepair option when you create your cluster or node pool using the gcloud command-line tool.

To create a cluster with auto-repair enabled, run the following command in your shell or terminal window:

gcloud beta container clusters create CLUSTER --zone ZONE --enable-autorepair

To create a node pool with auto-repair enabled, run the following command in your shell or terminal window:

gcloud beta container node-pools create NODEPOOL --cluster CLUSTER --zone ZONE --enable-autorepair

Enabling or Disabling Auto-Repair for an Existing Node Pool

To enable auto-repair for an existing node pool, use the gcloud beta container node-pools update command and specify the --enable-autorepair or --no-enable-autorepair option, as appropriate.

To enable auto-repair for a given node pool:

gcloud beta container node-pools update NODEPOOL --cluster CLUSTER --zone ZONE --enable-autorepair

To disable auto-repair for a given node pool:

gcloud beta container node-pools update NODEPOOL --cluster CLUSTER --zone ZONE --no-enable-autorepair

Send feedback about...

Container Engine