Container Engine's Node Auto-Repair feature helps you keep the nodes in your cluster in a healthy, running state. When enabled, Container Engine makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period (approximately 10 minutes), Container Engine initiates a repair process for that node.
Container Engine uses the node's health status to determine if a node needs to
be repaired. A node reporting a
Ready status is considered healthy. Container
Engine triggers a repair action if a node reports consecutive unhealthy status
reports for a given time threshold (approximately 10 minutes). An unhealthy
status can mean:
- A node reports a
NotReadystatus on consecutive checks over the given time threshold.
- A node does not report any status at all over the given time threshold.
- A node's boot disk is out of disk space for an extended time period.
Node Repair Process
If Container Engine detects that a node requires repair, that node will first be drained, and then Container Engine will re-create the node VM. The drain might not succeed if the node is unresponsive or is too unhealthy to process the drain command.
If multiple nodes require repair, Container Engine repairs one node at a time, with each repair lasting approximately 5-10 minutes. If you disable node auto-repair at any time during the repair process, the in-progress repairs are not cancelled and will still complete for any node currently under repair.
Container Engine will generate an entry in its operation logs for any automated
repair event. You can check the logs by using the
gcloud container operations
You enable node auto-repair on a per-node pool basis. When you create a cluster, you can enable or disable auto-repair for the cluster's default node pool. If you create additional node pools, you can enable or disable node auto-repair for those node pools, independent of the auto-repair setting for the default node pool.
Creating a Cluster or Node Pool with Auto-Repair Enabled
To create a cluster or node pool with node auto-repair enabled, specify the
--enable-autorepair option when you create your cluster or node pool using the
gcloud command-line tool.
To create a cluster with auto-repair enabled, run the following command in your shell or terminal window:
gcloud beta container clusters create CLUSTER --zone ZONE --enable-autorepair
To create a node pool with auto-repair enabled, run the following command in your shell or terminal window:
gcloud beta container node-pools create NODEPOOL --cluster CLUSTER --zone ZONE --enable-autorepair
Enabling or Disabling Auto-Repair for an Existing Node Pool
To enable auto-repair for an existing node pool, use the
gcloud beta container
node-pools update command and specify the
--no-enable-autorepair option, as appropriate.
To enable auto-repair for a given node pool:
gcloud beta container node-pools update NODEPOOL --cluster CLUSTER --zone ZONE --enable-autorepair
To disable auto-repair for a given node pool:
gcloud beta container node-pools update NODEPOOL --cluster CLUSTER --zone ZONE --no-enable-autorepair