Automatic Node Repair

The node auto repair feature continuously monitors the health of each node in a node pool. If a node becomes unhealthy, the node auto repair feature repairs it automatically. This feature decreases the likelihood of cluster outages and performance degradation, and it minimizes the need for manual maintenance of your clusters.

You can enable node auto repair when creating or updating a node pool. Note that you enable or disable this feature on node pools rather than on individual nodes.

Unhealthy node conditions

Node auto repair examines the health status of each node to determine if it requires repair. A node is considered healthy if it reports a Ready status. Otherwise, if it consecutively reports an unhealthy status for a specific duration, repairs are initiated.

An unhealthy status can arise from a NotReady state, detected in consecutive checks over approximately 15 minutes. Alternatively, an unhealthy status may result from depleted boot disk space, identified over a period of approximately 30 minutes.

You can manually check your node's health signals at any time by running the kubectl get nodes command.

Node repair strategies

Node auto repair follows certain strategies to ensure both the overall health of the cluster and the availability of applications during the repair process. This section describes how the node auto repair feature honors PodDisruptionBudget configurations, respects the Pod Termination Grace Period, and takes other measures that minimize cluster disruption when repairing nodes.

Honor PodDisruptionBudget for 30 minutes

If a node requires repair, it isn't instantly drained and re-created. Instead, the node auto repair feature honors PodDisruptionBudget (PDB) configurations for up to 30 minutes, after which all the Pods on the node are deleted. (A PDB configuration defines, among other things, the minimum number of replicas of a particular Pod that must be available at any given time).

By honoring the PodDisruptionBudget for approximately 30 minutes, the node auto repair feature provides a window of opportunity for Pods to be safely rescheduled and redistributed across other healthy nodes in the cluster. This helps maintain the desired level of application availability during the repair process.

After the 30 minute time limit, node auto repair proceeds with the repair process, even if it means violating the PodDisruptionBudget. Without a time limit, the repair process could stall indefinitely if the PodDisruptionBudget configuration prevents the evictions necessary for a repair.

Honor the Pod Termination Grace Period

The node auto repair feature also honors a Pod Termination Grace Period of approximately 30 minutes. The Pod Termination Grace Period provides Pods with a window of time for a graceful shutdown during termination. During the grace period, the kubelet on a Node is responsible for executing cleanup tasks and freeing resources associated with the Pods on that Node. The node auto repair feature allows up to 30 minutes for the kubelet to complete this cleanup. If the allotted 30 minutes elapse, the Node is forced to terminate, regardless of whether the Pods have gracefully terminated.

Additional node repair strategies

Node auto repair also implements the following strategies:

  • If multiple nodes require repair, they are repaired one at a time to limit cluster disruption and to protect workloads.
  • If you disable node auto-repair during the repair process, in-progress repairs nonetheless continue until the repair operation succeeds or fails.

How to enable and disable automatic node repair

You can enable or disable node auto repair when creating or updating a node pool. You enable or disable this feature on node pools rather than on individual nodes.

Enable auto repair for a new node pool

gcloud container azure node-pools create NODE_POOL_NAME \
   --cluster CLUSTER_NAME \
   --location GOOGLE_CLOUD_LOCATION \
   --node-version 1.28.3-gke.700 \
   --vm-size VM_SIZE \
   --max-pods-per-node 110 \
   --min-nodes MIN_NODES \
   --max-nodes MAX_NODES \
   --azure-availability-zone AZURE_ZONE \
   --ssh-public-key SSH_PUBLIC_KEY" \
   --subnet-id SUBNET_ID \
   --enable-autorepair

Replace the following:

  • NODE_POOL_NAME: a unique name for your node pool — for example, node-pool-1
  • CLUSTER_NAME: the name of your GKE on Azure cluster
  • GOOGLE_CLOUD_LOCATION: the Google Cloud location that manages your cluster
  • NODE_VERSION: the Kubernetes version to install on each node in the node pool (e.g., "1.28.3-gke.700")
  • VM_SIZE: a supported Azure VM size
  • MIN_NODES: the minimum number of nodes in the node pool — for more information, see Cluster autoscaler
  • MAX_NODES: the maximum number of nodes in the node pool
  • AZURE_ZONE: the Azure availability zone where GKE on Azure launches the node pool — for example, 3
  • SSH_PUBLIC_KEY: the text of your SSH public key.
  • SUBNET_ID:the ID of the node pool's subnet.

Enable auto repair for an existing node pool

To enable node auto repair on an existing node pool, run the following command:

gcloud container azure node-pools update NODE_POOL_NAME \
   --cluster CLUSTER_NAME \
   --location GOOGLE_CLOUD_LOCATION \
   --enable-autorepair

Replace the following:

  • NODE_POOL_NAME: a unique name for your node pool — for example, node-pool-1
  • CLUSTER_NAME: the name of your cluster
  • GOOGLE_CLOUD_LOCATION: the Google Cloud region that manages your cluster

Disable auto repair for an existing node pool

gcloud container azure node-pools update NODE_POOL_NAME \
   --cluster CLUSTER_NAME \
   --location GOOGLE_CLOUD_LOCATION \
   --no-enable-autorepair

Replace the following:

  • NODE_POOL_NAME: a unique name for your node pool — for example, node-pool-1
  • CLUSTER_NAME: the name of your cluster
  • GOOGLE_CLOUD_LOCATION: the Google Cloud region that manages your cluster

Note that GKE on Azure performs graceful node auto repair disablement. When disabling node auto repair for an existing node pool, GKE on Azure launches an update node pool operation. The operation waits for any existing node repairs to complete before it proceeds.

Check whether node auto repair is enabled

Run the following command to check whether or not node auto repair is enabled:

gcloud container azure node-pools describe NODE_POOL_NAME \
   --cluster CLUSTER_NAME \
   --location GOOGLE_CLOUD_LOCATION

Replace the following:

  • NODE_POOL_NAME: a unique name for your node pool — for example, node-pool-1
  • CLUSTER_NAME: the name of your cluster
  • GOOGLE_CLOUD_LOCATION: the Google Cloud region that manages your cluster

Node repair history

You can view the history of repairs performed on a node pool by running the following command:

gcloud container azure operations list \
   --location GOOGLE_CLOUD_LOCATION \
   --filter="metadata.verb=repair AND metadata.target=projects/PROJECT_ID/locations/GOOGLE_CLOUD_LOCATION/azureClusters/CLUSTER_NAME/azureNodePools/NODEPOOL_NAME

Replace the following:

  • GOOGLE_CLOUD_LOCATION: the supported Google Cloud region that manages your cluster — for example, us-west1
  • PROJECT_ID: your Google Cloud project
  • CLUSTER_NAME: the name of your cluster
  • NODE_POOL_NAME: a unique name for your node pool — for example, node-pool-1

Node pool health summary

Once you've enabled node auto repair, you can generate a node pool health summary by running the following command:

gcloud container azure node-pools describe NODE_POOL_NAME \
   --cluster CLUSTER_NAME \
   --location GOOGLE_CLOUD_LOCATION

A node pool healthy summary looks similar to this sample:

{
  "name": "some-np-name",
  "version": "some-version",
  "state": "RUNNING",

  ...

  "errors": [
    {
      "message": "1 node(s) is/are identified as unhealthy among 2 total node(s) in the node pool. No node is under repair."
    }
  ],
}

The node pool health summary helps you understand the current state of the node pool. In this example, the summary contains an error message which states that one of the two nodes in the node pool is unhealthy. It also reports that no nodes are currently undergoing the repair process.