Automatic Node Repair

The node auto repair feature continuously monitors the health of each node in a node pool. If a node becomes unhealthy, the node auto repair feature repairs it automatically. This feature decreases the likelihood of cluster outages and performance degradation, and it minimizes the need for manual maintenance of your clusters.

You can enable node auto repair when creating or updating a node pool. Note that you enable or disable this feature on node pools rather than on individual nodes.

Unhealthy node conditions

Node auto repair examines the health status of each node to determine if it requires repair. A node is considered healthy if it reports a Ready status. Otherwise, if it consecutively reports an unhealthy status for a specific duration, repairs are initiated.

An unhealthy status can arise from a NotReady state, detected in consecutive checks over approximately 15 minutes. Alternatively, an unhealthy status may result from depleted boot disk space, identified over a period of approximately 30 minutes.

You can manually check your node's health signals at any time by running the kubectl get nodes command.

Node repair strategies

Node auto repair follows certain strategies to ensure both the overall health of the cluster and the availability of applications during the repair process. This section describes how the node auto repair feature honors PodDisruptionBudget configurations, respects the Pod Termination Grace Period, and takes other measures that minimize cluster disruption when repairing nodes.

Honor PodDisruptionBudget for 30 minutes

If a node requires repair, it isn't instantly drained and re-created. Instead, the node auto repair feature honors PodDisruptionBudget (PDB) configurations for up to 30 minutes, after which all the Pods on the node are deleted. (A PDB configuration defines, among other things, the minimum number of replicas of a particular Pod that must be available at any given time).

By honoring the PodDisruptionBudget for approximately 30 minutes, the node auto repair feature provides a window of opportunity for Pods to be safely rescheduled and redistributed across other healthy nodes in the cluster. This helps maintain the desired level of application availability during the repair process.

After the 30 minute time limit, node auto repair proceeds with the repair process, even if it means violating the PodDisruptionBudget. Without a time limit, the repair process could stall indefinitely if the PodDisruptionBudget configuration prevents the evictions necessary for a repair.

Honor the Pod Termination Grace Period

The node auto repair feature also honors a Pod Termination Grace Period of approximately 30 minutes. The Pod Termination Grace Period provides Pods with a window of time for a graceful shutdown during termination. During the grace period, the kubelet on a Node is responsible for executing cleanup tasks and freeing resources associated with the Pods on that Node. The node auto repair feature allows up to 30 minutes for the kubelet to complete this cleanup. If the allotted 30 minutes elapse, the Node is forced to terminate, regardless of whether the Pods have gracefully terminated.

Additional node repair strategies

Node auto repair also implements the following strategies:

  • If multiple nodes require repair, they are repaired one at a time to limit cluster disruption and to protect workloads.
  • If you disable node auto-repair during the repair process, in-progress repairs nonetheless continue until the repair operation succeeds or fails.

How to enable and disable automatic node repair

You can enable or disable node auto repair when creating or updating a node pool. You enable or disable this feature on node pools rather than on individual nodes.

Enable auto repair for a new node pool

gcloud container aws node-pools create NODE_POOL_NAME \
   --cluster CLUSTER_NAME \
   --instance-type INSTANCE_TYPE \
   --root-volume-size ROOT_VOLUME_SIZE \
   --iam-instance-profile NODEPOOL_PROFILE \
   --node-version NODE_VERSION \
   --min-nodes MIN_NODES \
   --max-nodes MAX_NODES \
   --max-pods-per-node MAX_PODS_PER_NODE \
   --location GOOGLE_CLOUD_LOCATION \
   --subnet-id NODEPOOL_SUBNET \
   --ssh-ec2-key-pair SSH_KEY_PAIR_NAME \
   --config-encryption-kms-key-arn CONFIG_KMS_KEY_ARN \
   --tags "Name=CLUSTER_NAME-NODE_POOL_NAME" \
   --enable-autorepair

Replace the following:

  • NODE_POOL_NAME: a name you choose for your node pool. To get the names of your node pools, run the command gcloud container aws node-pools list --cluster CLUSTER_NAME --location GOOGLE_CLOUD_LOCATION
  • CLUSTER_NAME: the name of the cluster to attach the node pool to
  • INSTANCE_TYPE: the desired AWS machine instance type for this node pool— for example, m5.large
  • ROOT_VOLUME_SIZE: the desired size for each node's root volume, in Gb
  • NODEPOOL_PROFILE: the IAM instance profile for node pool VMs
  • NODE_VERSION: the Kubernetes version to install on each node in the node pool (e.g., "1.31.1-gke.1800")
  • MIN_NODES: the minimum number of nodes the node pool can contain
  • MAX_NODES: the maximum number of nodes the node pool can contain
  • MAX_PODS_PER_NODE: the maximum number of pods that can be created on any single node in the pool
  • GOOGLE_CLOUD_LOCATION: the name of the Google Cloud location from which this node pool will be managed
  • NODEPOOL_SUBNET: the ID of the subnet the node pool will run on.
    • There must not be any overlap between the cluster's Pod/Service IP ranges and the node pool subnet network. For more information on selecting Pod and Service IP ranges for your cluster, see Select CIDR ranges for your cluster
    • If this subnet is outside of the VPC primary CIDR block, some additional steps are needed. For more information, see security groups.
  • SSH_KEY_PAIR_NAME: the name of the AWS SSH key pair created for SSH access (optional)
  • CONFIG_KMS_KEY_ARN: the Amazon Resource Name (ARN) of the AWS KMS key that encrypts user data

Enable auto repair for an existing node pool

To enable node auto repair on an existing node pool, run the following command:

gcloud container aws node-pools update NODE_POOL_NAME \
   --cluster CLUSTER_NAME \
   --location GOOGLE_CLOUD_LOCATION \
   --enable-autorepair

Replace the following:

  • NODE_POOL_NAME: a unique name for your node pool — for example, node-pool-1
  • CLUSTER_NAME: the name of your cluster
  • GOOGLE_CLOUD_LOCATION: the Google Cloud region that manages your cluster

Disable auto repair for an existing node pool

gcloud container aws node-pools update NODE_POOL_NAME \
   --cluster CLUSTER_NAME \
   --location GOOGLE_CLOUD_LOCATION \
   --no-enable-autorepair

Replace the following:

  • NODE_POOL_NAME: a unique name for your node pool — for example, node-pool-1
  • CLUSTER_NAME: the name of your cluster
  • GOOGLE_CLOUD_LOCATION: the Google Cloud region that manages your cluster

Note that GKE on AWS performs graceful node auto repair disablement. When disabling node auto repair for an existing node pool, GKE on AWS launches an update node pool operation. The operation waits for any existing node repairs to complete before it proceeds.

Check whether node auto repair is enabled

Run the following command to check whether or not node auto repair is enabled:

gcloud container aws node-pools describe NODE_POOL_NAME \
   --cluster CLUSTER_NAME \
   --location GOOGLE_CLOUD_LOCATION

Replace the following:

  • NODE_POOL_NAME: a unique name for your node pool — for example, node-pool-1
  • CLUSTER_NAME: the name of your cluster
  • GOOGLE_CLOUD_LOCATION: the Google Cloud region that manages your cluster

Node repair history

You can view the history of repairs performed on a node pool by running the following command:

gcloud container aws operations list \
   --location GOOGLE_CLOUD_LOCATION \
   --filter="metadata.verb=repair AND metadata.target=projects/PROJECT_ID/locations/GOOGLE_CLOUD_LOCATION/awsClusters/CLUSTER_NAME/awsNodePools/NODEPOOL_NAME

Replace the following:

  • GOOGLE_CLOUD_LOCATION: the supported Google Cloud region that manages your cluster — for example, us-west1
  • PROJECT_ID: your Google Cloud project
  • CLUSTER_NAME: the name of your cluster
  • NODE_POOL_NAME: a unique name for your node pool — for example, node-pool-1

Node pool health summary

Once you've enabled node auto repair, you can generate a node pool health summary by running the following command:

gcloud container aws node-pools describe NODE_POOL_NAME \
   --cluster CLUSTER_NAME \
   --location GOOGLE_CLOUD_LOCATION

A node pool healthy summary looks similar to this sample:

{
  "name": "some-np-name",
  "version": "some-version",
  "state": "RUNNING",

  ...

  "errors": [
    {
      "message": "1 node(s) is/are identified as unhealthy among 2 total node(s) in the node pool. No node is under repair."
    }
  ],
}

The node pool health summary helps you understand the current state of the node pool. In this example, the summary contains an error message which states that one of the two nodes in the node pool is unhealthy. It also reports that no nodes are currently undergoing the repair process.