Autopilot cluster upgrades


This page discusses how automatic upgrades work on Google Kubernetes Engine (GKE) Autopilot clusters, including links to more information about related tasks and settings. You can use this information to keep your clusters updated for stability and security with minimal disruptions to your workloads.

Automatic control plane and node upgrades

Automatic upgrades are enabled on all Autopilot clusters. GKE initiates automatic upgrades when GKE versions are selected for auto-upgrade, observes automatic upgrades across all clusters, and intervenes if problems such as unhealthy nodes occur.

To upgrade a cluster, GKE updates the version the control plane and nodes are running. Clusters are upgraded to either a newer minor version (for example, 1.24 to 1.25) or newer patch version (for example, 1.24.2-gke.100 to 1.24.5-gke.200). For more information, see GKE versioning and support.

All Autopilot clusters are enrolled in a release channel, so GKE automatically upgrades the control plane and nodes to run the same GKE version.

GKE upgrades a cluster's control plane before upgrading nodes.

Automatic control plane upgrades

All Autopilot clusters are regional clusters. Regional clusters have multiple replicas of the control plane, and only one replica is upgraded at a time, in an undefined order. This ensures that the cluster remains highly available during automatic upgrades. Each control plane replica is only unavailable while the upgrade is in progress.

If you configure a maintenance window or exclusion, GKE honors the configuration if possible.

GKE can't create new nodes when a control plane upgrade is in progress in both Autopilot and Standard. If you deploy Autopilot Pods that require new node types while a control plane upgrade is in progress, you might experience delays until the control plane upgrade completes.

Automatic node upgrades

After GKE upgrades your Autopilot cluster control plane, GKE upgrades the nodes to the same GKE version.

In Autopilot, GKE groups nodes that share similar characteristics together. GKE uses surge upgrades for Autopilot nodes, upgrading up to 20 nodes in a group at the same time. The precise number of nodes that are upgraded at the same time varies to ensure continued high availability of nodes and workloads.

Node upgrades might take several hours depending on the number of nodes and the configuration of workloads running in the nodes. For example, the following configurations could contribute to longer upgrades:

If you configure a maintenance window or exclusion, GKE honors the configuration if possible.

When GKE upgrades a node, the following steps happen:

  1. GKE creates a new surge node with the new GKE version and waits for the surge node to register with the control plane.
  2. GKE selects an existing node, the target node, to upgrade.
  3. GKE cordons the target node, preventing new Pods from being placed on the target node.
  4. GKE drains the target node, evicting existing Pods from the target node.
  5. GKE reschedules Pods that are managed by a workload controller onto other available nodes. Pods that can't be rescheduled remain in PENDING state until GKE can reschedule them.

  6. GKE deletes the target node.

If a significant number of automatic upgrades to a specific GKE version result in unhealthy nodes across the GKE fleet, GKE stops upgrades to that version while we investigate the issue.

How versions are selected for auto-upgrade

GKE releases new minor versions regularly, but a released version isn't immediately selected for automatic upgrades. To qualify as an auto-upgrade target, the GKE version must accumulate enough usage to prove stability over time.

Google Cloud then selects that version as an auto-upgrade target for clusters that run a specific subset of older GKE versions. For example, soon after a new minor version becomes available, the oldest available minor version typically becomes unsupported. GKE upgrades clusters that run unsupported minor versions to the auto-upgrade target version.

GKE announces new auto-upgrade target versions in the release notes. Occasionally, a version is selected for control plane auto-upgrades and node auto-upgrades during different weeks. GKE automatically upgrades to new patch releases within a minor version (such as v1.21.x).

For information about the version lifecycle and versioning scheme, refer to GKE versioning and support.

Factors that affect version rollout timing

To ensure the stability and reliability of clusters on new versions, GKE follows certain practices during version rollouts.

These practices include, but are not limited to:

  • GKE gradually rolls out changes across Google Cloud regions and zones.
  • GKE gradually rolls out patch versions across release channels. A patch is given soak time in the Rapid release channel, then the Regular release channel, before being promoted to the Stable release channel once it has accumulated usage and continued to demonstrate stability. If an issue is found with a patch version during the soaking time on a release channel, that version is not promoted to the next channel and the issue is fixed on a newer patch version.
  • GKE gradually rolls out minor versions, following a similar soaking process to patch versions. Minor versions have longer soaking periods as they introduce more significant changes.
  • GKE may delay automatic upgrades when a new version impacts a group of clusters. For example, GKE pauses automatic upgrades for clusters that it detects are exposed to a deprecated API or feature that will be removed in the next minor version.
  • GKE might delay the rollout of new versions during peak times (for example, major holidays) to ensure business continuity.

Configuring when auto-upgrades can occur

By default, auto-upgrades can occur at any time. Auto-upgrades are minimally disruptive, especially for Autopilot clusters. However, some workloads might require finer-grained control. You can configure maintenance windows and exclusions to manage when auto-upgrades can and must not occur.

If you configure maintenance windows and exclusions, the upgrade does not occur until the current time is within a maintenance window. If a maintenance window expires before the upgrade completes, GKE attempts to pause the upgrade. GKE resumes the upgrade during the next available maintenance window.

Manually upgrade an Autopilot cluster

You can manually upgrade the GKE version of your Autopilot cluster control plane. GKE automatically upgrades your nodes to match the control plane version as soon as possible, subject to maintenance availability. For instructions, refer to Manually upgrading the control plane. You can't manually manage the node version for Autopilot clusters.

You can upgrade the control plane version to a supported minor or patch version in the same release channel, or to a patch version of the same minor version as your cluster in a different release channel.

For example, consider an Autopilot cluster running GKE version 1.22.8-gke.202 in the Regular release channel. The following behavior applies:

  • You can upgrade to any version in Regular.
  • You can upgrade to any patch version of 1.22 in the Rapid channel.

For more information about upgrading outside your channel, refer to Running patch versions from a newer channel.

Surge upgrades

Autopilot clusters use surge upgrades to upgrade multiple nodes at the same time. Surge upgrades let GKE reduce how disruptive version upgrades are to your running workloads by maintaining enough compute capacity for your running workloads. Autopilot manages the number of surge nodes that are added to the cluster during the upgrade. This number varies based on the total size of the cluster. GKE also manages the total number of target nodes that can be simultaneously unavailable during the upgrade.

The number of new surge nodes and unavailable target nodes varies to ensure that your cluster always has enough compute capacity for all running workloads. You might experience minor disruptions as GKE migrates workloads from target nodes to surge nodes during the upgrade.

For a description of how surge upgrades occur, refer to Automatic node upgrades.

Quota requirements for surge upgrades

Unlike node recreation, surge upgrades require additional Compute Engine resources. Resource allocation depends on your available Compute Engine quota. Depending on your configuration, this quota can limit the number of parallel upgrades or even cause the upgrade to fail. As a good practice to avoid scaling issues and for more predictable upgrades, ensure that your Compute Engine instance quota doesn't exceed 90%.

For more information about quota, refer to Ensure resources for node upgrades.

Receive upgrade notifications

GKE publishes upgrade notifications to Pub/Sub, providing you with a channel to receive information from GKE about your clusters.

For more information, see Receiving cluster notifications.

Component upgrades

GKE runs system workloads on worker nodes to support specific capabilities for clusters. For example, the gke-metadata-server system workload supports Workload Identity Federation for GKE. GKE is responsible for the health of these workloads. To learn more about these components, refer to the documentation for the associated capabilities.

When new features or fixes become available for a component, GKE indicates the patch version in which they are included. To obtain the latest version of a component, refer to the associated documentation or release notes for instructions on upgrading your control plane or nodes to the appropriate version.

What's next