Autopilot cluster upgrades


This page discusses how automatic upgrades work on Google Kubernetes Engine (GKE) Autopilot clusters, including links to more information about related tasks and settings. You can use this information to keep your clusters updated for stability and security with minimal disruptions to your workloads.

How cluster and node pool upgrades work

This section discusses what happens in your cluster during automatic upgrades. Google initiates automatic upgrades, observes automatic upgrades across all GKE clusters, and intervenes if problems are observed.

A cluster is upgraded before its nodes.

Cluster upgrades

This section discusses what to expect when Google auto-upgrades your cluster. Clusters created in the Autopilot mode are regional clusters. Regional clusters have multiple replicas of the control plane, and only one replica is upgraded at a time, in an undefined order. During the upgrade, the cluster remains highly available, and each control plane replica is unavailable only while it is being upgraded.

If you configure a maintenance window or exclusion, it is honored, if possible.

Node pool upgrades

Your Autopilot cluster and its node pools run the same version of GKE. This section discusses what to expect when Google auto-upgrades your node pool.

Node pools are upgraded one at a time. Depending on the size of the node pool, GKE upgrades up to 20 nodes in the node pool at the same time.

This process might take several hours depending on the number of nodes and their workload configurations. Some configurations that can slow the rate of node upgrades include:

If you configure a maintenance window or exclusion, it is honored, if possible.

When a node is upgraded, the following things happen:

  1. Since surge upgrades are enabled by default, GKE creates a new surge node with the upgraded version and waits for the node to be registered with the control plane.
  2. GKE selects an existing node (which we will call the "target node") to upgrade. GKE cordons and starts draining the target node. At this point, GKE can't schedule new Pods on the target node.
  3. Pods on the target node are rescheduled onto other nodes. If a Pod can't be rescheduled, that Pod stays PENDING until it can be rescheduled.
  4. The target node is deleted.
  5. If a significant number of node auto-upgrades to a given version result in unhealthy nodes across the GKE fleet, upgrades to that version are halted while the problem is investigated.

Automatic upgrades

When you create an Autopilot cluster, auto-upgrade is enabled on the cluster and its node pools by default. Google upgrades your clusters when a new GKE version is selected for auto-upgrade.

For more control over when an auto-upgrade can occur (or must not occur), you can configure maintenance windows and exclusions.

As your Autopilot cluster is automatically enrolled in a release channel, nodes always run the same version of GKE as the cluster itself, except during a brief period between completing the cluster's control plane upgrade and beginning to upgrade a given node pool.

How versions are selected for auto-upgrade

New GKE versions are released regularly, but a version is not selected for auto-upgrade right away. When a GKE version has accumulated enough cluster usage to prove stability over time, Google selects it as an auto-upgrade target for clusters running a subset of older versions.

New auto-upgrade targets are announced in the release notes. Occasionally, a version is selected for cluster auto-upgrade and node auto-upgrade during different weeks. Soon after a new minor version becomes generally available, the oldest available minor version typically becomes unsupported. Clusters running minor versions that become unsupported are automatically upgraded to the next minor version.

Within a minor version (such as v1.14.x), clusters can be automatically upgraded to a new patch release.

Factors that affect version rollout timing

To ensure the stability and reliability of clusters on new versions, GKE follows certain practices during version rollouts.

These practices include, but are not limited to:

  • GKE gradually rolls out changes across Google Cloud regions and zones.
  • GKE gradually rolls out patch versions across release channels. A patch is given soak time in the Rapid release channel, then the Regular release channel, before being promoted to the Stable release channel once it has accumulated usage and continued to demonstrate stability. If an issue is found with a patch version during the soaking time on a release channel, that version is not promoted to the next channel and the issue is fixed on a newer patch version.
  • GKE gradually rolls out minor versions, following a similar soaking process to patch versions. Minor versions have longer soaking periods as they introduce more significant changes.
  • GKE may delay automatic upgrades when a new version impacts a group of clusters. For example, GKE pauses automatic upgrades for clusters that it detects are exposed to a deprecated API or feature that will be removed in the next minor version.
  • GKE might delay the rollout of new versions during peak times (for example, major holidays) to ensure business continuity.

Configuring when auto-upgrades can occur

By default, auto-upgrades can occur at any time. Auto-upgrades are minimally disruptive, especially for regional clusters. However, some workloads may require finer-grained control. You can configure maintenance windows and exclusions to manage when auto-upgrades can and must not occur.

Surge upgrades

Surge upgrades control the number of nodes GKE can upgrade at a time and control how disruptive upgrades are to your workloads. Autopilot clusters are automatically configured to use surge upgrades, and cannot be overridden.

Surge upgrade behavior is determined by two settings:

max-surge-upgrade
The number of additional nodes that can be added to the node pool during an upgrade. Defaults to 1.
max-unavailable-upgrade

The number of nodes that can be simultaneously unavailable during an upgrade. Defaults to 0.

The number of nodes upgraded simultaneously is the sum of max-surge-upgrade and max-unavailable-upgrade.

The default surge upgrade is set to maxSurge=1 maxUnavailable=0. This means that only 1 surge node can be added to the node pool during an upgrade so only 1 node will be upgraded at a time.

Relationship with quota

While recreating nodes does not require additional Compute Engine resources, surge upgrading nodes does. Resource allocation is subjected to Compute Engine quota. Depending on your configuration, this quota can limit the number of parallel upgrades or even cause the upgrade to fail.

For more information about quota, see Node upgrades and quota.

Receiving upgrade notifications

GKE publishes upgrade notifications to Pub/Sub, providing you with a channel to receive information from GKE about your clusters.

For more information, see Receiving cluster notifications.

What's next