This page discusses how automatic and manual upgrades work on Google Kubernetes Engine (GKE) Standard clusters, including links to more information about related tasks and settings. You can use this information to keep your clusters updated for stability and security with minimal disruptions to your workloads.
For information on how cluster upgrades work for Autopilot, see Autopilot cluster upgrades.
How cluster and node pool upgrades work
This section discusses what happens in your cluster during automatic or manual upgrades. For auto-upgrades, Google initiates the auto-upgrade. Google observes automatic and manual upgrades across all GKE clusters, and intervenes if problems are observed.
If you enroll your cluster in a release channel, nodes always run the same version of GKE as the cluster, except during a brief period (typically a few days, depending on the current release) between completing the cluster's control plane upgrade and starting the node pool upgrade. Check the release notes for more information.
This section discusses what to expect when Google auto-upgrades your cluster or you initiate a manual upgrade.
Zonal clusters have only a single control plane. During the upgrade, your workloads continue to run, but you cannot deploy new workloads, modify existing workloads, or make other changes to the cluster's configuration until the upgrade is complete.
Regional clusters have multiple replicas of the control plane, and only one replica is upgraded at a time, in an undefined order. During the upgrade, the cluster remains highly available, and each control plane replica is unavailable only while it is being upgraded.
If you configure a maintenance window or exclusion, it is honored if possible.
Node pool upgrades
Your cluster and its node pools do not necessarily run the same version of GKE. This section discusses what to expect when Google auto-upgrades your node pool or you initiate a manual node pool upgrade.
Node pools are upgraded one at a time. By default, nodes are upgraded one at a time, in an undefined order. You can change the number of nodes upgraded together with surge upgrades.
In a node pool spread across multiple zones, upgrades take place one zone at a time. Within a zone, the nodes will be upgraded in an undefined order.
During a node pool upgrade, you cannot make changes to the cluster configuration unless you cancel the upgrade.
This process might take several hours depending on the number of nodes and their workload configurations. Configurations that can slow the rate of node upgrades include:
- A high value of terminationGracePeriodSeconds in a Pod's configuration.
- A conservative Pod Disruption Budget.
- Node affinity interactions.
- Attached PersistentVolumes.
GKE honors maintenance windows or exclusions during automatic upgrades when possible. Manual upgrades bypass your configured maintenance windows and exclusions.
When GKE upgrades a node, the following happens:
- If Surge upgrades are enabled, GKE creates a new surge node with the upgraded version and waits for it to be registered with the control plane.
- GKE selects an existing node (the target node) to upgrade. It cordons and starts draining the target node. At this point, GKE can't schedule new Pods on the target node.
- The control plane reschedules Pods managed by controllers onto other nodes. Pods that can't be rescheduled stay in the Pending phase until they can be scheduled.
- If a surge node was created, the target node is deleted. If a surge node wasn't created, GKE upgrades the target node once it is drained, then waits for the upgraded node to be registered with the control plane.
When you create a Standard cluster, by default, auto-upgrade is enabled on the cluster and its node pools.
Google is responsible for securing your cluster's control plane, and upgrades your clusters when a new GKE version is selected for auto-upgrade. Infrastructure security is high priority for GKE, and as such control planes are upgraded on a regular basis, and cannot be disabled. However, you can apply maintenance windows and exclusions to temporarily suspend upgrades for control planes and nodes.
Under the Shared Responsibility Model, you are responsible for securing your nodes, containers, and Pods. Node auto-upgrade is enabled by default. Although it is not recommended, you can disable node auto-upgrade. Opting out of node auto-upgrades does not block your cluster's control plane upgrade. If you opt out of node auto-upgrades you are responsible for ensuring that the cluster's nodes run a version compatible with the cluster's version, and that the version adheres to the Kubernetes version and version skew support policy.
For more control over when an auto-upgrade can occur (or must not occur), you can configure maintenance windows and exclusions.
A cluster's node pools can be no more than two minor versions behind the control plane version, to maintain compatibility with the cluster API. The node pool version also determines the versions of software packages installed on each node. It is recommended to keep node pools updated to the cluster version.
If you enroll your cluster in a release channel, nodes always run the same version of GKE as the cluster itself, except during a brief period (typically a few days, depending on the current release) between completing the cluster's control plane upgrade and beginning to upgrade a given node pool. Check the release notes for more information.
How versions are selected for auto-upgrade
New GKE versions are released regularly, but a version is not selected for auto-upgrade right away. When a GKE version has accumulated enough cluster usage to prove stability over time, Google selects it as an auto-upgrade target for clusters running a subset of older versions.
New auto-upgrade targets are announced in the release notes. Until an available version is selected for auto-upgrade, you can upgrade to it manually. Occasionally, a version is selected for cluster auto-upgrade and node auto-upgrade during different weeks.
Soon after a new minor version becomes generally available, the oldest available minor version typically becomes unsupported. Clusters running minor versions that become unsupported are automatically upgraded to the next minor version.
Within a minor version (such as v1.14.x), clusters can be automatically upgraded to a new patch release.
Release channels allow you to control your cluster and node pool version based on a version's stability rather than managing the version directly.
Factors that affect version rollout timing
To ensure the stability and reliability of clusters on new versions, GKE follows certain practices during version rollouts.
These practices include, but are not limited to:
- GKE gradually rolls out changes across Google Cloud regions and zones.
- GKE gradually rolls out patch versions across release channels. A patch is given soak time in the Rapid release channel, then the Regular release channel, before being promoted to the Stable release channel once it has accumulated usage and continued to demonstrate stability. If an issue is found with a patch version during the soaking time on a release channel, that version is not promoted to the next channel and the issue is fixed on a newer patch version.
- GKE gradually rolls out minor versions, following a similar soaking process to patch versions. Minor versions have longer soaking periods as they introduce more significant changes.
- GKE may delay automatic upgrades when a new version impacts a group of clusters. For example, GKE pauses automatic upgrades for clusters that it detects are exposed to a deprecated API or feature that will be removed in the next minor version.
- GKE might delay the rollout of new versions during peak times (for example, major holidays) to ensure business continuity.
Configuring when auto-upgrades can occur
By default, auto-upgrades can occur at any time to preserve infrastructure security. Auto-upgrades are minimally disruptive, especially for regional clusters. However, some workloads may require finer-grained control. You can configure maintenance windows and exclusions to manage when auto-upgrades can and must not occur.
You can request to manually upgrade your cluster or its node pools to an available and compatible version at any time. Manual upgrades bypass any configured maintenance windows and maintenance exclusions.
When you manually upgrade a cluster, its availability depends on whether the cluster is regional or not:
For zonal clusters, the control plane is unavailable while it is being upgraded. For the most part, workloads run normally but cannot be modified during the upgrade.
For regional clusters, one replica of the control plane is unavailable at a time while it is upgraded, but the cluster remains highly available during the upgrade.
You can manually initiate a node upgrade to a version compatible with the control plane.
Surge upgrades let you control the number of nodes GKE can upgrade at a time and control how disruptive upgrades are to your workloads.
Changing upgrade settings to balance speed and disruption
You can change how many nodes GKE attempts to upgrade at once by changing the surge upgrade parameters on a node pool. Surge upgrades reduce disruption to your workloads during cluster maintenance and also allow you to control the number of nodes upgraded in parallel. Surge upgrades also work with the Cluster Autoscaler to prevent changes to nodes that are being upgraded.
Surge upgrade behavior is determined by two settings:
The number of additional nodes that can be added to the node pool during an upgrade. Increasing
max-surge-upgraderaises the number of nodes that can be upgraded simultaneously. Default is
1. Can be set to
The number of nodes that can be simultaneously unavailable during an upgrade. Default is
max-unavailable-upgraderaises the number of nodes that can be upgraded in parallel.
The number of nodes upgraded simultaneously is the sum of
max-unavailable-upgrade. The maximum number of nodes upgraded
simultaneously is limited to
For example, a 5-node pool is created with
max-surge-upgrade set to 2 and
max-unavailable-upgrade set to 1. During a
node pool upgrade, GKE
creates two upgraded nodes. GKE brings down at most three
(the sum of
nodes after the upgraded nodes are ready. GKE will only
make a maximum of one node unavailable (
max-unavailable-upgrade) at a time.
During the upgrade process, the node pool will include between four and seven
You can configure surge upgrade parameters for node pools that use auto-upgrades and manual upgrades. You can learn more and try out Surge Upgrade by completing the tutorial "Use surge upgrades to decrease disruptions from GKE node upgrades".
Determining your optimal surge configuration
The following table describes three different upgrade settings as a demonstration to help you understand different configurations:
|Balanced (Default), slower but least disruptive||
|Fast, no surge resources, most disruptive||
|Fast, most surge resources and less disruptive||
The simplest way to take advantage of surge upgrade is to configure
maxSurge=1 maxUnavailable=0. This means that only 1 surge node can be added to
the node pool during an upgrade so only 1 node will be upgraded at a time. This
setting is superior to the existing upgrade configuration
maxSurge=0 maxUnavailable=1) because it speeds up Pod restarts during
upgrades while progressing conservatively.
Fast and no surge resources
If your workload isn't sensitive to disruption, like most batch jobs, you can
emphasize speed by using
maxSurge=0 maxUnavailable=20. This configuration does
not bring up additional surge nodes and allows 20 nodes to be upgraded at the
Fast and less disruptive
If your workload is sensitive to disruption and you have already set up
(PDB) and you are not using
externalTrafficPolicy: Local, which
does not work with parallel node drains,
you can increase the speed of the upgrade by using
maxSurge=20 maxUnavailable=0. This configuration upgrades 20 nodes in parallel
while the PDB limits the number of Pods that can be drained at a given time.
Although the configurations of PDBs may vary, if you create a PDB with
maxUnavailable: 1 for one or more workloads running on the node pool, then
only one Pod of those workloads can be evicted at a time, limiting the
parallelism of the entire upgrade.
Relationship with quota
While recreating nodes does not require additional Compute Engine resources, surge upgrading nodes does. Resource allocation is subjected to Compute Engine quota. Depending on your configuration, this quota can limit the number of parallel upgrades or even cause the upgrade to fail.
For more information about quotas, see Node upgrades and quota.
How GKE responds to auto-upgrade failure
Node pool auto-upgrades can fail because of issues with the underlying Compute Engine instances, or because of issues with Kubernetes. For example, auto-upgrades fail in the following situations:
- Your configured
maxSurgesetting exceeds your Compute Engine resource quota.
- New surge nodes didn't register with the cluster control plane.
- Nodes took too long to drain, or took too long to delete.
When issues occur with individual node upgrades, GKE retries the upgrade a few times, with an increasing interval between retries. If nodes in the node pool fail to upgrade, GKE does not rollback the upgraded nodes. Instead, GKE tries the node pool auto-upgrade again until all the nodes are successfully upgraded.
If your node upgrades fail because your surge node requests exceed your Compute Engine quota, GKE reduces the number of concurrent surge nodes to attempt to meet the quota and continue the upgrade.
Receiving upgrade notifications
GKE publishes notifications about events relevant to your cluster, such as version upgrades and security bulletins, to Pub/Sub, providing you with a channel to receive information from GKE about your clusters.
For more information, see Receiving cluster notifications.
- Learn more about the types of clusters.
- Learn more about upgrading a cluster or its nodes.
- Configure maintenance windows and exclusions.
- Best practices for upgrading clusters.