Standard cluster upgrades


This page discusses how automatic and manual upgrades work on Google Kubernetes Engine (GKE) Standard clusters, including links to more information about related tasks and settings. You can use this information to keep your clusters updated for stability and security with minimal disruptions to your workloads.

For information on how cluster upgrades work for Autopilot, see Autopilot cluster upgrades.

How cluster and node pool upgrades work

This section discusses what happens in your cluster during automatic or manual upgrades. For auto-upgrades, Google initiates the auto-upgrade. Google observes automatic and manual upgrades across all GKE clusters, and intervenes if problems are observed.

If you enroll your cluster in a release channel, nodes run the same version of GKE as the cluster, except during a brief period (typically a few days, depending on the current release) between completing the cluster's control plane upgrade and starting the node pool upgrade, or if the control plane was manually upgraded. Check the release notes for more information.

Cluster upgrades

This section discusses what to expect when Google auto-upgrades your cluster or you initiate a manual upgrade.

  • Zonal clusters have only a single control plane. During the upgrade, your workloads continue to run, but you cannot deploy new workloads, modify existing workloads, or make other changes to the cluster's configuration until the upgrade is complete.

  • Regional clusters have multiple replicas of the control plane, and only one replica is upgraded at a time, in an undefined order. During the upgrade, the cluster remains highly available, and each control plane replica is unavailable only while it is being upgraded.

If you configure a maintenance window or exclusion, it is honored if possible.

Node pool upgrades

This section discusses what to expect when Google auto-upgrades your node pool or you initiate a manual node pool upgrade.

Node pools are upgraded one at a time. By default, nodes within a node pool are also upgraded one at a time, in an undefined order. In a node pool spread across multiple zones, upgrades take place zone-by-zone. Within a zone, the nodes will be upgraded in an undefined order.

With GKE node pool upgrades, you can choose between two configurable, built-in upgrade strategies where you can tune the upgrade process based on your cluster environment's needs. To learn more about surge and blue-green upgrade strategies, see Upgrade strategies.

During a node pool upgrade, you cannot make changes to the cluster configuration unless you cancel the upgrade.

GKE honors maintenance windows or exclusions during automatic upgrades when possible. Manual upgrades bypass your configured maintenance windows and exclusions.

During a node pool upgrade, how the nodes are upgraded depends on the node pool upgrade strategy and how you configure it. However, the basic steps remain consistent. To upgrade a node, GKE removes Pods from the node so that it can be upgraded.

When a node is upgraded, the following happens with the Pods:

  1. The node is cordoned so that Kubernetes does not schedule new Pods on it.
  2. The node is then drained, meaning that the Pods are removed. For surge upgrades, GKE respects the Pod's PodDisruptionBudget and GracefulTerminationPeriod settings for up to one hour. With blue-green upgrades, this can be extended if you configure a longer soaking time.
  3. The control plane reschedules Pods managed by controllers onto other nodes. Pods that cannot be rescheduled stay in the Pending phase until they can be rescheduled.

The node pool upgrade process may take up to a few hours depending on the upgrade strategy, the number of nodes, and their workload configurations.

Configurations that can slow the rate of node upgrades include:

Node pool upgrade strategies

GKE offers built-in configurable strategies which determine how the node pool is upgraded.

Surge upgrades

By default, the surge upgrade strategy is used for node pool upgrades. Surge upgrades use a rolling method to upgrade nodes. This strategy is best for applications that can handle incremental, non-disruptive changes. With this strategy, nodes are upgraded in a rolling window. With the settings you can change how many nodes can be upgraded at once, and how disruptive the upgrades can be, finding the optimal balance of speed and disruption for your environment's needs.

Blue-green upgrades

The alternative approach is blue-green upgrades, where two sets of environments (the original and new environments) are maintained at once, making rolling back as easy as possible. Blue-green is more resource intensive and better for applications that are more sensitive to changes. With this strategy, workloads are gradually migrated from the original "blue" environment to the new "green" environment, and given soak time to validate them with the new configuration. If needed, the workloads can be quickly rolled back to the existing "blue" environment.

To learn more about how the node pool upgrade strategies work, see Node pool upgrade strategies.

Upgrading automatically

When you create a Standard cluster, by default, auto-upgrade is enabled on the cluster and its node pools.

Google is responsible for securing your cluster's control plane, and upgrades your clusters when a new GKE version is selected for auto-upgrade. Infrastructure security is high priority for GKE, and as such control planes are upgraded on a regular basis, and cannot be disabled. However, you can apply maintenance windows and exclusions to temporarily suspend upgrades for control planes and nodes.

Under the Shared Responsibility Model, you are responsible for securing your nodes, containers, and Pods. Node auto-upgrade is enabled by default. Although it is not recommended, you can disable node auto-upgrade. Opting out of node auto-upgrades does not block your cluster's control plane upgrade. If you opt out of node auto-upgrades you are responsible for ensuring that the cluster's nodes run a version compatible with the cluster's version, and that the version adheres to the Kubernetes version skew support policy.

For more control over when an auto-upgrade can occur (or must not occur), you can configure maintenance windows and exclusions.

A cluster's node pools can be no more than two minor versions behind the control plane version, to maintain compatibility with the cluster API. The node pool version also determines the versions of software packages installed on each node. It is recommended to keep node pools updated to the cluster version.

If you enroll your cluster in a release channel, nodes always run the same version of GKE as the cluster itself, except during a brief period (typically a few days, depending on the current release) between completing the cluster's control plane upgrade and beginning to upgrade a given node pool. Check the release notes for more information.

How versions are selected for auto-upgrade

New GKE versions are released regularly, but a version is not selected for auto-upgrade right away. When a GKE version has accumulated enough cluster usage to prove stability over time, Google selects it as an auto-upgrade target for clusters running a subset of older versions.

New auto-upgrade targets are announced in the release notes. Until an available version is selected for auto-upgrade, you can upgrade to it manually. Occasionally, a version is selected for cluster auto-upgrade and node auto-upgrade during different weeks.

Soon after a new minor version becomes generally available, the oldest available minor version typically becomes unsupported. Clusters running minor versions that become unsupported are automatically upgraded to the next minor version.

Within a minor version (such as v1.14.x), clusters can be automatically upgraded to a new patch release.

Release channels allow you to control your cluster and node pool version based on a version's stability rather than managing the version directly.

Factors that affect version rollout timing

To ensure the stability and reliability of clusters on new versions, GKE follows certain practices during version rollouts.

These practices include, but are not limited to:

  • GKE gradually rolls out changes across Google Cloud regions and zones.
  • GKE gradually rolls out patch versions across release channels. A patch is given soak time in the Rapid release channel, then the Regular release channel, before being promoted to the Stable release channel once it has accumulated usage and continued to demonstrate stability. If an issue is found with a patch version during the soaking time on a release channel, that version is not promoted to the next channel and the issue is fixed on a newer patch version.
  • GKE gradually rolls out minor versions, following a similar soaking process to patch versions. Minor versions have longer soaking periods as they introduce more significant changes.
  • GKE may delay automatic upgrades when a new version impacts a group of clusters. For example, GKE pauses automatic upgrades for clusters that it detects are exposed to a deprecated API or feature that will be removed in the next minor version.
  • GKE might delay the rollout of new versions during peak times (for example, major holidays) to ensure business continuity.

Configuring when auto-upgrades can occur

By default, auto-upgrades can occur at any time to preserve infrastructure security. Auto-upgrades are minimally disruptive, especially for regional clusters. However, some workloads may require finer-grained control. You can configure maintenance windows and exclusions to manage when auto-upgrades can and must not occur.

Upgrading manually

You can request to manually upgrade your cluster or its node pools to an available and compatible version at any time. Manual upgrades bypass any configured maintenance windows and maintenance exclusions.

When you manually upgrade a cluster, its availability depends on whether the cluster is regional or not:

  • For zonal clusters, the control plane is unavailable while it is being upgraded. For the most part, workloads run normally but cannot be modified during the upgrade.

  • For regional clusters, one replica of the control plane is unavailable at a time while it is upgraded, but the cluster remains highly available during the upgrade.

You can manually initiate a node upgrade to a version compatible with the control plane.

Relationship with quota

Surge upgrades and blue-green upgrades might require additional compute resources. Surge upgrades create extra VMs (if maxSurge is set to greater than 0) and blue-green upgrades temporarily double the number of nodes in a node pool.

If you want to minimize the additional compute resources required for upgrades, use surge upgrades and set maxSurge to 0. With this configuration, you can avoid GKE creating any additional nodes for node configuration changes that require the nodes to be recreated. To learn more about types of changes that use a node pool upgrade strategy, see When surge upgrades are used and When blue-green upgrades are used.

Resource allocation is subjected to Compute Engine quota. Depending on your configuration, this quota can limit the number of parallel upgrades or even cause the upgrade to fail.

To learn more about how to ensure your project has enough resources for surge upgrades, see Verifying node upgrades and quota. For blue-green upgrades, your project requires double the number of resources used by the node pool.

How GKE responds to auto-upgrade failure

Node pool auto-upgrades can fail because of issues with the underlying Compute Engine instances, or because of issues with Kubernetes. For example, auto-upgrades fail in the following situations:

  • Your configured maxSurge setting exceeds your Compute Engine resource quota.
  • New surge nodes didn't register with the cluster control plane.
  • Nodes took too long to drain, or took too long to delete.

When issues occur with individual node upgrades, GKE retries the upgrade a few times, with an increasing interval between retries. If nodes in the node pool fail to upgrade, GKE does not roll back the upgraded nodes. Instead, GKE tries the node pool auto-upgrade again until all the nodes are successfully upgraded.

If your node upgrades fail because your surge node requests exceed your Compute Engine quota, GKE reduces the number of concurrent surge nodes to attempt to meet the quota and continue the upgrade.

Receiving upgrade notifications

GKE publishes notifications about events relevant to your cluster, such as version upgrades and security bulletins, to Pub/Sub, providing you with a channel to receive information from GKE about your clusters.

For more information, see Receiving cluster notifications.

Check upgrade logs

GKE logs control plane and node pool upgrade events to Cloud Logging by default. Upgrade events log provides visibility into the upgrade process, and includes valuable information for troubleshooting if needed.

Control plane upgrade logs

Cluster upgrade events can be queried using the following filter:

resource.type="gke_cluster"
protoPayload.metadata.operationType=~"(UPDATE_CLUSTER|UPGRADE_MASTER)"
resource.labels.cluster_name="CLUSTER_NAME"

These logs are recorded as structured logging formats. You can use the following fields for the details of the upgrade events:



Field Description
protoPayload.metadata.operationType There are two types of cluster upgrade events: MASTER_UPGRADE and CLUSTER_UPDATE.
MASTER_UPGRADE is an upgrade changing the Kubernetes control plane version.
CLUSTER_UPDATE means an update not changing the Kubernetes control plane version.
Both cluster upgrade types can cause the loss of control plane availability for zonal clusters. To learn more, see How cluster and node pool upgrades work.
protoPayload.methodName This field shows which API triggered the cluster upgrade.
google.container.v1.ClusterManager.UpdateCluster: manual control plane upgrade
google.container.internal.ClusterManagerInternal.UpdateClusterInternal: automatic control plane upgrade
google.container.v1.ClusterManager.PatchCluster: cluster configuration change.
protoPayload.metadata.previousMasterVersion This field is used only for the MASTER_UPGRADE operation type, and contains the previous control plane version used before the upgrade.
protoPayload.metadata.currentMasterVersion This field is used only for the MASTER_UPGRADE operation type, and contains the new control plane version number used after the upgrade.

Node pool upgrade logs

Use the following query to view node pool upgrade events:

resource.type="gke_nodepool"
protoPayload.metadata.operationType="UPGRADE_NODES"
resource.labels.cluster_name="CLUSTER_NAME"

Use the following field for details about the upgrade event:

protoPayload.methodName field shows whether the upgrade was triggered manually or triggered automatically as follows.

What's next