Best practices for upgrading clusters

This page provides guidelines for keeping your Google Kubernetes Engine (GKE) cluster seamlessly up-to-date, and recommendations for creating an upgrade strategy that fits your needs and increases availability and reliability of your environments. You can use this information to keep your clusters updated for stability and security with minimal disruptions to your workloads.

Alternatively, to manage automatic cluster upgrades across production environments organized with fleets, see About cluster upgrades with rollout sequencing.

Set up multiple environments

As part of your workflow for delivering software updates, we recommend that you use multiple environments. Multiple environments help you minimize risk and unwanted downtime by testing software and infrastructure updates separately from your production environment. At minimum, you should have a production environment and a pre-production or test environment.

Consider the following recommended environments:

Environment	Description
Production	Used to serve live traffic to end users for mission critical business applications.
Staging	Used to ensure that all new changes deployed from previous environments are working as intended before the changes are deployed to production.
Testing	Used to performance benchmark, test and QA workloads against the GKE release you will use in production. In this environment, you can test the upgrade of the control plane and nodes before doing so in production.
Development	Used for active development that relies on the same version running in production. In this environment, you create fixes and incremental changes to be deployed in production.
Canary	Used as a secondary development environment for testing newer Kubernetes releases, GKE features and APIs to gain better time to market once these releases are promoted and become default.

Enroll clusters in release channels

Kubernetes often releases updates, to deliver security updates, fix known issues, and introduce new features. GKE release channels offer you the ability to balance between stability and feature set of the version deployed in the cluster. When you enroll a new cluster in a release channel, Google automatically manages the version and upgrade cadence for the cluster and its node pools.

To keep clusters up-to-date with the latest GKE and Kubernetes updates, here are some recommended environments and the respective release channels the clusters should be enrolled in:

Environment	Release channel	Description
Production	Stable or regular	For stability and version maturity, use the stable or regular channel for production workloads.
Staging	Same as production	To ensure your tests are indicative of the version your production will be upgraded to, use the same release channel as production.
Testing
Development
Canary	Rapid	To test the latest Kubernetes releases and to get ahead of the curve by testing new GKE features or APIs, use the rapid channel. You can improve your time to market when the version in rapid is promoted to the channel you're using for production.

Cluster control planes are always upgraded on a regular basis, regardless of whether your cluster is enrolled in a release channel or not.

Create a continuous upgrade strategy

After enrolling your cluster in a release channel, that cluster is regularly upgraded to the version that meets the quality and stability bar for the channel. These updates include security and bug fixes, applied with increasing scrutiny at each channel:

Patches are pushed out to control plane and nodes in all channels gradually, accumulating soak time in rapid and regular channels before landing in the stable channel.
The control plane is upgraded first, followed by nodes to comply with the Kubernetes OSS policy (i.e., kubelet must not be newer than kube-apiserver).
GKE will automatically roll out patches to channels based on their criticality and importance.
The stable channel receives only critical patches.

Receive updates about new GKE versions

Information about new versions is published to the main GKE release notes page, as well as to an RSS feed. Each release channel has a simplified and dedicated release notes page (example: Release notes for stable channel) with information about the recommended GKE version for that channel.

To proactively receive updates about GKE upgrades before the upgrades occur, use Pub/Sub and subscribe to upgrade notifications.

Once a new version becomes available, you should plan an upgrade before the version becomes the default in the fleet. This approach provides more control and predictability when needed, as GKE would skip auto-upgrade for clusters already upgraded ahead of time.

Test and verify new patch and minor versions

All releases pass internal testing regardless of the channel they are released in. However, with the frequent updates and patches from upstream Kubernetes, and GKE, we highly recommend testing new releases on testing and/or staging environments before the releases are rolled out into your production environment, especially Kubernetes minor version upgrades.

Each release channel offers two versions: default and available.

New patch releases are available a week prior to becoming default.
New Kubernetes minor releases will be available four weeks prior to becoming default.

GKE automatically upgrades clusters to the default version gradually. If more control over the upgrade process is necessary, we recommend upgrading ahead of time to an available version. GKE auto-upgrade skips manually-upgraded clusters.

A recommended approach to automate and streamline upgrades would involve:

A pre-production environment using the available version.
Upgrade notifications set up on the cluster to inform your team about new available versions to test and certify.
A production environment subscribed to a release channel using the default version in the channel.
Gradual rollout of new available versions to production clusters. For example, if there are multiple production clusters, a gradual upgrade plan would start by upgrading a portion of these clusters to the available version while keeping the others on the default version, followed by additional small portion upgrades until 100% is upgraded.

The following table summarizes the release events and recommended actions:

Event	Recommended action
New version X is made available in a channel.	Manually upgrade your testing cluster and qualify and test the new version.
Version X becomes the default version.	GKE starts auto-upgrading to the default version. Consider upgrading production ahead of the fleet.
GKE starts auto-upgrading clusters.	Allow clusters to get auto-upgraded, or postpone the upgrade using maintenance exclusion windows.

Upgrade strategy for patch releases

Here's a recommended upgrade strategy for patch releases, using a scenario where:

All clusters are subscribed to the stable channel.
New available versions are rolled out to the staging cluster first.
The production cluster is upgraded automatically with new default versions.
Regularly monitoring new available versions for GKE.

Time	Event	What should I do?
T - 1 week	New patch version becomes available.	Upgrade staging environment.
T	Patch version becomes default.	Consider upgrading the production control plane ahead of time for better predictability.
T	GKE will start upgrading control planes to the default version.	Consider upgrading the production node pools ahead of time for better predictability.
T + 1 week	GKE will start upgrading cluster node pools to the default version.	GKE will auto-upgrade clusters, skipping the manually-upgraded clusters.

Upgrade strategy for new minor releases

Here's a recommended upgrade strategy for new minor releases:

Time	Event	What should I do?
T - 3 weeks	New minor version becomes available	Upgrade testing control plane
T - 2 weeks		Given a successful control plane upgrade, consider upgrading the production control plane ahead of time. Upgrade testing node pools.
T - 1 week		Given a successful upgrade, consider upgrade production node pools ahead of time.
T	Minor version becomes default.
T	GKE will start upgrading cluster control planes to the default version.	Create an exclusion window if more testing or mitigation is needed before production rollout.
T + 1 week	GKE will start upgrading cluster node pools to the default version.	GKE will auto-upgrade clusters, skipping the manually-upgraded clusters.

Reduce disruption to existing workloads during an upgrade

Keeping your clusters up-to-date with security patches and bug fixes is critical for ensuring the vitality of your clusters, and for business continuity. Regular updates protect your workloads from vulnerabilities and failures.

Schedule maintenance windows and exclusions

To increase upgrade predictability and to align upgrades with off-peak business hours, you can control automatic upgrades of both the control plane and nodes by creating a maintenance window. GKE respects maintenance windows. Namely, if the upgrade process runs beyond the defined maintenance window, GKE attempts to pause the operation, and resumes the operation during the next maintenance window.

GKE follows a multi-day rollout schedule for making new versions available, as well as auto-upgrading cluster control planes and nodes in different regions. The rollout generally spans four or more days, and includes a buffer of time to observe and monitor for problems. In a multi-cluster environment, you can use separate maintenance window for each cluster to sequence the rollout across your clusters. For example, you might want to control when clusters in different regions receive maintenance by setting different maintenance windows for each cluster.

Another tool to reduce disruption, especially during high-demand business periods, is maintenance exclusions. Use maintenance exclusions to prevent automatic maintenance from occurring during these periods; maintenance exclusions can be set on new or existing clusters. You can also use exclusions in conjunction with your upgrade strategy. For example, you might want to postpone an upgrade to a production cluster if a testing or staging environment fails because of an upgrade.

Set your tolerance for disruption

You might be familiar with the concept of replicas in Kubernetes. Replicas ensure redundancy of your workloads for better performance and responsiveness. When set, replicas govern the number of Pod replicas running at any given time. However, during maintenance, Kubernetes removes the underlying node VMs, which can reduce the number of replicas. To ensure your workloads have a sufficient number of replicas for your applications, even during maintenance, use a Pod Disruption Budget (PDB).

In a Pod Disruption Budget, you can define a number (or percentage) of Pods that can be terminated, even if terminating the Pods brings the current replica count below the desired value. This process may speed up the node drain by removing the need to wait for migrated pods to become fully operational. Instead, drain evicts pods from a node following the PDB configuration, allowing deployment to deploy missing Pods on other available nodes. Once the PDB is set, GKE won't shut down Pods in your application if the number of Pods is equal to or less than a configured limit. GKE follows a PDB for up to 60 minutes.

Control node pool upgrades

With GKE, you can choose a node upgrade strategy to determine how the nodes in your node pools are upgraded. By default, node pools use surge upgrades. With surge upgrades, the upgrade process for GKE node pools involves recreating every VM in the node pool. A new VM is created with the new version (upgraded image) in a rolling update fashion. In turn, that requires shutting down all the Pods running on the old node and shifting the Pods to the new node. Your workloads can run with sufficient redundancy (replicas), and you can rely on Kubernetes to move and restart Pods as needed. However, a temporarily reduced number of replicas can still be disruptive to your business, and might slow down the workload performance until Kubernetes is able to meet the desired state again (that is, meet the minimum number of needed replicas). You can avoid such a disruption by using surge upgrades.

During an upgrade with surge upgrade enabled, GKE first secures the resources (machines) needed for the upgrade, then creates a new upgraded node, and only then drains the old node, and finally shuts it down. This way, the expected capacity remains intact throughout the upgrade process.

For large clusters where the upgrade process might take longer, you can accelerate the upgrade completion time by concurrently upgrading multiple nodes at a time. Use surge upgrade with maxSurge=20, maxUnavailable=0 to instruct GKE to upgrade 20 nodes at a time, without using any existing capacity.

Checklist summary

The following table summarizes the tasks that are recommended for an upgrade strategy to keep your GKE clusters seamlessly up-to-date:

Best Practice	Tasks
Set up multiple environments	At minimum, create a production and pre-production environment.
Enroll clusters in release channels	Enroll production clusters in the stable or regular channel. Enroll pre-production clusters in the same channels as production. Enroll early development clusters (for example, canary) in the rapid channel.
Create a continuous upgrade strategy	Proactively receive updates about GKE upgrades and GKE versions. Test and verify new patch and minor versions.
Reduce disruption to existing workloads	Control timing of automatic upgrades by creating a maintenance window. Use maintenance exclusions to prevent automatic maintenance from occurring during high-demand business periods. Set the correct Pod Disruption Budget for your workloads. Use a strategy to control node pool upgrades.

What's next

Watch the Google Cloud Next 2020 video on Ensuring business continuity at times of uncertainty and digital-only business with GKE.
Watch Best practices for GKE upgrade.
Learn more about Release channels.
Learn about versioning and automatic upgrades in GKE.