Best practices for upgrading clusters

This page provides guidelines for keeping your Google Kubernetes Engine (GKE) cluster seamlessly up-to-date, and recommendations for creating an upgrade strategy that fits your needs and increases availability and reliability of your environments. You can use this information to keep your clusters updated for stability and security with minimal disruptions to your workloads.

Set up multiple environments

As part of your workflow for delivering software updates, we recommend that you use multiple environments. Multiple environments help you minimize risk and unwanted downtime by testing software and infrastructure updates separately from your production environment. At minimum, you should have a production environment and a pre-production or test environment.

Consider the following recommended environments:

Environment Description
Production Used to serve live traffic to end users for mission critical business applications.
Staging Used to ensure that all new changes deployed from previous environments are working as intended before the changes are deployed to production.
Testing Used to performance benchmark, test and QA workloads against the GKE release you will use in production. In this environment, you can test the upgrade of the control plane and nodes before doing so in production.
Development Used for active development that relies on the same version running in production. In this environment, you create fixes and incremental changes to be deployed in production.
Canary Used as a secondary development environment for testing newer Kubernetes releases, GKE features and APIs to gain better time to market once these releases are promoted and become default.

Enroll clusters in release channels

Kubernetes often releases updates, to deliver security updates, fix known issues, and introduce new features. GKE release channels offer you the ability to balance between stability and feature set of the version deployed in the cluster. When you enroll a new cluster in a release channel, Google automatically manages the version and upgrade cadence for the cluster and its node pools.

To keep clusters up-to-date with the latest GKE and Kubernetes updates, here are some recommended environments and the respective release channels the clusters should be enrolled in:

Environment Release channel Description
Production Stable or regular For stability and version maturity, use the stable or regular channel for production workloads.
Staging Same as production To ensure your tests are indicative of the version your production will be upgraded to, use the same release channel as production.
Testing
Development
Canary Rapid To test the latest Kubernetes releases and to get ahead of the curve by testing new GKE features or APIs, use the rapid channel. You can improve your time to market when the version in rapid is promoted to the channel you're using for production.

Create a continuous upgrade strategy

After enrolling your cluster in a release channel, that cluster is regularly upgraded to the version that meets the quality and stability bar for the channel. These updates include security and bug fixes, applied with increasing scrutiny at each channel:

  • Patches are pushed out to control plane and nodes in all channels gradually, accumulating soak time in rapid and regular channels before landing in the stable channel.
  • The control plane is upgraded first, followed by nodes to comply with the Kubernetes OSS policy (i.e., kubelet must not be newer than kube-apiserver).
  • GKE will automatically roll out patches to channels based on their criticality and importance.
  • The stable channel receives only critical patches.

Receive updates about new GKE versions

Information about new versions is published to the main GKE release notes page, as well as to an RSS feed. Each release channel has a simplified and dedicated release notes page (example: Release notes for stable channel) with information about the recommended GKE version for that channel.

To proactively receive updates about GKE upgrades, we recommend using the following methods:

  1. Use the Kubernetes Engine API to orchestrate an upgrade workflow when an upgrade is required for your clusters.
  2. Use Pub/Sub and subscribe to upgrade notifications before the upgrades occur.

Once a new version becomes available, you should plan an upgrade before the version becomes the default in the fleet. This approach provides more control and predictability when needed, as GKE would skip auto-upgrade for clusters already upgraded ahead of time.

Test and verify new patch and minor versions

All releases pass internal testing regardless of the channel they are released in. However, with the frequent updates and patches from upstream Kubernetes, and GKE, we highly recommend testing new releases on testing and/or staging environments before the releases are rolled out into your production environment, especially Kubernetes minor version upgrades.

Each release channel offers two versions: default and available.

  • New patch releases are available a week prior to becoming default.
  • New Kubernetes minor releases will be available four weeks prior to becoming default.

GKE automatically upgrades clusters to the default version gradually. If more control over the upgrade process is necessary, we recommend upgrading ahead of time to an available version. GKE auto-upgrade skips manually-upgraded clusters.

A recommended approach to automate and streamline upgrades would involve:

  • A pre-production environment using the available version.
  • Upgrade notifications set up on the cluster to inform your team about new available versions to test and certify.
  • A production environment subscribed to a release channel using the default version in the channel.
  • Gradual rollout of new available versions to production clusters. For example, if there are multiple production clusters, a gradual upgrade plan would start by upgrading a portion of these clusters to the available version while keeping the others on the default version, followed by additional small portion upgrades until 100% is upgraded.

The following table summarizes the release events and recommended actions:

Event Recommended action
New version X is made available in a channel. Manually upgrade your testing cluster and qualify and test the new version.
Version X becomes the default version. GKE starts auto-upgrading to the default version. Consider upgrading production ahead of the fleet.
GKE starts auto-upgrading clusters. Allow clusters to get auto-upgraded, or postpone the upgrade using maintenance exclusion windows.

Upgrade strategy for patch releases

Here's a recommended upgrade strategy for patch releases, using a scenario where:

  • All clusters are subscribed to the stable channel.
  • New available versions are rolled out to the staging cluster first.
  • The production cluster is upgraded automatically with new default versions.
  • Regularly monitoring new available versions for GKE (example script).
Time Event What should I do?
T - 1 week New patch version becomes available. Upgrade staging environment.
T Patch version becomes default. Consider upgrading the production control plane ahead of time for better predictability.
T GKE will start upgrading control planes to the default version. Consider upgrading the production node pools ahead of time for better predictability.
T + 1 week GKE will start upgrading cluster node pools to the default version. GKE will auto-upgrade clusters, skipping the manually-upgraded clusters.

Upgrade strategy for new minor releases

Here's a recommended upgrade strategy for new minor releases:

Time Event What should I do?
T - 3 weeks New minor version becomes available Upgrade testing control plane
T - 2 weeks
  1. Given a successful control plane upgrade, consider upgrading the production control plane ahead of time.
  2. Upgrade testing node pools.
T - 1 week Given a successful upgrade, consider upgrade production node pools ahead of time.
T Minor version becomes default.
T GKE will start upgrading cluster control planes to the default version. Create an exclusion window if more testing or mitigation is needed before production rollout.
T + 1 week GKE will start upgrading cluster node pools to the default version. GKE will auto-upgrade clusters, skipping the manually-upgraded clusters.

Reduce disruption to existing workloads during an upgrade

Keeping your clusters up-to-date with security patches and bug fixes is critical for ensuring the vitality of your clusters, and for business continuity. Regular updates protect your workloads from vulnerabilities and failures.

Schedule maintenance windows and exclusions

To increase upgrade predictability and to align upgrades with off-peak business hours, you can control automatic upgrades of both the control plane and nodes by creating a maintenance window. GKE respects maintenance windows. Namely, if the upgrade process runs beyond the defined maintenance window, GKE attempts to pause the operation, and resumes the operation during the next maintenance window.

The GKE rollout process follows a four-day schedule which spreads the rollout process across four business days, and gradually upgrades clusters in different regions. In a multi-cluster environment, you can use the four-day rollout to predict changes applied geographically to different regions. Additionally, you can use maintenance windows to control and sequence disruption in different clusters. For example, you might want to control when clusters in different regions receive maintenance by setting different maintenance windows for each cluster.

Another tool to reduce disruption, especially during high-demand business periods, is maintenance exclusions. Use maintenance excluding to prevent automatic maintenance from occurring during these periods; maintenance exclusions can be set on new or existing clusters. You can also use exclusions in conjunction with your upgrade strategy. For example, you might want to postpone an upgrade to a production cluster if a testing or staging environment fails because of an upgrade.

Set your tolerance for disruption

You might be familiar with the concept of replicas in Kubernetes. Replicas ensure redundancy of your workloads for better performance and responsiveness. When set, replicas govern the number of Pod replicas running at any given time. However, during maintenance, Kubernetes removes the underlying node VMs, which can reduce the number of replicas. To ensure your workloads have a sufficient number of replicas for your applications, even during maintenance, use a Pod Disruption Budget (PDB).

In a Pod Disruption Budget, you can define a number (or percentage) of Pods that can be terminated, even if terminating the Pods brings the current replica count below the desired value. This process may speed up the node drain by removing the need to wait for migrated pods to become fully operational. Instead, drain evicts pods from a node following the PDB configuration, allowing deployment to deploy missing Pods on other available nodes. Once the PDB is set, GKE won't shut down Pods in your application if the number of Pods is equal to or less than a configured limit. GKE follows a PDB for up to 60 minutes.

Control node pool upgrades

The upgrade process for GKE node pools involves recreating every VM in the node pool. A new VM is created with the new version (upgraded image) in a rolling update fashion. In turn, that requires shutting down all the Pods running on the old node and shifting the Pods to the new node. Your workloads can run with sufficient redundancy (replicas), and you can rely on Kubernetes to move and restart Pods as needed. However, a temporarily reduced number of replicas can still be disruptive to your business, and might slow down the workload performance until Kubernetes is able to meet the desired state again (that is, meet the minimum number of needed replicas). You can avoid such a disruption by using surge upgrades.

During an upgrade with surge upgrade enabled, GKE first secures the resources (machines) needed for the upgrade, then creates a new upgraded node, and only then drains the old node, and finally shuts it down. This way, the expected capacity remains intact throughout the upgrade process.

For large clusters where the upgrade process might take longer, you can accelerate the upgrade completion time by concurrently upgrading multiple nodes at a time. Use surge upgrade with maxSurge=20, maxUnavailable=0 to instruct GKE to upgrade 20 nodes at a time, without using any existing capacity.

Checklist summary

The following table summarizes the tasks that are recommended for an upgrade strategy to keep your GKE clusters seamlessly up-to-date:

Best Practice Tasks
Set up multiple environments
Enroll clusters in release channels
Create a continuous upgrade strategy
Reduce disruption to existing workloads

What's next