Maintenance windows and exclusions


This page describes maintenance windows and maintenance exclusions, which are policies that provide control over when some cluster maintenance, such as auto-upgrades, can and can't occur on your Google Kubernetes Engine (GKE) clusters. For example, a retail business could limit maintenance to only occur on weekday evenings, and could prevent automated maintenance during a key industry sales event.

About GKE maintenance policies

GKE maintenance policies, which include maintenance windows and exclusions, give you control over when certain automatic maintenance can occur on your clusters, including cluster upgrades and other changes to the node configuration, or the cluster's network topology.

A maintenance window is a repeating window of time during which GKE automatic maintenance is permitted.

A maintenance exclusion is a non-repeating window of time during which GKE automatic maintenance is forbidden.

GKE makes automatic changes that respect your cluster's maintenance policies when there is an open maintenance window and no active maintenance exclusion. For each cluster, you can configure one recurring maintenance window, and multiple maintenance exclusions.

Other types of maintenance aren't dependent on GKE maintenance policies, including control plane repair operations, and maintenance of services on which GKE depends, like Compute Engine. To learn more, see Automatic maintenance that doesn't respect maintenance policies.

What changes do and don't respect GKE maintenance policies

Before configuring GKE maintenance policies—maintenance windows and exclusions—review the following sections to understand how GKE and related services do and don't respect them.

Automatic maintenance that respects GKE maintenance policies

With GKE maintenance policies, you can control the timing of the following types of events, which cause temporary disruption to your cluster:

Other types of automatic maintenance aren't dependent on maintenance policies. To learn more, see Automatic maintenance that doesn't respect maintenance policies.

Automatic maintenance that doesn't respect GKE maintenance policies

GKE maintenance windows and exclusions don't block all types of automatic maintenance. Before configuring your GKE cluster's maintenance policies, ensure that you understand what types of changes don't respect maintenance windows and exclusions.

Other Google Cloud maintenance

GKE maintenance windows and exclusions don't prevent automatic maintenance of underlying Google Cloud services, primarily Compute Engine, or services which install applications to the cluster, such as Cloud Deploy.

For example, GKE nodes are Compute Engine VMs that GKE manages for your cluster. Compute Engine VMs sometimes experience host events, which can include maintenance events or host errors. The way VMs behave during these events is determined by the VM's host maintenance policy, which, by default for most VMs, means to live migrate. This typically means little-to-no downtime for the nodes, and, for most workloads, the default policies are sufficient. For some VM machine families, you can monitor and plan for a host maintenance event and trigger a host maintenance event to time it with your GKE maintenance policies.

Some VMs, including those with GPUs and TPUs, can't perform live migration. If you're using these accelerators, learn how to handle disruption due to node maintenance for GPUs or TPUs.

We recommend that you review information about host events, host maintenance policies, and confirm that your workloads are prepared for disruption, especially if they're running on nodes that can't perform a live migration.

Automated repairs and resizing

GKE performs automated repairs on control planes. This includes processes like upscaling the control plane to an appropriate size or restarting the control plane to resolve issues. Most repairs ignore maintenance windows and exclusions because failing to perform the repairs can result in non-functional clusters.

You can't disable control plane repairs. However, most types of clusters, including Autopilot clusters and Standard regional clusters have multiple replicas of the control planes, which allows for high availability of the Kubernetes API server even during maintenance events. Standard zonal clusters, which only have a single control plane, can't be modified during control plane configuration changes and cluster maintenance. This includes deploying workloads.

Nodes also have auto-repair functionality, which you can disable for Standard clusters.

Critical security vulnerability patching

Maintenance windows and exclusions can cause security patches to be delayed. However, GKE reserves the right to override maintenance policies for critical security vulnerabilities.

Manual changes that respect GKE maintenance policies

Some changes to the nodes or networking configuration require the nodes to be recreated to apply the new configuration, including some of the following changes:

These changes respect GKE maintenance policies, meaning that GKE waits for an open maintenance window and waits for no active maintenance exclusion preventing node maintenance. To manually apply the changes to the nodes, use the Google Cloud CLI to call the gcloud container clusters upgrade command and passing the --cluster-version flag with the same GKE version that the node pool is already running.

Maintenance windows

Maintenance windows allow you to control when automatic upgrades of control planes and nodes can occur, to mitigate potential transient disruptions to your workloads. Maintenance windows are useful for the following types of scenarios, among others:

  • Off-peak hours: You want to minimize the chance of downtime by scheduling automatic upgrades during off-peak hours when traffic is reduced.
  • On-call: You want to ensure that upgrades happen during working hours so that someone can monitor the upgrades and manage any unanticipated issues.
  • Multi-cluster upgrades: You want to roll out upgrades across multiple clusters in different regions one at a time at specified intervals.

In addition to automatic upgrades, Google may occasionally need to perform other maintenance tasks, and honors a cluster's maintenance window if possible.

If tasks run beyond the maintenance window, GKE attempts to pause the tasks, and attempts to resume those tasks during the next maintenance window.

GKE reserves the right to roll out unplanned emergency upgrades outside of maintenance windows. Additionally, mandatory upgrades from deprecated or outdated software might automatically occur outside of maintenance windows.

To learn how to set up maintenance window for a new or existing cluster, see Configure a maintenance window.

Time zones for maintenance windows

When configuring and viewing maintenance windows, times are shown differently depending on the tool you are using:

When configuring maintenance windows

Times are always stored in UTC. However, when configuring the maintenance window, you either use UTC or your local time zone.

When configuring maintenance windows using the more generic --maintenance-window flag, you cannot specify a time zone. UTC is used when using the gcloud CLI or the API, and the Google Cloud console displays times using the local time zone.

When using more granular flags, such as --maintenance-window-start, you can specify the time zone as part of the value. If you omit the time zone, your local time zone is used.

When viewing maintenance windows

When viewing information about your cluster, timestamps for maintenance windows may be shown in UTC or in your local time zone, depending on how you are viewing the information:

  • When using the Google Cloud console to view information about your cluster, times are always displayed in your local time zone.
  • When using the gcloud CLI to view information about your cluster, times are always shown in UTC.

In both cases, the RRULE is always in UTC. That means that if specifying, for example, days of the week, then those days are in UTC.

Maintenance exclusions

With maintenance exclusions, you can prevent automatic maintenance from occurring during a specific time period. For example, many retail businesses have business guidelines prohibiting infrastructure changes during the end-of-year holidays. As another example, if a company is using an API that is scheduled for deprecation, they can use maintenance exclusions to pause minor upgrades to give them time to migrate applications.

For known high-impact events, we recommend that you match any internal change restrictions with a maintenance exclusion that starts one week before the event and lasts for the duration of the event.

Exclusions have no recurrence. Instead, create each instance of a periodic exclusion separately.

When exclusions and maintenance windows overlap, exclusions have precedence.

To learn how to set up maintenance exclusions for a new or existing cluster, see Configure a maintenance exclusion.

Scope of maintenance to exclude

Not only can you specify when to prevent automatic maintenance on your cluster, you can restrict the scope of automatic updates that might occur. Maintenance exclusion scopes are useful for the following types of scenarios, among others:

  • No upgrades - avoid any maintenance: You want to temporarily avoid any change to your cluster during a specific period of time. This is the default scope.
  • No minor upgrades - maintain current Kubernetes minor version: You want to temporarily maintain the minor version of a cluster to avoid API changes or validate the next minor version.
  • No minor or node upgrades - prevent node pool disruption: You want to temporarily avoid any eviction and rescheduling of your workloads because of node upgrades.

The following table lists the scope of automatic updates that you can restrict in a maintenance exclusion. The table also indicates what type of upgrades that occur (minor or patch). When upgrades occur, VMs for the control plane and node pools restart. For control planes, VM restarts may temporarily decrease the Kubernetes API Server availability, especially in zonal cluster topology with a single control plane. For nodes, VM restarts trigger Pod rescheduling which can temporarily disrupt existing workloads. You can set your tolerance for workload disruption using a Pod Disruption Budget (PDB).

Scope Control plane Node pools
Minor upgrade Patch upgrade VM disruption
due to GKE
maintenance
Minor upgrade Patch upgrade VM disruption
due to GKE
maintenance
No upgrades (default) No No No No No No
No minor upgrades No Yes Yes No Yes Yes
No minor or node upgrades No Yes Yes No No No

For definitions on minor and patch versions, see Versioning scheme.

Multiple exclusions

You may set multiple exclusions on a cluster. These exclusions may have different scopes and may have overlapping time ranges. The end-of-year holiday season use case is an example of overlapping exclusions, where both the "No upgrades" and "No minor upgrades" scopes are in use.

When exclusions overlap, if any active exclusion (that is, current time is within the exclusion time period) blocks an upgrade, the upgrade will be postponed.

Using the end-of-year holiday season use case, a cluster has the following exclusions specified:

  • No minor upgrades: September 30 - January 15
  • No upgrades: November 19 - December 4
  • No upgrades: December 15 - January 5

As a result of these overlapping exclusions, the following upgrades will be blocked on the cluster:

  • Patch upgrade to the node pool on November 25 (rejected by "No upgrades" exclusion)
  • Minor upgrade to the control plane on December 20 (rejected by "No minor upgrades" and "No upgrades" exclusion)
  • Patch upgrade to the control plane on December 25 (rejected by "No upgrades" exclusion)
  • Minor upgrade to the node pool on January 1 (rejected by "No minor upgrades" and "No upgrades" exclusion)

The following maintenance would be permitted on the cluster:

  • Patch upgrade to the control plane on November 10 (permitted by "No minor upgrades" exclusion)
  • VM disruption due to GKE maintenance on December 10 (permitted by "No minor upgrades" exclusion)

Exclusion expiration

When an exclusion expires (that is, the current time has moved beyond the end time specified for the exclusion), that exclusion will no longer prevent GKE updates. Other exclusions that are still valid (not expired) will continue to prevent GKE updates.

When no exclusions remain that prevent cluster upgrades, your cluster will gradually upgrade to the current default version in the cluster's release channel (or the static default for clusters in no release channel).

If your cluster is multiple minor versions behind the current default version after exclusion expiry, GKE will schedule one minor upgrade per month (upgrading both cluster control plane and nodes) until your cluster has reached the default version for the Release Channel. If you would like to return your cluster to the default version sooner, you can execute manual upgrades.

Limitations on configuring maintenance exclusions

Maintenance exclusions have the following limitations:

  • You can only restrict the scope of automatic upgrades in a maintenance exclusion for clusters that are enrolled in a release channel. For clusters not enrolled in a release channel, you can only create a maintenance exclusion with the default "No upgrades" scope.
  • You can add a maximum of three maintenance exclusions that exclude all upgrades (that is, a scope of "no upgrades"). These exclusions must be configured to allow for at least 48 hours of maintenance availability in a 32-day rolling window.
  • You can have a maximum of 20 maintenance exclusions for each cluster.
  • If you don't specify a scope in your exclusion, the scope defaults to "no upgrades".
  • You can set maintenance exclusions to different lengths of time, depending on the scope. Review the Maintenance exclusion length row in the Configure a maintenance exclusion section for more details.
  • You can't configure a maintenance exclusion to include or exceed the end of life date of the minor version. For example, with a cluster running a minor version where the GKE release schedule states that its end of life date is June 5, 2023, you must set the end time of the maintenance exclusion to 2023-06-05T00:00:00Z or earlier.

Usage examples

Here are some example use cases for restricting the scope of updates that can occur.

Example: Retailer preparing for the end-of-year holiday season

In this example, the retail business does not want disruptions during the highest-volume sales periods, which is the four days encompassing Black Friday through Cyber Monday, and the month of December until the start of the new year. In preparation for the shopping season, the cluster administrator sets up the following exclusions:

  • No minor upgrades: Allow only patch updates on the control plane and nodes between September 30 - January 15.
  • No upgrades: Freeze all upgrades between November 19 - December 4.
  • No upgrades: Freeze all upgrades between December 15 - January 5.

If no other exclusion windows apply when the maintenance exclusion expires, the cluster is upgraded to a new GKE minor version if one was made available between September 30 and January 6.

Example: Company using a beta API in Kubernetes that's being removed

In this example, a company is using the CustomResourceDefinition apiextensions.k8s.io/v1beta1 API, which will be removed in version 1.22. While the company is running versions earlier than 1.22, the cluster administrator sets up the following exclusion:

  • No minor upgrades: Freeze minor upgrades for three months while migrating customer applications from apiextensions.k8s.io/v1beta1 to apiextensions.k8s.io/v1.

Example: Company's legacy database not resilient to node pool upgrades

In this example, a company is running a database that does not respond well to Pod evictions and rescheduling that occurs during a node pool upgrade. The cluster administrator sets up the following exclusion:

  • No minor or node upgrades: Freeze node upgrades for three months. When the company is ready to accept downtime for the database, they trigger a manual node upgrade.

What's next