Manage cluster lifecycle changes to minimize disruption


This page explains how you and Google Kubernetes Engine (GKE) manage changes during the lifecycle of a cluster to maximize performance and availability while minimizing workload disruption.

This page is intended for platform administrators who want to plan and optimize their cluster environment to minimize disruption for their workloads. You can read this page either before or after learning how to perform the basic cluster management tasks described in Managing clusters and Cluster administration overview.

A managed platform and shared responsibility

GKE is a Google-managed implementation of the Kubernetes open source container orchestration platform. As mentioned in How GKE works, a GKE cluster consists of a control plane, which includes management nodes running system components, and worker nodes, where you deploy workloads.

Creating an optimal cluster environment for your workloads to run, with maximum performance, availability, and minimal disruption, is a shared responsibility:

  • GKE's responsibility is to maintain a reliable, available, secure, and performant cluster environment. To do this, GKE manages the control plane, system components, and, for Autopilot mode, the worker nodes.
  • Your responsibility as a platform administrator is to configure your cluster and manage your workloads, including preparing them to handle disruption. With Standard mode, you also create and manage the worker nodes, which are grouped in node pools.

To learn more, see GKE shared responsibility.

How GKE manages changes during the lifecycle of a cluster

As an implementation of Kubernetes, a GKE cluster is a network of processes and systems acting together to maintain the optimal environment to run your workloads. To manage the cluster, GKE performs maintenance tasks, makes changes, initiates operations, updates components, and upgrades the version of the control plane and nodes.

Most of the day-to-day running of your application occurs quietly in the background, keeping your workloads running without disruption. Some critical changes, however, must be completed in ways that could temporarily disrupt your workloads, as described in the next section.

Some cluster changes can be disruptive to workloads

While GKE strives to keep your workloads running seamlessly, some essential types of changes can require temporary disruptions to your workloads—primarily changes that restart the nodes running your workloads. Using GKE and Kubernetes features, you can specify when and how you want disruption to take place, so that when it does, your workloads can gracefully handle the changes.

The following sections explain what types of changes GKE makes to clusters, what type of disruption they cause, and how you can prepare.

Upgrades and updates with GKE cluster lifecycle management

In GKE, cluster upgrades and cluster updates have related meanings.

In GKE, the term cluster upgrades—or just upgrades—refers to updating the Kubernetes version of the control plane (control plane upgrades) or nodes (node upgrades), or both. When using Standard clusters, node upgrades can also be referred to as node pool upgrades because GKE uses a single operation to upgrade a node pool of nodes.

The term cluster updates—or just updates—is a more general term referring to any type of control plane or node changes, including updating their versions. GKE actively manages your cluster environment by performing upgrades, other types of updates, and necessary maintenance operations. These actions ensure your cluster remains performant, secure, and up-to-date with the latest features and bug fixes. GKE uses tools like node upgrade strategies and maintenance policies to minimize disruption during these processes.

Planning for node update disruptions

Certain types of cluster changes—mostly changes to nodes—can cause disruption.

GKE uses node upgrade strategies to update nodes—both Autopilot nodes or Standard cluster node pools—in a way that's optimized for your workload's needs. These strategies apply to version upgrades and also to some other types of node changes. The strategies allow GKE to minimize disruption while performing node updates, which are important for keeping clusters functional and performant.

Best practice:

Use maintenance windows and exclusions to choose when some cluster maintenance does and doesn't occur, and, for Standard clusters, pick a node upgrade strategy that best fits your workload profile and resource constraints.

For both manual and automatically-initiated changes to nodes, GKE makes changes with the following general characteristics:

  • Changes typically respect maintenance policies: When GKE makes changes to the nodes, these changes generally respect GKE maintenance policies. Consider the following if you initiate manual changes that require all the nodes in a node pool to be recreated:
    • For some changes, GKE respects maintenance policies and doesn't apply the change you submitted until there is maintenance availability. If GKE is waiting for maintenance availability, and the change is urgent, you can manually apply the changes to apply the new configuration immediately.
    • For other manual changes including manual upgrades, GKE doesn't respect maintenance policies. For these manual changes, ensure that your workloads are prepared for immediate disruption.
  • Changes generally use node upgrade strategies: When GKE applies most automatic or manually-initiated changes to nodes—including node updates other than version upgrades—GKE chooses a node upgrade strategy: surge upgrades or blue-green upgrades. Autopilot always uses surge upgrades. Changes to Standard cluster node pools typically use surge upgrades, except when you've configured blue-green upgrades and do certain types of changes.
  • Changes require sufficient resources: When GKE applies a change using a node upgrade strategy, this change requires a certain amount of resources depending on the strategy and its configuration. Your cluster's project must have enough resource quota, resource availability, and reservation capacity (for node pools with specific reservation affinity). To learn more, see Ensure resources for node upgrades.

For a detailed list of specific changes and their characteristics, see Types of changes to a GKE cluster in this page.

Maximize workload availability by preparing for disruptive changes

To maximize the availability of your workloads running on a GKE cluster, we recommend that you take the actions described in the following sections:

Choose your cluster availability

If control plane availability is a priority, choose an Autopilot cluster or regional Standard cluster rather than a zonal Standard cluster. To learn more, see About cluster configuration choices.

Control upgrades using GKE tools

You can use the following tools to control when and how GKE upgrades your cluster, making it possible to implement the best practices:

  • Release channels: Choose a release channel to get cluster versions with your chosen balance of feature availability and stability.
  • Maintenance windows: Specify a recurring window of time when certain types of GKE cluster maintenance, such as upgrades, can occur.
  • Maintenance exclusions: Prevent cluster maintenance from occurring for a specific time period.
  • Node upgrade strategies: If using Standard clusters, choose how your nodes are updated–surge upgrades or blue-green upgrades–to minimize disruption to your workloads.
  • Rollout sequencing: Qualify upgrades in a pre-production environment before GKE upgrades your production clusters.
  • Manual upgrades: Manually upgrade your cluster, and perform such actions as canceling, resuming, rolling back, and completing automatic or manual in-progress upgrades.

Manage and monitor your cluster

To manage potential disruption to your clusters, continuously perform the following tasks:

Prepare your workloads

Manage disruption by making your workloads as resilient to disruption as possible:

For a general discussion of these topics, see the Manage disruption section of the GKE best practices: Day 2 operations for business continuity blog post.

Types of changes to a GKE cluster

The following tables show the most common types of major changes to a cluster, including characteristics of these changes such as frequency and level of disruption.

Types of upgrades

Review the following table to understand how upgrades can disrupt a cluster environment.

Change Automatic or manually initiated Respects maintenance policies Frequency Type of disruption Level of disruption
Control plane upgrade Automatic or manual

Automatic upgrades respect maintenance policies until the end of support, except for extremely rare emergency fixes, as necessary.

Manual upgrades aren't blocked by maintenance policies.

Patch upgrades, as often as every week, depending on the release channel.

Minor upgrades approximately every four months.

For Extended channel clusters, minor upgrades only when the minor version nears the end of support.

Control plane

For Autopilot and regional Standard clusters, the control plane remains available.

For zonal Standard clusters, multiple minutes where you can't communicate with the control plane, meaning that you can't configure the cluster, nodes, and workloads during that time.

Node upgrade Automatic or manual

Automatic upgrades respect maintenance policies until the end of support, except for extremely rare emergency fixes, as necessary.

Manual upgrades aren't blocked by maintenance policies.

Typically the same as the control plane upgrades.

If your cluster isn't enrolled in a release channel and you disable node auto-upgrades, you're responsible for manually upgrading your cluster's node pools.

All nodes for Autopilot clusters, or one or more Standard cluster node pools.

Nodes must be shut down to be recreated, Pods must be replaced.

GKE uses surge upgrades for Autopilot, or the configured node upgrade strategy (surge or blue-green) for Standard clusters.

Manual changes that recreate the nodes using a node upgrade strategy and respecting maintenance policies

Review the following table to understand how these manual changes can disrupt a cluster environment. This list includes, amongst other changes, manual changes that respect GKE maintenance policies.

Change Automatic or manually initiated Respects maintenance policies Frequency Type of disruption Level of disruption
Rotating the cluster credentials Automatic if cluster credentials are expiring within 30 days, can also be manually initiated. Does respect maintenance policies, however GKE might override maintenance policies within 30 days of credential expiration. Also, if you manually trigger specific operations after the first step, that operation doesn't respect maintenance policies. Once per manual change of this type, or depends on cluster credential lifetime for automatic initiation. You can manually invoke operations for specific steps in the rotation process. For some steps, the control plane. For other steps, all nodes for Autopilot clusters, all nodes in each Standard cluster node pool.

When you initiate the rotation and complete the rotation, the level of disruption is the following:

  • For Autopilot and regional Standard clusters, the control plane remains available.
  • For zonal Standard clusters, both operations cause brief downtime, meaning multiple minutes where you can't communicate with the control plane to perform operations like configuring the cluster, nodes, and workloads.

When the nodes are recreated, the level of disruption is as follows:

  • Nodes must be shut down to be recreated, Pods must be replaced.
  • GKE uses surge upgrades to recreate the nodes.
Rotating the control plane's IP address Manually initiated Does respect maintenance policies, however if you manually trigger specific operations after the first step, that operation doesn't respect maintenance policies. Once per manual change of this type. You can manually invoke operations for specific steps in the rotation process. For some steps, the control plane. For other steps, all nodes for Autopilot clusters, all nodes in each Standard cluster node pool.

When you initiate the rotation and complete the rotation, the level of disruption is the following:

  • For Autopilot and regional Standard clusters, the control plane remains available.
  • For zonal Standard clusters, both operations cause brief downtime, meaning multiple minutes where you can't communicate with the control plane to perform operations like configuring the cluster, nodes, and workloads.

When the nodes are recreated, the level of disruption is as follows:

  • Nodes must be shut down to be recreated, Pods must be replaced.
  • GKE uses surge upgrades to recreate the nodes.
Configuring shielded nodes Manually initiated

Recreating the control plane doesn't respect maintenance policies, immediately makes the changes.

Recreating the nodes does respect maintenance policies.

Once per change of this type

The control plane is updated.

After the control plane is updated, all nodes in each Standard cluster node pool must be recreated.

When the control plane is recreated, the level of disruption is the following:

  • For Autopilot and regional Standard clusters, the control plane remains available.
  • For zonal Standard clusters, both operations cause brief downtime, meaning multiple minutes where you can't communicate with the control plane to perform operations like configuring the cluster, nodes, and workloads.

When the nodes are recreated, the level of disruption is as follows:

  • Nodes must be shut down to be recreated, Pods must be replaced.
  • GKE uses surge upgrades to recreate the nodes.
Configuring network policies Manually initiated Does respect maintenance policies Once per change of this type All nodes for Autopilot clusters, all nodes in each Standard cluster node pool.

Nodes must be shut down to be recreated, Pods must be replaced.

GKE uses surge upgrades to recreate the nodes.

Configuring intranode visibility Manually initiated Does respect maintenance policies Once per change of this type All nodes for Autopilot clusters, all nodes in each Standard cluster node pool.

Nodes must be shut down to be recreated, Pods must be replaced.

GKE uses surge upgrades to recreate the nodes.

Configuring NodeLocal DNSCache Manually initiated Does respect maintenance policies Once per change of this type All nodes in the Standard cluster node pool being updated must be updated.

Nodes must be shut down to be recreated, Pods must be replaced.

GKE uses surge upgrades to recreate the nodes.

Enabling Image streaming Manually initiated

When updating at the cluster level, respects maintenance policies.

When updating individual node pools, doesn't respect maintenance policies.

Once per change of this type

If toggled at the node pool level, all nodes in the Standard cluster node pool.

If toggled at the cluster level, nodes of any Standard cluster node pools where you haven't individually enabled or disabled the setting for the node pool.

GKE uses surge upgrades to recreate the nodes of a node pool.

Automatic maintenance that doesn't respect maintenance policies

Review the following table to understand how automatic maintenance that doesn't respect maintenance policies can disrupt a cluster environment.

Change Automatic or manually initiated Respects maintenance policies Frequency Type of disruption Level of disruption
Control plane repair or resize Automatic Doesn't respect maintenance policies

Control plane repair frequency is random but has no impact for Autopilot and regional Standard clusters.

Control plane resize is infrequent, but increases in frequency with cluster scaling events, and also has no impact for Autopilot and regional Standard clusters.

Control plane

For Autopilot and regional Standard clusters, the control plane remains available.

For zonal Standard clusters, multiple minutes where you can't communicate with the control plane, meaning that you can't configure the cluster, nodes, and workloads during that time.

Host maintenance event Automatic Doesn't respect maintenance policies Refer to Maintenance events for approximate frequency. One node

For most types of nodes, minimal effect.

Some nodes, including those with GPUs or TPUs, might experience greater disruption. To learn more, see Other Google Cloud maintenance.

Node auto-repair Automatic Doesn't respect maintenance policies

Node auto-repair frequency is random.

One node The node is restarted, so any Pods running on the node are disrupted.
Reclaim Spot VMs and Preemptible VMs Automatic Doesn't respect maintenance policies

For preemptible VMs, at least once every 24 hours.

For Spot VMs, when Compute Engine needs the resources elsewhere.

One node See details about the termination and graceful shutdown of Spot VMs, and the termination and graceful shutdown of preemptible VMs.

Manual changes that recreate the nodes using a node upgrade strategy without respecting maintenance policies

Review the following table to understand how these manual changes can disrupt a cluster environment. This list includes changes from when GKE uses surge upgrades and when GKE uses blue-green upgrades that aren't included in the other section as they don't respect maintenance policies.

Change Automatic or manually initiated Respects maintenance policies Frequency Type of disruption Level of disruption
Node pool label update Manually initiated Doesn't respect maintenance policies, immediately makes the changes. Once per change of this type All nodes in a Standard cluster node pool GKE immediately uses surge upgrades to recreate the node pool when you update the node labels on an existing node pool, regardless of any active maintenance policies.
Vertically scaling the nodes by changing the node machine attributes Manually initiated Doesn't respect maintenance policies, immediately makes the changes. Once per change of this type All nodes in a Standard cluster node pool GKE immediately uses surge upgrades to recreate the nodes on an existing node pool, regardless of any active maintenance policies.
Image type changes Manually initiated Doesn't respect maintenance policies, immediately makes the changes. Once per change of this type All nodes in a Standard cluster node pool

Nodes must be shut down to be recreated, Pods must be replaced.

GKE uses the configured node upgrade strategy (surge or blue-green) for Standard clusters.

Add or replace storage pools in a Standard cluster node pool Manually initiated Doesn't respect maintenance policies, immediately makes the changes. Once per change of this type All nodes in a Standard cluster node pool

Nodes must be shut down to be recreated, Pods must be replaced.

GKE uses the configured node upgrade strategy (surge or blue-green) for Standard clusters.

Enabling Image streaming Manually initiated

When updating at the cluster level, respects maintenance policies.

When updating individual node pools, doesn't respect maintenance policies.

Once per change of this type

If toggled at the node pool level, all nodes in the Standard cluster node pool.

If toggled at the cluster level, nodes of any Standard cluster node pools where you haven't individually enabled or disabled the setting for the node pool.

GKE uses surge upgrades to recreate the nodes of a node pool.
Network performance configuration updates Manually initiated Doesn't respect maintenance policies, immediately makes the changes. Once per change of this type All nodes in a Standard cluster node pool

Nodes must be shut down to be recreated, Pods must be replaced.

GKE immediately uses surge upgrades to recreate the nodes on an existing node pool, regardless of any active maintenance policies.

Enabling gVNIC Manually initiated Doesn't respect maintenance policies, immediately makes the changes. Once per change of this type All nodes in a Standard cluster node pool

Nodes must be shut down to be recreated, Pods must be replaced.

GKE immediately uses surge upgrades to recreate the nodes on an existing node pool, regardless of any active maintenance policies.

Node system configuration changes Manually initiated Doesn't respect maintenance policies, immediately makes the changes. Once per change of this type All nodes in a Standard cluster node pool

Nodes must be shut down to be recreated, Pods must be replaced.

GKE immediately uses surge upgrades to recreate the nodes on an existing node pool, regardless of any active maintenance policies.

Confidential nodes Manually initiated Doesn't respect maintenance policies, immediately makes the changes. Once per change of this type All nodes in a Standard cluster node pool

Nodes must be shut down to be recreated, Pods must be replaced.

GKE immediately uses surge upgrades to recreate the nodes on an existing node pool, regardless of any active maintenance policies.

Changes that don't require recreating the nodes

Review the following table to understand which changes to the node configuration don't require recreating the nodes. These changes aren't disruptive, however disruption is still possible if the updated node configuration affects your workload.

Change Automatic or manually initiated Respects maintenance policies Frequency Type of disruption Level of disruption

Update the following settings:

Manually initiated Doesn't respect maintenance policies, immediately makes the changes. Once per change of this type All relevant nodes are updated. Pods don't have to be replaced because the node configuration is updated without recreating the nodes.

What's next