This document describes best practices and considerations to upgrade Google Distributed Cloud. You learn how to prepare for cluster upgrades, and the best practices to follow before the upgrade. These best practices help to reduce the risks associated with cluster upgrades.
If you have multiple environments such as test, development, and production, we recommend that you start with the least critical environment, such as test, and verify the upgrade functionality. After you verify that the upgrade was successful, move on to the next environment. Repeat this process until you upgrade your production environments. This approach lets you move from one critical point to the next, and verify that the upgrade and your workloads all run correctly.
Upgrade checklist
We recommend that you follow all of the best practices in this document. Use the following checklist to help you track your progress. Each item in the list links to a section in this document with more information:
After these checks are complete, you can start the upgrade process. Monitor the progress until all clusters are successfully upgraded.
Plan the upgrade
Updates can be disruptive. Before you start the upgrade, plan carefully to make sure that your environment and applications are ready and prepared.
Estimate the time commitment and plan a maintenance window
The amount of time it takes to upgrade a cluster varies depending on the number of nodes and the workload density that runs on them. To successfully complete a cluster upgrade, use a maintenance window with enough time.
To calculate a rough estimate for the upgrade, use 10 minutes * the number of
nodes
for single concurrent node upgrade.
For example, if you have fifty nodes in a cluster, the total upgrade time would
be about five hundred minutes: 10 minutes * 50 nodes = 500 minutes
.
Check compatibility of other GKE Enterprise components
If your cluster runs GKE Enterprise components like Cloud Service Mesh, Config Sync, Policy Controller, or Config Controller, check the GKE Enterprise compatibility matrix and verify the supported versions with Google Distributed Cloud before and after the upgrade.
The compatibility check is based on the admin or user cluster that Cloud Service Mesh, Config Sync, Policy Controller, or Config Controller is deployed into.
Check cluster resource utilization
To make sure that Pods can be evacuated when the node drains and that there are enough resources in the cluster being upgraded to manage the upgrade, check the current resource usage of the cluster. To the check resource usage for your cluster, use the custom dashboards in Google Cloud Observability.
You can use commands, such as kubectl top nodes
to get the current cluster
resource usage, but dashboards can provide a more detailed view of resources
being consumed over time. This resource usage data can help indicate when an
upgrade would cause the least disruption, such as during weekends or evenings,
depending on the running workload and use cases.
The timing for the admin cluster upgrade might be less critical than for the user clusters, because an admin cluster upgrade usually does not introduce application downtime. However, it's still important to check for available resources before you begin an admin cluster upgrade. Also, upgrading the admin cluster might imply some risk, and therefore might be recommended during less active usage periods when management access to the cluster is less critical.
Admin cluster control plane resources
All of the upgrade controllers and jobs run in the admin cluster control plane nodes. Check the resource consumption of these control plane nodes for available compute resources. The upgrade process typically requires 1000 millicores of CPU (1000 mCPU) and 2-3 GiB RAM for the each set of lifecycle controllers. Note that the CPU unit 'mCPU' stands for "thousandth of a core", and so 1000 millicores is the equivalent of one core on each node for each set of lifecycle controllers. To reduce the additional compute resources required during an upgrade, try to keep user clusters at the same version.
In the following example deployment, the two user clusters are at different versions than the admin cluster:
Admin cluster | User cluster 1 | User cluster 2 |
---|---|---|
1.13.3 | 1.13.0 | 1.13.2 |
A set of lifecycle controllers is deployed in the admin controller for each
version in use. In this example, there are three sets of lifecycle controllers:
1.13.3
, 1.13.0
, and 1.13.2
. Each set of the lifecycle controllers consumes
a total of 1000 mCPU and 3 GiB RAM. The current total resource consumption of
these lifecycle controllers is 3000 mCPU and 9 GiB RAM.
If user cluster 2 is upgraded to 1.13.3
, there are now two sets of lifecycle
controllers: 1.13.3
and 1.13.0
:
Admin cluster | User cluster 1 | User cluster 2 |
---|---|---|
1.13.3 | 1.13.0 | 1.13.3 |
The lifecycle controllers now consume a total 2000 mCPU and 6 GiB of RAM.
If user cluster 1 is upgraded to 1.13.3
, the fleet now all run at the same
version: 1.13.3
:
Admin cluster | User cluster 1 | User cluster 2 |
---|---|---|
1.13.3 | 1.13.3 | 1.13.3 |
There is now only one set of lifecycle controllers, which consume a total 1000 mCPU and 3 GiB of RAM.
In the following example, all the user clusters are the same version. If the admin cluster is upgraded, only two sets of lifecycle controllers are used, so the compute resource consumption is reduced:
Admin cluster | User cluster 1 | User cluster 2 |
---|---|---|
1.14.0 | 1.13.3 | 1.13.3 |
In this example, the lifecycle controllers again consume a total 2000 mCPU and 6 GiB of RAM until all the user clusters are upgraded to the same version as the admin cluster.
If the control plane nodes don't have additional compute resources during the
upgrade, you might see Pods such as anthos-cluster-operator
,
capi-controller-manager
, cap-controller-manager
, or
cap-kubeadm-bootstraper
in a Pending
state. To resolve this problem, upgrade
some of the user clusters to the same version to consolidate the versions and
reduce the number of lifecycle controllers in use. If your upgrade is already
stuck, you can also use kubectl edit deployment
to edit the pending
deployments to lower the CPU and RAM requests so they fit into the admin cluster
control plane.
The following table details the compute resource requirements for different upgrade scenarios:
Cluster | Admin cluster resources required |
---|---|
User cluster upgrade | Upgrade to same version of other clusters: N/A Upgrade to different version of other admin or user clusters: 1000 mCPU and 3 GiB RAM User clusters in a hybrid cluster have the same resource requirements. |
Admin cluster upgrade (with user cluster) | 1000 mCPU and 3 GiB RAM |
Hybrid cluster upgrade (without user cluster) | 1000 mCPU and 3 GiB RAM surge. Resources are returned after use. |
Standalone | 200 mCPU and 1 GiB RAM surge. Resources are returned after use. |
Back up clusters
Before you start an upgrade,
back up clusters using the bmctl backup cluster
command.
Because the backup file contains sensitive information, store the backup file securely.
Verify clusters are configured and working properly
To check the health of a cluster before an upgrade, run bmctl check cluster
on
the cluster. The command runs advanced checks, such as to identify nodes that
aren't configured properly, or that have Pods that are in a stuck state.
When you run the bmctl upgrade cluster
command to upgrade your clusters, some
preflight checks run. The upgrade process stops if these checks aren't
successful. It's best to proactively identify and fix these problems with the
bmctl check cluster
command, rather than relying on the preflight checks which
are there to protect clusters from any possible damage.
Review user workload deployments
There are two areas to consider for user workloads: draining and API compatibility.
Workload draining
The user workload on a node is drained during an upgrade. If the workload has a single replica or all replicas are on the same node, workload draining might cause disruption on the services running in the cluster. Run your workloads with multiple replicas. The replica number should be above the concurrent node number.
To avoid a stuck upgrade, the draining process of upgrading up to
1.16 doesn't respect pod disruption budgets (PDBs). Workloads
might run in a degraded state, and the least serving replica would be total
replica number - concurrent upgrade number
.
API compatibility
For API compatibility, check your workload API compatibility with newer minor version of Kubernetes when doing a minor version upgrade. If needed, upgrade the workload to a compatible version. Where possible, the GKE Enterprise engineering team provides instruction to identify workloads using incompatible APIs, such as removed Kubernetes APIs.
If you use Cloud Service Mesh, Config Sync, Policy Controller, and Config Controller, or other GKE Enterprise components, check if the installed version is compatible with the new version of Google Distributed Cloud. For GKE Enterprise component version compatibility information, see GKE Enterprise version and upgrade support.
Audit the use of webhooks
Check if your cluster has any webhooks, especially Pod resources for auditing purposes like Policy Controller. The draining process during the cluster upgrade might disrupt the Policy Controller webhook service, which can cause the upgrade to become stuck or take a long time. We recommend you temporarily disable these webhooks, or use a highly available (HA) deployment.
Review the use of Preview features
Preview features are subject to change and are provided for testing and evaluation purposes only. Don't use Preview features on your production clusters. We don't guarantee that clusters that use Preview features can be upgraded. In some cases, we explicitly block upgrades for clusters that use Preview features.
For information about breaking changes related to upgrading, see the release notes.
Check SELinux status
If you want to enable SELinux to secure your containers, you must make sure that
SELinux is enabled in Enforced
mode on all your host machines. Starting with
Google Distributed Cloud release 1.9.0 or later, you can enable or disable SELinux
before or after cluster creation or cluster upgrades. SELinux is enabled by
default on Red Hat Enterprise Linux (RHEL) and CentOS. If SELinux is disabled on
your host machines or you aren't sure, see
Securing your containers using SELinux
for instructions on how to enable it.
Google Distributed Cloud supports SELinux in only RHEL and CentOS systems.
Don't change the Pod density configuration
Google Distributed Cloud supports the configuration of up to 250 maximum pods per
node with nodeConfig.PodDensity.MaxPodsPerNode
. You can configure pod density
during cluster creation only. You can't update pod density settings for existing
clusters. Don't try to change the pod density configuration during an upgrade.
Make sure control plane and load balancer nodes aren't in maintenance mode
Make sure that control plane and load balancer nodes aren't under maintenance before starting an upgrade. If any node is in maintenance mode, the upgrade pauses to ensure the control plane and load balancer node pools are sufficiently available.
What's next
- Upgrade clusters
- Learn more about the lifecycle and stages of upgrades
- Troubleshoot cluster upgrade issues