Upgrading GKE on VMware

This page explains how to upgrade GKE on VMware.

Overview of the upgrade process

You can upgrade directly to any version that is in the same minor release or the next minor release. For example, you can upgrade from 1.11.0 to 1.11.1, or from 1.10.1 to 1.11.0.

If you are upgrading to a version that is not part of the next minor release, you must upgrade through one version of each minor release between your current version and your desired version. For example, if you are upgrading from version 1.9.2 to version 1.11.0, it is not possible to upgrade directly. You must first upgrade from version 1.9.2 to version 1.10.x, and then upgrade to version 1.11.0.

This topic discusses how to upgrade from version 1.10.x or 1.11.x to version 1.11.y.

Here is the general workflow for upgrading.

  1. Upgrade your admin workstation.

  2. From your admin workstation, install the bundle that you will use to upgrade the clusters.

  3. From your admin workstation, upgrade your user clusters.

  4. After all of the user clusters have been upgraded, you can upgrade your admin cluster from the admin workstation. This step is optional unless you require the features available in the upgrade.

Prepare for upgrade

Before you created your admin workstation, you filled in an admin workstation configuration file that was generated by gkeadm create config. The default name for this file is admin-ws-config.yaml.

In addition, your workstation has an information file. The default name of this file is the same as the name of your current admin workstation.

Locate your admin workstation configuration file and your information file. You need them to do the upgrade steps. If these files are in your current directory and they have their default names, then you won't need to specify them when you run the upgrade commands. If these files are in another directory, or if you have changed the filenames, then you specify them by using the --config and --info-file flags.

If your output information file is missing, you can re-create it. See Re-create an information file if missing.

You should also plan your strategy for any down time that the upgrade will cause. See Downtime for upgrades.

Upgrade your admin workstation

  1. Run this command to download the specified gkeadm version:

    gkeadm upgrade gkeadm --target-version TARGET_VERSION
    

    Replace TARGET_VERSION with the version you want to download.

  2. Run this command to complete the upgrade:

    gkeadm upgrade admin-workstation --config AW_CONFIG_FILE --info-file INFO_FILE
    

    Replace the following:

    • AW_CONFIG_FILE: the path of your admin workstation configuration file. You can omit this flag if the file is in your current directory and has the name admin-ws-config.yaml.

    • INFO_FILE: the path of your information file. You can omit this flag if the file is in your current directory. The default name of this file is the same as the name of your admin workstation.

The preceding command performs the following tasks:

  • Backs up all files in the home directory of your current admin workstation. These include:

    • Your admin cluster configuration file. The default name is admin-cluster.yaml.
    • Your user cluster configuration file. The default name is user-cluster.yaml.
    • The kubeconfig files for your admin cluster and your user clusters.
    • The root certificate for your vCenter server. Note that this file must have owner read and owner write permission.
    • The JSON key file for your component access service account. Note that this file must have owner read and owner write permission.
    • The JSON key files for your connect-register and logging-monitoring service accounts.
  • Creates a new admin workstation, and copies all the backed-up files to the new admin workstation.

  • Deletes the old admin workstation.

Verify that enough IP addresses are available

Before you upgrade your clusters, be sure that you have allocated sufficient IP addresses. You can set aside additional IPs as needed, as described for each of DHCP and static IPs. If you have more than one node pool, then you should also have an extra IP address for each node pool, to facilitate the upgrade process.

See Manage node IP addresses to calculate how many IP addresses you need.

Check bundle for upgrade

If you have upgraded your admin workstation, the corresponding bundle version for upgrading your clusters is located on the workstation.

If you want to use a different version from the admin workstation version, then you must manually install the corresponding bundle. See Install bundle for upgrade.

Upgrade a user cluster

Command line

Take note of the following before you proceed with the upgrade:

  • The gkectl upgrade command runs preflight checks. If the preflight checks fail, the command is blocked. You must fix the failures, or use the flag --skip-preflight-check-blocking with the command to unblock it. You should only skip the preflight checks if you are confident there are no failures.

  • As of version 1.10, GKE on VMware includes the konnectivityServerNodePort for the manual load balancer. Make sure you specify an appropriate value for this node port, and configure the load balancer using this node port and add this new node port in the configuration file before upgrading. See manual load balance.

Proceed with these steps on your admin workstation:

  1. Make sure the bundlepath field in the admin cluster configuration file matches the path of the bundle to which you want to upgrade.

  2. Make sure the gkeOnPremVersion field in the user cluster configuration file matches the version to which you want to upgrade.

    If you make any other changes to the fields in the admin cluster configuration file or the user cluster configuration file, these changes are ignored during the upgrade. To make those changes take effect, you must first upgrade the cluster, and then run an update command with the configuration file changes to make other changes to the cluster.

  3. Run gkectl prepare to import OS images to vSphere:

    gkectl prepare \
     --bundle-path /var/lib/gke/bundles/gke-onprem-vsphere-TARGET_VERSION.tgz \
     --kubeconfig ADMIN_CLUSTER_KUBECONFIG
    

    Replace the following:

    • ADMIN_CLUSTER_KUBECONFIG: the admin cluster's kubeconfig file.
  4. Run the pre-upgrade tool to check the cluster health and configuration.

  5. Upgrade with the following command.

    gkectl upgrade cluster \
     --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
     --config USER_CLUSTER_CONFIG_FILE \
     FLAGS
    

    Replace the following:

    • USER_CLUSTER_CONFIG_FILE: the GKE on VMware user cluster configuration file on your new admin workstation.

    • FLAGS: an optional set of flags. For example, you could include the --skip-validation-infra flag to skip checking your vSphere infrastructure.

Console

  1. On your admin workstation, run the following command:

    gkectl prepare \
        --bundle-path /var/lib/gke/bundles/gke-onprem-vsphere-TARGET_VERSION.tgz \
        --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        --upgrade-platform
    

    Replace the following:

    • ADMIN_CLUSTER_KUBECONFIG is the path to the admin cluster's kubeconfig file.

    This command upgrades the user cluster controller and role-based access control (RBAC) policies that lets the Google Cloud console manage the user cluster.

  2. Run the pre-upgrade tool to check the cluster health and configuration.

  3. In the Google Cloud console, go to the GKE Enterprise clusters page.

    Go to the GKE Enterprise clusters page

  4. Select the Google Cloud project that the user cluster is in.

  5. In the list of clusters, click the cluster that you want to upgrade.

  6. In the Details panel, if the Type is vm Anthos (VMware) do the following steps to upgrade the cluster using the Google Cloud console:

    1. In the Details panel, click More details.

    2. In the Cluster basics section, click Upgrade.

    3. In the GKE on VMware version list, select the version that you want to upgrade to.

    4. Click Upgrade.

    If the Type is external, this indicates that the cluster was created using gkectl. In this case, follow the steps in the Command line tab to upgrade the cluster.

Resume an upgrade

If a user cluster upgrade is interrupted, you can resume the user cluster upgrade by running the same upgrade command with the --skip-validation-all flag:

gkectl upgrade cluster \
    --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
    --config USER_CLUSTER_CONFIG_FILE \
    --skip-validation-all

Upgrade the admin cluster

Do the steps in this section on your new admin workstation. Make sure your gkectl and clusters are the appropriate version for an upgrade, and that you have downloaded the appropriate bundle.

  1. Run the pre-upgrade tool to check the cluster health and configuration.

  2. Make sure the bundlepath field in the admin cluster configuration file matches the path of the bundle to which you want to upgrade.

    If you make any other changes to the fields in the admin cluster configuration file, these changes are ignored during the upgrade. To make those changes take effect, you must first upgrade the cluster, and then run an update command with the configuration file changes to make other changes to the cluster.

  3. Run the following command:

    gkectl upgrade admin \
        --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        --config ADMIN_CLUSTER_CONFIG_FILE \
        FLAGS
    

    Replace the following:

    • ADMIN_CLUSTER_KUBECONFIG: the admin cluster's kubeconfig file.

    • ADMIN_CLUSTER_CONFIG_FILE: the GKE on VMware admin cluster configuration file on your new admin workstation.

    • FLAGS: an optional set of flags. For example, you could include the --skip-validation-infra flag to skip checking of your vSphere infrastructure.

  4. After the admin cluster upgrade finishes, run the following command to determine if the component-access-sa-key field in the admin-cluster-creds secret has been wiped out:

    kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get secret admin-cluster-creds | grep 'component-access-sa-key'

    If the output is empty, run the next command to add the component-access-sa-key back:

    kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get secret admin-cluster-creds -ojson | jq --arg casa "$(cat COMPONENT_ACESS_SERVICE_ACOOUNT_KEY_PATH | base64 -w 0)" '.data["component-access-sa-key"]=$casa' | kubectl --kubeconfig ADMIN_KUBECONFIG apply -f -

If you downloaded a full bundle, and you have successfully run the gkectl prepare and gkectl upgrade admin commands, you should now delete the full bundle to save disk space on the admin workstation. Use this command:

rm /var/lib/gke/bundles/gke-onprem-vsphere-${TARGET_VERSION}-full.tgz

Resuming an admin cluster upgrade

If an admin cluster upgrade is interrupted or fails, the upgrade can be resumed if the admin cluster checkpoint contains the state required to restore the state prior to the interruption.

Follow these steps:

  1. Check if the admin control plane is healthy before you begin the initial upgrade attempt. See Diagnosing cluster issues. As discussed in that topic, run the gkectl diagnose cluster command for the admin cluster.

  2. If the admin control plane is unhealthy prior to the initial upgrade attempt, repair the admin control plane with the gkectl repair admin-master command.

  3. When you rerun the upgrade command after an upgrade has been interrupted or has failed, use the same bundle and target version as you did in the previous upgrade attempt.

When you rerun the upgrade command, the resumed upgrade recreates the state in the kind cluster from the checkpoint and reruns the entire upgrade. If the admin control plane is unhealthy, it will first be restored before proceeding to upgrade again.

The upgrade will resume from the point where it failed or exited if the admin cluster checkpoint is available. If the checkpoint is unavailable, the upgrade will fall back to relying on the admin control plane, and therefore the admin control plane must be healthy in order to proceed with the upgrade. After a successful upgrade, the checkpoint is regenerated.

If gkectl exits unexpectedly during an admin cluster upgrade, the kind cluster is not cleaned up. Before you rerun the upgrade command to resume the upgrade, delete the kind cluster:

docker stop gkectl-control-plane && docker rm gkectl-control-plane

After deleting the kind cluster, rerun the upgrade command again.

Roll back an admin workstation after an upgrade

You can roll back the admin workstation to the version used before the upgrade.

During the upgrade, gkeadm records the version before it was upgraded in the output information file. During the rollback, gkeadm uses the version listed to download the older file.

To roll back your admin workstation to the previous version:

gkeadm rollback admin-workstation --config=AW_CONFIG_FILE

You can omit --config=AW_CONFIG_FILE if your admin workstation configuration file is the default admin-ws-config.yaml. Otherwise, replace AW_CONFIG_FILE with the path to the admin workstation configuration file.

The rollback command performs these steps:

  1. Downloads the rollback version of gkeadm.
  2. Backs up the home directory of the current admin workstation.
  3. Creates a new admin workstation using the rollback version of gkeadm.
  4. Deletes the original admin workstation.

Install bundle with a different version for upgrade

If you upgrade your workstation, a bundle with a corresponding version is installed there for upgrading your clusters. If you want a different version, follow these steps to install a bundle for TARGET_VERSION, which is the version to which you want to upgrade.

  1. To check the current gkectl and cluster versions, run this command. Use the flag --details/-d for more detailed information.

    gkectl version --kubeconfig ADMIN_CLUSTER_KUBECONFIG --details
    

    The output provides information about your cluster versions.

  2. Based on the output you get, look for the following issues, and fix them as needed.

    • If the current admin cluster version is more than one minor version lower than the TARGET_VERSION, upgrade all your clusters to be one minor version lower than the TARGET_VERSION.

    • If the gkectl version is lower than 1.10, and you want to upgrade to 1.11.x, you will have to perform multiple upgrades. Upgrade one minor version at a time, until you get to 1.10.x, and then proceed with the instructions in this topic.

    • If the gkectl version is lower than the TARGET_VERSION, upgrade the admin workstation to the TARGET_VERSION.

  3. When you have determined that your gkectl and cluster versions are appropriate for an upgrade, download the bundle.

    Check whether the bundle tarball already exists on the admin workstation.

    stat /var/lib/gke/bundles/gke-onprem-vsphere-TARGET_VERSION.tgz

    If the bundle is not on the admin workstation, download it.

    gsutil cp gs://gke-on-prem-release/gke-onprem-bundle/TARGET_VERSION/gke-onprem-vsphere-TARGET_VERSION.tgz /var/lib/gke/bundles/
    

  4. Install the bundle.

    gkectl prepare --bundle-path /var/lib/gke/bundles/gke-onprem-vsphere-TARGET_VERSION.tgz --kubeconfig ADMIN_CLUSTER_KUBECONFIG
    

    Replace ADMIN_CLUSTER_KUBECONFIG with the path of your kubeconfig file. You can omit this flag if the file is in your current directory and has the name kubeconfig.

  5. List available cluster versions, and make sure the target version is included in the available user cluster versions.

    gkectl version --kubeconfig ADMIN_CLUSTER_KUBECONFIG --details

You can now create a user cluster at the target version, or upgrade a user cluster to the target version.

Troubleshooting the upgrade process

If you experience an issue when following the recommended upgrade process, follow these recommendations to resolve them. These suggestions assume that you have begun with a version 1.10.x setup, and are proceeding through the recommended upgrade process.

See also: Troubleshooting cluster creation and upgrade

Troubleshooting a user cluster upgrade issue

Suppose you find an issue with the upgrade version when upgrading a user cluster. You determine from Google Support that the issue will be fixed in an upcoming patch release. You can proceed as follows:

  1. Continue using the current version for production.
  2. Test the patch release in a non-production cluster when it is released.
  3. Upgrade all production user clusters to the patch release version when you are confident.
  4. Upgrade the admin cluster to the patch release version.

Troubleshooting an admin cluster upgrade issue

If you encounter an issue when upgrading the admin cluster, you must contact Google Support to resolve the issue with the admin cluster.

In the meantime, with the new upgrade flow, you can still benefit from new user cluster features without being blocked by the admin cluster upgrade, which allows you to reduce the upgrade frequency of the admin cluster if you want. Your upgrade process can proceed as follows:

  1. Upgrade production user clusters to 1.11.x.
  2. Keep the admin cluster at its earlier version and continue receiving security patches.
  3. Test admin cluster upgrade from 1.10.x to 1.11.x in a test environment, and report issues if there are any;
  4. If your issue is solved by a 1.11.x patch release, you can then choose to upgrade the production admin cluster to this patch release if desired.

Known issues for recent versions

The following known issues might affect upgrades.

See also: Known issues

Upgrading the admin workstation might fail if the data disk is nearly full

If you upgrade the admin workstation with the gkectl upgrade admin-workstation command, the upgrade might fail if the data disk is nearly full, because the system attempts to back up the current admin workstation locally while upgrading to a new admin workstation. If you cannot clear sufficient space on the data disk, use the gkectl upgrade admin-workstation command with the additional flag --backup-to-local=false to prevent making a local backup of the current admin workstation.

Disruption for workloads with PodDisruptionBudgets

Currently, upgrading clusters can cause disruption or downtime for workloads that use PodDisruptionBudgets (PDBs).

Nodes fail to complete their upgrade process

If you have PodDisruptionBudget objects configured that are unable to allow any additional disruptions, node upgrades might fail to upgrade to the control plane version after repeated attempts. To prevent this failure, we recommend that you scale up the Deployment or HorizontalPodAutoscaler to allow the node to drain while still respecting the PodDisruptionBudget configuration.

To see all PodDisruptionBudget objects that do not allow any disruptions:

kubectl get poddisruptionbudget --all-namespaces -o jsonpath='{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.name}/{.metadata.namespace}{"\n"}{end}'

Appendix

About VMware DRS rules enabled in version 1.1.0-gke.6

As of version 1.1.0-gke.6, GKE on VMware automatically creates VMware Distributed Resource Scheduler (DRS) anti-affinity rules for your user cluster's nodes, causing them to be spread across at least three physical hosts in your datacenter. As of version 1.1.0-gke.6, this feature is automatically enabled for new clusters and existing clusters.

Before you upgrade, be sure that your vSphere environment meets the following conditions:

  • VMware DRS is enabled. VMware DRS requires vSphere Enterprise Plus license edition. To learn how to enable DRS, see Enabling VMware DRS in a cluster

  • The vSphere username provided in your credentials configuration file has the Host.Inventory.EditCluster permission.

  • There are at least three physical hosts available.

If your vSphere environment does not meet the preceding conditions, you can still upgrade, but for upgrading a user cluster from 1.3.x to 1.4.x, you need to disable anti-affinity groups. For more information, see this known issue in the GKE on VMware release notes.

About downtime during upgrades

Resource Description
Admin cluster

When an admin cluster is down, user cluster control planes and workloads on user clusters continue to run, unless they were affected by a failure that caused the downtime.

User cluster control plane

Typically, you should expect no noticeable downtime to user cluster control planes. However, long-running connections to the Kubernetes API server might break and would need to be re-established. In those cases, the API caller should retry until it establishes a connection. In the worst case, there can be up to one minute of downtime during an upgrade.

User cluster nodes

If an upgrade requires a change to user cluster nodes, GKE on VMware recreates the nodes in a rolling fashion, and reschedules Pods running on these nodes. You can prevent impact to your workloads by configuring appropriate PodDisruptionBudgets and anti-affinity rules.

Re-create an information file if missing

If the output information file for your admin workstation is missing, you must re-create this file so you can then proceed with the upgrade. This file was created when you initially created your workstation, and if you have since done an upgrade, it was updated with new information.

The output information file has this format:

Admin workstation version: GKEADM_VERSION
Created using gkeadm version: GKEADM_VERSION
VM name: ADMIN_WS_NAME
IP: ADMIN_WS_IP
SSH key used: FULL_PATH_TO_ADMIN_WS_SSH_KEY
To access your admin workstation:
ssh -i FULL-PATH-TO-ADMIN-WS-SSH-KEY ubuntu@ADMIN-WS-IP

Here is a sample output information file:

Admin workstation version: v1.10.3-gke.49
Created using gkeadm version: v1.10.3-gke.49
VM name: admin-ws-janedoe
IP: 172.16.91.21
SSH key used: /usr/local/google/home/janedoe/.ssh/gke-admin-workstation
Upgraded from (rollback version): v1.10.0-gke.194
To access your admin workstation:
ssh -i /usr/local/google/home/janedoe/.ssh/gke-admin-workstation ubuntu@172.16.91.21

Create the file in an editor, substituting the appropriate parameters. Save the file with a filename that is the same as the VM name in the directory from which gkeadm is run. For example, if the VM name is admin-ws-janedoe, save the file as admin-ws-janedoe.