Upgrading GKE On-Prem

This topic explains how to upgrade GKE On-Prem.

To upgrade GKE On-Prem, you upgrade your admin workstation. Then, you upgrade your clusters.

Before you begin

Also, read through the following considerations:

About downtime during upgrades

Resource Description
Admin cluster

When an admin cluster is down, user cluster control planes and workloads on user clusters continue to run, unless they were affected by a failure that caused the downtime

User cluster control plane

Typically, you should expect no noticeable downtime to user cluster control planes. However, long-running connections to the Kubernetes API server might break and would need to be re-established. In those cases, the API caller should retry until it establishes a connection. In the worst case, there can be up to one minute of downtime during an upgrade.

User cluster nodes

If an upgrade requires a change to user cluster nodes, GKE On-Prem recreates the nodes in a rolling fashion, and reschedules Pods running on these nodes. You can prevent impact to your workloads by configuring appropriate PodDisruptionBudgets and anti-affinity rules.

Sequential upgrading

GKE On-Prem supports sequential upgrading. To upgrade a cluster to a new version, the cluster needs to already be at the previous latest version.

You are not able to upgrade your clusters directly to the latest version from a version that is more than one version behind. If your cluster is more than one version behind, you need to upgrade the cluster sequentially.

Example

Suppose that the following versions are available, and suppose that your admin workstation and clusters are running the oldest version:

  • 1.0.1 (oldest version)
  • 1.0.2
  • 1.1 (latest version)

In this case, 1.1 is the latest version. To upgrade from 1.0.1 to 1.1, you'd follow these steps:

  1. Upgrade your admin workstation from 1.0.1 to 1.0.2.
  2. Upgrade your clusters from 1.0.1 to 1.0.2.
  3. Upgrade your admin workstation from 1.0.2 to 1.1.
  4. Upgrade your clusters from 1.0.2 to 1.1.

Back up your GKE On-Prem configuration file and kubeconfig files

When you upgrade your admin workstation, Terraform deletes the admin workstation VM in and replaces it with an upgraded admin workstation. Before you upgrade your admin workstation, you need to back up your GKE On-Prem configuration file and your clusters' kubeconfig files. Later, you copy the files to your upgraded admin workstation.

Upgrading admin workstation

When you upgrade your admin workstation, it includes the following entities at the same version as the admin workstation's Open Virtualization Appliance (OVA) file:

  • gkectl
  • full bundle

After you upgrade your admin workstation, you upgrade your clusters.

Downloading the OVA

From Downloads, download the admin workstation OVA file for the version to which you're upgrading.

To download the latest OVA, run the following command:

gsutil cp gs://gke-on-prem-release/admin-appliance/1.1.2-gke.0/gke-on-prem-admin-appliance-vsphere-1.1.2-gke.0.{ova,ova.sig} ~/

Importing the OVA to vSphere and marking it as a VM template

In the following sections, you:

  1. Create some variables declaring elements of your vCenter Server and vSphere environment.
  2. Import the admin workstation OVA to vSphere and mark it as a VM template.

Creating variables for govc

Before you import the admin workstation OVA to vSphere, you need to provide govc some variables declaring elements of your vCenter Server and vSphere environment:

export GOVC_URL=https://[VCENTER_SERVER_ADDRESS]/sdk
export GOVC_USERNAME=[VCENTER_SERVER_USERNAME]
export GOVC_PASSWORD=[VCENTER_SERVER_PASSWORD]
export GOVC_DATASTORE=[VSPHERE_DATASTORE]
export GOVC_DATACENTER=[VSPHERE_DATACENTER]
export GOVC_INSECURE=true

You can choose to use vSphere's default resource pool or create your own:

# If you want to use a resource pool you've configured yourself, export this variable:
export GOVC_RESOURCE_POOL=[VSPHERE_CLUSTER]/Resources/[VSPHERE_RESOURCE_POOL]
# If you want to use vSphere's default resource pool, export this variable instead:
export GOVC_RESOURCE_POOL=[VSPHERE_CLUSTER]/Resources

where:

  • [VCENTER_SERVER_ADDRESS] is your vCenter Server's IP address or hostname.
  • [VCENTER_SERVER_USERNAME] is the username of an account that holds the Administrator role or equivalent privileges in vCenter Server.
  • [VCENTER_SERVER_PASSWORD] is the vCenter Server account's password.
  • [VSPHERE_DATASTORE] is the name of the datastore you've configured in your vSphere environment.
  • [VSPHERE_DATACENTER] is the name of the datacenter you've configured in your vSphere environment.
  • [VSPHERE_CLUSTER] is the name of the cluster you've configured in your vSphere environment.
  • For using a non-default resource pool,
  • [VSPHERE_RESOURCE_POOL] is the name of the resource pool you've configured to your vSphere environment.

Importing the OVA to vSphere: Standard switch

If you are using a vSphere Standard Switch, import the OVA to vSphere using this command:

govc import.ova -options - ~/gke-on-prem-admin-appliance-vsphere-1.1.2-gke.0.ova <<EOF
{
  "DiskProvisioning": "thin",
  "MarkAsTemplate": true
}
EOF

Importing the OVA to vSphere: Distributed switch

If you are using a vSphere Distributed Switch, import the OVA to vSphere using this command, where [YOUR_DISTRIBUTED_PORT_GROUP_NAME] is the name of your distributed port group:

govc import.ova -options - ~/gke-on-prem-admin-appliance-vsphere-1.1.2-gke.0.ova <<EOF
{
  "DiskProvisioning": "thin",
  "MarkAsTemplate": true,
  "NetworkMapping": [
      {
          "Name": "VM Network",
          "Network": "[YOUR_DISTRIBUTED_PORT_GROUP_NAME]"
      }
  ]
}
EOF

Setting the Terraform template variable for the new admin workstation VM

In your admin workstation's TFVARS file, set vm_template to the version to which you're upgrading. The value of vm_template looks like this, where [VERSION] is the OVA's version:

gke-on-prem-admin-appliance-vsphere-[VERSION]

Using Terraform to upgrade your admin workstation

To upgrade your admin workstation, run the following command. This command deletes the current admin workstation VM and replaces it with an upgraded VM:

terraform init && terraform apply -auto-approve -input=false

Connecting to your admin workstation

  1. SSH in to your admin workstation:

    ssh -i ~/.ssh/vsphere_workstation ubuntu@[IP_ADDRESS]
    
  2. If you are using a proxy, you need to configure Google Cloud CLI for the proxy, so that you can run gcloud and gsutil commands. For instructions, see Configuring gcloud CLI for use behind a proxy/firewall.

  3. Log in to Google Cloud using your account credentials:

    gcloud auth login
  4. Register gcloud as a Docker credential helper. (Read more about this command):

    gcloud auth configure-docker
  5. Create a private key for your allowlisted service account.

    Copy the service account's email address:

    gcloud iam service-accounts list

    Create the service account's private key, where [KEY_FILE]is a name you choose for file. This command saves the file to the current working directory:

    gcloud iam service-accounts keys create key.json \
    --project [PROJECT_ID] --iam-account [ALLOWLISTED_SERVICE_ACCOUNT_EMAIL]

    where:

    • [PROJECT_ID] is your project ID.
    • [KEY_FILE] is a name and path in which to save service account's private key such as /home/ubuntu/key.json.
    • [ALLOWLISTED_SERVICE_ACCOUNT_EMAIL] is the allowlisted service account's email address.
  6. Activate your allowlisted service account:

    gcloud auth activate-service-account --project [PROJECT_ID] \
    --key-file [KEY_FILE]
    

Copy your backed up configuration and kubeconfig files

Earlier, you backed up your GKE On-Prem configuration file and your clusters' kubeconfig files. Now, you should copy those files back to your upgraded admin workstation.

Upgrading clusters

After upgrading your admin workstation and connecting to it, perform the following steps:

Verify that enough IP addresses are available

Before you upgrade, be sure that you have enough IP addresses available for your clusters.

DHCP

If the cluster has its IP addresses assigned by a DHCP server, check that the DHCP server in the network in which the nodes are created has enough IP addresses. There should be more IP addresses than there are nodes running in the user cluster.

Static IPs

If the cluster has static IP addresses, check that you've allocated enough IP addresses in the cluster:

kubectl get cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \
-n [USER_CLUSTER_NAME] [USER_CLUSTER_NAME] -o yaml

where:

  • [ADMIN_CLUSTER_KUBECONFIG] tells kubectl to use the admin cluster's kubeconfig, which is used to view and/or change user cluster configurations.
  • -n [USER_CLUSTER_NAME] tells kubectl to look in a namespace named after the user cluster.
  • [USER_CLUSTER_NAME] -o yaml tells kubectl which user cluster you're running the command against. -o yaml displays the user cluster's configuration.

In the command's output, look for the reservedAddresses field. There should be more IP addresses in the field than there are nodes running in the user cluster.

If you need to add more addresses to the reservedAddresses field, perform the following steps:

  1. Open the user cluster's configuration file for editing:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] edit cluster [USER_CLUSTER_NAME] \
    -n [USER_CLUSTER_NAME] --validate=false
    

    The cluster configuration is opened in your shell's default editor.

  2. Add as many additional static IP blocks as required. An IP block is composed of gateway, hostname, ip, and netmask fields.

Below is an example reservedAddresses field with its four static IP blocks highlighted:

...
networkSpec:
  dns:
  - 172.x.x.x
  ntp: 129.x.x.x
  reservedAddresses:
  - gateway: 100.x.x.x
    hostname: host-1
    ip: 100.x.x.x
    netmask: x
  - gateway: 100.x.x.x
    hostname: host-2
    ip: 100.x.x.x
    netmask: x
  - gateway: 100.x.x.x
    hostname: host-3
    ip: 100.x.x.x
    netmask: x
  - gateway: 100.x.x.x
    hostname: host-4
    ip: 100.x.x.x
    netmask: x
...

Modifying the configuration file

On your admin workstation VM, edit your configuration file. Set the value of bundlepath, where [VERSION] is the GKE On-Prem version to which you're upgrading your clusters:

bundlepath: /var/lib/gke/bundles/gke-onprem-vsphere-[VERSION].tgz

About automatically-enabled features

A new GKE On-Prem version might include new features or support for specific VMware vSphere features. Sometimes, upgrading to a GKE On-Prem version automatically enables such features. You learn about new features in GKE On-Prem's Release notes. New features are sometimes surfaced in the GKE On-Prem configuration file.

Disabling new features via the configuration file

If you need to disable a new feature that is automatically enabled in a new GKE On-Prem version and driven by the configuration file, perform the following steps before you upgrade your cluster:

  1. From your upgraded admin workstation, create a new configuration file with a different name from your current configuration file:

    gkectl create-config --config [CONFIG_NAME]
    
  2. Open the new configuration file and the feature's field. Close the file.

  3. Open your current configuration file and add the new feature's field in the appropriate specification.

  4. Provide the field a false or equivalent value.

  5. Save the configuration file. Proceed with upgrading your clusters.

You should always review the Release notes before you upgrade your clusters. You cannot declaratively change an existing cluster's configuration after you upgrade it.

Running gkectl prepare

Run the following command:

gkectl prepare --config [CONFIG_FILE]

The gkectl prepare command performs the following tasks:

  • If necessary, copy a new node OS image to your vSphere environment, and mark the OS image as a template.

  • Push updated Docker images, specified in the new bundle, to your private Docker registry, if you have configured one.

Upgrading your admin cluster

Run the following command:

gkectl upgrade admin \
--kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \
--config [CONFIG_FILE]

where [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file, and [CONFIG_FILE] is the GKE On-Prem configuration file you're using to perform the upgrade.

Upgrading your user cluster

To upgrade a user cluster, your admin cluster must have a version at least as high as the target version of the user cluster upgrade. If your admin cluster version is not that high, upgrade your admin cluster before you upgrade your user cluster.

gkectl

From your admin workstation, run the following command:

gkectl upgrade cluster \
--kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \
--config [CONFIG_FILE] \
--cluster-name [CLUSTER_NAME]

where [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file, [CLUSTER_NAME] is the name of the user cluster you're upgrading, and [CONFIG_FILE] is the GKE On-Prem configuration file you're using to perform the upgrade.

Console

You can choose to register your user clusters with Google Cloud console during installation or after you've created them. You can view and log in to your registered GKE On-Prem clusters and your Google Kubernetes Engine clusters from Google Cloud console's GKE menu.

When a upgrade becomes available for GKE On-Prem user clusters, a notification appears in Google Cloud console. Clicking this notification displays a list of available versions and a gkectl command you can run to upgrade the cluster:

  1. Visit the GKE menu in Google Cloud console.

    Visit the GKE On-Prem menu

  2. Under the Notifications column for the user cluster, click Upgrade available, if available.

  3. Copy the gkectl upgrade cluster command.

  4. From your admin workstation, run the gkectl upgrade cluster command, where [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file, [CLUSTER_NAME] is the name of the user cluster you're upgrading, and [CONFIG_FILE] is the GKE On-Prem configuration file you're using to perform the upgrade.

Resuming an upgrade

If a user cluster upgrade is interrupted, but the admin cluster was upgraded successfully, you can resume the upgrade by running gkectl upgrade cluster again with the same GKE On-Prem configuration file and admin cluster kubeconfig.

About resuming an admin cluster upgrade

You shouldn't interrupt an admin cluster upgrade. Currently, admin cluster upgrades aren't always resumable, If an admin cluster upgrade is interrupted for any reason, you should contact support for assistance.

Known issues

The following known issues affect upgrading clusters.

Disruption for workloads with PodDisruptionBudgets

Currently, upgrading clusters can cause disruption or downtime for workloads that use PodDisruptionBudgets (PDBs).

Version 1.1.1-gke.2: Data disk in vSAN datastore folder can be deleted

If you're using a vSAN datastore, you need to create a folder in which to save the VMDK. Currently, a known issue requires that you provide the folder's universally unique identifier (UUID) path, rather than its file path, to vcenter.datadisk. This mismatch can cause upgrades to fail.

A fix is targeted for version 1.1.2. Before you upgrade, perform these steps for the admin control plane node as a workaround:

  1. From the vCenter interface, get the UUID of the folder in your vSAN data store.
  2. List the Machine resources in your clusters. These Machines correspond to the nodes in the clusters:

    kubectl get machines -n all
  3. For the admin control plane's Machine (gke-admin-master), open its configuration for editing:

    kubectl edit machine [MACHINE_NAME]
    
  4. Change the spec.providerSpec.value.machineVariables.data_disk_path field. Replace the path to the VMDK file with the UUID. For example:

    spec:
    providerSpec:
     value:
       apiVersion: vsphereproviderconfig.k8s.io/v1alpha1
       kind: VsphereMachineProviderConfig
       machineVariables:
         data_disk_path: 14159b5d-4265-a2ba-386b-246e9690c588/my-disk.vmdk
         datacenter: datacenter
         datastore: datastore
  5. Save the file.

  6. Open your GKE On-Prem configuration file.

  7. From vcenter.datadisk, replace the folder in the file path with the folder's UUID. For example:

    vcenter:
     ...
     datadisk: "14159b5d-4265-a2ba-386b-246e9690c588/my-disk.vmdk"
    
  8. Proceed with upgrading your clusters.

Upgrading to version 1.1.0-gke.6 from version 1.0.2-gke.3: OIDC issue

Version 1.0.11, 1.0.1-gke.5, and 1.0.2-gke.3 clusters that have OpenID Connect (OIDC) configured cannot be upgraded to version 1.1.0-gke.6. This issue is fixed in version 1.1.1-gke.2.

If you configured a version 1.0.11, 1.0.1-gke.5, or 1.0.2-gke.3 cluster with OIDC during installation, you are not able to upgrade it. Instead, you should create new clusters.

Upgrading to version 1.0.2-gke.3 from version 1.0.11

Version 1.0.2-gke.3 introduces the following OIDC fields (usercluster.oidc). These fields enable logging in to a cluster from Google Cloud console:

  • usercluster.oidc.kubectlredirecturl
  • usercluster.oidc.clientsecret
  • usercluster.oidc.usehttpproxy

If you want to use OIDC, the clientsecret field is required even if you don't want to log in to a cluster from Google Cloud console. To use OIDC, you might need to provide a placeholder value for clientsecret:

oidc:
  clientsecret: "secret"

Appendix

About VMware DRS rules enabled in version 1.1.0-gke.6

As of version 1.1.0-gke.6, GKE On-Prem automatically creates VMware Distributed Resource Scheduler (DRS) anti-affinity rules for your user cluster's nodes, causing them to be spread across at least three physical hosts in your datacenter. As of version 1.1.0-gke.6, this feature is automatically enabled for new clusters and existing clusters.

Before you upgrade, be sure that your vSphere environment meets the following conditions:

Disabling VMware DRS before upgrading to 1.1.0-gke.6

If you do not want to enable this feature for your existing user clusters—for example, if you don't have enough hosts to accommodate the feature—perform the following steps before you upgrade your user clusters:

  1. Open your existing GKE On-Prem configuration file.
  2. Under the usercluster specification, add the antiaffinitygroups field as described in the antiaffinitygroups documentation:
    usercluster:
          ...
          antiaffinitygroups:
            enabled: false
    
  3. Save the file.
  4. Use the configuration file to upgrade. Your clusters are upgraded, but the feature is not enabled.

Alternate upgrade scenario

This topic describes the simplest way to upgrade GKE On-Prem. The table below describes an alternate upgrade scenario. In this scenario, you would only upgrade gkectl and your clusters, and you would not upgrade the admin workstation:

Scenario Steps
Release has no security updates for admin workstation.
  1. Download gkectl.
  2. Download the bundle.
  3. Follow the instructions on this page.

Troubleshooting

For more information, refer to Troubleshooting.

New nodes created but not healthy

Symptoms

New nodes don't register themselves to the user cluster control plane when using manual load balancing mode.

Possible causes

In-node Ingress validation might be enabled that blocks the boot up process of the nodes.

Resolution

To disable the validation, run:

kubectl patch machinedeployment [MACHINE_DEPLOYMENT_NAME] -p '{"spec":{"template":{"spec":{"providerSpec":{"value":{"machineVariables":{"net_validation_ports": null}}}}}}}' --type=merge

Diagnosing cluster issues using gkectl

Use gkectl diagnosecommands to identify cluster issues and share cluster information with Google. See Diagnosing cluster issues.

Default logging behavior

For gkectl and gkeadm it is sufficient to use the default logging settings:

  • By default, log entries are saved as follows:

    • For gkectl, the default log file is /home/ubuntu/.config/gke-on-prem/logs/gkectl-$(date).log, and the file is symlinked with the logs/gkectl-$(date).log file in the local directory where you run gkectl.
    • For gkeadm, the default log file is logs/gkeadm-$(date).log in the local directory where you run gkeadm.
  • All log entries are saved in the log file, even if they are not printed in the terminal (when --alsologtostderr is false).
  • The -v5 verbosity level (default) covers all the log entries needed by the support team.
  • The log file also contains the command executed and the failure message.

We recommend that you send the log file to the support team when you need help.

Specifying a non-default location for the log file

To specify a non-default location for the gkectl log file, use the --log_file flag. The log file that you specify will not be symlinked with the local directory.

To specify a non-default location for the gkeadm log file, use the --log_file flag.

Locating Cluster API logs in the admin cluster

If a VM fails to start after the admin control plane has started, you can try debugging this by inspecting the Cluster API controllers' logs in the admin cluster:

  1. Find the name of the Cluster API controllers Pod in the kube-system namespace, where [ADMIN_CLUSTER_KUBECONFIG] is the path to the admin cluster's kubeconfig file:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system get pods | grep clusterapi-controllers
  2. Open the Pod's logs, where [POD_NAME] is the name of the Pod. Optionally, use grep or a similar tool to search for errors:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system logs [POD_NAME] vsphere-controller-manager