This page explains how to upgrade GKE on-prem.
Target versions
Starting with GKE on-prem version 1.3.2, you can upgrade directly to any version that is in the same minor release or the next minor release. For example, you can upgrade from 1.3.2 to 1.3.5, or from 1.5.2 to 1.6.1.
If your current version is lower than 1.3.2, then you must do sequential upgrades to reach version 1.3.2 first. For example, to upgrade from 1.3.0 to 1.3.2, you must first upgrade from 1.3.0 to 1.3.1, and then from 1.3.1 to 1.3.2.
If you are upgrading from version 1.3.2 or later to a version that is not part of the next minor release, you must upgrade through one version of each minor release between your current version and your desired version. For example, if you are upgrading from version 1.3.2 to version 1.6.1, it is not possible to upgrade directly. You must first upgrade from version 1.3.2 to version 1.4.x, where x represents any patch release under that minor release. You can then upgrade to version 1.5.x, and finally to version 1.6.1.
Overview of the upgrade process
Download the
gkeadm
tool. The version ofgkeadm
must be the same as the target version of your upgrade.Use
gkeadm
to upgrade your admin workstation.From your admin workstation, upgrade your admin cluster.
From your admin workstation, upgrade your user clusters.
Upgrade policy
After you upgrade your admin cluster:
Any new user clusters that you create must have the same version as your admin cluster.
If you upgrade an existing user cluster, you must upgrade to the same version as your admin cluster.
Before you upgrade your admin cluster again, you must upgrade all of your user clusters to the same version as your current admin cluster.
Locating your configuration and information files
When you created your current admin workstation, you filled in an admin
workstation configuration file that was generated by gkeadm create config
.
The default name for this file is admin-ws-config.yaml
.
When you created your current admin workstation, gkeadm
created an information
file for you. The default name of this file is the same as the name of your
current admin workstation.
Locate your admin workstation configuration file and your information file. You
need them to do the steps in this guide. If these files are in your current
directory and they have their default names, then you won't need to specify
them when you run gkeadm upgrade admin-workstation
. If these files are in
another directory, or if you have changed the filenames, then you specify them
by using the --config
and --info-file
flags.
Upgrading your admin workstation
To upgrade your admin workstation, first download a new version of the gkeadm
tool, and then use it to upgrade the configuration of your admin workstation.
The version of gkeadm
must match the target version of your upgrade.
Downloading gkeadm
To download the current version of gkeadm
, follow the gkeadm download
instructions on the GKE on-prem page.
Upgrading your admin workstation configuration
gkeadm upgrade admin-workstation --config [AW_CONFIG_FILE] --info-file [INFO_FILE]
where:
[AW_CONFIG_FILE] is the path of your admin workstation configuration file. You can omit this flag if the file is in your current directory and has the name
admin-ws-config.yaml
.[INFO_FILE] is the path of your information file. You can omit this flag if the file is in your current directory. The default name of this file is the same as the name of your admin workstation.
The preceding command performs the following tasks:
Back up all files in the home directory of your current admin workstation. These include:
Your GKE on-prem configuration file. The default name of this file is
config.yaml
.The kubeconfig files for your admin cluster and your user clusters.
The root certificate for your vCenter server. Note that this file must have owner read and owner write permission.
The JSON key file for your allowlisted service account. Note that this file must have owner read and owner write permission.
The JSON key files for your connect-register, connect-agent, and logging-monitoring service accounts.
Create a new admin workstation, and copy all the backed-up files to the new admin workstation.
Delete the old admin workstation.
Removing the old admin workstation from known_hosts
If your admin workstation has a static IP address, you need to remove your old
admin workstation from the known_hosts
file after upgrading your admin
workstation.
To remove the old admin workstation from known_hosts
:
ssh-keygen -R [ADMIN_WS_IP]
where [ADMIN_WS_IP] is the IP address of your admin workstation.
Setting the bundle path in your GKE on-prem configuration file
On your new admin workstation, open your GKE on-prem configuration
file. Set the value of bundlepath
to the path of the bundle file on your new
admin workstation:
bundlepath: "/var/lib/gke/bundles/gke-onprem-vsphere-[VERSION]-full.tgz"
where [VERSION] is the target version of your upgrade.
Updating the node OS image and Docker images
On your new admin workstation, run the following command:
gkectl prepare --config [CONFIG_FILE] [FLAGS]
where:
[CONFIG_FILE] is the GKE on-prem configuration file on your new admin workstation.
[FLAGS] is an optional set of flags. For example, you could include the
--skip-validation-infra
flag to skip checking of your vSphere infrastructure.
The preceding command performs the following tasks:
If necessary, copy a new node OS image to your vSphere environment, and mark the OS image as a template.
If you have configured a private Docker registry, push updated Docker images to your private Docker registry.
Verify that enough IP addresses are available
Do the steps in this section on your new admin workstation.
Before you upgrade, be sure that you have enough IP addresses available for your clusters.
DHCP
During an upgrade, GKE on-prem creates one temporary node in the admin cluster and one temporary node in each associated user cluster. Make sure that your DHCP server can provide enough IP addresses for these temporary nodes. For more information, see IP addresses needed for admin and user clusters.
Static IPs
During an upgrade, GKE on-prem creates one temporary node in the admin cluster and one temporary node in each associated user cluster. For your admin cluster and each of your user clusters, verify that you have reserved enough IP addresses. For each cluster, you need to have reserved at least one more IP address than the number of cluster nodes. For more information, see Configuring static IP addresses.
Determine the number of nodes in your admin cluster:
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get nodes
where [ADMIN_CLUSTER_KUBECONFIG] is the path of your admin cluster's kubeconfig file.
Next, view the addresses reserved for your admin cluster:
kubectl get cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -o yaml
In the output, in the reservedAddresses
field, you can see the number of
IP addresses that are reserved for the admin cluster nodes. For example, the
following output shows that there are five IP addresses reserved for the
admin cluster nodes::
...
reservedAddresses:
- gateway: 21.0.135.254
hostname: admin-node-1
ip: 21.0.133.41
netmask: 21
- gateway: 21.0.135.254
hostname: admin-node-2
ip: 21.0.133.50
netmask: 21
- gateway: 21.0.135.254
hostname: admin-node-3
ip: 21.0.133.56
netmask: 21
- gateway: 21.0.135.254
hostname: admin-node-4
ip: 21.0.133.47
netmask: 21
- gateway: 21.0.135.254
hostname: admin-node-5
ip: 21.0.133.44
netmask: 21
The number of reserved IP addresses should be at least one more than the number of nodes in the admin cluster. If this is not the case, you can reserve an additional address by editing the Cluster object.
Open the Cluster object for editing:
kubectl edit cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG]
Under reservedAddresses
, add an additional block that has gateway
,
hostname
, ip
, and netmask
.
Go through the same procedure for each of your user clusters.
To determine the number of nodes in a user cluster:
kubectl --kubeconfig [USER_CLUSTER_KUBECONFIG] get nodes
where [USER_CLUSTER_KUBECONFIG] is the path of your user cluster's kubeconfig file.
To view the addresses reserved for a user cluster:
kubectl get cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ -n [USER_CLUSTER_NAME] [USER_CLUSTER_NAME] -o yaml
where:
[ADMIN_CLUSTER_KUBECONFIG] is the path of your admin cluster's kubeconfig file.
[USER_CLUSTER_NAME] is the name of the user cluster.
To edit the Cluster object of a user cluster:
kubectl edit cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ -n [USER_CLUSTER_NAME] [USER_CLUSTER_NAME]
(Optional) Disabling new vSphere features
A new GKE on-prem version might include new features or support for specific VMware vSphere features. Sometimes, upgrading to a GKE on-prem version automatically enables such features. You learn about new features in GKE on-prem's Release notes. New features are sometimes surfaced in the GKE on-prem configuration file.
If you need to disable a new feature that is automatically enabled in a new GKE on-prem version and driven by the configuration file, perform the following steps before you upgrade your cluster:
From your upgraded admin workstation, create a new configuration file with a different name from your current configuration file:
gkectl create-config --config [CONFIG_NAME]
Open the new configuration file and make a note of the feature's field. Close the file.
Open your current configuration file and add the new feature's field. Set the value of the field to
false
or equivalent.Save the configuration file.
Review the Release notes before you upgrade your clusters. You cannot declaratively change an existing cluster's configuration after you upgrade it.
Upgrading your admin cluster
Do the steps in this section on your new admin workstation.
Recall that the target version of your upgrade must be the same as your
gkeadm
version.
Run the following command:
gkectl upgrade admin \ --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ --config [ADMIN_CLUSTER_CONFIG_FILE] \ [FLAGS]
where:
[ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.
[ADMIN_CLUSTER_CONFIG_FILE] is the GKE on-prem admin cluster configuration file on your new admin workstation.
[FLAGS] is an optional set of flags. For example, you could include the
--skip-validation-infra
flag to skip checking of your vSphere infrastructure.
Upgrading a user cluster
Do the steps in this section on your new admin workstation.
Recall that the target version of your upgrade must be the same as your
gkeadm
version.
gkectl
gkectl upgrade cluster \ --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ --config [USER_CLUSTER_CONFIG_FILE] \ --cluster-name [CLUSTER_NAME] \ [FLAGS]
where:
[ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.
[CLUSTER_NAME] is the name of the user cluster you're upgrading.
[USER_CLUSTER_CONFIG_FILE] is the GKE on-prem user cluster configuration file on your new admin workstation.
[FLAGS] is an optional set of flags. For example, you could include the
--skip-validation-infra
flag to skip checking of your vSphere infrastructure.
Console
You can choose to register your user clusters with Google Cloud console during installation or after you've created them. You can view and log in to your registered GKE on-prem clusters and your Google Kubernetes Engine clusters from Google Cloud console's GKE menu.
When an upgrade becomes available for GKE on-prem user clusters,
a notification appears in Google Cloud console. Clicking this notification
displays a list of available versions and a gkectl
command you can run to
upgrade the cluster:
Visit the GKE menu in Google Cloud console.
Under the Notifications column for the user cluster, click Upgrade available, if available.
Copy the
gkectl upgrade cluster
command.From your admin workstation, run the
gkectl upgrade cluster
command, where [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file, [CLUSTER_NAME] is the name of the user cluster you're upgrading, and [USER_CLUSTER_CONFIG_FILE] is the GKE on-prem user cluster configuration file on your new admin workstation.
Resuming an upgrade
If a user cluster upgrade is interrupted after the admin cluster is successfully
upgraded, you can resume the user cluster upgrade by running the
same upgrade command with the --skip-validation-all
flag:
gkectl upgrade cluster \ --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ --config [USER_CLUSTER_CONFIG_FILE] \ --cluster-name [CLUSTER_NAME] \ --skip-validation-all
Resuming an admin cluster upgrade
You shouldn't interrupt an admin cluster upgrade. Currently, admin cluster upgrades aren't always resumable. If an admin cluster upgrade is interrupted for any reason, you should contact support for assistance.
Creating new user cluster after an upgrade
After you upgrade your admin workstation and your admin cluster, any new user clusters that you create must have the same version as the upgrade target version.
Known issues
The following known issues affect upgrading clusters.
Version 1.1.0-gke.6, 1.2.0-gke.6: stackdriver.proxyconfigsecretname
field removed
The stackdriver.proxyconfigsecretname
field was removed in version
1.1.0-gke.6. GKE on-prem's preflight checks will return an error if
the field is present in your configuration file.
To work around this, before you upgrade to 1.2.0-gke.6, delete the
proxyconfigsecretname
field from your configuration file.
Stackdriver references old version
Before version 1.2.0-gke.6, a known issue prevents Stackdriver from updating its configuration after cluster upgrades. Stackdriver still references an old version, which prevents Stackdriver from receiving the latest features of its telemetry pipeline. This issue can make it difficult for Google Support to troubleshoot clusters.
After you upgrade clusters to 1.2.0-gke.6, run the following command against admin and user clusters:
kubectl --kubeconfig=[KUBECONFIG] \ -n kube-system --type=json patch stackdrivers stackdriver \ -p '[{"op":"remove","path":"/spec/version"}]'
where [KUBECONFIG] is the path to the cluster's kubeconfig file.
Disruption for workloads with PodDisruptionBudgets
Currently, upgrading clusters can cause disruption or downtime for workloads that use PodDisruptionBudgets (PDBs).
Version 1.2.0-gke.6: Prometheus and Grafana disabled after upgrading
In user clusters, Prometheus and Grafana get automatically disabled during upgrade. However, the configuration and metrics data are not lost. In admin clusters, Prometheus and Grafana stay enabled.
For instructions, refer to the GKE on-prem release notes.
Version 1.1.2-gke.0: Deleted user cluster nodes aren't removed from vSAN datastore
For instructions, refer to the GKE on-prem release notes.
Version 1.1.1-gke.2: Data disk in vSAN datastore folder can be deleted
If you're using a vSAN datastore, you need to create a folder in which to save
the VMDK. A known issue
requires that you provide the folder's universally unique identifier (UUID) path,
rather than its file path, to vcenter.datadisk
. This mismatch can cause
upgrades to fail.
For instructions, refer to the GKE on-prem release notes.
Upgrading to version 1.1.0-gke.6 from version 1.0.2-gke.3: OIDC issue
Version 1.0.11, 1.0.1-gke.5, and 1.0.2-gke.3 clusters that have OpenID Connect (OIDC) configured cannot be upgraded to version 1.1.0-gke.6. This issue is fixed in version 1.1.1-gke.2.
If you configured a version 1.0.11, 1.0.1-gke.5, or 1.0.2-gke.3 cluster with OIDC during installation, you are not able to upgrade it. Instead, you should create new clusters.
Upgrading to version 1.0.2-gke.3 from version 1.0.11
Version 1.0.2-gke.3 introduces the following OIDC fields (usercluster.oidc
).
These fields enable logging in to a cluster from Google Cloud console:
usercluster.oidc.kubectlredirecturl
usercluster.oidc.clientsecret
usercluster.oidc.usehttpproxy
If you want to use OIDC, the clientsecret
field is required even if you don't
want to log in to a cluster from Google Cloud console. To use OIDC, you might
need to provide a placeholder value for clientsecret
:
oidc: clientsecret: "secret"
Nodes fail to complete their upgrade process
If you have PodDisruptionBudget
objects configured that are unable to
allow any additional disruptions, node upgrades might fail to upgrade to the
control plane version after repeated attempts. To prevent this failure, we
recommend that you scale up the Deployment
or HorizontalPodAutoscaler
to
allow the node to drain while still respecting the PodDisruptionBudget
configuration.
To see all PodDisruptionBudget
objects that do not allow any disruptions:
kubectl get poddisruptionbudget --all-namespaces -o jsonpath='{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.name}/{.metadata.namespace}{"\n"}{end}'
Appendix
About VMware DRS rules enabled in version 1.1.0-gke.6
As of version 1.1.0-gke.6, GKE on-prem automatically creates VMware Distributed Resource Scheduler (DRS) anti-affinity rules for your user cluster's nodes, causing them to be spread across at least three physical hosts in your datacenter. As of version 1.1.0-gke.6, this feature is automatically enabled for new clusters and existing clusters.
Before you upgrade, be sure that your vSphere environment meets the following conditions:
- VMware DRS is enabled. VMware DRS requires vSphere Enterprise Plus license edition. To learn how to enable DRS, see Enabling VMware DRS in a cluster
- The vSphere user account provided in the
vcenter
field has theHost.Inventory.EditCluster
permission. - There are at least three physical hosts available.
If your vSphere environment does not meet the preceding conditions, you can still upgrade, but for upgrading a user cluster from 1.3.x to 1.4.x, you need to disable anti-affinity groups. For more information, see this known issue in the GKE on-prem release notes.
Down time
About downtime during upgrades
Resource | Description |
---|---|
Admin cluster | When an admin cluster is down, user cluster control planes and workloads on user clusters continue to run, unless they were affected by a failure that caused the downtime |
User cluster control plane | Typically, you should expect no noticeable downtime to user cluster control planes. However, long-running connections to the Kubernetes API server might break and would need to be re-established. In those cases, the API caller should retry until it establishes a connection. In the worst case, there can be up to one minute of downtime during an upgrade. |
User cluster nodes | If an upgrade requires a change to user cluster nodes, GKE on-prem recreates the nodes in a rolling fashion, and reschedules Pods running on these nodes. You can prevent impact to your workloads by configuring appropriate PodDisruptionBudgets and anti-affinity rules. |
Troubleshooting
For more information, refer to Troubleshooting.
New nodes created but not healthy
- Symptoms
New nodes don't register themselves to the user cluster control plane when using manual load balancing mode.
- Possible causes
In-node Ingress validation might be enabled that blocks the boot up process of the nodes.
- Resolution
To disable the validation, run:
kubectl patch machinedeployment [MACHINE_DEPLOYMENT_NAME] -p '{"spec":{"template":{"spec":{"providerSpec":{"value":{"machineVariables":{"net_validation_ports": null}}}}}}}' --type=merge
Diagnosing cluster issues using gkectl
Use gkectl diagnose
commands to identify cluster issues
and share cluster information with Google. See
Diagnosing cluster issues.
Running gkectl
commands verbosely
-v5
Logging gkectl
errors to stderr
--alsologtostderr
Locating gkectl
logs in the admin workstation
Even if you don't pass in its debugging flags, you can view
gkectl
logs in the following admin workstation directory:
/home/ubuntu/.config/gke-on-prem/logs
Locating Cluster API logs in the admin cluster
If a VM fails to start after the admin control plane has started, you can try debugging this by inspecting the Cluster API controllers' logs in the admin cluster:
Find the name of the Cluster API controllers Pod in the
kube-system
namespace, where [ADMIN_CLUSTER_KUBECONFIG] is the path to the admin cluster's kubeconfig file:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system get pods | grep clusterapi-controllers
Open the Pod's logs, where [POD_NAME] is the name of the Pod. Optionally, use
grep
or a similar tool to search for errors:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system logs [POD_NAME] vsphere-controller-manager