Anthos clusters on VMware known issues

This page lists all known issues for Anthos clusters on VMware. To filter the known issues by a product version or category, select your desired filters from the following drop-down menus.

Select your Anthos clusters on VMware version:

Select your problem category:

Or, search for you issue:

Category Identified version(s) Issue and workaround
Storage 1.14, 1.15, 1.16

Data corruption on NFSv3 when parallel appends to a shared file are done from multiple hosts

If you use Nutanix storage arrays to provide NFSv3 shares to your hosts, you might experience data corruption or the inability for Pods to run successfully. This issue is caused by a known compatibility issue between certain versions of VMware and Nutanix versions. For more information, see the associated VMware KB article.


Workaround:

The VMware KB article is out of date in noting that there is no current resolution. To resolve this issue, update to the latest version of ESXi on your hosts and to the latest Nutanix version on your storage arrays.

Operating system 1.13.10, 1.14.6, 1.15.3

Version mismatch between the kubelet and the Kubernetes control plane

For certain Anthos clusters on VMware releases, the kubelet running on the nodes uses a different version than the Kubernetes control plane. There is a mismatch because the kubelet binary preloaded on the OS image is using a different version.

The following table lists the identified version mismatches:

Anthos version kubelet version Kubernetes version
1.13.10 v1.24.11-gke.1200 v1.24.14-gke.2100
1.14.6 v1.25.8-gke.1500 v1.25.10-gke.1200
1.15.3 v1.26.2-gke.1001 v1.26.5-gke.2100

Workaround:

No action is needed. The inconsistency is only between Kubernetes patch versions and no problems have been caused by this version skew.

Upgrades and updates 1.15.0-1.15.4

Upgrading or updating an admin cluster with a CA version greater than 1 fails

When an admin cluster has a certificate authority (CA) version greater than 1, an update or upgrade fails due to the CA version validation in the webhook. The output of gkectl upgrade/update contains the following error message:

    CAVersion must start from 1
    

Workaround:

  • Scale down the auto-resize-controller deployment in the admin cluster to disable node auto-resizing. This is necessary because a new field introduced to the admin cluster Custom Resource in 1.15 can cause a nil pointer error in the auto-resize-controller.
     kubectl scale deployment auto-resize-controller -n kube-system --replicas=0 --kubeconfig KUBECONFIG
          
  • Run gkectl commands with --disable-admin-cluster-webhook flag.For example:
            gkectl upgrade admin --config ADMIN_CLUSTER_CONFIG_FILE --kubeconfig KUBECONFIG --disable-admin-cluster-webhook
            
Operation 1.13, 1.14.0-1.14.8, 1.15.0-1.15.4, 1.16.0-1.16.1

Non-HA Controlplane V2 cluster deletion stuck until timeout

When a non-HA Controlplane V2 cluster is deleted, it is stuck at node deletion until it timesout.

Workaround:

If the cluster contains a StatefulSet with critical data, contact contact Cloud Customer Care to resolve this issue.

Otherwise, do the following steps:

  • Delete all cluster VMs from vSphere. You can delete the VMs through the vSphere UI, or run the following command:
          govc vm.destroy
    .
  • Force delete the cluster again:
         gkectl delete cluster --cluster USER_CLUSTER_NAME --kubeconfig ADMIN_KUBECONFIG --force
         

Storage 1.15.0+, 1.16.0+

Constant CNS attachvolume tasks appear every minute for in-tree PVC/PV after upgrading to Anthos 1.15+

When a cluster contains in-tree vSphere persistent volumes (for example, PVCs created with the standard StorageClass), you will observe com.vmware.cns.tasks.attachvolume tasks triggered every minute from vCenter.


Workaround:

Edit the vSphere CSI feature configMap and set list-volumes to false:

     kubectl edit configmap internal-feature-states.csi.vsphere.vmware.com -n kube-system --kubeconfig KUBECONFIG
     

Restart the vSphere CSI controller pods:

     kubectl rollout restart vsphere-csi-controller -n kube-system --kubeconfig KUBECONFIG
    
Storage 1.16.0

False warnings agaisnt PVCs

When a cluster contains intree vSphere persistent volumes, the commands gkectl diagnose and gkectl upgrade might raise false warnings against their persistent volume claims (PVCs) when validating the cluster storage settings. The warning message looks like the following

    CSIPrerequisites pvc/pvc-name: PersistentVolumeClaim pvc-name bounds to an in-tree vSphere volume created before CSI migration enabled, but it doesn't have the annotation pv.kubernetes.io/migrated-to set to csi.vsphere.vmware.com after CSI migration is enabled
    

Workaround:

Run the following command to check the annotations of a PVC with the above warning:

    kubectl get pvc PVC_NAME -n PVC_NAMESPACE -oyaml --kubeconfig KUBECONFIG
    

If the annotations field in the output contains the following, you can safely ignore the warning:

      pv.kubernetes.io/bind-completed: "yes"
      pv.kubernetes.io/bound-by-controller: "yes"
      volume.beta.kubernetes.io/storage-provisioner: csi.vsphere.vmware.com
    
Upgrades and updates 1.15.0+, 1.16.0+

Service account key rotation fails when multiple keys are expired

If your cluster is not using a private registry, and your component access service account key and Logging-monitoring (or Connect-register) service account keys are expired, when you rotate the service account keys, gkectl update credentials fails with an error similar to the following:

Error: reconciliation failed: failed to update platform: ...

Workaround:

First, rotate the component access service account key. Although the same error message is displayed, you should be able to rotate the other keys after the component access service account key rotation.

If the update is still not successful, contact Cloud Customer Care to resolve this issue.

Upgrades and updates 1.16.0

Control plane node fails to be created

During an upgrade or update of an admin cluster, a race condition might cause the vSphere cloud controller manager to unexpectedly delete a new control plane node. This causes the clusterapi-controller to be stuck waiting for the node to be created, and evenutally the upgrade/update times out. In this case, the output of the gkectl upgrade/update command is similar to the following:

    controlplane 'default/gke-admin-hfzdg' is not ready: condition "Ready": condition is not ready with reason "MachineInitializing", message "Wait for the control plane machine "gke-admin-hfzdg-6598459f9zb647c8-0\" to be rebooted"...
    

To identify the symptom, run the command below to get log in vSphere cloud controller manager in the admin cluster:

    kubectl get pods --kubeconfig ADMIN_KUBECONFIG -n kube-system | grep vsphere-cloud-controller-manager
    kubectl logs -f vsphere-cloud-controller-manager-POD_NAME_SUFFIX --kubeconfig ADMIN_KUBECONFIG -n kube-system
    

Here is a sample error message from the above command:

    node name: 81ff17e25ec6-qual-335-1500f723 has a different uuid. Skip deleting this node from cache.
    

Workaround:

  1. Reboot the failed machine to recreate the deleted node object.
  2. SSH into each control plane node and restart the vSphere cloud controller manager static pod:
          sudo crictl ps | grep vsphere-cloud-controller-manager | awk '{print $1}'
          sudo crictl stop PREVIOUS_COMMAND_OUTPUT
          
  3. Rerun upgrade/update command.
Storage 1.11+, 1.12+, 1.13+, 1.14+, 1.15+, 1.16

PVC creation failure after node is recreated with the same name

After a node is deleted and then recreated with the same node name, there is a slight chance that a subsequent PersistentVolumeClaim (PVC) creation fails with an error like the following:

    The object 'vim.VirtualMachine:vm-988369' has already been deleted or has not been completely created

This is caused by race condition where vSphere CSI controller does not delete a removed machine from its cache.


Workaround:

Restart the vSphere CSI controller pods:

    kubectl rollout restart vsphere-csi-controller -n kube-system --kubeconfig KUBECONFIG
    
Operation 1.16.0

gkectl repair admin-master returns kubeconfig unmarshall error

When you run the gkectl repair admin-master command on an HA admin cluster, gkectl returns the following error message:

  Exit with error: Failed to repair: failed to select the template: failed to get cluster name from kubeconfig, please contact Google support. failed to decode kubeconfig data: yaml: unmarshal errors:
    line 3: cannot unmarshal !!seq into map[string]*api.Cluster
    line 8: cannot unmarshal !!seq into map[string]*api.Context
  

Workaround:

Add the --admin-master-vm-template= flag to the command and provide the VM template of the machine to repair:

  gkectl repair admin-master --kubeconfig=ADMIN_CLUSTER_KUBECONFIG \
      --config ADMIN_CLUSTER_CONFIG_FILE \
      --admin-master-vm-template=/DATA_CENTER/vm/VM_TEMPLATE_NAME
  

To find the VM template of the machine:

  1. Go to the Hosts and Clusters page in the vSphere client.
  2. Click VM Templates and filter by the admin cluster name.

    You should see the three VM templates for the admin cluster.

  3. Copy the name VM template that matches the name of the machine you're repairing and use the template name in the repair command.
  gkectl repair admin-master \
      --config=/home/ubuntu/admin-cluster.yaml \
      --kubeconfig=/home/ubuntu/kubeconfig \
      --admin-master-vm-template=/atl-qual-vc07/vm/gke-admin-98g94-zx...7vx-0-tmpl
Networking 1.10.0+, 1.11.0+, 1.12.0+, 1.13.0+, 1.14.0-1.14.7, 1.15.0-1.15.3, 1.16.0

Seesaw VM broken due to disk space low

If you use Seesaw as the load balancer type for your cluster and you see that a Seesaw VM is down or keeps failing to boot, you might see the following error message in the vSphere console:

    GRUB_FORCE_PARTUUID set, initrdless boot failed. Attempting with initrd
    

This error indicates that the disk space is low on the VM because the fluent-bit running on the Seesaw VM is not configured with correct log rotation.


Workaround:

Locate the log files that consume most of the disk space using du -sh -- /var/lib/docker/containers/* | sort -rh. Clean up the log file with largest size and reboot the VM.

Note: If the VM is completely inaccessible, attach the disk to a working VM (e.g. admin workstation), remove the file from the attached disk, then reattach the disk back to the original Seesaw VM.

To prevent the issue from happening again, connect to the VM and modify the /etc/systemd/system/docker.fluent-bit.service file. Add --log-opt max-size=10m --log-opt max-file=5 in the Docker command, then run systemctl restart docker.fluent-bit.service

Operation 1.13, 1.14.0-1.14.6, 1.15

Admin SSH public key error after admin cluster upgrade or update

When you try to upgrade (gkectl upgrade admin) or update (gkectl update admin) a non-High-Availability admin cluster with checkpoint enabled, the upgrade or update may fail with errors like the following:

Checking admin cluster certificates...FAILURE
    Reason: 20 admin cluster certificates error(s).
Unhealthy Resources:
    AdminMaster clusterCA bundle: failed to get clusterCA bundle on admin master, command [ssh -o IdentitiesOnly=yes -i admin-ssh-key -o StrictHostKeyChecking=no -o ConnectTimeout=30 ubuntu@AdminMasterIP -- sudo cat /etc/kubernetes/pki/ca-bundle.crt] failed with error: exit status 255, stderr: Authorized uses only. All activity may be monitored and reported.
    ubuntu@AdminMasterIP: Permission denied (publickey).
failed to ssh AdminMasterIP, failed with error: exit status 255, stderr: Authorized uses only. All activity may be monitored and reported.
    ubuntu@AdminMasterIP: Permission denied (publickey)
error dialing ubuntu@AdminMasterIP: failed to establish an authenticated SSH connection: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey]...


Workaround:

If you're unable to upgrade to a patch version of Anthos clusters on VMware with the fix, contact Google Support for assistance.

Upgrades 1.13.0-1.13.9, 1.14.0-1.14.6, 1.15.1-1.15.2

Upgrading an admin cluster enrolled in the Anthos On-Prem API could fail

When an admin cluster is enrolled in the Anthos On-Prem API, upgrading the admin cluster to the affected versions could fail because the fleet membership couldn't be updated. When this failure happens, you see the following error when trying to upgrade the cluster:

    failed to register cluster: failed to apply Hub Membership: Membership API request failed: rpc error: code = InvalidArgument desc = InvalidFieldError for field endpoint.on_prem_cluster.resource_link: field cannot be updated
    

An admin cluster is enrolled in the API when you explicitly enroll the cluster, or when you upgrade a user cluster using an Anthos On-Prem API client.


Workaround:

Unenroll the admin cluster:
    gcloud alpha container vmware admin-clusters unenroll ADMIN_CLUSTER_NAME --project CLUSTER_PROJECT --location=CLUSTER_LOCATION --allow-missing
    
and resume upgrading the admin cluster. You might see the stale `failed to register cluster` error temporarily. After a while, it should be updated automatically.

Upgrades and updates 1.13.0-1.13.9, 1.14.0-1.14.4, 1.15.0

When an admin cluster is enrolled in the Anthos On-Prem API, its resource link annotation is applied to the OnPremAdminCluster custom resource, which is not preserved during later admin cluster updates due to the wrong annotation key being used. This can cause the admin cluster to be enrolled in the Anthos On-Prem API again by mistake.

An admin cluster is enrolled in the API when you explicitly enroll the cluster, or when you upgrade a user cluster using an Anthos On-Prem API client.


Workaround:

Unenroll the admin cluster:
    gcloud alpha container vmware admin-clusters unenroll ADMIN_CLUSTER_NAME --project CLUSTER_PROJECT --location=CLUSTER_LOCATION --allow-missing
    
and re-enroll the admin cluster again.

Networking 1.15.0-1.15.2

CoreDNS orderPolicy not recognized

OrderPolicy doesn't get recognized as a parameter and isn't used. Instead, Anthos clusters on VMware always uses Random.

This issue occurs because the CoreDNS template was not updated, which causes orderPolicy to be ignored.


Workaround:

Update the CoreDNS template and apply the fix. This fix persists until an upgrade.

  1. Edit the existing template:
    kubectl edit cm -n kube-system coredns-template
    
    Replace the contents of the template with the following:
    coredns-template: |-
      .:53 {
        errors
        health {
          lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
    {{- if .PrivateGoogleAccess }}
        import zones/private.Corefile
    {{- end }}
    {{- if .RestrictedGoogleAccess }}
        import zones/restricted.Corefile
    {{- end }}
        prometheus :9153
        forward . {{ .UpstreamNameservers }} {
          max_concurrent 1000
          {{- if ne .OrderPolicy "" }}
          policy {{ .OrderPolicy }}
          {{- end }}
        }
        cache 30
    {{- if .DefaultDomainQueryLogging }}
        log
    {{- end }}
        loop
        reload
        loadbalance
    }{{ range $i, $stubdomain := .StubDomains }}
    {{ $stubdomain.Domain }}:53 {
      errors
    {{- if $stubdomain.QueryLogging }}
      log
    {{- end }}
      cache 30
      forward . {{ $stubdomain.Nameservers }} {
        max_concurrent 1000
        {{- if ne $.OrderPolicy "" }}
        policy {{ $.OrderPolicy }}
        {{- end }}
      }
    }
    {{- end }}
    
Upgrades and updates 1.10, 1.11, 1.12, 1.13.0-1.13.7, 1.14.0-1.14.3

OnPremAdminCluster status inconsistent between checkpoint and actual CR

Certain race conditions could cause the OnPremAdminCluster status to be inconsistent between checkpoint and actual CR. When the issue happens, you could encounter the following error when update the admin cluster after you upgraded it:

Exit with error:
E0321 10:20:53.515562  961695 console.go:93] Failed to update the admin cluster: OnPremAdminCluster "gke-admin-rj8jr" is in the middle of a create/upgrade ("" -> "1.15.0-gke.123"), which must be completed before it can be updated
Failed to update the admin cluster: OnPremAdminCluster "gke-admin-rj8jr" is in the middle of a create/upgrade ("" -> "1.15.0-gke.123"), which must be completed before it can be updated
To workaround this issue, you will need to either edit the checkpoint or disable the checkpoint for upgrade/update, please reach out to our support team to proceed with the workaround.
Operation 1.13.0-1.13.9, 1.14.0-1.14.5, 1.15.0-1.15.1

Reconciliation process changes admin certificates on admin clusters

Anthos clusters on VMware changes the admin certificates on admin cluster control planes with every reconciliation process, such as during a cluster upgrade. This behavior increases the possibility of getting invalid certificates for your admin cluster, especially for version 1.15 clusters.

If you're affected by this issue, you may encounter problems like the following:

  • Invalid certificates may cause the following commands to time out and return errors:
    • gkectl create admin
    • gkectl upgrade amdin
    • gkectl update admin

    These commands may return authorization errors like the following:

    Failed to reconcile admin cluster: unable to populate admin clients: failed to get admin controller runtime client: Unauthorized
    
  • The kube-apiserver logs for your admin cluster may contain errors like the following:
    Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid...
    

Workaround:

Upgrade to a version of Anthos clusters on VMware with the fix: 1.13.10+, 1.14.6+, 1.15.2+. If upgrading isn't feasible for you, contact Cloud Customer Care to resolve this issue.

Networking, Operation 1.10, 1.11, 1.12, 1.13, 1.14

Anthos Network Gateway components evicted or pending due to missing priority class

Network gateway Pods in kube-system might show a status of Pending or Evicted, as shown in the following condensed example output:

$ kubectl -n kube-system get pods | grep ang-node
ang-node-bjkkc     2/2     Running     0     5d2h
ang-node-mw8cq     0/2     Evicted     0     6m5s
ang-node-zsmq7     0/2     Pending     0     7h

These errors indicate eviction events or an inability to schedule Pods due to node resources. As Anthos Network Gateway Pods have no PriorityClass, they have the same default priority as other workloads. When nodes are resource-constrained, the network gateway Pods might be evicted. This behavior is particularly bad for the ang-node DaemonSet, as those Pods must be scheduled on a specific node and can't migrate.


Workaround:

Upgrade to 1.15 or later.

As a short-term fix, you can manually assign a PriorityClass to the Anthos Network Gateway components. The Anthos clusters on VMware controller overwrites these manual changes during a reconciliation process, such as during a cluster upgrade.

  • Assign the system-cluster-critical PriorityClass to the ang-controller-manager and autoscaler cluster controller Deployments.
  • Assign the system-node-critical PriorityClass to the ang-daemon node DaemonSet.
Upgrades and updates 1.12, 1.13, 1.14, 1.15.0-1.15.2

admin cluster upgrade fails after registering the cluster with gcloud

After you use gcloud to register an admin cluster with non-empty gkeConnect section, you might see the following error when trying to upgrade the cluster:

failed to register cluster: failed to apply Hub Mem\
bership: Membership API request failed: rpc error: code = InvalidArgument desc = InvalidFieldError for field endpoint.o\
n_prem_cluster.admin_cluster: field cannot be updated

Delete the gke-connect namespace:

kubectl delete ns gke-connect --kubeconfig=ADMIN_KUBECONFIG
Get the admin cluster name:
kubectl get onpremadmincluster -n kube-system --kubeconfig=ADMIN_KUBECONFIG
Delete the fleet membership:
gcloud container fleet memberships delete ADMIN_CLUSTER_NAME
and resume upgrading the admin cluster.

Operation 1.13.0-1.13.8, 1.14.0-1.14.5, 1.15.0-1.15.1

gkectl diagnose snapshot --log-since fails to limit the time window for journalctl commands running on the cluster nodes

This does not affect the functionality of taking a snapshot of the cluster, as the snapshot still includes all logs that are collected by default by running journalctl on the cluster nodes. Therefore, no debugging information is missed.

Installation, Upgrades and Updates 1.9+, 1.10+, 1.11+, 1.12+

gkectl prepare windows fails

gkectl prepare windows fails to install Docker on Anthos clusters on VMware versions earlier than 1.13 because MicrosoftDockerProvider is deprecated.


Workaround:

The general idea to workaround this issue is to upgrade to Anthos clusters on VMware 1.13 and use the 1.13 gkectl to create a Windows VM template and then create Windows node pools. There are two options to get to Anthos clusters on VMware 1.13 from your current version as shown below.

Note: We do have options to workaround this issue in your current version without needing to upgrade all the way to 1.13, but it will need more manual steps, please reach out to our support team if you would like to consider this option.


Option 1: Blue/Green upgrade

You can create a new cluster using Anthos clusters on VMware 1.13+ version with windows node pools, and migrate your workloads to the new cluster, then tear down the current cluster. It's recommended to use the latest Anthos minor version.

Note: This will require extra resources to provision the new cluster, but less downtime and disruption for existing workloads.


Option 2: Delete Windows node pools and add them back when upgrading to Anthos clusters on VMware 1.13

Note: For this option, the Windows workloads will not be able to run until the cluster is upgraded to 1.13 and Windows node pools are added back.

  1. Delete existing Windows node pools by removing the windows node pools config from user-cluster.yaml file, then run the command:
    gkectl update cluster --kubeconfig=ADMIN_KUBECONFIG --config USER_CLUSTER_CONFIG_FILE
  2. Upgrade the Linux-only admin+user clusters to 1.12 following the upgrade user guide for the corresponding target minor version.
  3. (Make sure to perform this step before upgrading to 1.13) Ensure the enableWindowsDataplaneV2: true is configured in OnPremUserCluster CR, otherwise the cluster will keep using Docker for Windows node pools, which will not be compatible with the newly created 1.13 Windows VM template that not have Docker installed. If not configured or setting to false, update your cluster to set it to true in user-cluster.yaml, then run:
    gkectl update cluster --kubeconfig=ADMIN_KUBECONFIG --config USER_CLUSTER_CONFIG_FILE
  4. Upgrade the Linux-only admin+user clusters to 1.13 following the upgrade user guide.
  5. Prepare Windows VM template using 1.13 gkectl:
    gkectl prepare windows --base-vm-template BASE_WINDOWS_VM_TEMPLATE_NAME --bundle-path 1.13_BUNDLE_PATH --kubeconfig=ADMIN_KUBECONFIG
  6. Add back the Windows node pool configuration to user-cluster.yaml with the OSImage field set to the newly created Windows VM template.
  7. Update the cluster to add Windows node pools
    gkectl update cluster --kubeconfig=ADMIN_KUBECONFIG --config USER_CLUSTER_CONFIG_FILE
Installation, Upgrades and Updates 1.13.0-1.13.9, 1.14.0-1.14.5, 1.15.0-1.15.1

RootDistanceMaxSec configuration not taking effect for ubuntu nodes

The 5 seconds default value for RootDistanceMaxSec will be used on the nodes, instead of 20 seconds which should be the expected configuration. If you check the node startup log by SSH'ing into the VM, which is located at `/var/log/startup.log`, you can find the following error:

+ has_systemd_unit systemd-timesyncd
/opt/bin/master.sh: line 635: has_systemd_unit: command not found

Using a 5 seconds RootDistanceMaxSec might cause the system clock to be out of sync with NTP server when the clock drift is larger than 5 seconds.


Workaround:

SSH into the nodes and configure the RootDistanceMaxSec:

mkdir -p /etc/systemd/timesyncd.conf.d
cat > /etc/systemd/timesyncd.conf.d/90-gke.conf <<EOF
[Time]
RootDistanceMaxSec=20
EOF
systemctl restart systemd-timesyncd
Upgrades and updates 1.12.0-1.12.6, 1.13.0-1.13.6, 1.14.0-1.14.2

gkectl update admin fails because of empty osImageType field

When you use version 1.13 gkectl to update a version 1.12 admin cluster, you might see the following error:

Failed to update the admin cluster: updating OS image type in admin cluster
is not supported in "1.12.x-gke.x"

When you use gkectl update admin for version 1.13 or 1.14 clusters, you might see the following message in the response:

Exit with error:
Failed to update the cluster: the update contains multiple changes. Please
update only one feature at a time

If you check the gkectl log, you might see that the multiple changes include setting osImageType from an empty string to ubuntu_containerd.

These update errors are due to improper backfilling of the osImageType field in the admin cluster config since it was introduced in version 1.9.


Workaround:

Upgrade to a version of Anthos clusters on VMware with the fix. If upgrading isn't feasible for you, contact Cloud Customer Care to resolve this issue.

Installation, Security 1.13, 1.14, 1.15, 1.16

SNI doesn't work on user clusters with Controlplane V2

The ability to provide an additional serving certificate for the Kubernetes API server of a user cluster with authentication.sni doesn't work when the Controlplane V2 is enabled ( enableControlplaneV2: true).


Workaround:

Until a Anthos clusters on VMware patch is available with the fix, if you need to use SNI, disable Controlplane V2 (enableControlplaneV2: false).

Installation 1.0-1.11, 1.12, 1.13.0-1.13.9, 1.14.0-1.14.5, 1.15.0-1.15.1

$ in the private registry username causes admin control plane machine startup failure

The admin control plane machine fails to start up when the private registry username contains $. When checking the /var/log/startup.log on the admin control plane machine, you see the following error:

++ REGISTRY_CA_CERT=xxx
++ REGISTRY_SERVER=xxx
/etc/startup/startup.conf: line 7: anthos: unbound variable

Workaround:

Use a private registry username without $, or use a version of Anthos clusters on VMware with the fix.

Upgrades and updates 1.12.0-1.12.4

False-positive warnings about unsupported changes during admin cluster update

When you update admin clusters, you will see the following false-positive warnings in the log, and you can ignore them.

    console.go:47] detected unsupported changes: &v1alpha1.OnPremAdminCluster{
      ...
      - 		CARotation:        &v1alpha1.CARotationConfig{Generated: &v1alpha1.CARotationGenerated{CAVersion: 1}},
      + 		CARotation:        nil,
      ...
    }
Upgrades and updates 1.13.0-1.13.9, 1.14.0-1.14.5, 1.15.0-1.15.1

Update user cluster failed after KSA signing key rotation

After you rotate KSA signing keys and subsequently update a user cluster, gkectl update might fail with the following error message:

Failed to apply OnPremUserCluster 'USER_CLUSTER_NAME-gke-onprem-mgmt/USER_CLUSTER_NAME':
admission webhook "vonpremusercluster.onprem.cluster.gke.io" denied the request:
requests must not decrement *v1alpha1.KSASigningKeyRotationConfig Version, old version: 2, new version: 1"


Workaround:

Change the version of your KSA signing key version back to 1, but retain the latest key data:
  1. Check the secret in admin cluster under USER_CLUSTER_NAME namespace, and get the name of ksa-signing-key secret:
    kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME get secrets | grep ksa-signing-key
  2. Copy the ksa-signing-key secret, and name the copied secret as service-account-cert:
    kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME get secret KSA-KEY-SECRET-NAME -oyaml | \
    sed 's/ name: .*/ name: service-account-cert/' | \
    kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME apply -f -
  3. Delete the previous ksa-signing-key secret:
    kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME delete secret KSA-KEY-SECRET-NAME
  4. Update the data.data field in ksa-signing-key-rotation-stage configmap to '{"tokenVersion":1,"privateKeyVersion":1,"publicKeyVersions":[1]}':
    kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME \
    edit configmap ksa-signing-key-rotation-stage
  5. Disable the validation webhook to edit the version information in the OnPremUserCluster custom resource:
    kubectl --kubeconfig=ADMIN_KUBECONFIG patch validatingwebhookconfiguration onprem-user-cluster-controller -p '
    webhooks:
    - name: vonpremnodepool.onprem.cluster.gke.io
      rules:
      - apiGroups:
        - onprem.cluster.gke.io
        apiVersions:
        - v1alpha1
        operations:
        - CREATE
        resources:
        - onpremnodepools
    - name: vonpremusercluster.onprem.cluster.gke.io
      rules:
      - apiGroups:
        - onprem.cluster.gke.io
        apiVersions:
        - v1alpha1
        operations:
        - CREATE
        resources:
        - onpremuserclusters
    '
  6. Update the spec.ksaSigningKeyRotation.generated.ksaSigningKeyRotation field to 1 in your OnPremUserCluster custom resource:
    kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME-gke-onprem-mgmt \
    edit onpremusercluster USER_CLUSTER_NAME
  7. Wait until the target user cluster to be ready, you can check the status by:
    kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME-gke-onprem-mgmt \
    get onpremusercluster
  8. Restore the validation webhook for the user cluster:
    kubectl --kubeconfig=ADMIN_KUBECONFIG patch validatingwebhookconfiguration onprem-user-cluster-controller -p '
    webhooks:
    - name: vonpremnodepool.onprem.cluster.gke.io
      rules:
      - apiGroups:
        - onprem.cluster.gke.io
        apiVersions:
        - v1alpha1
        operations:
        - CREATE
        - UPDATE
        resources:
        - onpremnodepools
    - name: vonpremusercluster.onprem.cluster.gke.io
      rules:
      - apiGroups:
        - onprem.cluster.gke.io
        apiVersions:
        - v1alpha1
        operations:
        - CREATE
        - UPDATE
        resources:
        - onpremuserclusters
    '
  9. Avoid another KSA signing key rotation until the cluster is upgraded to the version with the fix.
Operation 1.13.1+, 1.14, 1., 1.16

F5 BIG-IP virtual servers aren't cleaned up when Terraform deletes user clusters

When you use Terraform to delete a user cluster with a F5 BIG-IP load balancer, the F5 BIG-IP virtual servers aren't removed after the cluster deletion.


Workaround:

To remove the F5 resources, follow the steps to clean up a user cluster F5 partition

Installation, Upgrades and Updates 1.13.8, 1.14.4

kind cluster pulls container images from docker.io

If you create a version 1.13.8 or version 1.14.4 admin cluster, or upgrade an admin cluster to version 1.13.8 or 1.14.4, the kind cluster pulls the following container images from docker.io:

  • docker.io/kindest/kindnetd
  • docker.io/kindest/local-path-provisioner
  • docker.io/kindest/local-path-helper
  • If docker.io isn't accessible from your admin workstation, the admin cluster creation or upgrade fails to bring up the kind cluster. Running the following command on the admin workstation shows the corresponding containers pending with ErrImagePull:

    docker exec gkectl-control-plane kubectl get pods -A
    

    The response contains entries like the following:

    ...
    kube-system         kindnet-xlhmr                             0/1
        ErrImagePull  0    3m12s
    ...
    local-path-storage  local-path-provisioner-86666ffff6-zzqtp   0/1
        Pending       0    3m12s
    ...
    

    These container images should be preloaded in the kind cluster container image. However, kind v0.18.0 has an issue with the preloaded container images, which causes them to be pulled from the internet by mistake.


    Workaround:

    Run the following commands on the admin workstation, while your admin cluster is pending on creation or upgrade:

    docker exec gkectl-control-plane ctr -n k8s.io images tag docker.io/kindest/kindnetd:v20230330-48f316cd@sha256:c19d6362a6a928139820761475a38c24c0cf84d507b9ddf414a078cf627497af docker.io/kindest/kindnetd:v20230330-48f316cd
    docker exec gkectl-control-plane ctr -n k8s.io images tag docker.io/kindest/kindnetd:v20230330-48f316cd@sha256:c19d6362a6a928139820761475a38c24c0cf84d507b9ddf414a078cf627497af docker.io/kindest/kindnetd@sha256:c19d6362a6a928139820761475a38c24c0cf84d507b9ddf414a078cf627497af
    
    docker exec gkectl-control-plane ctr -n k8s.io images tag docker.io/kindest/local-path-helper:v20230330-48f316cd@sha256:135203f2441f916fb13dad1561d27f60a6f11f50ec288b01a7d2ee9947c36270 docker.io/kindest/local-path-helper:v20230330-48f316cd
    docker exec gkectl-control-plane ctr -n k8s.io images tag docker.io/kindest/local-path-helper:v20230330-48f316cd@sha256:135203f2441f916fb13dad1561d27f60a6f11f50ec288b01a7d2ee9947c36270 docker.io/kindest/local-path-helper@sha256:135203f2441f916fb13dad1561d27f60a6f11f50ec288b01a7d2ee9947c36270
    
    docker exec gkectl-control-plane ctr -n k8s.io images tag docker.io/kindest/local-path-provisioner:v0.0.23-kind.0@sha256:f2d0a02831ff3a03cf51343226670d5060623b43a4cfc4808bd0875b2c4b9501 docker.io/kindest/local-path-provisioner:v0.0.23-kind.0
    docker exec gkectl-control-plane ctr -n k8s.io images tag docker.io/kindest/local-path-provisioner:v0.0.23-kind.0@sha256:f2d0a02831ff3a03cf51343226670d5060623b43a4cfc4808bd0875b2c4b9501 docker.io/kindest/local-path-provisioner@sha256:f2d0a02831ff3a03cf51343226670d5060623b43a4cfc4808bd0875b2c4b9501
    
    Operation 1.13.0-1.13.7, 1.14.0-1.14.4, 1.15.0

    Unsuccessful failover on HA Controlplane V2 user cluster and admin cluster when the network filters out duplicate GARP requests

    If your cluster VMs are connected with a switch that filters out duplicate GARP (gratuitous ARP) requests, the keepalived leader election might encounter a race condition, which causes some nodes to have incorrect ARP table entries.

    The affected nodes can ping the control plane VIP, but a TCP connection to the control plane VIP will time out.


    Workaround:

    Run the following command on each control plane node of the affected cluster:
        iptables -I FORWARD -i ens192 --destination CONTROL_PLANE_VIP -j DROP
        
    Upgrades and Updates 1.13.0-1.13.7, 1.14.0-1.14.4, 1.15.0

    vsphere-csi-controller needs be restarted after the vCenter certificate rotation

    vsphere-csi-controller should refresh its vCenter secret after vCenter certificate rotation. However, the current system does not properly restart the pods of vsphere-csi-controller, causing vsphere-csi-controller to crash after the rotation.

    Workaround:

    For clusters created at 1.13 and later versions, follow the instructions below to restart vsphere-csi-controller

    kubectl --kubeconfig=ADMIN_KUBECONFIG rollout restart deployment vsphere-csi-controller -n kube-system
    Installation 1.10.3-1.10.7, 1.11, 1.12, 1.13.0-1.13.1

    Admin cluster creation does not fail on cluster registration errors

    Even when cluster registration fails during admin cluster creation, the command gkectl create admin does not fail on the error and might succeed. In other words, the admin cluster creation could "succeed" without being registered to a fleet.

    To identify the symptom, you can look for the following error messages in the log of `gkectl create admin`,
    Failed to register admin cluster

    You can also check whether you can find the cluster among registered clusters on cloud console.

    Workaround:

    For clusters created at 1.12 and later versions, follow the instructions for re-attempting the admin cluster registration after cluster creation. For clusters created at earlier versions,

    1. Append a fake key-value pair like "foo: bar" to your connect-register SA key file
    2. Run gkectl update admin to re-register the admin cluster.

    Upgrades and Updates 1.10, 1.11, 1.12, 1.13.0-1.13.1

    Admin cluster re-registration might be skipped during admin cluster upgrade

    During admin cluster upgrade, if upgrading user control plane nodes times out, the admin cluster will not be re-registered with the updated connect agent version.


    Workaround:

    Check whether the cluster shows among registered clusters. As an optional step, Log in to the cluster after setting up authentication. If the cluster is still registered, you might skip the following instructions for re-attempting the registration. For clusters upgraded to 1.12 and later versions, follow the instructions for re-attempting the admin cluster registration after cluster creation. For clusters upgraded to earlier versions,
    1. Append a fake key-value pair like "foo: bar" to your connect-register SA key file
    2. Run gkectl update admin to re-register the admin cluster.

    Configuration 1.15.0

    False error message about vCenter.dataDisk

    For a high-availability admin cluster, gkectl prepare shows this false error message:

    vCenter.dataDisk must be present in the AdminCluster spec

    Workaround:

    You can safely ignore this error message.

    VMware 1.15.0

    Node pool creation fails because of redundant VM-Host affinity rules

    During creation of a node pool that uses VM-Host affinity, a race condition might result in multiple VM-Host affinity rules being created with the same name. This can cause node pool creation to fail.


    Workaround:

    Remove the old redundant rules so that node pool creation can proceed. These rules are named [USER_CLUSTER_NAME]-[HASH].

    Operation 1.15.0

    gkectl repair admin-master may fail due to failed to delete the admin master node object and reboot the admin master VM

    The gkectl repair admin-master command may fail due to a race condition with the following error.

    Failed to repair: failed to delete the admin master node object and reboot the admin master VM


    Workaround:

    This command is idempotent. It can rerun safely until the command succeeds.

    Upgrades and updates 1.15.0

    Pods remain in Failed state afer re-creation or update of a control-plane node

    After you re-create or update a control-plane node, certain Pods might be left in the Failed state due to NodeAffinity predicate failure. These failed Pods don't affect normal cluster operations or health.


    Workaround:

    You can safely ignore the failed Pods or manually delete them.

    Security, Configuration 1.15.0-1.15.1

    OnPremUserCluster not ready because of private registry credentials

    If you use prepared credentials and a private registry, but you haven't configured prepared credentials for your private registry, the OnPremUserCluster might not become ready, and you might see the following error message:

    failed to check secret reference for private registry …


    Workaround:

    Prepare the private registry credentials for the user cluster according to the instructions in Configure prepared credentials.

    Upgrades and updates 1.15.0

    gkectl upgrade admin fails with StorageClass standard sets the parameter diskformat which is invalid for CSI Migration

    During gkectl upgrade admin, the storage preflight check for CSI Migration verifies that the StorageClasses don't have parameters that are ignored after CSI Migration. For example, if there's a StorageClass with the parameter diskformat then gkectl upgrade admin flags the StorageClass and reports a failure in the preflight validation. Admin clusters created in Anthos 1.10 and before have a StorageClass with diskformat: thin which will fail this validation however this StorageClass still works fine after CSI Migration. These failures should be interpreted as warnings instead.

    For more information, check the StorageClass parameter section in Migrating In-Tree vSphere Volumes to vSphere Container Storage Plug-in.


    Workaround:

    After confirming that your cluster has a StorageClass with parameters ignored after CSI Migration run gkectl upgrade admin with the flag --skip-validation-cluster-health.

    Storage 1.15, 1.16

    Migrated in-tree vSphere volumes using the Windows file system can't be used with vSphere CSI driver

    Under certain conditions disks can be attached as readonly to Windows nodes. This results in the corresponding volume being readonly inside a Pod. This problem is more likely to occur when a new set of nodes replaces an old set of nodes (for example, cluster upgrade or node pool update). Stateful workloads that previously worked fine might be unable to write to their volumes on the new set of nodes.


    Workaround:

    1. Get the UID of the Pod that is unable to write to its volume:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get pod \
          POD_NAME --namespace POD_NAMESPACE \
          -o=jsonpath='{.metadata.uid}{"\n"}'
    2. Use the PersistentVolumeClaim to get the name of the PersistentVolume:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get pvc \
          PVC_NAME --namespace POD_NAMESPACE \
          -o jsonpath='{.spec.volumeName}{"\n"}'
    3. Determine the name of the node where the Pod is running:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIGget pods \
          --namespace POD_NAMESPACE \
          -o jsonpath='{.spec.nodeName}{"\n"}'
    4. Obtain powershell access to the node, either through SSH or the vSphere web interface.
    5. Set environment variables:
      PS C:\Users\administrator> pvname=PV_NAME
      PS C:\Users\administrator> podid=POD_UID
    6. Identify the disk number for the disk associated with the PersistentVolume:
      PS C:\Users\administrator> disknum=(Get-Partition -Volume (Get-Volume -UniqueId ("\\?\"+(Get-Item (Get-Item
      "C:\var\lib\kubelet\pods\$podid\volumes\kubernetes.io~csi\$pvname\mount").Target).Target))).DiskNumber
    7. Verify that the disk is readonly:
      PS C:\Users\administrator> (Get-Disk -Number $disknum).IsReadonly
      The result should be True.
    8. Set readonly to false.
      PS C:\Users\administrator> Set-Disk -Number $disknum -IsReadonly $false
      PS C:\Users\administrator> (Get-Disk -Number $disknum).IsReadonly
    9. Delete the Pod so that it will get restarted:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG delete pod POD_NAME \
          --namespace POD_NAMESPACE
    10. The Pod should get scheduled to the same node. But in case the Pod gets scheduled to a new node, you might need to repeat the preceding steps on the new node.

    Upgrades and updates 1.12, 1.13.0-1.13.7, 1.14.0-1.14.4

    vsphere-csi-secret is not updated after gkectl update credentials vsphere --admin-cluster

    If you update the vSphere credentials for an admin cluster following updating cluster credentials, you might find vsphere-csi-secret under kube-system namespace in the admin cluster still uses the old credential.


    Workaround:

    1. Get the vsphere-csi-secret secret name:
      kubectl --kubeconfig=ADMIN_KUBECONFIG -n=kube-system get secrets | grep vsphere-csi-secret
    2. Update the data of the vsphere-csi-secret secret you got from the above step:
      kubectl --kubeconfig=ADMIN_KUBECONFIG -n=kube-system patch secret CSI_SECRET_NAME -p \
        "{\"data\":{\"config\":\"$( \
          kubectl --kubeconfig=ADMIN_KUBECONFIG -n=kube-system get secrets CSI_SECRET_NAME -ojsonpath='{.data.config}' \
            | base64 -d \
            | sed -e '/user/c user = \"VSPHERE_USERNAME_TO_BE_UPDATED\"' \
            | sed -e '/password/c password = \"VSPHERE_PASSWORD_TO_BE_UPDATED\"' \
            | base64 -w 0 \
          )\"}}"
    3. Restart vsphere-csi-controller:
      kubectl --kubeconfig=ADMIN_KUBECONFIG -n=kube-system rollout restart deployment vsphere-csi-controller
    4. You can track the rollout status with:
      kubectl --kubeconfig=ADMIN_KUBECONFIG -n=kube-system rollout status deployment vsphere-csi-controller
      After the deployment is successfully rolled out, the updated vsphere-csi-secret should be used by the controller.
    Upgrades and updates 1.10, 1.11, 1.12.0-1.12.6, 1.13.0-1.13.6, 1.14.0-1.14.2

    audit-proxy crashloop when enabling Cloud Audit Logs with gkectl update cluster

    audit-proxy might crashloop because of empty --cluster-name. This behavior is caused by a bug in the update logic, where the cluster name is not propagated to the audit-proxy pod / container manifest.


    Workaround:

    For a control plane v2 user cluster with enableControlplaneV2: true, connect to the user control plane machine using SSH, and update /etc/kubernetes/manifests/audit-proxy.yaml with --cluster_name=USER_CLUSTER_NAME.

    For a control plane v1 user cluster, edit the audit-proxy container in the kube-apiserver statefulset to add --cluster_name=USER_CLUSTER_NAME:

    kubectl edit statefulset kube-apiserver -n USER_CLUSTER_NAME --kubeconfig=ADMIN_CLUSTER_KUBECONFIG
    Upgrades and updates 1.11, 1.12, 1.13.0-1.13.5, 1.14.0-1.14.1

    An additional control plane redeployment right after gkectl upgrade cluster

    Right after gkectl upgrade cluster, the control plane pods might be re-deployed again. The cluster state from gkectl list clusters change from RUNNING TO RECONCILING. Requests to the user cluster might timeout.

    This behavior is because of the control plane certificate rotation happens automatically after gkectl upgrade cluster.

    This issue only happens to user clusters that do NOT use control plane v2.


    Workaround:

    Wait for the cluster state to change back to RUNNING again in gkectl list clusters, or upgrade to versions with the fix: 1.13.6+, 1.14.2+ or 1.15+.

    Upgrades and updates 1.12.7

    Bad release 1.12.7-gke.19 has been removed

    Anthos clusters on VMware 1.12.7-gke.19 is a bad release and you should not use it. The artifacts have been removed from the Cloud Storage bucket.

    Workaround:

    Use the 1.12.7-gke.20 release instead.

    Upgrades and updates 1.12.0+, 1.13.0-1.13.7, 1.14.0-1.14.3

    gke-connect-agent continues to use the older image after registry credential updated

    If you update the registry credential using one of the following methods:

    • gkectl update credentials componentaccess if not using private registry
    • gkectl update credentials privateregistry if using private registry

    you might find gke-connect-agent continues to use the older image or the gke-connect-agent pods cannot be pulled up due to ImagePullBackOff.

    This issue will be fixed in Anthos clusters on VMware releases 1.13.8, 1.14.4, and subsequent releases.


    Workaround:

    Option 1: Redeploy gke-connect-agent manually:

    1. Delete the gke-connect namespace:
      kubectl --kubeconfig=KUBECONFIG delete namespace gke-connect
    2. Redeploy gke-connect-agent with the original register service account key (no need to update the key):

      For admin cluster:
      gkectl update credentials register --kubeconfig=ADMIN_CLUSTER_KUBECONFIG --config ADMIN_CLUSTER_CONFIG_FILE --admin-cluster
      For user cluster:
      gkectl update credentials register --kubeconfig=ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG_FILE

    Option 2: You can manually change the data of the image pull secret regcred which is used by gke-connect-agent deployment:

    kubectl --kubeconfig=KUBECONFIG -n=gke-connect patch secrets regcred -p "{\"data\":{\".dockerconfigjson\":\"$(kubectl --kubeconfig=KUBECONFIG -n=kube-system get secrets private-registry-creds -ojsonpath='{.data.\.dockerconfigjson}')\"}}"

    Option 3: You can add the default image pull secret for your cluster in the gke-connect-agent deployment by:

    1. Copy the default secret to gke-connect namespace:
      kubectl --kubeconfig=KUBECONFIG -n=kube-system get secret private-registry-creds -oyaml | sed 's/ namespace: .*/ namespace: gke-connect/' | kubectl --kubeconfig=KUBECONFIG -n=gke-connect apply -f -
    2. Get the gke-connect-agent deployment name:
      kubectl --kubeconfig=KUBECONFIG -n=gke-connect get deployment | grep gke-connect-agent
    3. Add the default secret to gke-connect-agent deployment:
      kubectl --kubeconfig=KUBECONFIG -n=gke-connect patch deployment DEPLOYMENT_NAME -p '{"spec":{"template":{"spec":{"imagePullSecrets": [{"name": "private-registry-creds"}, {"name": "regcred"}]}}}}'
    Installation 1.13, 1.14

    Manual LB configuration check failure

    When you validate the configuration before creating a cluster with Manual load balancer by running gkectl check-config, then the command will fail with the following error messages.

     - Validation Category: Manual LB    Running validation check for "Network 
    configuration"...panic: runtime error: invalid memory address or nil pointer 
    dereference
    

    Workaround:

    Option 1: You can use the patch version 1.13.7 and 1.14.4 that will include the fix.

    Option 2: You can also run the same command to validate the configuration but skip the load balancer validation.

    gkectl check-config --skip-validation-load-balancer
    
    Operation 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13, and 1.14

    etcd watch starvation

    Clusters running etcd version 3.4.13 or earlier may experience watch starvation and non-operational resource watches, which can lead to the following problems:

    • Pod scheduling is disrupted
    • Nodes are unable to register
    • kubelet doesn't observe pod changes

    These problems can make the cluster non-functional.

    This issue is fixed in Anthos clusters on VMware releases 1.12.7, 1.13.6, 1.14.3, and subsequent releases. These newer releases use etcd version 3.4.21. All prior versions of Anthos clusters on VMware are affected by this issue.

    Workaround

    If you can't upgrade immediately, you can mitigate the risk of cluster failure by reducing the number of nodes in your cluster. Remove nodes until the etcd_network_client_grpc_sent_bytes_total metric is less than 300 MBps.

    To view this metric in Metrics Explorer:

    1. Go to the Metrics Explorer in the Google Cloud console:

      Go to Metrics Explorer

    2. Select the Configuration tab.
    3. Expand the Select a metric, enter Kubernetes Container in the filter bar, and then use the submenus to select the metric:
      1. In the Active resources menu, select Kubernetes Container.
      2. In the Active metric categories menu, select Anthos.
      3. In the Active metrics menu, select etcd_network_client_grpc_sent_bytes_total.
      4. Click Apply.
    Upgrades and updates 1.10, 1.11, 1.12, 1.13, and 1.14

    Anthos Identity Service can cause control plane latencies

    At cluster restarts or upgrades, Anthos Identity Service can get overwhelmed with traffic consisting of expired JWT tokens forwarded from the kube-apiserver to Anthos Identity Service over the authentication webhook. Although Anthos Identity Service doesn't crashloop, it becomes unresponsive and ceases to serve further requests. This problem ultimately leads to higher control plane latencies.

    This issue is fixed in the following Anthos clusters on VMware releases:

    • 1.12.6+
    • 1.13.6+
    • 1.14.2+

    To determine if you're affected by this issue, perform the following steps:

    1. Check whether the Anthos Identity Service endpoint can be reached externally:
      curl -s -o /dev/null -w "%{http_code}" \
          -X POST https://CLUSTER_ENDPOINT/api/v1/namespaces/anthos-identity-service/services/https:ais:https/proxy/authenticate -d '{}'

      Replace CLUSTER_ENDPOINT with the control plane VIP and control plane load balancer port for your cluster (for example, 172.16.20.50:443).

      If you're affected by this issue, the command returns a 400 status code. If the request times out, restart the ais Pod and rerun the curl command to see if that resolves the problem. If you get a status code of 000, the problem has been resolved and you are done. If you still get a 400 status code, the Anthos Identity Service HTTP server isn't starting. In this case, continue.

    2. Check the Anthos Identity Service and kube-apiserver logs:
      1. Check the Anthos Identity Service log:
        kubectl logs -f -l k8s-app=ais -n anthos-identity-service \
            --kubeconfig KUBECONFIG

        If the log contains an entry like the following, then you are affected by this issue:

        I0811 22:32:03.583448      32 authentication_plugin.cc:295] Stopping OIDC authentication for ???. Unable to verify the OIDC ID token: JWT verification failed: The JWT does not appear to be from this identity provider. To match this provider, the 'aud' claim must contain one of the following audiences:
        
      2. Check the kube-apiserver logs for your clusters:

        In the following commands, KUBE_APISERVER_POD is the name of the kube-apiserver Pod on the given cluster.

        Admin cluster:

        kubectl --kubeconfig ADMIN_KUBECONFIG logs \
            -n kube-system KUBE_APISERVER_POD kube-apiserver

        User cluster:

        kubectl --kubeconfig ADMIN_KUBECONFIG logs \
            -n USER_CLUSTER_NAME KUBE_APISERVER_POD kube-apiserver

        If the kube-apiserver logs contain entries like the following, then you are affected by this issue:

        E0811 22:30:22.656085       1 webhook.go:127] Failed to make webhook authenticator request: error trying to reach service: net/http: TLS handshake timeout
        E0811 22:30:22.656266       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, error trying to reach service: net/http: TLS handshake timeout]"
        

    Workaround

    If you can't upgrade your clusters immediately to get the fix, you can identify and restart the offending pods as a workaround:

    1. Increase the Anthos Identity Service verbosity level to 9:
      kubectl patch deployment ais -n anthos-identity-service --type=json \
          -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", \
          "value":"--vmodule=cloud/identity/hybrid/charon/*=9"}]' \
          --kubeconfig KUBECONFIG
    2. Check the Anthos Identity Service log for the invalid token context:
      kubectl logs -f -l k8s-app=ais -n anthos-identity-service \
          --kubeconfig KUBECONFIG
    3. To get the token payload associated with each invalid token context, parse each related service account secret with the following command:
      kubectl -n kube-system get secret SA_SECRET \
          --kubeconfig KUBECONFIG \
          -o jsonpath='{.data.token}' | base64 --decode
      
    4. To decode the token and see the source pod name and namespace, copy the token to the debugger at jwt.io.
    5. Restart the pods identified from the tokens.
    Operation 1.8, 1.9, 1.10

    The memory usage increase issue of etcd-maintenance pods

    The etcd maintenance pods that use etcddefrag:gke_master_etcddefrag_20210211.00_p0 image are affected. The `etcddefrag` container opens a new connection to etcd server during each defrag cycle and the old connections are not cleaned up.


    Workaround:

    Option 1: Upgrade to the latest patch version from 1.8 to 1.11 which contain the fix.

    Option 2: If you are using patch version earlier than 1.9.6 and 1.10.3, you need to scale down the etcd-maintenance pod for admin and user cluster:

    kubectl scale --replicas 0 deployment/gke-master-etcd-maintenance -n USER_CLUSTER_NAME --kubeconfig ADMIN_CLUSTER_KUBECONFIG
    kubectl scale --replicas 0 deployment/gke-master-etcd-maintenance -n kube-system --kubeconfig ADMIN_CLUSTER_KUBECONFIG
    
    Operation 1.9, 1.10, 1.11, 1.12, 1.13

    Miss the health checks of user cluster control plane pods

    Both the cluster health controller and the gkectl diagnose cluster command perform a set of health checks including the pods health checks across namespaces. However, they start to skip the user control plane pods by mistake. If you use the control plane v2 mode, this won't affect your cluster.


    Workaround:

    This won't affect any workload or cluster management. If you want to check the control plane pods healthiness, you can run the following commands:

    kubectl get pods -owide -n USER_CLUSTER_NAME --kubeconfig ADMIN_CLUSTER_KUBECONFIG
    
    Upgrades and updates 1.6+, 1.7+

    1.6 and 1.7 admin cluster upgrades may be affected by the k8s.gcr.io -> registry.k8s.io redirect

    Kubernetes redirected the traffic from k8s.gcr.io to registry.k8s.io on 3/20/2023. In Anthos clusters on VMware 1.6.x and 1.7.x, the admin cluster upgrades use the container image k8s.gcr.io/pause:3.2. If you use a proxy for your admin workstation and the proxy doesn't allow registry.k8s.io and the container image k8s.gcr.io/pause:3.2 is not cached locally, the admin cluster upgrades will fail when pulling the container image.


    Workaround:

    Add registry.k8s.io to the allowlist of the proxy for your admin workstation.

    Networking 1.10, 1.11, 1.12.0-1.12.6, 1.13.0-1.13.6, 1.14.0-1.14.2

    Seesaw validation failure on load balancer creation

    gkectl create loadbalancer fails with the following error message:

    - Validation Category: Seesaw LB - [FAILURE] Seesaw validation: xxx cluster lb health check failed: LB"xxx.xxx.xxx.xxx" is not healthy: Get "http://xxx.xxx.xxx.xxx:xxx/healthz": dial tcpxxx.xxx.xxx.xxx:xxx: connect: no route to host
    

    This is due to the seesaw group file already existing. And the preflight check tries to validate a non-existent seesaw load balancer.

    Workaround:

    Remove the existing seesaw group file for this cluster. The file name is seesaw-for-gke-admin.yaml for the admin cluster, and seesaw-for-{CLUSTER_NAME}.yaml for a user cluster.

    Networking 1.14

    Application timeouts caused by conntrack table insertion failures

    Anthos clusters on VMware version 1.14 is susceptible to netfilter connection tracking (conntrack) table insertion failures when using Ubuntu or COS operating system images. Insertion failures lead to random application timeouts and can occur even when the conntrack table has room for new entries. The failures are caused by changes in kernel 5.15 and higher that restrict table insertions based on chain length.

    To see if you are affected by this issue, you can check the in-kernel connection tracking system statistics on each node with the following command:

    sudo conntrack -S

    The response looks like this:

    cpu=0       found=0 invalid=4 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
    cpu=1       found=0 invalid=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
    cpu=2       found=0 invalid=16 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
    cpu=3       found=0 invalid=13 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
    cpu=4       found=0 invalid=9 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
    cpu=5       found=0 invalid=1 insert=0 insert_failed=0 drop=0 early_drop=0 error=519 search_restart=0 clash_resolve=126 chaintoolong=0 
    ...
    

    If a chaintoolong value in the response is a non-zero number, you're affected by this issue.

    Workaround

    The short term mitigation is to increase the size of both the netfiler hash table (nf_conntrack_buckets) and the netfilter connection tracking table (nf_conntrack_max). Use the following commands on each cluster node to increase the size of the tables:

    sysctl -w net.netfilter.nf_conntrack_buckets=TABLE_SIZE
    sysctl -w net.netfilter.nf_conntrack_max=TABLE_SIZE

    Replace TABLE_SIZE with new size in bytes. The default table size value is 262144. We suggest that you set a value equal to 65,536 times the number of cores on the node. For example, if your node has eight cores, set the table size to 524288.

    Networking 1.13.0-1.13.2

    calico-typha or anetd-operator crash loop on Windows nodes with Controlplane v2

    With Controlplane v2 or a new installation model, calico-typha or anetd-operator might be scheduled to Windows nodes and get into crash loop.

    The reason is that the two deployments tolerate all taints including Windows node taint.


    Workaround:

    Either upgrade to 1.13.3+, or run the following commands to edit the `calico-typha` or `anetd-operator` deployment:

        # If dataplane v2 is not used.
        kubectl edit deployment -n kube-system calico-typha --kubeconfig USER_CLUSTER_KUBECONFIG
        # If dataplane v2 is used.
        kubectl edit deployment -n kube-system anetd-operator --kubeconfig USER_CLUSTER_KUBECONFIG
        

    Remove the following spec.template.spec.tolerations:

        - effect: NoSchedule
          operator: Exists
        - effect: NoExecute
          operator: Exists
        

    And add the following toleration:

        - key: node-role.kubernetes.io/master
          operator: Exists
        
    Configuration 1.14.0-1.14.2

    User cluster private registry credential file cannot be loaded

    You might not be able to create a user cluster if you specify the privateRegistry section with credential fileRef. Preflight might fail with the following message:

    [FAILURE] Docker registry access: Failed to login.
    


    Workaround:

    • If you did not intend to specify the field or you want to use the same private registry credential as admin cluster, you can simply remove or comment the privateRegistry section in your user cluster config file.
    • If you want to use a specific private registry credential for your user cluster, you may temporarily specify the privateRegistry section this way:
      privateRegistry:
        address: PRIVATE_REGISTRY_ADDRESS
        credentials:
          username: PRIVATE_REGISTRY_USERNAME
          password: PRIVATE_REGISTRY_PASSWORD
        caCertPath: PRIVATE_REGISTRY_CACERT_PATH
      
      (NOTE: This is only a temporarily fix and these fields are already deprecated, consider using the credential file when upgrading to 1.14.3+.)

    Operations 1.10+

    Anthos Service Mesh and other service meshes not compatible with Dataplane v2

    Dataplane V2 takes over load balancing and creates a kernel socket instead of a packet based DNAT. This means that Anthos Service Mesh cannot do packet inspection as the pod is bypassed and never uses IPTables.

    This manifests in kube-proxy free mode by loss of connectivity or incorrect traffic routing for services with Anthos Service Mesh as the sidecar cannot do packet inspection.

    This issue is present on all versions of Anthos clusters on bare metal 1.10, however some newer versions of 1.10 (1.10.2+) have a workaround.


    Workaround:

    Either upgrade to 1.11 for full compatibility or if running 1.10.2 or later, run:

        kubectl edit cm -n kube-system cilium-config --kubeconfig USER_CLUSTER_KUBECONFIG
        

    Add bpf-lb-sock-hostns-only: true to the configmap and then restart the anetd daemonset:

          kubectl rollout restart ds anetd -n kube-system --kubeconfig USER_CLUSTER_KUBECONFIG
        

    Storage 1.12+, 1.13.3

    kube-controller-manager might detach persistent volumes forcefully after 6 minutes

    kube-controller-manager might timeout when detaching PV/PVCs after 6 minutes, and forcefully detach the PV/PVCs. Detailed logs from kube-controller-manager show events similar to the following:

    $ cat kubectl_logs_kube-controller-manager-xxxx | grep "DetachVolume started" | grep expired
    
    kubectl_logs_kube-controller-manager-gke-admin-master-4mgvr_--container_kube-controller-manager_--kubeconfig_kubeconfig_--request-timeout_30s_--namespace_kube-system_--timestamps:2023-01-05T16:29:25.883577880Z W0105 16:29:25.883446       1 reconciler.go:224] attacherDetacher.DetachVolume started for volume "pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^126f913b-4029-4055-91f7-beee75d5d34a") on node "sandbox-corp-ant-antho-0223092-03-u-tm04-ml5m8-7d66645cf-t5q8f"
    This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching
    

    To verify the issue, log into the node and run the following commands:

    # See all the mounting points with disks
    lsblk -f
    
    # See some ext4 errors
    sudo dmesg -T
    

    In the kubelet log, errors like the following are displayed:

    Error: GetDeviceMountRefs check failed for volume "pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^126f913b-4029-4055-91f7-beee75d5d34a") on node "sandbox-corp-ant-antho-0223092-03-u-tm04-ml5m8-7d66645cf-t5q8f" :
    the device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16/globalmount" is still mounted by other references [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16/globalmount
    

    Workaround:

    Connect to the affected node using SSH and reboot the node.

    Upgrades and updates 1.12+, 1.13+, 1.14+

    Cluster upgrade is stuck if 3rd party CSI driver is used

    You might not be able to upgrade a cluster if you use a 3rd party CSI driver. The gkectl cluster diagnose command might return the following error:

    "virtual disk "kubernetes.io/csi/csi.netapp.io^pvc-27a1625f-29e3-4e4f-9cd1-a45237cc472c" IS NOT attached to machine "cluster-pool-855f694cc-cjk5c" but IS listed in the Node.Status"
    


    Workaround:

    Perform the upgrade using the --skip-validation-all option.

    Operation 1.10+, 1.11+, 1.12+, 1.13+, 1.14+

    gkectl repair admin-master creates the admin master VM without upgrading its vm hardware version

    The admin master node created via gkectl repair admin-master may use a lower VM hardware version than expected. When the issue happens, you will see the error from the gkectl diagnose cluster report.

    CSIPrerequisites [VM Hardware]: The current VM hardware versions are lower than vmx-15 which is unexpected. Please contact Anthos support to resolve this issue.


    Workaround:

    Shutdown the admin master node, follow https://kb.vmware.com/s/article/1003746 to upgrade the node to the expected version described in the error message, and then start the node.

    Operating system 1.10+, 1.11+, 1.12+, 1.13+, 1.14+, 1.15+, 1.16+

    VM releases DHCP lease on shutdown/reboot unexpectedly, which may result in IP changes

    In systemd v244, systemd-networkd has a default behavior change on the KeepConfiguration configuration. Before this change, VMs did not send a DHCP lease release message to the DHCP server on shutdown or reboot. After this change, VMs send such a message and return the IPs to the DHCP server. As a result, the released IP may be reallocated to a different VM and/or a different IP may be assigned to the VM, resulting in IP conflict (at Kubernetes level, not vSphere level) and/or IP change on the VMs, which can break the clusters in various ways.

    For example, you may see the following symptoms.

    • vCenter UI shows that no VMs use the same IP, but kubectl get nodes -o wide returns nodes with duplicate IPs.
      NAME   STATUS    AGE  VERSION          INTERNAL-IP    EXTERNAL-IP    OS-IMAGE            KERNEL-VERSION    CONTAINER-RUNTIME
      node1  Ready     28h  v1.22.8-gke.204  10.180.85.130  10.180.85.130  Ubuntu 20.04.4 LTS  5.4.0-1049-gkeop  containerd://1.5.13
      node2  NotReady  71d  v1.22.8-gke.204  10.180.85.130  10.180.85.130  Ubuntu 20.04.4 LTS  5.4.0-1049-gkeop  containerd://1.5.13
    • New nodes fail to start due to calico-node error
      2023-01-19T22:07:08.817410035Z 2023-01-19 22:07:08.817 [WARNING][9] startup/startup.go 1135: Calico node 'node1' is already using the IPv4 address 10.180.85.130.
      2023-01-19T22:07:08.817514332Z 2023-01-19 22:07:08.817 [INFO][9] startup/startup.go 354: Clearing out-of-date IPv4 address from this node IP="10.180.85.130/24"
      2023-01-19T22:07:08.825614667Z 2023-01-19 22:07:08.825 [WARNING][9] startup/startup.go 1347: Terminating
      2023-01-19T22:07:08.828218856Z Calico node failed to start


    Workaround:

    Deploy the following DaemonSet on the cluster to revert the systemd-networkd default behavior change. The VMs that run this DaemonSet will not release the IPs to the DHCP server on shutdown/reboot. The IPs will be freed automatically by the DHCP server when the leases expire.

          apiVersion: apps/v1
          kind: DaemonSet
          metadata:
            name: set-dhcp-on-stop
          spec:
            selector:
              matchLabels:
                name: set-dhcp-on-stop
            template:
              metadata:
                labels:
                  name: set-dhcp-on-stop
              spec:
                hostIPC: true
                hostPID: true
                hostNetwork: true
                containers:
                - name: set-dhcp-on-stop
                  image: ubuntu
                  tty: true
                  command:
                  - /bin/bash
                  - -c
                  - |
                    set -x
                    date
                    while true; do
                      export CONFIG=/host/run/systemd/network/10-netplan-ens192.network;
                      grep KeepConfiguration=dhcp-on-stop "${CONFIG}" > /dev/null
                      if (( $? != 0 )) ; then
                        echo "Setting KeepConfiguration=dhcp-on-stop"
                        sed -i '/\[Network\]/a KeepConfiguration=dhcp-on-stop' "${CONFIG}"
                        cat "${CONFIG}"
                        chroot /host systemctl restart systemd-networkd
                      else
                        echo "KeepConfiguration=dhcp-on-stop has already been set"
                      fi;
                      sleep 3600
                    done
                  volumeMounts:
                  - name: host
                    mountPath: /host
                  resources:
                    requests:
                      memory: "10Mi"
                      cpu: "5m"
                  securityContext:
                    privileged: true
                volumes:
                - name: host
                  hostPath:
                    path: /
                tolerations:
                - operator: Exists
                  effect: NoExecute
                - operator: Exists
                  effect: NoSchedule
          

    Operation, upgrades and updates 1.12.0-1.12.5, 1.13.0-1.13.5, 1.14.0-1.14.1

    Component access service account key wiped out after admin cluster upgraded from 1.11.x

    This issue will only affect admin clusters which are upgraded from 1.11.x, and won't affect admin clusters which are newly created after 1.12.

    After upgrading a 1.11.x cluster to 1.12.x, the component-access-sa-key field in admin-cluster-creds secret will be wiped out to empty. This can be checked by running the following command:

    kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get secret admin-cluster-creds -o yaml | grep 'component-access-sa-key'
    If you find the output is empty that means the key is wiped out.

    After the component access service account key been deleted, installing new user clusters or upgrading existing user clusters will fail. The following lists some error messages you might encounter:

    • Slow validation preflight failure with error message: "Failed to create the test VMs: failed to get service account key: service account is not configured."
    • Prepare by gkectl prepare failed with error message: "Failed to prepare OS images: dialing: unexpected end of JSON input"
    • If you are upgrading a 1.13 user cluster using the Google Cloud Console or the gcloud CLI, when you run gkectl update admin --enable-preview-user-cluster-central-upgrade to deploy the upgrade platform controller, the command fails with the message: "failed to download bundle to disk: dialing: unexpected end of JSON input" (You can see this message in the status field in the output of kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get onprembundle -oyaml).


    Workaround:

    Add the component access service account key back into the secret manually by running the following command:

    kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get secret admin-cluster-creds -ojson | jq --arg casa "$(cat COMPONENT_ACESS_SERVICE_ACOOUNT_KEY_PATH | base64 -w 0)" '.data["component-access-sa-key"]=$casa' | kubectl --kubeconfig ADMIN_KUBECONFIG apply -f -

    Operation 1.13.0+, 1.14.0+

    Cluster autoscaler does not work when Controlplane V2 is enabled

    For user clusters created with Controlplane V2 or a new installation model, node pools with autoscaling enabled always use their autoscaling.minReplicas in the user-cluster.yaml. The log of the cluster-autoscaler pod also shows that their are unhealthy.

      > kubectl --kubeconfig $USER_CLUSTER_KUBECONFIG -n kube-system \
      logs $CLUSTER_AUTOSCALER_POD --container_cluster-autoscaler
     TIMESTAMP  1 gkeonprem_provider.go:73] error getting onpremusercluster ready status: Expected to get a onpremusercluster with id foo-user-cluster-gke-onprem-mgmt/foo-user-cluster
     TIMESTAMP 1 static_autoscaler.go:298] Failed to get node infos for groups: Expected to get a onpremusercluster with id foo-user-cluster-gke-onprem-mgmt/foo-user-cluster
      
    The cluster autoscaler pod can be found by running the following commands.
      > kubectl --kubeconfig $USER_CLUSTER_KUBECONFIG -n kube-system \
       get pods | grep cluster-autoscaler
    cluster-autoscaler-5857c74586-txx2c                          4648017n    48076Ki    30s
      


    Workaround:

    Disable autoscaling in all the node pools with `gkectl update cluster` until upgrading to a version with the fix

    Installation 1.12.0-1.12.4, 1.13.0-1.13.3, 1.14.0

    CIDR is not allowed in the IP block file

    When users use CIDR in the IP block file, the config validation will fail with the following error:

    - Validation Category: Config Check
        - [FAILURE] Config: AddressBlock for admin cluster spec is invalid: invalid IP:
    172.16.20.12/30
      


    Workaround:

    Include individual IPs in the IP block file until upgrading to a version with the fix: 1.12.5, 1.13.4, 1.14.1+.

    Upgrades and updates 1.14.0-1.14.1

    OS image type update in the admin-cluster.yaml doesn't wait for user control plane machines to be re-created

    When Updating control plane OS image type in the admin-cluster.yaml, and if its corresponding user cluster was created via Controlplane V2, the user control plane machines may not finish their re-creation when the gkectl command finishes.


    Workaround:

    After the update is finished, keep waiting for the user control plane machines to also finish their re-creation by monitoring their node os image types using kubectl --kubeconfig USER_KUBECONFIG get nodes -owide. e.g. If updating from Ubuntu to COS, we should wait for all the control plane machines to completely change from Ubuntu to COS even after the update command is complete.

    Operation 1.14.0

    Pod create or delete errors due to Calico CNI service account auth token issue

    An issue with Calico in Anthos clusters on VMware 1.14.0 causes Pod creation and deletion to fail with the following error message in the output of kubectl describe pods:

      error getting ClusterInformation: connection is unauthorized: Unauthorized
      

    This issue is only observed 24 hours after the cluster is created or upgraded to 1.14 using Calico.

    Admin clusters are always using Calico, while for user cluster there is a config field `enableDataPlaneV2` in user-cluster.yaml, if that field is set to `false`, or not specified, that means you are using Calico in user cluster.

    The nodes' install-cni container creates a kubeconfig with a token that is valid for 24 hours. This token needs to be periodically renewed by the calico-node Pod. The calico-node Pod is unable to renew the token as it doesn't have access to the directory that contains the kubeconfig file on the node.


    Workaround:

    To mitigate the issue, apply the following patch on the calico-node DaemonSet in your admin and user cluster:

      kubectl -n kube-system get daemonset calico-node \
        --kubeconfig ADMIN_CLUSTER_KUBECONFIG -o json \
        | jq '.spec.template.spec.containers[0].volumeMounts += [{"name":"cni-net-dir","mountPath":"/host/etc/cni/net.d"}]' \
        | kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f -
    
      kubectl -n kube-system get daemonset calico-node \
        --kubeconfig USER_CLUSTER_KUBECONFIG -o json \
        | jq '.spec.template.spec.containers[0].volumeMounts += [{"name":"cni-net-dir","mountPath":"/host/etc/cni/net.d"}]' \
        | kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f -
      
    Replace the following:
    • ADMIN_CLUSTER_KUBECONFIG: the path of the admin cluster kubeconfig file.
    • USER_CLUSTER_CONFIG_FILE: the path of your user cluster configuration file.
    Installation 1.12.0-1.12.4, 1.13.0-1.13.3, 1.14.0

    IP block validation fails when using CIDR

    Cluster creation fails despite the user having the proper configuration. User sees creation failing due to the cluster not having enough IPs.


    Workaround:

    Split CIDR's into several smaller CIDR blocks, such as 10.0.0.0/30 becomes 10.0.0.0/31, 10.0.0.2/31. As long as there are N+1 CIDR's, where N is the number of nodes in the cluster, this should suffice.

    Operation, Upgrades and updates 1.11.0 - 1.11.1, 1.10.0 - 1.10.4, 1.9.0 - 1.9.6

    Admin cluster backup does not include the always-on secrets encryption keys and configuration

    When the always-on secrets encryption feature is enabled along with cluster backup, the admin cluster backup fails to include the encryption keys and configuration required by always-on secrets encryption feature. As a result, repairing the admin master with this backup using gkectl repair admin-master --restore-from-backup causes the following error:

    Validating admin master VM xxx ...
    Waiting for kube-apiserver to be accessible via LB VIP (timeout "8m0s")...  ERROR
    Failed to access kube-apiserver via LB VIP. Trying to fix the problem by rebooting the admin master
    Waiting for kube-apiserver to be accessible via LB VIP (timeout "13m0s")...  ERROR
    Failed to access kube-apiserver via LB VIP. Trying to fix the problem by rebooting the admin master
    Waiting for kube-apiserver to be accessible via LB VIP (timeout "18m0s")...  ERROR
    Failed to access kube-apiserver via LB VIP. Trying to fix the problem by rebooting the admin master
    

    Operation, Upgrades and updates 1.10+

    Recreating the admin master VM with a new boot disk (e.g., gkectl repair admin-master) will fail if the always-on secrets encryption feature is enabled using `gkectl update` command.

    If the always-on secrets encryption feature is not enabled at cluster creation, but enabled later using gkectl update operation then the gkectl repair admin-master fails to repair the admin cluster control plane node. It is recommend that always-on secrets encryption feature is enabled at cluster creation. There is no current mitigation.

    Upgrades and updates 1.10

    Upgrading the first user cluster from 1.9 to 1.10 recreates nodes in other user clusters

    Upgrading the first user cluster from 1.9 to 1.10 could recreate nodes in other user clusters under the same admin cluster. The recreation is performed in a rolling fashion.

    The disk_label was removed from MachineTemplate.spec.template.spec.providerSpec.machineVariables, which triggered an update on all MachineDeployments unexpectedly.


    Workaround:

    Upgrades and updates 1.10.0

    Docker restarts frequently after cluster upgrade

    Upgrade user cluster to 1.10.0 might cause docker restart frequently.

    You can detect this issue by running kubectl describe node NODE_NAME --kubeconfig USER_CLUSTER_KUBECONFIG

    A node condition will show whether the docker restart frequently. Here is an example output:

    Normal   FrequentDockerRestart    41m (x2 over 141m)     systemd-monitor  Node condition FrequentDockerRestart is now: True, reason: FrequentDockerRestart
    

    To understand the root cause, you need to ssh to the node that has the symptom and run commands like sudo journalctl --utc -u docker or sudo journalctl -x


    Workaround:

    Upgrades and updates 1.11, 1.12

    Self-deployed GMP components not preserved after upgrading to version 1.12

    If you are using an Anthos clusters on VMware version below 1.12, and have manually set up Google-managed Prometheus (GMP) components in the gmp-system namespace for your cluster, the components are not preserved when you upgrade to version 1.12.x.

    From version 1.12, GMP components in the gmp-system namespace and CRDs are managed by stackdriver object, with the enableGMPForApplications flag set to false by default. If you manually deploy GMP components in the namespace prior to upgrading to 1.12, the resources will be deleted by stackdriver.


    Workaround:

    Operation 1.11, 1.12, 1.13.0 - 1.13.1

    Missing ClusterAPI objects in cluster snapshot system scenario

    In the system scenario, the cluster snapshot doesn't include any resources under the default namespace.

    However, some Kubernetes resources like Cluster API objects that are under this namespace contain useful debugging information. The cluster snapshot should include them.


    Workaround:

    You can manually run the following commands to collect the debugging information.

    export KUBECONFIG=USER_CLUSTER_KUBECONFIG
    kubectl get clusters.cluster.k8s.io -o yaml
    kubectl get controlplanes.cluster.k8s.io -o yaml
    kubectl get machineclasses.cluster.k8s.io -o yaml
    kubectl get machinedeployments.cluster.k8s.io -o yaml
    kubectl get machines.cluster.k8s.io -o yaml
    kubectl get machinesets.cluster.k8s.io -o yaml
    kubectl get services -o yaml
    kubectl describe clusters.cluster.k8s.io
    kubectl describe controlplanes.cluster.k8s.io
    kubectl describe machineclasses.cluster.k8s.io
    kubectl describe machinedeployments.cluster.k8s.io
    kubectl describe machines.cluster.k8s.io
    kubectl describe machinesets.cluster.k8s.io
    kubectl describe services
    
    where:

    USER_CLUSTER_KUBECONFIG is the user cluster's kubeconfig file.

    Upgrades and updates 1.11.0-1.11.4, 1.12.0-1.12.3, 1.13.0-1.13.1

    User cluster deletion stuck at node drain for vSAN setup

    When deleting, updating or upgrading a user cluster, node drain may be stuck in the following scenarios:

    • The admin cluster has been using vSphere CSI driver on vSAN since version 1.12.x, and
    • There are no PVC/PV objects created by in-tree vSphere plugins in the admin and user cluster.

    To identify the symptom, run the command below:

    kubectl logs clusterapi-controllers-POD_NAME_SUFFIX  --kubeconfig ADMIN_KUBECONFIG -n USER_CLUSTER_NAMESPACE
    

    Here is a sample error message from the above command:

    E0920 20:27:43.086567 1 machine_controller.go:250] Error deleting machine object [MACHINE]; Failed to delete machine [MACHINE]: failed to detach disks from VM "[MACHINE]": failed to convert disk path "kubevols" to UUID path: failed to convert full path "ds:///vmfs/volumes/vsan:[UUID]/kubevols": ServerFaultCode: A general system error occurred: Invalid fault
    

    kubevols is the default directory for vSphere in-tree driver. When there are no PVC/PV objects created, you may hit a bug that node drain will be stuck at finding kubevols, since the current implementation assumes that kubevols always exists.


    Workaround:

    Create the directory kubevols in the datastore where the node is created. This is defined in the vCenter.datastore field in the user-cluster.yaml or admin-cluster.yaml files.

    Configuration 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13, 1.14

    Cluster Autoscaler clusterrolebinding and clusterrole are deleted after deleting a user cluster.

    On user cluster deletion, the corresponding clusterrole and clusterrolebinding for cluster-autoscaler are also deleted. This affects all other user clusters on the same admin cluster with cluster autoscaler enabled. This is because the same clusterrole and clusterrolebinding are used for all cluster autoscaler pods within the same admin cluster.

    The symptoms are the following:

    • cluster-autoscaler logs
    • kubectl logs --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system \
      cluster-autoscaler
      
      where ADMIN_CLUSTER_KUBECONFIG is the admin cluster's kubeconfig file. Here is an example of error messages you might see:
      2023-03-26T10:45:44.866600973Z W0326 10:45:44.866463       1 reflector.go:424] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: failed to list *unstructured.Unstructured: onpremuserclusters.onprem.cluster.gke.io is forbidden: User "..." cannot list resource "onpremuserclusters" in API group "onprem.cluster.gke.io" at the cluster scope
      2023-03-26T10:45:44.866646815Z E0326 10:45:44.866494       1 reflector.go:140] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: onpremuserclusters.onprem.cluster.gke.io is forbidden: User "..." cannot list resource "onpremuserclusters" in API group "onprem.cluster.gke.io" at the cluster scope
      

    Workaround:

    Configuration 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13

    admin cluster cluster-health-controller and vsphere-metrics-exporter do not work after deleting user cluster

    On user cluster deletion, the corresponding clusterrole is also deleted, which results in auto repair and vsphere metrics exporter not working

    The symptoms are the following:

    • cluster-health-controller logs
    • kubectl logs --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system \
      cluster-health-controller
      
      where ADMIN_CLUSTER_KUBECONFIG is the admin cluster's kubeconfig file. Here is an example of error messages you might see:
      error retrieving resource lock default/onprem-cluster-health-leader-election: configmaps "onprem-cluster-health-leader-election" is forbidden: User "system:serviceaccount:kube-system:cluster-health-controller" cannot get resource "configmaps" in API group "" in the namespace "default": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-health-controller-role" not found
      
    • vsphere-metrics-exporter logs
    • kubectl logs --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system \
      vsphere-metrics-exporter
      
      where ADMIN_CLUSTER_KUBECONFIG is the admin cluster's kubeconfig file. Here is an example of error messages you might see:
      vsphere-metrics-exporter/cmd/vsphere-metrics-exporter/main.go:68: Failed to watch *v1alpha1.Cluster: failed to list *v1alpha1.Cluster: clusters.cluster.k8s.io is forbidden: User "system:serviceaccount:kube-system:vsphere-metrics-exporter" cannot list resource "clusters" in API group "cluster.k8s.io" in the namespace "default"
      

    Workaround:

    Configuration 1.12.1-1.12.3, 1.13.0-1.13.2

    gkectl check-config fails at OS image validation

    A known issue that could fail the gkectl check-config without running gkectl prepare. This is confusing because we suggest running the command before running gkectl prepare

    The symptom is that the gkectl check-config command will fail with the following error message:

    Validator result: {Status:FAILURE Reason:os images [OS_IMAGE_NAME] don't exist, please run `gkectl prepare` to upload os images. UnhealthyResources:[]}
    

    Workaround:

    Option 1: run gkectl prepare to upload the missing OS images.

    Option 2: use gkectl check-config --skip-validation-os-images to skip the OS images validation.

    Upgrades and updates 1.11, 1.12, 1.13

    gkectl update admin/cluster fails at updating anti affinity groups

    A known issue that could fail the gkectl update admin/cluster when updating anti affinity groups.

    The symptom is that the gkectl update command will fail with the following error message:

    Waiting for machines to be re-deployed...  ERROR
    Exit with error:
    Failed to update the cluster: timed out waiting for the condition
    

    Workaround:

    Installation, Upgrades and updates 1.13.0-1.13.8, 1.14.0-1.14.4, 1.15.0

    Nodes fail to register if configured hostname contains a period

    Node registration fails during cluster creation, upgrade, update and node auto repair, when ipMode.type is static and the configured hostname in the IP block file contains one or more periods. In this case, Certificate Signing Requests (CSR) for a node are not automatically approved.

    To see pending CSRs for a node, run the following command:

    kubectl get csr -A -o wide
    

    Check the following logs for error messages:

    • View the logs in the admin cluster for the clusterapi-controller-manager container in the clusterapi-controllers Pod:
      kubectl logs clusterapi-controllers-POD_NAME \
          -c clusterapi-controller-manager -n kube-system \
          --kubeconfig ADMIN_CLUSTER_KUBECONFIG
      
    • To view the same logs in the user cluster, run the following command:
      kubectl logs clusterapi-controllers-POD_NAME \
          -c clusterapi-controller-manager -n USER_CLUSTER_NAME \
          --kubeconfig ADMIN_CLUSTER_KUBECONFIG
      
      where:
      • ADMIN_CLUSTER_KUBECONFIG is the admin cluster's kubeconfig file.
      • USER_CLUSTER_NAME is the name of the user cluster.
      Here is an example of error messages you might see: "msg"="failed to validate token id" "error"="failed to find machine for node node-worker-vm-1" "validate"="csr-5jpx9"
    • View the kubelet logs on the problematic node:
      journalctl --u kubelet
      
      Here is an example of error messages you might see: "Error getting node" err="node \"node-worker-vm-1\" not found"

    If you specify a domain name in the hostname field of an IP block file, any characters following the first period will be ignored. For example, if you specify the hostname as bob-vm-1.bank.plc, the VM hostname and node name will be set to bob-vm-1.

    When node ID verification is enabled, the CSR approver compares the node name with the hostname in the Machine spec, and fails to reconcile the name. The approver rejects the CSR, and the node fails to bootstrap.


    Workaround:

    User cluster

    Disable node ID verification by completing the following steps:

    1. Add the following fields in your user cluster configuration file:
      disableNodeIDVerification: true
      disableNodeIDVerificationCSRSigning: true
      
    2. Save the file, and update the user cluster by running the following command:
      gkectl update cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
          --config USER_CLUSTER_CONFIG_FILE
      
      Replace the following:
      • ADMIN_CLUSTER_KUBECONFIG: the path of the admin cluster kubeconfig file.
      • USER_CLUSTER_CONFIG_FILE: the path of your user cluster configuration file.

    Admin cluster

    1. Open the OnPremAdminCluster custom resource for editing:
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
          edit onpremadmincluster -n kube-system
      
    2. Add the following annotation to the custom resource:
      features.onprem.cluster.gke.io/disable-node-id-verification: enabled
      
    3. Edit the kube-controller-manager manifest in the admin cluster control plane:
      1. SSH into the admin cluster control plane node.
      2. Open the kube-controller-manager manifest for editing:
        sudo vi /etc/kubernetes/manifests/kube-controller-manager.yaml
        
      3. Find the list of controllers:
        --controllers=*,bootstrapsigner,tokencleaner,-csrapproving,-csrsigning
        
      4. Update this section as shown below:
        --controllers=*,bootstrapsigner,tokencleaner
        
    4. Open the Deployment Cluster API controller for editing:
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
          edit deployment clusterapi-controllers -n kube-system
      
    5. Change the values of node-id-verification-enabled and node-id-verification-csr-signing-enabled to false:
      --node-id-verification-enabled=false
      --node-id-verification-csr-signing-enabled=false
      
    Installation, Upgrades and updates 1.11.0-1.11.4

    Admin control plane machine startup failure caused by private registry certificate bundle

    The admin cluster creation/upgrade is stuck at the following log forever and eventually times out:

    Waiting for Machine gke-admin-master-xxxx to become ready...
    

    The Cluster API controller log in the external cluster snapshot includes the following log:

    Invalid value 'XXXX' specified for property startup-data
    

    Here is an example file path for the Cluster API controller log:

    kubectlCommands/kubectl_logs_clusterapi-controllers-c4fbb45f-6q6g6_--container_vsphere-controller-manager_--kubeconfig_.home.ubuntu..kube.kind-config-gkectl_--request-timeout_30s_--namespace_kube-system_--timestamps
        

    VMware has a 64k vApp property size limit. In the identified versions, the data passed via vApp property is close to the limit. When the private registry certificate contains a certificate bundle, it may cause the final data to exceed the 64k limit.


    Workaround:

    Only include the required certificates in the private registry certificate file configured in privateRegistry.caCertPath in the admin cluster config file.

    Or upgrade to a version with the fix when available.

    Networking 1.10, 1.11.0-1.11.3, 1.12.0-1.12.2, 1.13.0

    NetworkGatewayNodes marked unhealthy from concurrent status update conflict

    In networkgatewaygroups.status.nodes, some nodes switch between NotHealthy and Up.

    Logs for the ang-daemon Pod running on that node reveal repeated errors:

    2022-09-16T21:50:59.696Z ERROR ANGd Failed to report status {"angNode": "kube-system/my-node", "error": "updating Node CR status: sending Node CR update: Operation cannot be fulfilled on networkgatewaynodes.networking.gke.io \"my-node\": the object has been modified; please apply your changes to the latest version and try again"}
    

    The NotHealthy status prevents the controller from assigning additional floating IPs to the node. This can result in higher burden on other nodes or a lack of redundancy for high availability.

    Dataplane activity is otherwise not affected.

    Contention on the networkgatewaygroup object causes some status updates to fail due to a fault in retry handling. If too many status updates fail, ang-controller-manager sees the node as past its heartbeat time limit and marks the node NotHealthy.

    The fault in retry handling has been fixed in later versions.


    Workaround:

    Upgrade to a fixed version, when available.

    Upgrades and updates 1.12.0-1.12.2, 1.13.0

    Race condition blocks machine object deletion during and update or upgrade

    A known issue that could cause the cluster upgrade or update to be stuck at waiting for the old machine object to be deleted. This is because the finalizer cannot be removed from the machine object. This affects any rolling update operation for node pools.

    The symptom is that the gkectl command times out with the following error message:

    E0821 18:28:02.546121   61942 console.go:87] Exit with error:
    E0821 18:28:02.546184   61942 console.go:87] error: timed out waiting for the condition, message: Node pool "pool-1" is not ready: ready condition is not true: CreateOrUpdateNodePool: 1/3 replicas are updated
    Check the status of OnPremUserCluster 'cluster-1-gke-onprem-mgmt/cluster-1' and the logs of pod 'kube-system/onprem-user-cluster-controller' for more detailed debugging information.
    

    In clusterapi-controller Pod logs, the errors are like below:

    $ kubectl logs clusterapi-controllers-[POD_NAME_SUFFIX] -n cluster-1
        -c vsphere-controller-manager --kubeconfig [ADMIN_KUBECONFIG]
        | grep "Error removing finalizer from machine object"
    [...]
    E0821 23:19:45.114993       1 machine_controller.go:269] Error removing finalizer from machine object cluster-1-pool-7cbc496597-t5d5p; Operation cannot be fulfilled on machines.cluster.k8s.io "cluster-1-pool-7cbc496597-t5d5p": the object has been modified; please apply your changes to the latest version and try again
    

    The error repeats for the same machine for several minutes for successful runs even without this issue, for most of the time it can go through quickly, but for some rare cases it can be stuck at this race condition for several hours.

    The issue is that the underlying VM is already deleted in vCenter, but the corresponding machine object cannot be removed, which is stuck at the finalizer removal due to very frequent updates from other controllers. This can cause the gkectl command to timeout, but the controller keeps reconciling the cluster so the upgrade or update process eventually completes.


    Workaround:

    We have prepared several different mitigation options for this issue, which depends on your environment and requirements.

    • Option 1: Wait for the upgrade to eventually complete by itself.

      Based on the analysis and reproduction in your environment, the upgrade can eventually finish by itself without any manual intervention. The caveat of this option is that it's uncertain how long it will take for the finalizer removal to go through for each machine object. It can go through immediately if lucky enough, or it could last for several hours if the machineset controller reconcile is too fast and the machine controller never gets a chance to remove the finalizer in between the reconciliations.

      The good thing is that this option doesn't need any action from your side, and the workloads won't be disrupted. It just needs a longer time for the upgrade to finish.
    • Option 2: Apply auto repair annotation to all the old machine objects.

      The machineset controller will filter out the machines that have the auto repair annotation and deletion timestamp being non zero, and won't keep issuing delete calls on those machines, this can help avoid the race condition.

      The downside is that the pods on the machines will be deleted directly instead of evicted, which means it won't respect the PDB configuration, this might potentially cause downtime for your workloads.

      The command for getting all machine names:
      kubectl --kubeconfig CLUSTER_KUBECONFIG get machines
      
      The command for applying auto repair annotation for each machine:
      kubectl annotate --kubeconfig CLUSTER_KUBECONFIG \
          machine MACHINE_NAME \
          onprem.cluster.gke.io/repair-machine=true
      

    If you encounter this issue and the upgrade or update still can't complete after a long time, contact our support team for mitigations.

    Installation, Upgrades and updates 1.10.2, 1.11, 1.12, 1.13

    gkectl prepare OS image validation preflight failure

    gkectl prepare command failed with:

    - Validation Category: OS Images
        - [FAILURE] Admin cluster OS images exist: os images [os_image_name] don't exist, please run `gkectl prepare` to upload os images.
    

    The preflight checks of gkectl prepare included an incorrect validation.


    Workaround:

    Run the same command with an additional flag --skip-validation-os-images.

    Installation 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13

    vCenter URL with https:// or http:// prefix may cause cluster startup failure

    Admin cluster creation failed with:

    Exit with error:
    Failed to create root cluster: unable to apply admin base bundle to external cluster: error: timed out waiting for the condition, message:
    Failed to apply external bundle components: failed to apply bundle objects from admin-vsphere-credentials-secret 1.x.y-gke.z to cluster external: Secret "vsphere-dynamic-credentials" is invalid:
    [data[https://xxx.xxx.xxx.username]: Invalid value: "https://xxx.xxx.xxx.username": a valid config key must consist of alphanumeric characters, '-', '_' or '.'
    (e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+'), data[https://xxx.xxx.xxx.password]:
    Invalid value: "https://xxx.xxx.xxx.password": a valid config key must consist of alphanumeric characters, '-', '_' or '.'
    (e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')]
    

    The URL is used as part of a Secret key, which doesn't support "/" or ":".


    Workaround:

    Remove https:// or http:// prefix from the vCenter.Address field in the admin cluster or user cluster config yaml.

    Installation, Upgrades and updates 1.10, 1.11, 1.12, 1.13

    gkectl prepare panic on util.CheckFileExists

    gkectl prepare can panic with the following stacktrace:

    panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xde0dfa]
    
    goroutine 1 [running]:
    gke-internal.googlesource.com/syllogi/cluster-management/pkg/util.CheckFileExists(0xc001602210, 0x2b, 0xc001602210, 0x2b) pkg/util/util.go:226 +0x9a
    gke-internal.googlesource.com/syllogi/cluster-management/gkectl/pkg/config/util.SetCertsForPrivateRegistry(0xc000053d70, 0x10, 0xc000f06f00, 0x4b4, 0x1, 0xc00015b400)gkectl/pkg/config/util/utils.go:75 +0x85
    ...
    

    The issue is that gkectl prepare created the private registry certificate directory with a wrong permission.


    Workaround:

    To fix this issue, please run the following commands on the admin workstation:

    sudo mkdir -p /etc/docker/certs.d/PRIVATE_REGISTRY_ADDRESS
    sudo chmod 0755 /etc/docker/certs.d/PRIVATE_REGISTRY_ADDRESS
    
    Upgrades and updates 1.10, 1.11, 1.12, 1.13

    gkectl repair admin-master and resumable admin upgrade do not work together

    After a failed admin cluster upgrade attempt, don't run gkectl repair admin-master. Doing so may cause subsequent admin upgrade attempts to fail with issues such as admin master power on failure or the VM being inaccessible.


    Workaround:

    If you've already encountered this failure scenario, contact support.

    Upgrades and updates 1.10, 1.11

    Resumed admin cluster upgrade can lead to missing admin control plane VM template

    If the admin control plane machine isn't recreated after a resumed admin cluster upgrade attempt, the admin control plane VM template is deleted. The admin control plane VM template is the template of the admin master that is used to recover the control plane machine with gkectl repair admin-master.


    Workaround:

    The admin control plane VM template will be regenerated during the next admin cluster upgrade.

    Operating system 1.12, 1.13

    cgroup v2 could affect workloads

    In version 1.12.0, cgroup v2 (unified) is enabled by default for Container Optimized OS (COS) nodes. This could potentially cause instability for your workloads in a COS cluster.


    Workaround:

    We switched back to cgroup v1 (hybrid) in version 1.12.1. If you are using COS nodes, we recommend that you upgrade to version 1.12.1 as soon as it is released.

    Identity 1.10, 1.11, 1.12, 1.13

    ClientConfig custom resource

    gkectl update reverts any manual changes that you have made to the ClientConfig custom resource.


    Workaround:

    We strongly recommend that you back up the ClientConfig resource after every manual change.

    Installation 1.10, 1.11, 1.12, 1.13

    gkectl check-config validation fails: can't find F5 BIG-IP partitions

    Validation fails because F5 BIG-IP partitions can't be found, even though they exist.

    An issue with the F5 BIG-IP API can cause validation to fail.


    Workaround:

    Try running gkectl check-config again.

    Installation 1.12

    User cluster installation failed because of cert-manager/ca-injector's leader election issue

    You might see an installation failure due to cert-manager-cainjector in crashloop, when the apiserver/etcd is slow:

    # These are logs from `cert-manager-cainjector`, from the command
    # `kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system
      cert-manager-cainjector-xxx`
    
    I0923 16:19:27.911174       1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election: timed out waiting for the condition
    
    E0923 16:19:27.911110       1 leaderelection.go:321] error retrieving resource lock kube-system/cert-manager-cainjector-leader-election-core:
      Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/cert-manager-cainjector-leader-election-core": context deadline exceeded
    
    I0923 16:19:27.911593       1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election-core: timed out waiting for the condition
    
    E0923 16:19:27.911629       1 start.go:163] cert-manager/ca-injector "msg"="error running core-only manager" "error"="leader election lost"
    

    Workaround:

    VMware 1.10, 1.11, 1.12, 1.13

    Restarting or upgrading vCenter for versions lower than 7.0U2

    If the vCenter, for versions lower than 7.0U2, is restarted, after an upgrade or otherwise, the network name in vm information from vCenter is incorrect, and results in the machine being in an Unavailable state. This eventually leads to the nodes being auto-repaired to create new ones.

    Related govmomi bug.


    Workaround:

    This workaround is provided by VMware support:

    1. The issue is fixed in vCenter versions 7.0U2 and above.
    2. For lower versions, right-click the host, and then select Connection > Disconnect. Next, reconnect, which forces an update of the VM's portgroup.
    Operating system 1.10, 1.11, 1.12, 1.13

    SSH connection closed by remote host

    For Anthos clusters on VMware version 1.7.2 and above, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.

    To meet the CIS rule "5.2.16 Ensure SSH Idle Timeout Interval is configured", /etc/ssh/sshd_config has the following settings:

    ClientAliveInterval 300
    ClientAliveCountMax 0
    

    The purpose of these settings is to terminate a client session after 5 minutes of idle time. However, the ClientAliveCountMax 0 value causes unexpected behavior. When you use the ssh session on the admin workstation, or a cluster node, the SSH connection might be disconnected even your ssh client is not idle, such as when running a time-consuming command, and your command could get terminated with the following message:

    Connection to [IP] closed by remote host.
    Connection to [IP] closed.
    

    Workaround:

    You can either:

    • Use nohup to prevent your command being terminated on SSH disconnection,
      nohup gkectl upgrade admin --config admin-cluster.yaml \
          --kubeconfig kubeconfig
      
    • Update the sshd_config to use a non-zero ClientAliveCountMax value. The CIS rule recommends to use a value less than 3:
      sudo sed -i 's/ClientAliveCountMax 0/ClientAliveCountMax 1/g' \
          /etc/ssh/sshd_config
      sudo systemctl restart sshd
      

    Make sure you reconnect your SSH session.

    Installation 1.10, 1.11, 1.12, 1.13

    Conflicting cert-manager installation

    In 1.13 releases, monitoring-operator will install cert-manager in the cert-manager namespace. If for certain reasons, you need to install your own cert-manager, follow the following instructions to avoid conflicts:

    You only need to apply this work around once for each cluster, and the changes will be preserved across cluster upgrade.

    Note: One common symptom of installing your own cert-manager is that the cert-manager version or image (for example v1.7.2) may revert back to its older version. This is caused by monitoring-operator trying to reconcile the cert-manager, and reverting the version in the process.

    Workaround:

    Avoid conflicts during upgrade

    1. Uninstall your version of cert-manager. If you defined your own resources, you may want to backup them.
    2. Perform the upgrade.
    3. Follow the following instructions to restore your own cert-manager.

    Restore your own cert-manager in user clusters

    • Scale the monitoring-operator Deployment to 0:
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
          -n USER_CLUSTER_NAME \
          scale deployment monitoring-operator --replicas=0
      
    • Scale the cert-manager deployments managed by monitoring-operator to 0:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          -n cert-manager scale deployment cert-manager --replicas=0
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          -n cert-manager scale deployment cert-manager-cainjector\
          --replicas=0
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          -n cert-manager scale deployment cert-manager-webhook --replicas=0
      
    • Reinstall your version of cert-manager. Restore your customized resources if you have.
    • You can skip this step if you are using upstream default cert-manager installation, or you are sure your cert-manager is installed in the cert-manager namespace. Otherwise, copy the metrics-ca cert-manager.io/v1 Certificate and the metrics-pki.cluster.local Issuer resources from cert-manager to the cluster resource namespace of your installed cert-manager.
      relevant_fields='
      {
        apiVersion: .apiVersion,
        kind: .kind,
        metadata: {
          name: .metadata.name,
          namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE"
        },
        spec: .spec
      }
      '
      f1=$(mktemp)
      f2=$(mktemp)
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          get issuer -n cert-manager metrics-pki.cluster.local -o json \
          | jq "${relevant_fields}" > $f1
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          get certificate -n cert-manager metrics-ca -o json \
          | jq "${relevant_fields}" > $f2
      kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f $f1
      kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f $f2
      

    Restore your own cert-manager in admin clusters

    In general, you shouldn't need to re-install cert-manager in admin clusters because admin clusters only run Anthos clusters on VMware control plane workloads. In the rare cases that you also need to install your own cert-manager in admin clusters, please follow the following instructions to avoid conflicts. Please note, if you are an Apigee customer and you only need cert-manager for Apigee, you do not need to run the admin cluster commands.

    • Scale the monitoring-operator deployment to 0.
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
          -n kube-system scale deployment monitoring-operator --replicas=0
      
    • Scale the cert-manager deployments managed by monitoring-operator to 0.
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
          -n cert-manager scale deployment cert-manager \
          --replicas=0
      
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
           -n cert-manager scale deployment cert-manager-cainjector \
           --replicas=0
      
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
          -n cert-manager scale deployment cert-manager-webhook \
          --replicas=0
      
    • Reinstall your version of cert-manager. Restore your customized resources if you have.
    • You can skip this step if you are using upstream default cert-manager installation, or you are sure your cert-manager is installed in the cert-manager namespace. Otherwise, copy the metrics-ca cert-manager.io/v1 Certificate and the metrics-pki.cluster.local Issuer resources from cert-manager to the cluster resource namespace of your installed cert-manager.
      relevant_fields='
      {
        apiVersion: .apiVersion,
        kind: .kind,
        metadata: {
          name: .metadata.name,
          namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE"
        },
        spec: .spec
      }
      '
      f3=$(mktemp)
      f4=$(mktemp)
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \n
          get issuer -n cert-manager metrics-pki.cluster.local -o json \
          | jq "${relevant_fields}" > $f3
      kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
          get certificate -n cert-manager metrics-ca -o json \
          | jq "${relevant_fields}" > $f4
      kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f $f3
      kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f $f4
      
    Operating system 1.10, 1.11, 1.12, 1.13

    False positives in docker, containerd, and runc vulnerability scanning

    The Docker, containerd, and runc in the Ubuntu OS images shipped with Anthos clusters on VMware are pinned to special versions using Ubuntu PPA. This ensures that any container runtime changes will be qualified by Anthos clusters on VMware before each release.

    However, the special versions are unknown to the Ubuntu CVE Tracker, which is used as the vulnerability feeds by various CVE scanning tools. Therefore, you will see false positives in Docker, containerd, and runc vulnerability scanning results.

    For example, you might see the following false positives from your CVE scanning results. These CVEs are already fixed in the latest patch versions of Anthos clusters on VMware.

    Refer to the release notes] for any CVE fixes.


    Workaround:

    Canonical is aware of this issue, and the fix is tracked at https://github.com/canonical/sec-cvescan/issues/73.

    Upgrades and updates 1.10, 1.11, 1.12, 1.13

    Network connection between admin and user cluster might be unavailable for a short time during non-HA cluster upgrade

    If you are upgrading non-HA clusters from 1.9 to 1.10, you might notice that the kubectl exec, kubectl log and webhook against user clusters might be unavailable for a short time. This downtime can be up to one minute. This happens because the incoming request (kubectl exec, kubectl log and webhook) is handled by kube-apiserver for the user cluster. User kube-apiserver is a Statefulset. In a non-HA cluster, there is only one replica for the Statefulset. So during upgrade, there is a chance that the old kube-apiserver is unavailable while the new kube-apiserver is not yet ready.


    Workaround:

    This downtime only happens during upgrade process. If you want a shorter downtime during upgrade, we recommend you to switch to HA clusters.

    Installation, Upgrades and updates 1.10, 1.11, 1.12, 1.13

    Konnectivity readiness check failed in HA cluster diagnose after cluster creation or upgrade

    If you are creating or upgrading an HA cluster and notice konnectivity readiness check failed in cluster diagnose, in most cases it will not affect the functionality of Anthos clusters on VMware (kubectl exec, kubectl log and webhook). This happens because sometimes one or two of the konnectivity replicas might be unready for a period of time due to unstable networking or other issues.


    Workaround:

    The konnectivity will recover by itself. Wait for 30 minutes to 1 hour and rerun cluster diagnose.

    Operating system 1.7, 1.8, 1.9, 1.10, 1.11

    /etc/cron.daily/aide CPU and memory spike issue

    Starting from Anthos clusters on VMware version 1.7.2, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.

    As a result, the cron script /etc/cron.daily/aide has been installed so that an aide check is scheduled so as to ensure that the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked" is followed.

    The cron job runs daily at 6:25 AM UTC. Depending on the number of files on the filesystem, you may experience CPU and memory usage spikes around that time that are caused by this aide process.


    Workaround:

    If the spikes are affecting your workload, you can disable the daily cron job:

    sudo chmod -x /etc/cron.daily/aide
    
    Networking 1.10, 1.11, 1.12, 1.13

    Load balancers and NSX-T stateful distributed firewall rules interact unpredictably

    When deploying Anthos clusters on VMware version 1.9 or later, when the deployment has the Seesaw bundled load balancer in an environment that uses NSX-T stateful distributed firewall rules, stackdriver-operator might fail to create gke-metrics-agent-conf ConfigMap and cause gke-connect-agent Pods to be in a crash loop.

    The underlying issue is that the stateful NSX-T distributed firewall rules terminate the connection from a client to the user cluster API server through the Seesaw load balancer because Seesaw uses asymmetric connection flows. The integration issues with NSX-T distributed firewall rules affect all Anthos clusters on VMware releases that use Seesaw. You might see similar connection problems on your own applications when they create large Kubernetes objects whose sizes are bigger than 32K.


    Workaround:

    Follow these instructions to disable NSX-T distributed firewall rules, or to use stateless distributed firewall rules for Seesaw VMs.

    If your clusters use a manual load balancer, follow these instructions to configure your load balancer to reset client connections when it detects a backend node failure. Without this configuration, clients of the Kubernetes API server might stop responding for several minutes when a server instance goes down.

    Logging and monitoring 1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16

    Unexpected monitoring billing

    For Anthos clusters on VMware versions 1.10 to latest, some customers have found unexpectedly high billing for Metrics volume on the Billing page. This issue affects you only when all of the following circumstances apply:

    • Application monitoring is enabled (enableStackdriverForApplications=true)
    • Managed Service for Prometheus is not enabled (enableGMPForApplications)
    • Application Pods have the prometheus.io/scrap=true annotation. (Installing Anthos Service Mesh can also add this annotation.)

    To confirm whether you are affected by this issue, list your user-defined metrics. If you see billing for unwanted metrics, then this issue applies to you.


    Workaround

    If you are affected by this issue, we recommend that you upgrade your clusters to version 1.12 and switch to new application monitoring solution managed-service-for-prometheus that address this issue:

  • Separate flags to control the collection of application logs versus application metrics
  • Bundled Google Cloud Managed Service for Prometheus
  • If you can't upgrade to version 1.12, use the following steps:

    1. Find the source Pods and Services that have the unwanted billed
      kubectl --kubeconfig KUBECONFIG \
        get pods -A -o yaml | grep 'prometheus.io/scrape: "true"'
      kubectl --kubeconfig KUBECONFIG get \
        services -A -o yaml | grep 'prometheus.io/scrape: "true"'
      
    2. Remove the prometheus.io/scrap=true annotation from the Pod or Service. If the annotation is added by Anthos Service Mesh, consider configuring Anthos Service Mesh without the Prometheus option, or turning off the Istio Metrics Merging feature.
    Installation 1.11, 1.12, 1.13

    Installer fails when creating vSphere datadisk

    The Anthos clusters on VMware installer can fail if custom roles are bound at the wrong permissions level.

    When the role binding is incorrect, creating a vSphere datadisk with govc hangs and the disk is created with a size equal to 0. To fix the issue, you should bind the custom role at the vSphere vCenter level (root).


    Workaround:

    If you want to bind the custom role at the DC level (or lower than root), you also need to bind the read-only role to the user at the root vCenter level.

    For more information on role creation, see vCenter user account privileges.

    Logging and monitoring 1.9.0-1.9.4, 1.10.0-1.10.1

    High network traffic to monitoring.googleapis.com

    You might see high network traffic to monitoring.googleapis.com, even in a new cluster that has no user workloads.

    This issue affects version 1.10.0-1.10.1 and version 1.9.0-1.9.4. This issue is fixed in version 1.10.2 and 1.9.5.


    Workaround:

    Logging and monitoring 1.10, 1.11

    gke-metrics-agent has frequent CrashLoopBackOff errors

    For Anthos clusters on VMware version 1.10 and above, `gke-metrics-agent` DaemonSet has frequent CrashLoopBackOff errors when `enableStackdriverForApplications` is set to `true` in the `stackdriver` object.


    Workaround:

    To mitigate this issue, disable application metrics collection by running the following commands. These commands will not disable application logs collection.

    1. To prevent the following changes from reverting, scale down stackdriver-operator:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          --namespace kube-system scale deploy stackdriver-operator \
          --replicas=0
      
      Replace USER_CLUSTER_KUBECONFIG with the path of the user cluster kubeconfig file.
    2. Open the gke-metrics-agent-conf ConfigMap for editing:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          --namespace kube-system edit configmap gke-metrics-agent-conf
      
    3. Under services.pipelines, comment out the entire metrics/app-metrics section:
      services:
        pipelines:
          #metrics/app-metrics:
          #  exporters:
          #  - googlecloud/app-metrics
          #  processors:
          #  - resource
          #  - metric_to_resource
          #  - infer_resource
          #  - disk_buffer/app-metrics
          #  receivers:
          #  - prometheus/app-metrics
          metrics/metrics:
            exporters:
            - googlecloud/metrics
            processors:
            - resource
            - metric_to_resource
            - infer_resource
            - disk_buffer/metrics
            receivers:
            - prometheus/metrics
      
    4. Close the editing session.
    5. Restart the gke-metrics-agent DaemonSet:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          --namespace kube-system rollout restart daemonset gke-metrics-agent
      
    Logging and monitoring 1.11, 1.12, 1.13

    Replace deprecated metrics in dashboard

    If deprecated metrics are used in your OOTB dashboards, you will see some empty charts. To find deprecated metrics in the Monitoring dashboards, run the following commands:

    gcloud monitoring dashboards list > all-dashboard.json
    
    # find deprecated metrics
    cat all-dashboard.json | grep -E \
      'kube_daemonset_updated_number_scheduled\
        |kube_node_status_allocatable_cpu_cores\
        |kube_node_status_allocatable_pods\
        |kube_node_status_capacity_cpu_cores'
    

    The following deprecated metrics should be migrated to their replacements.

    DeprecatedReplacement
    kube_daemonset_updated_number_scheduled kube_daemonset_status_updated_number_scheduled
    kube_node_status_allocatable_cpu_cores
    kube_node_status_allocatable_memory_bytes
    kube_node_status_allocatable_pods
    kube_node_status_allocatable
    kube_node_status_capacity_cpu_cores
    kube_node_status_capacity_memory_bytes
    kube_node_status_capacity_pods
    kube_node_status_capacity
    kube_hpa_status_current_replicas kube_horizontalpodautoscaler_status_current_replicas

    Workaround:

    To replace the deprecated metrics

    1. Delete "GKE on-prem node status" in the Google Cloud Monitoring dashboard. Reinstall "GKE on-prem node status" following these instructions.
    2. Delete "GKE on-prem node utilization" in the Google Cloud Monitoring dashboard. Reinstall "GKE on-prem node utilization" following these instructions.
    3. Delete "GKE on-prem vSphere vm health" in the Google Cloud Monitoring dashboard. Reinstall "GKE on-prem vSphere vm health" following these instructions.
    4. This deprecation is due to the upgrade of kube-state-metrics agent from v1.9 to v2.4, which is required for Kubernetes 1.22. You can replace all deprecated kube-state-metrics metrics, which have the prefix kube_, in your custom dashboards or alerting policies.

    Logging and monitoring 1.10, 1.11, 1.12, 1.13

    Unknown metric data in Cloud Monitoring

    For Anthos clusters on VMware version 1.10 and above, the data for clusters in Cloud Monitoring may contain irrelevant summary metrics entries such as the following:

    Unknown metric: kubernetes.io/anthos/go_gc_duration_seconds_summary_percentile
    

    Other metrics types that may have irrelevant summary metrics include

    :
    • apiserver_admission_step_admission_duration_seconds_summary
    • go_gc_duration_seconds
    • scheduler_scheduling_duration_seconds
    • gkeconnect_http_request_duration_seconds_summary
    • alertmanager_nflog_snapshot_duration_seconds_summary

    While these summary type metrics are in the metrics list, they are not supported by gke-metrics-agent at this time.

    Logging and monitoring 1.10, 1.11, 1.12, 1.13

    Missing metrics on some nodes

    You might find that the following metrics are missing on some, but not all, nodes:

    • kubernetes.io/anthos/container_memory_working_set_bytes
    • kubernetes.io/anthos/container_cpu_usage_seconds_total
    • kubernetes.io/anthos/container_network_receive_bytes_total

    Workaround:

    To fix this issue, perform the following steps as a workaround. For [version 1.9.5+, 1.10.2+, 1.11.0]: increase cpu for gke-metrics-agent by following steps 1 - 4

    1. Open your stackdriver resource for editing:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          --namespace kube-system edit stackdriver stackdriver
      
    2. To increase the CPU request for gke-metrics-agent from 10m to 50m, CPU limit from 100m to 200m add the following resourceAttrOverride section to the stackdriver manifest :
      spec:
        resourceAttrOverride:
          gke-metrics-agent/gke-metrics-agent:
            limits:
              cpu: 100m
              memory: 4608Mi
            requests:
              cpu: 10m
              memory: 200Mi
      
      Your edited resource should look similar to the following:
      spec:
        anthosDistribution: on-prem
        clusterLocation: us-west1-a
        clusterName: my-cluster
        enableStackdriverForApplications: true
        gcpServiceAccountSecretName: ...
        optimizedMetrics: true
        portable: true
        projectID: my-project-191923
        proxyConfigSecretName: ...
        resourceAttrOverride:
          gke-metrics-agent/gke-metrics-agent:
            limits:
              cpu: 200m
              memory: 4608Mi
            requests:
              cpu: 50m
              memory: 200Mi
      
    3. Save your changes and close the text editor.
    4. To verify your changes have taken effect, run the following command:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          --namespace kube-system get daemonset gke-metrics-agent -o yaml \
          | grep "cpu: 50m"
      
      The command finds cpu: 50m if your edits have taken effect.
    Logging and monitoring 1.11.0-1.11.2, 1.12.0

    Missing scheduler and controller-manager metrics in admin cluster

    If your admin cluster is affected by this issue, scheduler and controller-manager metrics are missing. For example, these two metrics are missing

    # scheduler metric example
    scheduler_pending_pods
    # controller-manager metric example
    replicaset_controller_rate_limiter_use
    

    Workaround:

    Upgrade to v1.11.3+, v1.12.1+, or v1.13+.

    1.11.0-1.11.2, 1.12.0

    Missing scheduler and controller-manager metrics in user cluster

    If your user cluster is affected by this issue, scheduler and controller-manager metrics are missing. For example, these two metrics are missing:

    # scheduler metric example
    scheduler_pending_pods
    # controller-manager metric example
    replicaset_controller_rate_limiter_use
    

    Workaround:

    This issue is fixed in Anthos clusters on VMware version 1.13.0 and later. Upgrade your cluster to a version with the fix.

    Installation, Upgrades and updates 1.10, 1.11, 1.12, 1.13

    Failure to register admin cluster during creation

    If you create an admin cluster for version 1.9.x or 1.10.0, and if the admin cluster fails to register with the provided gkeConnect spec during its creation, you will get the following error.

    Failed to create root cluster: failed to register admin cluster: failed to register cluster: failed to apply Hub Membership: Membership API request failed: rpc error:  ode = PermissionDenied desc = Permission 'gkehub.memberships.get' denied on PROJECT_PATH
    

    You will still be able to use this admin cluster, but you will get the following error if you later attempt to upgrade the admin cluster to version 1.10.y.

    failed to migrate to first admin trust chain: failed to parse current version "": invalid version: "" failed to migra