Anthos clusters on VMware

Stay organized with collections Save and categorize content based on your preferences.

Select your Anthos clusters on VMware version:

Select your problem category:

Or, search for you issue:

Category Identified version(s) Issue and workaround
Operation 1.8, 1.9, 1.10

The memory usage increase issue of etcd-maintenance pods

The etcd maintenance pods that use etcddefrag:gke_master_etcddefrag_20210211.00_p0 image are affected. The `etcddefrag` container opens a new connection to etcd server during each defrag cycle and the old connections are not cleaned up.


Workaround:

Option 1: Upgrade to the latest patch version from 1.8 to 1.11 which contain the fix.

Option 2: If you are using patch version earlier than 1.9.6 and 1.10.3, you need to scale down the etcd-maintenance pod for admin and user cluster:

kubectl scale --replicas 0 deployment/gke-master-etcd-maintenance -n USER_CLUSTER_NAME --kubeconfig ADMIN_CLUSTER_KUBECONFIG
kubectl scale --replicas 0 deployment/gke-master-etcd-maintenance -n kube-system --kubeconfig ADMIN_CLUSTER_KUBECONFIG
Operation 1.9, 1.10, 1.11, 1.12, 1.13

Miss the health checks of user cluster control plane pods

Both the cluster health controller and the gkectl diagnose cluster command perform a set of health checks including the pods health checks across namespaces. However, they start to skip the user control plane pods by mistake. If you use the control plane v2 mode, this won't affect your cluster.


Workaround:

This won't affect any workload or cluster management. If you want to check the control plane pods healthiness, you can run the following commands:

kubectl get pods -owide -n USER_CLUSTER_NAME --kubeconfig ADMIN_CLUSTER_KUBECONFIG
Upgrades and updates 1.6+, 1.7+

1.6 and 1.7 admin cluster upgrades may be affected by the k8s.gcr.io -> registry.k8s.io redirect

Kubernetes redirected the traffic from k8s.gcr.io to registry.k8s.io on 3/20/2023. In Anthos clusters on VMware 1.6.x and 1.7.x, the admin cluster upgrades use the container image k8s.gcr.io/pause:3.2. If you use a proxy for your admin workstation and the proxy doesn't allow registry.k8s.io and the container image k8s.gcr.io/pause:3.2 is not cached locally, the admin cluster upgrades will fail when pulling the container image.


Workaround:

Add registry.k8s.io to the allowlist of the proxy for your admin workstation.

Networking 1.10, 1.11, 1.12.0-1.12.6, 1.13.0-1.13.6, 1.14.0-1.14.2

Seesaw validation failure on load balancer creation

gkectl create loadbalancer fails with the following error message:

- Validation Category: Seesaw LB - [FAILURE] Seesaw validation: xxx cluster lb health check failed: LB"xxx.xxx.xxx.xxx" is not healthy: Get "http://xxx.xxx.xxx.xxx:xxx/healthz": dial tcpxxx.xxx.xxx.xxx:xxx: connect: no route to host

This is due to the seesaw group file already existing. And the preflight check tries to validate a non-existent seesaw load balancer.

Workaround:

Remove the existing seesaw group file for this cluster. The file name is seesaw-for-gke-admin.yaml for the admin cluster, and seesaw-for-{CLUSTER_NAME}.yaml for a user cluster.

Networking 1.14

Application timeouts caused by conntrack table insertion failures

Anthos clusters on VMware version 1.14 is susceptible to netfilter connection tracking (conntrack) table insertion failures when using Ubuntu or COS operating system images. Insertion failures lead to random application timeouts and can occur even when the conntrack table has room for new entries. The failures are caused by changes in kernel 5.15 and higher that restrict table insertions based on chain length.

To see if you are affected by this issue, you can check the in-kernel connection tracking system statistics on each node with the following command:

sudo conntrack -S

The response looks like this:

cpu=0       found=0 invalid=4 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
cpu=1       found=0 invalid=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
cpu=2       found=0 invalid=16 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
cpu=3       found=0 invalid=13 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
cpu=4       found=0 invalid=9 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0 
cpu=5       found=0 invalid=1 insert=0 insert_failed=0 drop=0 early_drop=0 error=519 search_restart=0 clash_resolve=126 chaintoolong=0 
...

If a chaintoolong value in the response is a non-zero number, you're affected by this issue.

Workaround

The short term mitigation is to increase the size of both the netfiler hash table (nf_conntrack_buckets) and the netfilter connection tracking table (nf_conntrack_max). Use the following commands on each cluster node to increase the size of the tables:

sysctl -w net.netfilter.nf_conntrack_buckets=TABLE_SIZE
sysctl -w net.netfilter.nf_conntrack_max=TABLE_SIZE

Replace TABLE_SIZE with new size in bytes. The default table size value is 262144. We suggest that you set a value equal to 65,536 times the number of cores on the node. For example, if your node has eight cores, set the table size to 524288.

Networking 1.13.0-1.13.2

calico-typha or anetd-operator crash loop on Windows nodes with Controlplane v2

With Controlplane v2 or a new installation model, calico-typha or anetd-operator might be scheduled to Windows nodes and get into crash loop.

The reason is that the two deployments tolerate all taints including Windows node taint.


Workaround:

Either upgrade to 1.13.3+, or run the following commands to edit the `calico-typha` or `anetd-operator` deployment:

    # If dataplane v2 is not used.
    kubectl edit deployment -n kube-system calico-typha --kubeconfig USER_CLUSTER_KUBECONFIG
    # If dataplane v2 is used.
    kubectl edit deployment -n kube-system anetd-operator --kubeconfig USER_CLUSTER_KUBECONFIG
    

Remove the following spec.template.spec.tolerations:

    - effect: NoSchedule
      operator: Exists
    - effect: NoExecute
      operator: Exists
    

And add the following toleration:

    - key: node-role.kubernetes.io/master
      operator: Exists
    
Configuration 1.14.0-1.14.2

User cluster private registry credential file cannot be loaded

You might not be able to create a user cluster if you specify the privateRegistry section with credential fileRef. Preflight might fail with the following message:

[FAILURE] Docker registry access: Failed to login.


Workaround:

  • If you did not intend to specify the field or you want to use the same private registry credential as admin cluster, you can simply remove or comment the privateRegistry section in your user cluster config file.
  • If you want to use a specific private registry credential for your user cluster, you may temporarily specify the privateRegistry section this way:
    privateRegistry:
      address: PRIVATE_REGISTRY_ADDRESS
      credentials:
        username: PRIVATE_REGISTRY_USERNAME
        password: PRIVATE_REGISTRY_PASSWORD
      caCertPath: PRIVATE_REGISTRY_CACERT_PATH
    
    (NOTE: This is only a temporarily fix and these fields are already deprecated, consider using the credential file when upgrading to 1.14.3+.)

Operations 1.10+

Anthos Service Mesh and other service meshes not compatible with Dataplane v2

Dataplane V2 takes over load balancing and creates a kernel socket instead of a packet based DNAT. This means that Anthos Service Mesh cannot do packet inspection as the pod is bypassed and never uses IPTables.

This manifests in kube-proxy free mode by loss of connectivity or incorrect traffic routing for services with Anthos Service Mesh as the sidecar cannot do packet inspection.

This issue is present on all versions of Anthos clusters on bare metal 1.10, however some newer versions of 1.10 (1.10.2+) have a workaround.


Workaround:

Either upgrade to 1.11 for full compatibility or if running 1.10.2 or later, run:

    kubectl edit cm -n kube-system cilium-config --kubeconfig USER_CLUSTER_KUBECONFIG
    

Add bpf-lb-sock-hostns-only: true to the configmap and then restart the anetd daemonset:

      kubectl rollout restart ds anetd -n kube-system --kubeconfig USER_CLUSTER_KUBECONFIG
    

Storage 1.12+, 1.13.3

kube-controller-manager might detach persistent volumes forcefully after 6 minutes

kube-controller-manager might timeout when detaching PV/PVCs after 6 minutes, and forcefully detach the PV/PVCs. Detailed logs from kube-controller-manager show events similar to the following:

$ cat kubectl_logs_kube-controller-manager-xxxx | grep "DetachVolume started" | grep expired

kubectl_logs_kube-controller-manager-gke-admin-master-4mgvr_--container_kube-controller-manager_--kubeconfig_kubeconfig_--request-timeout_30s_--namespace_kube-system_--timestamps:2023-01-05T16:29:25.883577880Z W0105 16:29:25.883446       1 reconciler.go:224] attacherDetacher.DetachVolume started for volume "pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^126f913b-4029-4055-91f7-beee75d5d34a") on node "sandbox-corp-ant-antho-0223092-03-u-tm04-ml5m8-7d66645cf-t5q8f"
This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching

To verify the issue, log into the node and run the following commands:

# See all the mounting points with disks
lsblk -f

# See some ext4 errors
sudo dmesg -T

In the kubelet log, errors like the following are displayed:

Error: GetDeviceMountRefs check failed for volume "pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^126f913b-4029-4055-91f7-beee75d5d34a") on node "sandbox-corp-ant-antho-0223092-03-u-tm04-ml5m8-7d66645cf-t5q8f" :
the device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16/globalmount" is still mounted by other references [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16/globalmount

Workaround:

Connect to the affected node using SSH and reboot the node.

Upgrades and updates 1.12+, 1.13+, 1.14+

Cluster upgrade is stuck if 3rd party CSI driver is used

You might not be able to upgrade a cluster if you use a 3rd party CSI driver. The gkectl cluster diagnose command might return the following error:

"virtual disk "kubernetes.io/csi/csi.netapp.io^pvc-27a1625f-29e3-4e4f-9cd1-a45237cc472c" IS NOT attached to machine "cluster-pool-855f694cc-cjk5c" but IS listed in the Node.Status"


Workaround:

Perform the upgrade using the --skip-validation-all option.

Operation 1.10+, 1.11+, 1.12+, 1.13+, 1.14+

gkectl repair admin-master creates the admin master VM without upgrading its vm hardware version

The admin master node created via gkectl repair admin-master may use a lower VM hardware version than expected. When the issue happens, you will see the error from the gkectl diagnose cluster report.

CSIPrerequisites [VM Hardware]: The current VM hardware versions are lower than vmx-15 which is unexpected. Please contact Anthos support to resolve this issue.


Workaround:

Shutdown the admin master node, follow https://kb.vmware.com/s/article/1003746 to upgrade the node to the expected version described in the error message, and then start the node.

Operating system 1.10+, 1.11+, 1.12+, 1.13+, 1.14+

VM releases DHCP lease on shutdown/reboot unexpectedly, which may result in IP changes

In systemd v244, systemd-networkd has a default behavior change on the KeepConfiguration configuration. Before this change, VMs did not send a DHCP lease release message to the DHCP server on shutdown or reboot. After this change, VMs send such a message and return the IPs to the DHCP server. As a result, the released IP may be reallocated to a different VM and/or a different IP may be assigned to the VM, resulting in IP conflict (at Kubernetes level, not vSphere level) and/or IP change on the VMs, which can break the clusters in various ways.

For example, you may see the following symptoms.

  • vCenter UI shows that no VMs use the same IP, but kubectl get nodes -o wide returns nodes with duplicate IPs.
    NAME   STATUS    AGE  VERSION          INTERNAL-IP    EXTERNAL-IP    OS-IMAGE            KERNEL-VERSION    CONTAINER-RUNTIME
    node1  Ready     28h  v1.22.8-gke.204  10.180.85.130  10.180.85.130  Ubuntu 20.04.4 LTS  5.4.0-1049-gkeop  containerd://1.5.13
    node2  NotReady  71d  v1.22.8-gke.204  10.180.85.130  10.180.85.130  Ubuntu 20.04.4 LTS  5.4.0-1049-gkeop  containerd://1.5.13
  • New nodes fail to start due to calico-node error
    2023-01-19T22:07:08.817410035Z 2023-01-19 22:07:08.817 [WARNING][9] startup/startup.go 1135: Calico node 'node1' is already using the IPv4 address 10.180.85.130.
    2023-01-19T22:07:08.817514332Z 2023-01-19 22:07:08.817 [INFO][9] startup/startup.go 354: Clearing out-of-date IPv4 address from this node IP="10.180.85.130/24"
    2023-01-19T22:07:08.825614667Z 2023-01-19 22:07:08.825 [WARNING][9] startup/startup.go 1347: Terminating
    2023-01-19T22:07:08.828218856Z Calico node failed to start


Workaround:

Deploy the following DaemonSet on the cluster to revert the systemd-networkd default behavior change. The VMs that run this DaemonSet will not release the IPs to the DHCP server on shutdown/reboot. The IPs will be freed automatically by the DHCP server when the leases expire.

      apiVersion: apps/v1
      kind: DaemonSet
      metadata:
        name: set-dhcp-on-stop
      spec:
        selector:
          matchLabels:
            name: set-dhcp-on-stop
        template:
          metadata:
            labels:
              name: set-dhcp-on-stop
          spec:
            hostIPC: true
            hostPID: true
            hostNetwork: true
            containers:
            - name: set-dhcp-on-stop
              image: ubuntu
              tty: true
              command:
              - /bin/bash
              - -c
              - |
                set -x
                date
                while true; do
                  export CONFIG=/host/run/systemd/network/10-netplan-ens192.network;
                  grep KeepConfiguration=dhcp-on-stop "${CONFIG}" > /dev/null
                  if (( $? != 0 )) ; then
                    echo "Setting KeepConfiguration=dhcp-on-stop"
                    sed -i '/\[Network\]/a KeepConfiguration=dhcp-on-stop' "${CONFIG}"
                    cat "${CONFIG}"
                    chroot /host systemctl restart systemd-networkd
                  else
                    echo "KeepConfiguration=dhcp-on-stop has already been set"
                  fi;
                  sleep 3600
                done
              volumeMounts:
              - name: host
                mountPath: /host
              resources:
                requests:
                  memory: "10Mi"
                  cpu: "5m"
              securityContext:
                privileged: true
            volumes:
            - name: host
              hostPath:
                path: /
            tolerations:
            - operator: Exists
              effect: NoExecute
            - operator: Exists
              effect: NoSchedule
      

Operation, upgrades and updates 1.12.0+, 1.13.0+, 1.14.0+

Component access service account key wiped out after admin cluster upgraded from 1.11.x

This issue will only affect admin clusters which are upgraded from 1.11.x, and won't affect admin clusters which are newly created after 1.12.

After upgrading a 1.11.x cluster to 1.12.x, the component-access-sa-key field in admin-cluster-creds secret will be wiped out to empty. This can be checked by running the following command:

kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get secret admin-cluster-creds -o yaml | grep 'component-access-sa-key'
If you find the output is empty that means the key is wiped out.

After the component access service account key been deleted, installing new user clusters or upgrading existing user clusters will fail. The following lists some error messages you might encounter:

  • Slow validation preflight failure with error message: "Failed to create the test VMs: failed to get service account key: service account is not configured."
  • Prepare by gkectl prepare failed with error message: "Failed to prepare OS images: dialing: unexpected end of JSON input"
  • If you are upgrading a 1.13 user cluster using the Google Cloud Console or the gcloud CLI, when you run gkectl update admin --enable-preview-user-cluster-central-upgrade to deploy the upgrade platform controller, the command fails with the message: "failed to download bundle to disk: dialing: unexpected end of JSON input" (You can see this message in the status field in the output of kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get onprembundle -oyaml).


Workaround:

Add the component access service account key back into the secret manually by running the following command:

kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get secret admin-cluster-creds -ojson | jq --arg casa "$(cat COMPONENT_ACESS_SERVICE_ACOOUNT_KEY_PATH | base64 -w 0)" '.data["component-access-sa-key"]=$casa' | kubectl --kubeconfig ADMIN_KUBECONFIG apply -f -

Operation 1.13.0+, 1.14.0+

Cluster autoscaler does not work when Controlplane V2 is enabled

For user clusters created with Controlplane V2 or a new installation model, node pools with autoscaling enabled always use their autoscaling.minReplicas in the user-cluster.yaml. The log of the cluster-autoscaler pod also shows that their are unhealthy.

  > kubectl --kubeconfig $USER_CLUSTER_KUBECONFIG -n kube-system \
  logs $CLUSTER_AUTOSCALER_POD --container_cluster-autoscaler
 TIMESTAMP  1 gkeonprem_provider.go:73] error getting onpremusercluster ready status: Expected to get a onpremusercluster with id foo-user-cluster-gke-onprem-mgmt/foo-user-cluster
 TIMESTAMP 1 static_autoscaler.go:298] Failed to get node infos for groups: Expected to get a onpremusercluster with id foo-user-cluster-gke-onprem-mgmt/foo-user-cluster
  
The cluster autoscaler pod can be found by running the following commands.
  > kubectl --kubeconfig $USER_CLUSTER_KUBECONFIG -n kube-system \
   get pods | grep cluster-autoscaler
cluster-autoscaler-5857c74586-txx2c                          4648017n    48076Ki    30s
  


Workaround:

Disable autoscaling in all the node pools with `gkectl update cluster` until upgrading to a version with the fix

Installation 1.12.0-1.12.4, 1.13.0-1.13.3, 1.14.0

CIDR is not allowed in the IP block file

When users use CIDR in the IP block file, the config validation will fail with the following error:

- Validation Category: Config Check
    - [FAILURE] Config: AddressBlock for admin cluster spec is invalid: invalid IP:
172.16.20.12/30
  


Workaround:

Include individual IPs in the IP block file until upgrading to a version with the fix: 1.12.5, 1.13.4, 1.14.1+.

Upgrades and updates 1.14.0-1.14.1

OS image type update in the admin-cluster.yaml doesn't wait for user control plane machines to be re-created

When Updating control plane OS image type in the admin-cluster.yaml, and if its corresponding user cluster was created via Controlplane V2, the user control plane machines may not finish their re-creation when the gkectl command finishes.


Workaround:

After the update is finished, keep waiting for the user control plane machines to also finish their re-creation by monitoring their node os image types using kubectl --kubeconfig USER_KUBECONFIG get nodes -owide. e.g. If updating from Ubuntu to COS, we should wait for all the control plane machines to completely change from Ubuntu to COS even after the update command is complete.

Operation 1.14.0

Pod create or delete errors due to Calico CNI service account auth token issue

An issue with Calico in Anthos clusters on VMware 1.14.0 causes Pod creation and deletion to fail with the following error message in the output of kubectl describe pods:

  error getting ClusterInformation: connection is unauthorized: Unauthorized
  

This issue is only observed 24 hours after the cluster is created or upgraded to 1.14 using Calico.

Admin clusters are always using Calico, while for user cluster there is a config field `enableDataPlaneV2` in user-cluster.yaml, if that field is set to `false`, or not specified, that means you are using Calico in user cluster.

The nodes' install-cni container creates a kubeconfig with a token that is valid for 24 hours. This token needs to be periodically renewed by the calico-node Pod. The calico-node Pod is unable to renew the token as it doesn't have access to the directory that contains the kubeconfig file on the node.


Workaround:

To mitigate the issue, apply the following patch on the calico-node DaemonSet in your admin and user cluster:

  kubectl -n kube-system get daemonset calico-node \
    --kubeconfig ADMIN_CLUSTER_KUBECONFIG -o json \
    | jq '.spec.template.spec.containers[0].volumeMounts += [{"name":"cni-net-dir","mountPath":"/host/etc/cni/net.d"}]' \
    | kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f -

  kubectl -n kube-system get daemonset calico-node \
    --kubeconfig USER_CLUSTER_KUBECONFIG -o json \
    | jq '.spec.template.spec.containers[0].volumeMounts += [{"name":"cni-net-dir","mountPath":"/host/etc/cni/net.d"}]' \
    | kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f -
  
Replace the following:
  • ADMIN_CLUSTER_KUBECONFIG: the path of the admin cluster kubeconfig file.
  • USER_CLUSTER_CONFIG_FILE: the path of your user cluster configuration file.
Installation 1.12.0-1.12.4, 1.13.0-1.13.3, 1.14.0

IP block validation fails when using CIDR

Cluster creation fails despite the user having the proper configuration. User sees creation failing due to the cluster not having enough IPs.


Workaround:

Split CIDR's into several smaller CIDR blocks, such as 10.0.0.0/30 becomes 10.0.0.0/31, 10.0.0.2/31. As long as there are N+1 CIDR's, where N is the number of nodes in the cluster, this should suffice.

Operation, Upgrades and updates 1.11.0 - 1.11.1, 1.10.0 - 1.10.4, 1.9.0 - 1.9.6

Admin cluster backup does not include the always-on secrets encryption keys and configuration

When the always-on secrets encryption feature is enabled along with cluster backup, the admin cluster backup fails to include the encryption keys and configuration required by always-on secrets encryption feature. As a result, repairing the admin master with this backup using gkectl repair admin-master --restore-from-backup causes the following error:

Validating admin master VM xxx ...
Waiting for kube-apiserver to be accessible via LB VIP (timeout "8m0s")...  ERROR
Failed to access kube-apiserver via LB VIP. Trying to fix the problem by rebooting the admin master
Waiting for kube-apiserver to be accessible via LB VIP (timeout "13m0s")...  ERROR
Failed to access kube-apiserver via LB VIP. Trying to fix the problem by rebooting the admin master
Waiting for kube-apiserver to be accessible via LB VIP (timeout "18m0s")...  ERROR
Failed to access kube-apiserver via LB VIP. Trying to fix the problem by rebooting the admin master

Operation, Upgrades and updates 1.10+

Recreating the admin master VM with a new boot disk (e.g., gkectl repair admin-master) will fail if the always-on secrets encryption feature is enabled using `gkectl update` command.

If the always-on secrets encryption feature is not enabled at cluster creation, but enabled later using gkectl update operation then the gkectl repair admin-master fails to repair the admin cluster control plane node. It is recommend that always-on secrets encryption feature is enabled at cluster creation. There is no current mitigation.

Upgrades and updates 1.10

Upgrading the first user cluster from 1.9 to 1.10 recreates nodes in other user clusters

Upgrading the first user cluster from 1.9 to 1.10 could recreate nodes in other user clusters under the same admin cluster. The recreation is performed in a rolling fashion.

The disk_label was removed from MachineTemplate.spec.template.spec.providerSpec.machineVariables, which triggered an update on all MachineDeployments unexpectedly.


Workaround:

Upgrades and updates 1.10.0

Docker restarts frequently after cluster upgrade

Upgrade user cluster to 1.10.0 might cause docker restart frequently.

You can detect this issue by running kubectl describe node NODE_NAME --kubeconfig USER_CLUSTER_KUBECONFIG

A node condition will show whether the docker restart frequently. Here is an example output:

Normal   FrequentDockerRestart    41m (x2 over 141m)     systemd-monitor  Node condition FrequentDockerRestart is now: True, reason: FrequentDockerRestart

To understand the root cause, you need to ssh to the node that has the symptom and run commands like sudo journalctl --utc -u docker or sudo journalctl -x


Workaround:

Upgrades and updates 1.11, 1.12

Self-deployed GMP components not preserved after upgrading to version 1.12

If you are using an Anthos clusters on VMware version below 1.12, and have manually set up Google-managed Prometheus (GMP) components in the gmp-system namespace for your cluster, the components are not preserved when you upgrade to version 1.12.x.

From version 1.12, GMP components in the gmp-system namespace and CRDs are managed by stackdriver object, with the enableGMPForApplications flag set to false by default. If you manually deploy GMP components in the namespace prior to upgrading to 1.12, the resources will be deleted by stackdriver.


Workaround:

Operation 1.11, 1.12, 1.13.0 - 1.13.1

Missing ClusterAPI objects in cluster snapshot system scenario

In the system scenario, the cluster snapshot doesn't include any resources under the default namespace.

However, some Kubernetes resources like Cluster API objects that are under this namespace contain useful debugging information. The cluster snapshot should include them.


Workaround:

You can manually run the following commands to collect the debugging information.

export KUBECONFIG=USER_CLUSTER_KUBECONFIG
kubectl get clusters.cluster.k8s.io -o yaml
kubectl get controlplanes.cluster.k8s.io -o yaml
kubectl get machineclasses.cluster.k8s.io -o yaml
kubectl get machinedeployments.cluster.k8s.io -o yaml
kubectl get machines.cluster.k8s.io -o yaml
kubectl get machinesets.cluster.k8s.io -o yaml
kubectl get services -o yaml
kubectl describe clusters.cluster.k8s.io
kubectl describe controlplanes.cluster.k8s.io
kubectl describe machineclasses.cluster.k8s.io
kubectl describe machinedeployments.cluster.k8s.io
kubectl describe machines.cluster.k8s.io
kubectl describe machinesets.cluster.k8s.io
kubectl describe services
where:

USER_CLUSTER_KUBECONFIG is the user cluster's kubeconfig file.

Upgrades and updates 1.11.0-1.11.4, 1.12.0-1.12.3, 1.13.0-1.13.1

User cluster deletion stuck at node drain for vSAN setup

When deleting, updating or upgrading a user cluster, node drain may be stuck in the following scenarios:

  • The admin cluster has been using vSphere CSI driver on vSAN since version 1.12.x, and
  • There are no PVC/PV objects created by in-tree vSphere plugins in the admin and user cluster.

To identify the symptom, run the command below:

kubectl logs clusterapi-controllers-POD_NAME_SUFFIX  --kubeconfig ADMIN_KUBECONFIG -n USER_CLUSTER_NAMESPACE

Here is a sample error message from the above command:

E0920 20:27:43.086567 1 machine_controller.go:250] Error deleting machine object [MACHINE]; Failed to delete machine [MACHINE]: failed to detach disks from VM "[MACHINE]": failed to convert disk path "kubevols" to UUID path: failed to convert full path "ds:///vmfs/volumes/vsan:[UUID]/kubevols": ServerFaultCode: A general system error occurred: Invalid fault

kubevols is the default directory for vSphere in-tree driver. When there are no PVC/PV objects created, you may hit a bug that node drain will be stuck at finding kubevols, since the current implementation assumes that kubevols always exists.


Workaround:

Create the directory kubevols in the datastore where the node is created. This is defined in the vCenter.datastore field in the user-cluster.yaml or admin-cluster.yaml files.

Configuration 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13

admin cluster cluster-health-controller and vsphere-metrics-exporter do not work after deleting user cluster

On user cluster deletion, the corresponding clusterrole is also deleted, which results in auto repair and vsphere metrics exporter not working

The symptoms are the following:

  • cluster-health-controller logs
  • kubectl logs --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system \
    cluster-health-controller
    
    where ADMIN_CLUSTER_KUBECONFIG is the admin cluster's kubeconfig file. Here is an example of error messages you might see:
    error retrieving resource lock default/onprem-cluster-health-leader-election: configmaps "onprem-cluster-health-leader-election" is forbidden: User "system:serviceaccount:kube-system:cluster-health-controller" cannot get resource "configmaps" in API group "" in the namespace "default": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-health-controller-role" not found
    
  • vsphere-metrics-exporter logs
  • kubectl logs --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system \
    vsphere-metrics-exporter
    
    where ADMIN_CLUSTER_KUBECONFIG is the admin cluster's kubeconfig file. Here is an example of error messages you might see:
    vsphere-metrics-exporter/cmd/vsphere-metrics-exporter/main.go:68: Failed to watch *v1alpha1.Cluster: failed to list *v1alpha1.Cluster: clusters.cluster.k8s.io is forbidden: User "system:serviceaccount:kube-system:vsphere-metrics-exporter" cannot list resource "clusters" in API group "cluster.k8s.io" in the namespace "default"
    

Workaround:

Configuration 1.12.1-1.12.3, 1.13.0-1.13.2

gkectl check-config fails at OS image validation

A known issue that could fail the gkectl check-config without running gkectl prepare. This is confusing because we suggest running the command before running gkectl prepare

The symptom is that the gkectl check-config command will fail with the following error message:

Validator result: {Status:FAILURE Reason:os images [OS_IMAGE_NAME] don't exist, please run `gkectl prepare` to upload os images. UnhealthyResources:[]}

Workaround:

Option 1: run gkectl prepare to upload the missing OS images.

Option 2: use gkectl check-config --skip-validation-os-images to skip the OS images validation.

Upgrades and updates 1.11, 1.12, 1.13

gkectl update admin/cluster fails at updating anti affinity groups

A known issue that could fail the gkectl update admin/cluster when updating anti affinity groups.

The symptom is that the gkectl update command will fail with the following error message:

Waiting for machines to be re-deployed...  ERROR
Exit with error:
Failed to update the cluster: timed out waiting for the condition

Workaround:

Installation, Upgrades and updates 1.13.0

Nodes fail to register if configured hostname contains a period

Node registration fails during cluster creation, upgrade, update and node auto repair, when ipMode.type is static and the configured hostname in the IP block file contains one or more periods. In this case, Certificate Signing Requests (CSR) for a node are not automatically approved.

To see pending CSRs for a node, run the following command:

kubectl get csr -A -o wide

Check the following logs for error messages:

  • View the logs in the admin cluster for the clusterapi-controller-manager container in the clusterapi-controllers Pod:
    kubectl logs clusterapi-controllers-POD_NAME \
        -c clusterapi-controller-manager -n kube-system \
        --kubeconfig ADMIN_CLUSTER_KUBECONFIG
    
  • To view the same logs in the user cluster, run the following command:
    kubectl logs clusterapi-controllers-POD_NAME \
        -c clusterapi-controller-manager -n USER_CLUSTER_NAME \
        --kubeconfig ADMIN_CLUSTER_KUBECONFIG
    
    where:
    • ADMIN_CLUSTER_KUBECONFIG is the admin cluster's kubeconfig file.
    • USER_CLUSTER_NAME is the name of the user cluster.
    Here is an example of error messages you might see: "msg"="failed to validate token id" "error"="failed to find machine for node node-worker-vm-1" "validate"="csr-5jpx9"
  • View the kubelet logs on the problematic node:
    journalctl --u kubelet
    
    Here is an example of error messages you might see: "Error getting node" err="node \"node-worker-vm-1\" not found"

If you specify a domain name in the hostname field of an IP block file, any characters following the first period will be ignored. For example, if you specify the hostname as bob-vm-1.bank.plc, the VM hostname and node name will be set to bob-vm-1.

When node ID verification is enabled, the CSR approver compares the node name with the hostname in the Machine spec, and fails to reconcile the name. The approver rejects the CSR, and the node fails to bootstrap.


Workaround:

User cluster

Disable node ID verification by completing the following steps:

  1. Add the following fields in your user cluster configuration file:
    disableNodeIDVerification: true
    disableNodeIDVerificationCSRSigning: true
    
  2. Save the file, and update the user cluster by running the following command:
    gkectl update cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        --config USER_CLUSTER_CONFIG_FILE
    
    Replace the following:
    • ADMIN_CLUSTER_KUBECONFIG: the path of the admin cluster kubeconfig file.
    • USER_CLUSTER_CONFIG_FILE: the path of your user cluster configuration file.

Admin cluster

  1. Open the OnPremAdminCluster custom resource for editing:
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        edit onpremadmincluster -n kube-system
    
  2. Add the following annotation to the custom resource:
    features.onprem.cluster.gke.io/disable-node-id-verification: enabled
    
  3. Edit the kube-controller-manager manifest in the admin cluster control plane:
    1. SSH into the admin cluster control plane node.
    2. Open the kube-controller-manager manifest for editing:
      sudo vi /etc/kubernetes/manifests/kube-controller-manager.yaml
      
    3. Find the list of controllers:
      --controllers=*,bootstrapsigner,tokencleaner,-csrapproving,-csrsigning
      
    4. Update this section as shown below:
      --controllers=*,bootstrapsigner,tokencleaner
      
  4. Open the Deployment Cluster API controller for editing:
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        edit deployment clusterapi-controllers -n kube-system
    
  5. Change the values of node-id-verification-enabled and node-id-verification-csr-signing-enabled to false:
    --node-id-verification-enabled=false
    --node-id-verification-csr-signing-enabled=false
    
Installation, Upgrades and updates 1.11.0-1.11.4

Admin control plane machine startup failure caused by private registry certificate bundle

The admin cluster creation/upgrade is stuck at the following log forever and eventually times out:

Waiting for Machine gke-admin-master-xxxx to become ready...

The Cluster API controller log in the external cluster snapshot includes the following log:

Invalid value 'XXXX' specified for property startup-data

Here is an example file path for the Cluster API controller log:

kubectlCommands/kubectl_logs_clusterapi-controllers-c4fbb45f-6q6g6_--container_vsphere-controller-manager_--kubeconfig_.home.ubuntu..kube.kind-config-gkectl_--request-timeout_30s_--namespace_kube-system_--timestamps
    

VMware has a 64k vApp property size limit. In the identified versions, the data passed via vApp property is close to the limit. When the private registry certificate contains a certificate bundle, it may cause the final data to exceed the 64k limit.


Workaround:

Only include the required certificates in the private registry certificate file configured in privateRegistry.caCertPath in the admin cluster config file.

Or upgrade to a version with the fix when available.

Networking 1.10, 1.11.0-1.11.3, 1.12.0-1.12.2, 1.13.0

NetworkGatewayNodes marked unhealthy from concurrent status update conflict

In networkgatewaygroups.status.nodes, some nodes switch between NotHealthy and Up.

Logs for the ang-daemon Pod running on that node reveal repeated errors:

2022-09-16T21:50:59.696Z ERROR ANGd Failed to report status {"angNode": "kube-system/my-node", "error": "updating Node CR status: sending Node CR update: Operation cannot be fulfilled on networkgatewaynodes.networking.gke.io \"my-node\": the object has been modified; please apply your changes to the latest version and try again"}

The NotHealthy status prevents the controller from assigning additional floating IPs to the node. This can result in higher burden on other nodes or a lack of redundancy for high availability.

Dataplane activity is otherwise not affected.

Contention on the networkgatewaygroup object causes some status updates to fail due to a fault in retry handling. If too many status updates fail, ang-controller-manager sees the node as past its heartbeat time limit and marks the node NotHealthy.

The fault in retry handling has been fixed in later versions.


Workaround:

Upgrade to a fixed version, when available.

Upgrades and updates 1.12.0-1.12.2, 1.13.0

Race condition blocks machine object deletion during and update or upgrade

A known issue that could cause the cluster upgrade or update to be stuck at waiting for the old machine object to be deleted. This is because the finalizer cannot be removed from the machine object. This affects any rolling update operation for node pools.

The symptom is that the gkectl command times out with the following error message:

E0821 18:28:02.546121   61942 console.go:87] Exit with error:
E0821 18:28:02.546184   61942 console.go:87] error: timed out waiting for the condition, message: Node pool "pool-1" is not ready: ready condition is not true: CreateOrUpdateNodePool: 1/3 replicas are updated
Check the status of OnPremUserCluster 'cluster-1-gke-onprem-mgmt/cluster-1' and the logs of pod 'kube-system/onprem-user-cluster-controller' for more detailed debugging information.

In clusterapi-controller Pod logs, the errors are like below:

$ kubectl logs clusterapi-controllers-[POD_NAME_SUFFIX] -n cluster-1
    -c vsphere-controller-manager --kubeconfig [ADMIN_KUBECONFIG]
    | grep "Error removing finalizer from machine object"
[...]
E0821 23:19:45.114993       1 machine_controller.go:269] Error removing finalizer from machine object cluster-1-pool-7cbc496597-t5d5p; Operation cannot be fulfilled on machines.cluster.k8s.io "cluster-1-pool-7cbc496597-t5d5p": the object has been modified; please apply your changes to the latest version and try again

The error repeats for the same machine for several minutes for successful runs even without this bug, for most of the time it can go through quickly, but for some rare cases it can be stuck at this race condition for several hours.

The issue is that the underlying VM is already deleted in vCenter, but the corresponding machine object cannot be removed, which is stuck at the finalizer removal due to very frequent updates from other controllers. This can cause the gkectl command to timeout, but the controller keeps reconciling the cluster so the upgrade or update process eventually completes.


Workaround:

We have prepared several different mitigation options for this issue, which depends on your environment and requirements.

  • Option 1: Wait for the upgrade to eventually complete by itself.

    Based on the analysis and reproduction in your environment, the upgrade can eventually finish by itself without any manual intervention. The caveat of this option is that it's uncertain how long it will take for the finalizer removal to go through for each machine object. It can go through immediately if lucky enough, or it could last for several hours if the machineset controller reconcile is too fast and the machine controller never gets a chance to remove the finalizer in between the reconciliations.

    The good thing is that this option doesn't need any action from your side, and the workloads won't be disrupted. It just needs a longer time for the upgrade to finish.
  • Option 2: Apply auto repair annotation to all the old machine objects.

    The machineset controller will filter out the machines that have the auto repair annotation and deletion timestamp being non zero, and won't keep issuing delete calls on those machines, this can help avoid the race condition.

    The downside is that the pods on the machines will be deleted directly instead of evicted, which means it won't respect the PDB configuration, this might potentially cause downtime for your workloads.

    The command for getting all machine names:
    kubectl --kubeconfig CLUSTER_KUBECONFIG get machines
    
    The command for applying auto repair annotation for each machine:
    kubectl annotate --kubeconfig CLUSTER_KUBECONFIG \
        machine MACHINE_NAME \
        onprem.cluster.gke.io/repair-machine=true
    

If you encounter this issue and the upgrade or update still can't complete after a long time, contact our support team for mitigations.

Installation, Upgrades and updates 1.10.2, 1.11, 1.12, 1.13

gkectl prepare OS image validation preflight failure

gkectl prepare command failed with:

- Validation Category: OS Images
    - [FAILURE] Admin cluster OS images exist: os images [os_image_name] don't exist, please run `gkectl prepare` to upload os images.

The preflight checks of gkectl prepare included an incorrect validation.


Workaround:

Run the same command with an additional flag --skip-validation-os-images.

Installation 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13

vCenter URL with https:// or http:// prefix may cause cluster startup failure

Admin cluster creation failed with:

Exit with error:
Failed to create root cluster: unable to apply admin base bundle to external cluster: error: timed out waiting for the condition, message:
Failed to apply external bundle components: failed to apply bundle objects from admin-vsphere-credentials-secret 1.x.y-gke.z to cluster external: Secret "vsphere-dynamic-credentials" is invalid:
[data[https://xxx.xxx.xxx.username]: Invalid value: "https://xxx.xxx.xxx.username": a valid config key must consist of alphanumeric characters, '-', '_' or '.'
(e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+'), data[https://xxx.xxx.xxx.password]:
Invalid value: "https://xxx.xxx.xxx.password": a valid config key must consist of alphanumeric characters, '-', '_' or '.'
(e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')]

The URL is used as part of a Secret key, which doesn't support "/" or ":".


Workaround:

Remove https:// or http:// prefix from the vCenter.Address field in the admin cluster or user cluster config yaml.

Installation, Upgrades and updates 1.10, 1.11, 1.12, 1.13

gkectl prepare panic on util.CheckFileExists

gkectl prepare can panic with the following stacktrace:

panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xde0dfa]

goroutine 1 [running]:
gke-internal.googlesource.com/syllogi/cluster-management/pkg/util.CheckFileExists(0xc001602210, 0x2b, 0xc001602210, 0x2b) pkg/util/util.go:226 +0x9a
gke-internal.googlesource.com/syllogi/cluster-management/gkectl/pkg/config/util.SetCertsForPrivateRegistry(0xc000053d70, 0x10, 0xc000f06f00, 0x4b4, 0x1, 0xc00015b400)gkectl/pkg/config/util/utils.go:75 +0x85
...

The issue is that gkectl prepare created the private registry certificate directory with a wrong permission.


Workaround:

To fix this issue, please run the following commands on the admin workstation:

sudo mkdir -p /etc/docker/certs.d/PRIVATE_REGISTRY_ADDRESS
sudo chmod 0755 /etc/docker/certs.d/PRIVATE_REGISTRY_ADDRESS
Upgrades and updates 1.10, 1.11, 1.12, 1.13

gkectl repair admin-master and resumable admin upgrade do not work together

After a failed admin cluster upgrade attempt, don't run gkectl repair admin-master. Doing so may cause subsequent admin upgrade attempts to fail with issues such as admin master power on failure or the VM being inaccessible.


Workaround:

If you've already encountered this failure scenario, contact support.

Upgrades and updates 1.10, 1.11

Resumed admin cluster upgrade can lead to missing admin control plane VM template

If the admin control plane machine isn't recreated after a resumed admin cluster upgrade attempt, the admin control plane VM template is deleted. The admin control plane VM template is the template of the admin master that is used to recover the control plane machine with gkectl repair admin-master.


Workaround:

The admin control plane VM template will be regenerated during the next admin cluster upgrade.

Operating system 1.12, 1.13

cgroup v2 could affect workloads

In version 1.12.0, cgroup v2 (unified) is enabled by default for Container Optimized OS (COS) nodes. This could potentially cause instability for your workloads in a COS cluster.


Workaround:

We switched back to cgroup v1 (hybrid) in version 1.12.1. If you are using COS nodes, we recommend that you upgrade to version 1.12.1 as soon as it is released.

Identity 1.10, 1.11, 1.12, 1.13

ClientConfig custom resource

gkectl update reverts any manual changes that you have made to the ClientConfig custom resource.


Workaround:

We strongly recommend that you back up the ClientConfig resource after every manual change.

Installation 1.10, 1.11, 1.12, 1.13

gkectl check-config validation fails: can't find F5 BIG-IP partitions

Validation fails because F5 BIG-IP partitions can't be found, even though they exist.

An issue with the F5 BIG-IP API can cause validation to fail.


Workaround:

Try running gkectl check-config again.

Installation 1.12

User cluster installation failed because of cert-manager/ca-injector's leader election issue

You might see an installation failure due to cert-manager-cainjector in crashloop, when the apiserver/etcd is slow:

# These are logs from `cert-manager-cainjector`, from the command
# `kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system
  cert-manager-cainjector-xxx`

I0923 16:19:27.911174       1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election: timed out waiting for the condition

E0923 16:19:27.911110       1 leaderelection.go:321] error retrieving resource lock kube-system/cert-manager-cainjector-leader-election-core:
  Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/cert-manager-cainjector-leader-election-core": context deadline exceeded

I0923 16:19:27.911593       1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election-core: timed out waiting for the condition

E0923 16:19:27.911629       1 start.go:163] cert-manager/ca-injector "msg"="error running core-only manager" "error"="leader election lost"

Workaround:

Security, Upgrades and updates 1.10, 1.11, 1.12, 1.13

Renewal of certificates might be required before an admin cluster upgrade

Before you begin the admin cluster upgrade process, you should make sure that your admin cluster certificates are currently valid, and renew these certificates if they are not.

If you have begun the upgrade process and discovered an error with certificate expiry, contact Google Support for assistance.

Note: This guidance is strictly for admin cluster certificates renewal.

Workaround:

VMware 1.10, 1.11, 1.12, 1.13

Restarting or upgrading vCenter for versions lower than 7.0U2

If the vCenter, for versions lower than 7.0U2, is restarted, after an upgrade or otherwise, the network name in vm information from vCenter is incorrect, and results in the machine being in an Unavailable state. This eventually leads to the nodes being auto-repaired to create new ones.

Related govmomi bug.


Workaround:

This workaround is provided by VMware support:

  1. The issue is fixed in vCenter versions 7.0U2 and above.
  2. For lower versions, right-click the host, and then select Connection > Disconnect. Next, reconnect, which forces an update of the VM's portgroup.
Operating system 1.10, 1.11, 1.12, 1.13

SSH connection closed by remote host

For Anthos clusters on VMware version 1.7.2 and above, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.

To meet the CIS rule "5.2.16 Ensure SSH Idle Timeout Interval is configured", /etc/ssh/sshd_config has the following settings:

ClientAliveInterval 300
ClientAliveCountMax 0

The purpose of these settings is to terminate a client session after 5 minutes of idle time. However, the ClientAliveCountMax 0 value causes unexpected behavior. When you use the ssh session on the admin workstation, or a cluster node, the SSH connection might be disconnected even your ssh client is not idle, such as when running a time-consuming command, and your command could get terminated with the following message:

Connection to [IP] closed by remote host.
Connection to [IP] closed.

Workaround:

You can either:

  • Use nohup to prevent your command being terminated on SSH disconnection,
    nohup gkectl upgrade admin --config admin-cluster.yaml \
        --kubeconfig kubeconfig
    
  • Update the sshd_config to use a non-zero ClientAliveCountMax value. The CIS rule recommends to use a value less than 3:
    sudo sed -i 's/ClientAliveCountMax 0/ClientAliveCountMax 1/g' \
        /etc/ssh/sshd_config
    sudo systemctl restart sshd
    

Make sure you reconnect your SSH session.

Installation 1.10, 1.11, 1.12, 1.13

Conflicting cert-manager installation

In 1.13 releases, monitoring-operator will install cert-manager in the cert-manager namespace. If for certain reasons, you need to install your own cert-manager, follow the following instructions to avoid conflicts:

You only need to apply this work around once for each cluster, and the changes will be preserved across cluster upgrade.

Note: One common symptom of installing your own cert-manager is that the cert-manager version or image (for example v1.7.2) may revert back to its older version. This is caused by monitoring-operator trying to reconcile the cert-manager, and reverting the version in the process.

Workaround:

Avoid conflicts during upgrade

  1. Uninstall your version of cert-manager. If you defined your own resources, you may want to backup them.
  2. Perform the upgrade.
  3. Follow the following instructions to restore your own cert-manager.

Restore your own cert-manager in user clusters

  • Scale the monitoring-operator Deployment to 0:
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        -n USER_CLUSTER_NAME \
        scale deployment monitoring-operator --replicas=0
    
  • Scale the cert-manager deployments managed by monitoring-operator to 0:
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        -n cert-manager scale deployment cert-manager --replicas=0
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        -n cert-manager scale deployment cert-manager-cainjector\
        --replicas=0
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        -n cert-manager scale deployment cert-manager-webhook --replicas=0
    
  • Reinstall your version of cert-manager. Restore your customized resources if you have.
  • You can skip this step if you are using upstream default cert-manager installation, or you are sure your cert-manager is installed in the cert-manager namespace. Otherwise, copy the metrics-ca cert-manager.io/v1 Certificate and the metrics-pki.cluster.local Issuer resources from cert-manager to the cluster resource namespace of your installed cert-manager.
    relevant_fields='
    {
      apiVersion: .apiVersion,
      kind: .kind,
      metadata: {
        name: .metadata.name,
        namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE"
      },
      spec: .spec
    }
    '
    f1=$(mktemp)
    f2=$(mktemp)
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        get issuer -n cert-manager metrics-pki.cluster.local -o json \
        | jq "${relevant_fields}" > $f1
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        get certificate -n cert-manager metrics-ca -o json \
        | jq "${relevant_fields}" > $f2
    kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f $f1
    kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f $f2
    

Restore your own cert-manager in admin clusters

In general, you shouldn't need to re-install cert-manager in admin clusters because admin clusters only run Anthos clusters on VMware control plane workloads. In the rare cases that you also need to install your own cert-manager in admin clusters, please follow the following instructions to avoid conflicts. Please note, if you are an Apigee customer and you only need cert-manager for Apigee, you do not need to run the admin cluster commands.

  • Scale the monitoring-operator deployment to 0.
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        -n kube-system scale deployment monitoring-operator --replicas=0
    
  • Scale the cert-manager deployments managed by monitoring-operator to 0.
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        -n cert-manager scale deployment cert-manager \
        --replicas=0
    
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
         -n cert-manager scale deployment cert-manager-cainjector \
         --replicas=0
    
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        -n cert-manager scale deployment cert-manager-webhook \
        --replicas=0
    
  • Reinstall your version of cert-manager. Restore your customized resources if you have.
  • You can skip this step if you are using upstream default cert-manager installation, or you are sure your cert-manager is installed in the cert-manager namespace. Otherwise, copy the metrics-ca cert-manager.io/v1 Certificate and the metrics-pki.cluster.local Issuer resources from cert-manager to the cluster resource namespace of your installed cert-manager.
    relevant_fields='
    {
      apiVersion: .apiVersion,
      kind: .kind,
      metadata: {
        name: .metadata.name,
        namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE"
      },
      spec: .spec
    }
    '
    f3=$(mktemp)
    f4=$(mktemp)
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \n
        get issuer -n cert-manager metrics-pki.cluster.local -o json \
        | jq "${relevant_fields}" > $f3
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        get certificate -n cert-manager metrics-ca -o json \
        | jq "${relevant_fields}" > $f4
    kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f $f3
    kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f $f4
    
Operating system 1.10, 1.11, 1.12, 1.13

False positives in docker, containerd, and runc vulnerability scanning

The Docker, containerd, and runc in the Ubuntu OS images shipped with Anthos clusters on VMware are pinned to special versions using Ubuntu PPA. This ensures that any container runtime changes will be qualified by Anthos clusters on VMware before each release.

However, the special versions are unknown to the Ubuntu CVE Tracker, which is used as the vulnerability feeds by various CVE scanning tools. Therefore, you will see false positives in Docker, containerd, and runc vulnerability scanning results.

For example, you might see the following false positives from your CVE scanning results. These CVEs are already fixed in the latest patch versions of Anthos clusters on VMware.

Refer to the release notes] for any CVE fixes.


Workaround:

Canonical is aware of this issue, and the fix is tracked at https://github.com/canonical/sec-cvescan/issues/73.

Upgrades and updates 1.10, 1.11, 1.12, 1.13

Network connection between admin and user cluster might be unavailable for a short time during non-HA cluster upgrade

If you are upgrading non-HA clusters from 1.9 to 1.10, you might notice that the kubectl exec, kubectl log and webhook against user clusters might be unavailable for a short time. This downtime can be up to one minute. This happens because the incoming request (kubectl exec, kubectl log and webhook) is handled by kube-apiserver for the user cluster. User kube-apiserver is a Statefulset. In a non-HA cluster, there is only one replica for the Statefulset. So during upgrade, there is a chance that the old kube-apiserver is unavailable while the new kube-apiserver is not yet ready.


Workaround:

This downtime only happens during upgrade process. If you want a shorter downtime during upgrade, we recommend you to switch to HA clusters.

Installation, Upgrades and updates 1.10, 1.11, 1.12, 1.13

Konnectivity readiness check failed in HA cluster diagnose after cluster creation or upgrade

If you are creating or upgrading an HA cluster and notice konnectivity readiness check failed in cluster diagnose, in most cases it will not affect the functionality of Anthos clusters on VMware (kubectl exec, kubectl log and webhook). This happens because sometimes one or two of the konnectivity replicas might be unready for a period of time due to unstable networking or other issues.


Workaround:

The konnectivity will recover by itself. Wait for 30 minutes to 1 hour and rerun cluster diagnose.

Operating system 1.7, 1.8, 1.9, 1.10, 1.11

/etc/cron.daily/aide CPU and memory spike issue

Starting from Anthos clusters on VMware version 1.7.2, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.

As a result, the cron script /etc/cron.daily/aide has been installed so that an aide check is scheduled so as to ensure that the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked" is followed.

The cron job runs daily at 6:25 AM UTC. Depending on the number of files on the filesystem, you may experience CPU and memory usage spikes around that time that are caused by this aide process.


Workaround:

If the spikes are affecting your workload, you can disable the daily cron job:

sudo chmod -x /etc/cron.daily/aide
Networking 1.10, 1.11, 1.12, 1.13

Load balancers and NSX-T stateful distributed firewall rules interact unpredictably

When deploying Anthos clusters on VMware version 1.9 or later, when the deployment has the Seesaw bundled load balancer in an environment that uses NSX-T stateful distributed firewall rules, stackdriver-operator might fail to create gke-metrics-agent-conf ConfigMap and cause gke-connect-agent Pods to be in a crash loop.

The underlying issue is that the stateful NSX-T distributed firewall rules terminate the connection from a client to the user cluster API server through the Seesaw load balancer because Seesaw uses asymmetric connection flows. The integration issues with NSX-T distributed firewall rules affect all Anthos clusters on VMware releases that use Seesaw. You might see similar connection problems on your own applications when they create large Kubernetes objects whose sizes are bigger than 32K.


Workaround:

Follow these instructions to disable NSX-T distributed firewall rules, or to use stateless distributed firewall rules for Seesaw VMs.

If your clusters use a manual load balancer, follow these instructions to configure your load balancer to reset client connections when it detects a backend node failure. Without this configuration, clients of the Kubernetes API server might stop responding for several minutes when a server instance goes down.

Logging and monitoring 1.10, 1.11, 1.12, 1.13, 1.14

Unexpected monitoring billing

For Anthos clusters on VMware versions 1.10 to latest, some customers have found unexpectedly high billing for Metrics volume on the Billing page. This issue affects you only when all of the following circumstances apply:

  • Application monitoring is enabled (enableStackdriverForApplications=true)
  • Managed Service for Prometheus is not enabled (enableGMPForApplications)
  • Application Pods have the prometheus.io/scrap=true annotation

To confirm whether you are affected by this issue, list your user-defined metrics. If you see billing for unwanted metrics, then this issue applies to you.


Workaround

If you are affected by this issue, we recommend that you upgrade your clusters to version 1.12 and switch to new application monitoring solution managed-service-for-prometheus that address this issue:

  • Separate flags to control the collection of application logs versus application metrics
  • Bundled Google Cloud Managed Service for Prometheus
  • If you can't upgrade to version 1.12, use the following steps:

    1. Find the source Pods and Services that have the unwanted billed
      kubectl --kubeconfig KUBECONFIG \
        get pods -A -o yaml | grep 'prometheus.io/scrape: "true"'
      kubectl --kubeconfig KUBECONFIG get \
        services -A -o yaml | grep 'prometheus.io/scrape: "true"'
      
    2. Remove the prometheus.io/scrap=true annotation from the Pod or Service.
    Installation 1.11, 1.12, 1.13

    Installer fails when creating vSphere datadisk

    The Anthos clusters on VMware installer can fail if custom roles are bound at the wrong permissions level.

    When the role binding is incorrect, creating a vSphere datadisk with govc hangs and the disk is created with a size equal to 0. To fix the issue, you should bind the custom role at the vSphere vCenter level (root).


    Workaround:

    If you want to bind the custom role at the DC level (or lower than root), you also need to bind the read-only role to the user at the root vCenter level.

    For more information on role creation, see vCenter user account privileges.

    Logging and monitoring 1.9.0-1.9.4, 1.10.0-1.10.1

    High network traffic to monitoring.googleapis.com

    You might see high network traffic to monitoring.googleapis.com, even in a new cluster that has no user workloads.

    This issue affects version 1.10.0-1.10.1 and version 1.9.0-1.9.4. This issue is fixed in version 1.10.2 and 1.9.5.


    Workaround:

    Logging and monitoring 1.10, 1.11

    gke-metrics-agent has frequent CrashLoopBackOff errors

    For Anthos clusters on VMware version 1.10 and above, `gke-metrics-agent` DaemonSet has frequent CrashLoopBackOff errors when `enableStackdriverForApplications` is set to `true` in the `stackdriver` object.


    Workaround:

    To mitigate this issue, disable application metrics collection by running the following commands. These commands will not disable application logs collection.

    1. To prevent the following changes from reverting, scale down stackdriver-operator:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          --namespace kube-system scale deploy stackdriver-operator \
          --replicas=0
      
      Replace USER_CLUSTER_KUBECONFIG with the path of the user cluster kubeconfig file.
    2. Open the gke-metrics-agent-conf ConfigMap for editing:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          --namespace kube-system edit configmap gke-metrics-agent-conf
      
    3. Under services.pipelines, comment out the entire metrics/app-metrics section:
      services:
        pipelines:
          #metrics/app-metrics:
          #  exporters:
          #  - googlecloud/app-metrics
          #  processors:
          #  - resource
          #  - metric_to_resource
          #  - infer_resource
          #  - disk_buffer/app-metrics
          #  receivers:
          #  - prometheus/app-metrics
          metrics/metrics:
            exporters:
            - googlecloud/metrics
            processors:
            - resource
            - metric_to_resource
            - infer_resource
            - disk_buffer/metrics
            receivers:
            - prometheus/metrics
      
    4. Close the editing session.
    5. Restart the gke-metrics-agent DaemonSet:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          --namespace kube-system rollout restart daemonset gke-metrics-agent
      
    Logging and monitoring 1.11, 1.12, 1.13

    Replace deprecated metrics in dashboard

    If deprecated metrics are used in your OOTB dashboards, you will see some empty charts. To find deprecated metrics in the Monitoring dashboards, run the following commands:

    gcloud monitoring dashboards list > all-dashboard.json
    
    # find deprecated metrics
    cat all-dashboard.json | grep -E \
      'kube_daemonset_updated_number_scheduled\
        |kube_node_status_allocatable_cpu_cores\
        |kube_node_status_allocatable_pods\
        |kube_node_status_capacity_cpu_cores'
    

    The following deprecated metrics should be migrated to their replacements.

    DeprecatedReplacement
    kube_daemonset_updated_number_scheduled kube_daemonset_status_updated_number_scheduled
    kube_node_status_allocatable_cpu_cores
    kube_node_status_allocatable_memory_bytes
    kube_node_status_allocatable_pods
    kube_node_status_allocatable
    kube_node_status_capacity_cpu_cores
    kube_node_status_capacity_memory_bytes
    kube_node_status_capacity_pods
    kube_node_status_capacity
    kube_hpa_status_current_replicas kube_horizontalpodautoscaler_status_current_replicas

    Workaround:

    To replace the deprecated metrics

    1. Delete "GKE on-prem node status" in the Google Cloud Monitoring dashboard. Reinstall "GKE on-prem node status" following these instructions.
    2. Delete "GKE on-prem node utilization" in the Google Cloud Monitoring dashboard. Reinstall "GKE on-prem node utilization" following these instructions.
    3. Delete "GKE on-prem vSphere vm health" in the Google Cloud Monitoring dashboard. Reinstall "GKE on-prem vSphere vm health" following these instructions.
    4. This deprecation is due to the upgrade of kube-state-metrics agent from v1.9 to v2.4, which is required for Kubernetes 1.22. You can replace all deprecated kube-state-metrics metrics, which have the prefix kube_, in your custom dashboards or alerting policies.

    Logging and monitoring 1.10, 1.11, 1.12, 1.13

    Unknown metric data in Cloud Monitoring

    For Anthos clusters on VMware version 1.10 and above, the data for clusters in Cloud Monitoring may contain irrelevant summary metrics entries such as the following:

    Unknown metric: kubernetes.io/anthos/go_gc_duration_seconds_summary_percentile
    

    Other metrics types that may have irrelevant summary metrics include

    :
    • apiserver_admission_step_admission_duration_seconds_summary
    • go_gc_duration_seconds
    • scheduler_scheduling_duration_seconds
    • gkeconnect_http_request_duration_seconds_summary
    • alertmanager_nflog_snapshot_duration_seconds_summary

    While these summary type metrics are in the metrics list, they are not supported by gke-metrics-agent at this time.

    Logging and monitoring 1.10, 1.11, 1.12, 1.13

    Missing metrics on some nodes

    You might find that the following metrics are missing on some, but not all, nodes:

    • kubernetes.io/anthos/container_memory_working_set_bytes
    • kubernetes.io/anthos/container_cpu_usage_seconds_total
    • kubernetes.io/anthos/container_network_receive_bytes_total

    Workaround:

    To fix this issue, perform the following steps as a workaround. For [version 1.9.5+, 1.10.2+, 1.11.0]: increase cpu for gke-metrics-agent by following steps 1 - 4

    1. Open your stackdriver resource for editing:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          --namespace kube-system edit stackdriver stackdriver
      
    2. To increase the CPU request for gke-metrics-agent from 10m to 50m, CPU limit from 100m to 200m add the following resourceAttrOverride section to the stackdriver manifest :
      spec:
        resourceAttrOverride:
          gke-metrics-agent/gke-metrics-agent:
            limits:
              cpu: 100m
              memory: 4608Mi
            requests:
              cpu: 10m
              memory: 200Mi
      
      Your edited resource should look similar to the following:
      spec:
        anthosDistribution: on-prem
        clusterLocation: us-west1-a
        clusterName: my-cluster
        enableStackdriverForApplications: true
        gcpServiceAccountSecretName: ...
        optimizedMetrics: true
        portable: true
        projectID: my-project-191923
        proxyConfigSecretName: ...
        resourceAttrOverride:
          gke-metrics-agent/gke-metrics-agent:
            limits:
              cpu: 200m
              memory: 4608Mi
            requests:
              cpu: 50m
              memory: 200Mi
      
    3. Save your changes and close the text editor.
    4. To verify your changes have taken effect, run the following command:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          --namespace kube-system get daemonset gke-metrics-agent -o yaml \
          | grep "cpu: 50m"
      
      The command finds cpu: 50m if your edits have taken effect.
    Logging and monitoring 1.11.0-1.11.2, 1.12.0

    Missing scheduler and controller-manager metrics in admin cluster

    If your admin cluster is affected by this issue, scheduler and controller-manager metrics are missing. For example, these two metrics are missing

    # scheduler metric example
    scheduler_pending_pods
    # controller-manager metric example
    replicaset_controller_rate_limiter_use
    

    Workaround:

    Upgrade to v1.11.3+, v1.12.1+, or v1.13+.

    1.11.0-1.11.2, 1.12.0

    Missing scheduler and controller-manager metrics in user cluster

    If your user cluster is affected by this issue, scheduler and controller-manager metrics are missing. For example, these two metrics are missing:

    # scheduler metric example
    scheduler_pending_pods
    # controller-manager metric example
    replicaset_controller_rate_limiter_use
    

    Workaround:

    Installation, Upgrades and updates 1.10, 1.11, 1.12, 1.13

    Failure to register admin cluster during creation

    If you create an admin cluster for version 1.9.x or 1.10.0, and if the admin cluster fails to register with the provided gkeConnect spec during its creation, you will get the following error.

    Failed to create root cluster: failed to register admin cluster: failed to register cluster: failed to apply Hub Membership: Membership API request failed: rpc error:  ode = PermissionDenied desc = Permission 'gkehub.memberships.get' denied on PROJECT_PATH
    

    You will still be able to use this admin cluster, but you will get the following error if you later attempt to upgrade the admin cluster to version 1.10.y.

    failed to migrate to first admin trust chain: failed to parse current version "": invalid version: "" failed to migrate to first admin trust chain: failed to parse current version "": invalid version: ""
    

    Workaround:

    Identity 1.10, 1.11, 1.12, 1.13

    Using Anthos Identity Service can cause the Connect Agent to restart unpredictably

    If you are using the Anthos Identity Service feature to manage Anthos Identity Service ClientConfig, the Connect Agent might restart unexpectedly.


    Workaround:

    If you have experienced this issue with an existing cluster, you can do one of the following:

    • Disable Anthos Identity Service (AIS). If you disable AIS, that will not remove the deployed AIS binary or remove AIS ClientConfig. To disable AIS, run this command:
      gcloud beta container hub identity-service disable \
          --project PROJECT_NAME
      
      Replace PROJECT_NAME with the name of the cluster's fleet host project.
    • Update the cluster to version 1.9.3 or later, or version 1.10.1 or later, so as to upgrade the Connect Agent version.
    Networking 1.10, 1.11, 1.12, 1.13

    Cisco ACI doesn't work with Direct Server Return (DSR)

    Seesaw runs in DSR mode, and by default it doesn't work in Cisco ACI because of data-plane IP learning.


    Workaround:

    A possible workaround is to disable IP learning by adding the Seesaw IP address as a L4-L7 Virtual IP in the Cisco Application Policy Infrastructure Controller (APIC).

    You can configure the L4-L7 Virtual IP option by going to Tenant > Application Profiles > Application EPGs or uSeg EPGs. Failure to disable IP learning will result in IP endpoint flapping between different locations in the Cisco API fabric.

    VMware 1.10, 1.11, 1.12, 1.13

    vSphere 7.0 Update 3 issues

    VMWare has recently identified critical issues with the following vSphere 7.0 Update 3 releases:

    • vSphere ESXi 7.0 Update 3 (build 18644231)
    • vSphere ESXi 7.0 Update 3a (build 18825058)
    • vSphere ESXi 7.0 Update 3b (build 18905247)
    • vSphere vCenter 7.0 Update 3b (build 18901211)

    Workaround:

    VMWare has since removed these releases. You should upgrade the ESXi and vCenter Servers to a newer version.

    Operating system 1.10, 1.11, 1.12, 1.13

    Failure to mount emptyDir volume as exec into Pod running on COS nodes

    For Pods running on nodes that use Container-Optimized OS (COS) images, you cannot mount emptyDir volume as exec. It mounts as noexec and you will get the following error: exec user process caused: permission denied. For example, you will see this error message if you deploy the following test Pod:

    apiVersion: v1
    kind: Pod
    metadata:
      creationTimestamp: null
      labels:
        run: test
      name: test
    spec:
      containers:
      - args:
        - sleep
        - "5000"
        image: gcr.io/google-containers/busybox:latest
        name: test
        volumeMounts:
          - name: test-volume
            mountPath: /test-volume
        resources:
          limits:
            cpu: 200m
            memory: 512Mi
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      volumes:
        - emptyDir: {}
          name: test-volume
    

    And in the test Pod, if you run mount | grep test-volume, it would show noexec option:

    /dev/sda1 on /test-volume type ext4 (rw,nosuid,nodev,noexec,relatime,commit=30)
    

    Workaround:

    Upgrades and updates 1.10, 1.11, 1.12, 1.13

    Cluster node pool replica update does not work after autoscaling has been disabled on the node pool

    Node pool replicas do not update once autoscaling has been enabled and disabled on a node pool.


    Workaround:

    Removine the cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size and cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size annotations from the machine deployment of the corresponding node pool.

    Logging and monitoring 1.11, 1.12, 1.13

    Windows monitoring dashboards show data from Linux clusters

    From version 1.11, on the out-of-the-box monitoring dashboards, the Windows Pod status dashboard and Windows node status dashboard also show data from Linux clusters. This is because the Windows node and Pod metrics are also exposed on Linux clusters.

    Logging and monitoring 1.10, 1.11, 1.12

    stackdriver-log-forwarder in constant CrashLoopBackOff

    For Anthos clusters on VMware version 1.10, 1.11, and 1.12, stackdriver-log-forwarder DaemonSet might have CrashLoopBackOff errors when there are broken buffered logs on the disk.


    Workaround:

    To mitigate this issue, we will need to clean up the buffered logs on the node.

    1. To prevent the unexpected behaviour, scale down stackdriver-log-forwarder:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          -n kube-system patch daemonset stackdriver-log-forwarder -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
      
      Replace USER_CLUSTER_KUBECONFIG with the path of the user cluster kubeconfig file.
    2. Deploy the clean-up DaemonSet to clean up broken chunks:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
          -n kube-system -f - << EOF
      apiVersion: apps/v1
      kind: DaemonSet
      metadata:
        name: fluent-bit-cleanup
        namespace: kube-system
      spec:
        selector:
          matchLabels:
            app: fluent-bit-cleanup
        template:
          metadata:
            labels:
              app: fluent-bit-cleanup
          spec:
            containers:
            - name: fluent-bit-cleanup
              image: debian:10-slim
              command: ["bash", "-c"]
              args:
              - |
                rm -rf /var/log/fluent-bit-buffers/
                echo "Fluent Bit local buffer is cleaned up."
                sleep 3600
              volumeMounts:
              - name: varlog
                mountPath: /var/log
              securityContext:
                privileged: true
            tolerations:
            - key: "CriticalAddonsOnly"
              operator: "Exists"
            - key: node-role.kubernetes.io/master
              effect: NoSchedule
            - key: node-role.gke.io/observability
              effect: NoSchedule
            volumes:
            - name: varlog
              hostPath:
                path: /var/log
      EOF
      
    3. To make sure the clean-up DaemonSet has cleaned up all the chunks, you can run the following commands. The output of the two commands should be equal to your node number in the cluster:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        logs -n kube-system -l app=fluent-bit-cleanup | grep "cleaned up" | wc -l
      
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        -n kube-system get pods -l app=fluent-bit-cleanup --no-headers | wc -l
      
    4. Delete the clean-up DaemonSet:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        -n kube-system delete ds fluent-bit-cleanup
      
    5. Resume stackdriver-log-forwarder:
      kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        -n kube-system patch daemonset stackdriver-log-forwarder --type json -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing"}]'
      
    Security 1.13

    Kubelet service will be temporarily unavailable after NodeReady

    there is a short period where node is ready but kubelet server certificate is not ready. kubectl exec and kubectl logs are unavailable during this tens of seconds. This is because it takes time for the new server certificate approver to see the updated valid IPs of the node.

    This issue affects kubelet server certificate only, it will not affect Pod scheduling.

    Upgrades and updates 1.12

    Partial admin cluster upgrade does not block later user cluster upgrade

    User cluster upgrade failed with:

    .LBKind in body is required (Check the status of OnPremUserCluster 'cl-stg-gdl-gke-onprem-mgmt/cl-stg-gdl' and the logs of pod 'kube-system/onprem-user-cluster-controller' for more detailed debugging information.
    

    The admin cluster is not fully upgraded, and the status version is still 1.10. User cluster upgrade to 1.12 won't be blocked by any preflight check, and fails with version skew issue.


    Workaround:

    Complete to upgrade the admin cluster to 1.11 first, and then upgrade the user cluster to 1.12.

    Storage 1.10.0-1.10.5, 1.11.0-1.11.2, 1.12.0

    Datastore incorrectly reports insufficient free space

    gkectl diagnose cluster command failed with:

    Checking VSphere Datastore FreeSpace...FAILURE
        Reason: vCenter datastore: [DATASTORE_NAME] insufficient FreeSpace, requires at least [NUMBER] GB
    

    The validation of datastore free space should not be used for existing cluster node pools, and was added in gkectl diagnose cluster by mistake.


    Workaround:

    You can ignore the error message or skip the validation using --skip-validation-infra.

    Operation, Networking 1.11, 1.12.0-1.12.1

    Failure to add new user cluster when admin cluster is using MetalLB load balancer

    You may not be able to add a new user cluster if your admin cluster is set up with a MetalLB load balancer configuration.

    The user cluster deletion process may get stuck for some reason which results in an invalidation of the MatalLB ConfigMap. It won't be possible to add a new user cluster in this state.


    Workaround:

    You can force delete your user cluster.

    Installation, Operating system 1.10, 1.11, 1.12, 1.13

    Failure when using Container-Optimized OS (COS) for user cluster

    If osImageType is using cos for admin cluster, and when gkectl check-config is executed after admin cluster creation and before user cluster creation, it would fail on:

    Failed to create the test VMs: VM failed to get IP addresses on the network.
    

    The test VM created for user cluster check-config by default uses the same osImageType from admin cluster, and currently test VM is not compatible with COS yet.


    Workaround:

    To avoid the slow preflight check which creates the test VM, using gkectl check-config --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG --fast.

    Logging and monitoring 1.12.0-1.12.1

    Grafana in the admin cluster unable to reach user clusters

    This issue affects customers using Grafana in the admin cluster to monitor user clusters in Anthos clusters on VMware versions 1.12.0 and 1.12.1. It comes from a mismatch of pushprox-client certificates in user clusters and the allowlist in the pushprox-server in the admin cluster. The symptom is pushprox-client in user clusters printing error logs like the following:

    level=error ts=2022-08-02T13:34:49.41999813Z caller=client.go:166 msg="Error reading request:" err="invalid method \"RBAC:\""
    

    Workaround:

    Other 1.11.3

    gkectl repair admin-master does not provide the VM template to be used for recovery

    gkectl repair admin-master command failed with:

    Failed to repair: failed to select the template: no VM templates is available for repairing the admin master (check if the admin cluster version >= 1.4.0 or contact support
    

    gkectl repair admin-master is not able to fetch the VM template to be used for repairing the admin control plane VM if the name of the admin control plane VM ends with the characters t, m, p, or l.


    Workaround:

    Rerun the command with --skip-validation.

    Logging and monitoring 1.11

    Cloud Audit Logging failure due to permission denied

    Anthos Cloud Audit Logging needs a special permission setup that is currently only automatically performed for user clusters through GKE Hub. It is recommended to have at least one user cluster that uses the same project ID and service account with the admin cluster for cloud audit logging, so the admin cluster will have the right permission needed for cloud audit logging.

    However in cases where admin cluster uses different project ID or different service account with any user cluster, audit logs from the admin cluster would fail to be injected into the cloud. The symptom is a series of Permission Denied errors in the audit-proxy pod in admin cluster.


    Workaround:

    s
    Operation, Security 1.11

    gkectl diagnose checking certificates failure

    If your work station does not have access to user cluster worker nodes, it will get the following failures when running gkectl diagnose:

    Checking user cluster certificates...FAILURE
        Reason: 3 user cluster certificates error(s).
        Unhealthy Resources:
        Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
        Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
        Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
    

    If your work station does not have access to admin cluster worker nodes or admin cluster worker nodes, it will get the following failures when running gkectl diagnose:

    Checking admin cluster certificates...FAILURE
        Reason: 3 admin cluster certificates error(s).
        Unhealthy Resources:
        Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
        Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
        Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
    

    Workaround:

    If is safe to ignore these messages.

    Operating system 1.8, 1.9, 1.10, 1.11, 1.12, 1.13

    /var/log/audit/ filling up disk space on Admin workstation

    /var/log/audit/ is filled with audit logs. You can check the disk usage by running sudo du -h -d 1 /var/log/audit.

    Certain gkectl commands on the admin workstation, for example, gkectl diagnose snapshot contribute to disk space usage.

    Since Anthos v1.8, the Ubuntu image is hardened with CIS Level 2 Benchmark. And one of the compliance rules, "4.1.2.2 Ensure audit logs are not automatically deleted", ensures the auditd setting max_log_file_action = keep_logs. This results in all the audit rules kept on the disk.


    Workaround:

    Networking 1.10, 1.11.0-1.11.3, 1.12.0-1.12.2, 1.13.0

    NetworkGatewayGroup Floating IP conflicts with node address

    Users are unable to create or update NetworkGatewayGroup objects because of the following validating webhook error:

    [1] admission webhook "vnetworkgatewaygroup.kb.io" denied the request: NetworkGatewayGroup.networking.gke.io "default" is invalid: [Spec.FloatingIPs: Invalid value: "10.0.0.100": IP address conflicts with node address with name: "my-node-name"
    

    In affected versions, the kubelet can erroneously bind to a floating IP address assigned to the node and report it as a node address in node.status.addresses. The validating webhook checks NetworkGatewayGroup floating IP addresses against all node.status.addresses in the cluster and sees this as a conflict.


    Workaround:

    In the same cluster where create or update of NetworkGatewayGroup objects is failing, temporarily disable the ANG validating webhook and submit your change:

    1. Save the webhook config so it can be restored at the end:
      kubectl -n kube-system get validatingwebhookconfiguration \
          ang-validating-webhook-configuration -o yaml > webhook-config.yaml
      
    2. Edit the webhook config:
      kubectl -n kube-system edit validatingwebhookconfiguration \
          ang-validating-webhook-configuration
      
    3. Remove the vnetworkgatewaygroup.kb.io item from the webhook config list and close to apply the changes.
    4. Create or edit your NetworkGatewayGroup object.
    5. Reapply the original webhook config:
      kubectl -n kube-system apply -f webhook-config.yaml
      
    Installation, Upgrades and updates 1.10.0-1.10.2

    Creating or upgrading admin cluster timeout

    During an admin cluster upgrade attempt, the admin control plane VM might get stuck during creation. The admin control plane VM goes into an infinite waiting loop during the boot up, and you will see the following infinite loop error in the /var/log/cloud-init-output.log file:

    + echo 'waiting network configuration is applied'
    waiting network configuration is applied
    ++ get-public-ip
    +++ ip addr show dev ens192 scope global
    +++ head -n 1
    +++ grep -v 192.168.231.1
    +++ grep -Eo 'inet ([0-9]{1,3}\.){3}[0-9]{1,3}'
    +++ awk '{print $2}'
    ++ echo
    + '[' -n '' ']'
    + sleep 1
    + echo 'waiting network configuration is applied'
    waiting network configuration is applied
    ++ get-public-ip
    +++ ip addr show dev ens192 scope global
    +++ grep -Eo 'inet ([0-9]{1,3}\.){3}[0-9]{1,3}'
    +++ awk '{print $2}'
    +++ grep -v 192.168.231.1
    +++ head -n 1
    ++ echo
    + '[' -n '' ']'
    + sleep 1
    

    This is because when Anthos clusters on VMware tries to get the node IP address in the startup script, it uses grep -v ADMIN_CONTROL_PLANE_VIP to skip the admin cluster control-plane VIP which can be assigned to the NIC too. However, the command also skips over any IP address that has a prefix of the control-plane VIP, which causes the startup script to hang.

    For example, suppose that the admin cluster control-plane VIP is 192.168.1.25. If the IP address of the admin cluster control-plane VM has the same prefix, for example,192.168.1.254, then the control-plane VM will get stuck during creation. This issue can also be trigerred if the broadcast address has the same prefix as the control-plane VIP, for example, 192.168.1.255.


    Workaround:

    • If the reason for the admin cluster creation timeout is due to the broadcast IP address, run the following command on the admin cluster control-plane VM:
      ip addr add ${ADMIN_CONTROL_PLANE_NODE_IP}/32 dev ens192
      
      This will create a line without a broadcast address, and unblock the boot up process. After the startup script is unblocked, remove this added line by running the following command:
      ip addr del ${ADMIN_CONTROL_PLANE_NODE_IP}/32 dev ens192
      
    • However, if the reason for the admin cluster creation timeout is due to the IP address of the control-plane VM, you cannot unblock the startup script. Switch to a different IP address, and recreate or upgrade to version 1.10.3 or later.
    Operating system, Upgrades and updates 1.10.0-1.10.2

    The state of the admin cluster using COS image will get lost upon admin cluster upgrade or admin master repair

    DataDisk can't be mounted correctly to admin cluster master node when using COS image and the state of the admin cluster using COS image will get lost upon admin cluster upgrade or admin master repair. (admin cluster using COS image is a preview feature)


    Workaround:

    Re-create admin cluster with osImageType set to ubuntu_containerd

    After you create the admin cluster with osImageType set to cos, grab the admin cluster SSH key and SSH into admin master node. df -h result contains /dev/sdb1 98G 209M 93G 1% /opt/data. And lsblk result contains -sdb1 8:17 0 100G 0 part /opt/data

    Operating system 1.10

    systemd-resolved failed DNS lookup on .local domains

    In Anthos clusters on VMware version 1.10.0, name resolutions on Ubuntu are routed to local systemd-resolved listening on 127.0.0.53 by default. The reason is that on the Ubuntu 20.04 image used in version 1.10.0, /etc/resolv.conf is sym-linked to /run/systemd/resolve/stub-resolv.conf, which points to the 127.0.0.53 localhost DNS stub.

    As a result, the localhost DNS name resolution refuses to check the upstream DNS servers (specified in /run/systemd/resolve/resolv.conf) for names with a .local suffix, unless the names are specified as search domains.

    This causes any lookups for .local names to fail. For example, during node startup, kubelet fails on pulling images from a private registry with a .local suffix. Specifying a vCenter address with a .local suffix will not work on an admin workstation.


    Workaround:

    You can avoid this issue for cluster nodes if you specify the searchDomainsForDNS field in your admin cluster configuration file and the user cluster configuration file to include the domains.

    Currently gkectl update doesn't support updating the searchDomainsForDNS field yet.

    Therefore, if you haven't set up this field before cluster creation, you must SSH into the nodes and bypass the local systemd-resolved stub by changing the symlink of /etc/resolv.conf from /run/systemd/resolve/stub-resolv.conf (which contains the 127.0.0.53 local stub) to /run/systemd/resolve/resolv.conf (which points to the actual upstream DNS):

    sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf
    

    As for the admin workstation, gkeadm doesn't support specifying search domains, so must work around this issue with this manual step.

    This solution for does not persist across VM re-creations. You must reapply this workaround whenever VMs are re-created.

    Installation, Operating system 1.10

    Docker bridge IP uses 172.17.0.1/16 instead of 169.254.123.1/24

    Anthos clusters on VMware specifies a dedicated subnet for the Docker bridge IP address that uses --bip=169.254.123.1/24, so that it won't reserve the default 172.17.0.1/16 subnet. However, in version 1.10.0, there is a bug in Ubuntu OS image that caused the customized Docker config to be ignored.

    As a result, Docker picks the default 172.17.0.1/16 as its bridge IP address subnet. This might cause an IP address conflict if you already have workload running within that IP address range.


    Workaround:

    To work around this issue, you must rename the following systemd config file for dockerd, and then restart the service:

    sudo mv /etc/systemd/system/docker.service.d/50-cloudimg-settings.cfg \
        /etc/systemd/system/docker.service.d/50-cloudimg-settings.conf
    
    sudo systemctl daemon-reload
    
    sudo systemctl restart docker
    

    Verify that Docker picks the correct bridge IP address:

    ip a | grep docker0
    

    This solution does not persist across VM re-creations. You must reapply this workaround whenever VMs are re-created.

    Upgrades and updates 1.11

    Upgrade to 1.11 blocked by stackdriver readiness

    In Anthos clusters on VMware version 1.11.0, there are changes in the definition of custom resources related to logging and monitoring:

    • Group name of the stackdriver custom resource changed from addons.sigs.k8s.io to addons.gke.io;
    • Group name of the monitoring and metricsserver custom resources changed from addons.k8s.io to addons.gke.io;
    • The specs of the above resources start to be valiidated against its schema. In particular, the resourceAttrOverride and storageSizeOverride spec in the stackdriver custom resource need to have string type in the values of the cpu, memory and storage size requests and limits.

    The group name changes are made to comply with CustomResourceDefinition updates in Kubernetes 1.22.

    There is no action required if you do not have additional logic that applies or edits the affected custom resources. The Anthos clusters on VMware upgrade process will take care of the migration of the affected resources and keep their existing specs after the group name change.

    However if you run any logic that applies or edits the affected resources, special attention is needed. First, they need to be referenced with the new group name in your manifest file. For example:

    apiVersion: addons.gke.io/v1alpha1  ## instead of `addons.sigs.k8s.io/v1alpha1`
    kind: Stackdriver
    

    Secondly, make sure the resourceAttrOverride and storageSizeOverride spec values are of string type. For example:

    spec:
      resourceAttrOverride:
        stackdriver-log-forwarder/stackdriver-log-forwarder
          limits:
            cpu: 1000m # or "1"
            # cpu: 1 # integer value like this would not work 
            memory: 3000Mi
    

    Otherwise, the applies and edits will not take effect and may lead to unexpected status in logging and monitoring components. Potential symptoms may include:

    • Reconciliation error logs in onprem-user-cluster-controller, for example:
      potential reconciliation error: Apply bundle components failed, requeue after 10s, error: failed to apply addon components: failed to apply bundle objects from stackdriver-operator-addon 1.11.2-gke.53 to cluster my-cluster: failed to create typed live object: .spec.resourceAttrOverride.stackdriver-log-forwarder/stackdriver-log-forwarder.limits.cpu: expected string, got &value.valueUnstructured{Value:1}
    • Failure in kubectl edit stackdriver stackdriver, for example:
      Error from server (NotFound): stackdrivers.addons.gke.io "stackdriver" not found

    If you encounter the above errors, it means an unsupported type under stackdriver CR spec was already present before the upgrade. As a workaround, you could manually edit the stackdriver CR under the old group name kubectl edit stackdrivers.addons.sigs.k8s.io stackdriver and do the following:

    1. Change the resource requests and limits to string type;
    2. Remove any addons.gke.io/migrated-and-deprecated: true annotation if present.
    Then resume or restart the upgrade process.

    If you need additional assistance, reach out to Google support.