Anthos clusters on VMware

Stay organized with collections Save and categorize content based on your preferences.

Select your Anthos clusters on VMware version:

Select your problem category:

Or, search for you issue:

Category Identified version(s) Issue and workaround
Operation, Upgrades and updates 1.11.0 - 1.11.1, 1.10.0 - 1.10.4, 1.9.0 - 1.9.6

Admin cluster backup does not include the always-on secrets encryption keys and configuration

When the always-on secrets encryption feature is enabled along with cluster backup, the admin cluster backup fails to include the encryption keys and configuration required by always-on secrets encryption feature. As a result, repairing the admin master with this backup using gkectl repair admin-master --restore-from-backup causes the following error:

Validating admin master VM xxx ...
Waiting for kube-apiserver to be accessible via LB VIP (timeout "8m0s")...  ERROR
Failed to access kube-apiserver via LB VIP. Trying to fix the problem by rebooting the admin master
Waiting for kube-apiserver to be accessible via LB VIP (timeout "13m0s")...  ERROR
Failed to access kube-apiserver via LB VIP. Trying to fix the problem by rebooting the admin master
Waiting for kube-apiserver to be accessible via LB VIP (timeout "18m0s")...  ERROR
Failed to access kube-apiserver via LB VIP. Trying to fix the problem by rebooting the admin master

Operation, Upgrades and updates 1.10+

Recreating the admin master VM with a new boot disk (e.g., gkectl repair admin-master) will fail if the always-on secrets encryption feature is enabled using `gkectl update` command.

If the always-on secrets encryption feature is not enabled at cluster creation, but enabled later using gkectl update operation then the gkectl repair admin-master fails to repair the admin cluster control plane node. It is recommend that always-on secrets encryption feature is enabled at cluster creation. There is no current mitigation.

Upgrades and updates 1.10

Upgrading the first user cluster from 1.9 to 1.10 recreates nodes in other user clusters

Upgrading the first user cluster from 1.9 to 1.10 could recreate nodes in other user clusters under the same admin cluster. The recreation is performed in a rolling fashion.

The disk_label was removed from MachineTemplate.spec.template.spec.providerSpec.machineVariables, which triggered an update on all MachineDeployments unexpectedly.


Workaround:

Upgrades and updates 1.10.0

Docker restarts frequently after cluster upgrade

Upgrade user cluster to 1.10.0 might cause docker restart frequently.

You can detect this issue by running kubectl describe node NODE_NAME --kubeconfig USER_CLUSTER_KUBECONFIG

A node condition will show whether the docker restart frequently. Here is an example output:

Normal   FrequentDockerRestart    41m (x2 over 141m)     systemd-monitor  Node condition FrequentDockerRestart is now: True, reason: FrequentDockerRestart

To understand the root cause, you need to ssh to the node that has the symptom and run commands like sudo journalctl --utc -u docker or sudo journalctl -x


Workaround:

Upgrades and updates 1.11, 1.12

Self-deployed GMP components not preserved after upgrading to version 1.12

If you are using an Anthos clusters on VMware version below 1.12, and have manually set up Google-managed Prometheus (GMP) components in the gmp-system namespace for your cluster, the components are not preserved when you upgrade to version 1.12.x.

From version 1.12, GMP components in the gmp-system namespace and CRDs are managed by stackdriver object, with the enableGMPForApplications flag set to false by default. If you manually deploy GMP components in the namespace prior to upgrading to 1.12, the resources will be deleted by stackdriver.


Workaround:

Operation 1.11, 1.12, 1.13.0 - 1.13.1

Missing ClusterAPI objects in cluster snapshot system scenario

In the system scenario, the cluster snapshot doesn't include any resources under the default namespace.

However, some Kubernetes resources like Cluster API objects that are under this namespace contain useful debugging information. The cluster snapshot should include them.


Workaround:

You can manually run the following commands to collect the debugging information.

export KUBECONFIG=USER_CLUSTER_KUBECONFIG
kubectl get clusters.cluster.k8s.io -o yaml
kubectl get controlplanes.cluster.k8s.io -o yaml
kubectl get machineclasses.cluster.k8s.io -o yaml
kubectl get machinedeployments.cluster.k8s.io -o yaml
kubectl get machines.cluster.k8s.io -o yaml
kubectl get machinesets.cluster.k8s.io -o yaml
kubectl get services -o yaml
kubectl describe clusters.cluster.k8s.io
kubectl describe controlplanes.cluster.k8s.io
kubectl describe machineclasses.cluster.k8s.io
kubectl describe machinedeployments.cluster.k8s.io
kubectl describe machines.cluster.k8s.io
kubectl describe machinesets.cluster.k8s.io
kubectl describe services
where:

USER_CLUSTER_KUBECONFIG is the user cluster's kubeconfig file.

Upgrades and updates 1.11.0-1.11.4, 1.12.0-1.12.3, 1.13.0-1.13.1

User cluster deletion stuck at node drain for vSAN setup

When deleting, updating or upgrading a user cluster, node drain may be stuck in the following scenarios:

  • The admin cluster has been using vSphere CSI driver on vSAN since version 1.12.x, and
  • There are no PVC/PV objects created by in-tree vSphere plugins in the admin and user cluster.

To identify the symptom, run the command below:

kubectl logs clusterapi-controllers-POD_NAME_SUFFIX  --kubeconfig ADMIN_KUBECONFIG -n USER_CLUSTER_NAMESPACE

Here is a sample error message from the above command:

E0920 20:27:43.086567 1 machine_controller.go:250] Error deleting machine object [MACHINE]; Failed to delete machine [MACHINE]: failed to detach disks from VM "[MACHINE]": failed to convert disk path "kubevols" to UUID path: failed to convert full path "ds:///vmfs/volumes/vsan:[UUID]/kubevols": ServerFaultCode: A general system error occurred: Invalid fault

kubevols is the default directory for vSphere in-tree driver. When there are no PVC/PV objects created, you may hit a bug that node drain will be stuck at finding kubevols, since the current implementation assumes that kubevols always exists.


Workaround:

Create the directory kubevols in the datastore where the node is created. This is defined in the vCenter.datastore field in the user-cluster.yaml or admin-cluster.yaml files.

Configuration 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13

admin cluster cluster-health-controller and vsphere-metrics-exporter do not work after deleting user cluster

On user cluster deletion, the corresponding clusterrole is also deleted, which results in auto repair and vsphere metrics exporter not working

The symptoms are the following:

  • cluster-health-controller logs
  • kubectl logs --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system \
    cluster-health-controller
    
    where ADMIN_CLUSTER_KUBECONFIG is the admin cluster's kubeconfig file. Here is an example of error messages you might see:
    error retrieving resource lock default/onprem-cluster-health-leader-election: configmaps "onprem-cluster-health-leader-election" is forbidden: User "system:serviceaccount:kube-system:cluster-health-controller" cannot get resource "configmaps" in API group "" in the namespace "default": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-health-controller-role" not found
    
  • vsphere-metrics-exporter logs
  • kubectl logs --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system \
    vsphere-metrics-exporter
    
    where ADMIN_CLUSTER_KUBECONFIG is the admin cluster's kubeconfig file. Here is an example of error messages you might see:
    vsphere-metrics-exporter/cmd/vsphere-metrics-exporter/main.go:68: Failed to watch *v1alpha1.Cluster: failed to list *v1alpha1.Cluster: clusters.cluster.k8s.io is forbidden: User "system:serviceaccount:kube-system:vsphere-metrics-exporter" cannot list resource "clusters" in API group "cluster.k8s.io" in the namespace "default"
    

Workaround:

Configuration 1.12.1-1.12.3, 1.13.0-1.13.2

gkectl check-config fails at OS image validation

A known issue that could fail the gkectl check-config without running gkectl prepare. This is confusing because we suggest running the command before running gkectl prepare

The symptom is that the gkectl check-config command will fail with the following error message:

Validator result: {Status:FAILURE Reason:os images [OS_IMAGE_NAME] don't exist, please run `gkectl prepare` to upload os images. UnhealthyResources:[]}

Workaround:

Option 1: run gkectl prepare to upload the missing OS images.

Option 2: use gkectl check-config --skip-validation-os-images to skip the OS images validation.

Upgrades and updates 1.11, 1.12, 1.13

gkectl update admin/cluster fails at updating anti affinity groups

A known issue that could fail the gkectl update admin/cluster when updating anti affinity groups.

The symptom is that the gkectl update command will fail with the following error message:

Waiting for machines to be re-deployed...  ERROR
Exit with error:
Failed to update the cluster: timed out waiting for the condition

Workaround:

Installation, Upgrades and updates 1.13.0

Nodes fail to register if configured hostname contains a period

Node registration fails during cluster creation, upgrade, update and node auto repair, when ipMode.type is static and the configured hostname in the IP block file contains one or more periods. In this case, Certificate Signing Requests (CSR) for a node are not automatically approved.

To see pending CSRs for a node, run the following command:

kubectl get csr -A -o wide

Check the following logs for error messages:

  • View the logs in the admin cluster for the clusterapi-controller-manager container in the clusterapi-controllers Pod:
    kubectl logs clusterapi-controllers-POD_NAME \
        -c clusterapi-controller-manager -n kube-system \
        --kubeconfig ADMIN_CLUSTER_KUBECONFIG
    
  • To view the same logs in the user cluster, run the following command:
    kubectl logs clusterapi-controllers-POD_NAME \
        -c clusterapi-controller-manager -n USER_CLUSTER_NAME \
        --kubeconfig ADMIN_CLUSTER_KUBECONFIG
    
    where:
    • ADMIN_CLUSTER_KUBECONFIG is the admin cluster's kubeconfig file.
    • USER_CLUSTER_NAME is the name of the user cluster.
    Here is an example of error messages you might see: "msg"="failed to validate token id" "error"="failed to find machine for node node-worker-vm-1" "validate"="csr-5jpx9"
  • View the kubelet logs on the problematic node:
    journalctl --u kubelet
    
    Here is an example of error messages you might see: "Error getting node" err="node \"node-worker-vm-1\" not found"

If you specify a domain name in the hostname field of an IP block file, any characters following the first period will be ignored. For example, if you specify the hostname as bob-vm-1.bank.plc, the VM hostname and node name will be set to bob-vm-1.

When node ID verification is enabled, the CSR approver compares the node name with the hostname in the Machine spec, and fails to reconcile the name. The approver rejects the CSR, and the node fails to bootstrap.


Workaround:

User cluster

Disable node ID verification by completing the following steps:

  1. Add the following fields in your user cluster configuration file:
    disableNodeIDVerification: true
    disableNodeIDVerificationCSRSigning: true
    
  2. Save the file, and update the user cluster by running the following command:
    gkectl update cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        --config USER_CLUSTER_CONFIG_FILE
    
    Replace the following:
    • ADMIN_CLUSTER_KUBECONFIG: the path of the admin cluster kubeconfig file.
    • USER_CLUSTER_CONFIG_FILE: the path of your user cluster configuration file.

Admin cluster

  1. Open the OnPremAdminCluster custom resource for editing:
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        edit onpremadmincluster -n kube-system
    
  2. Add the following annotation to the custom resource:
    features.onprem.cluster.gke.io/disable-node-id-verification: enabled
    
  3. Edit the kube-controller-manager manifest in the admin cluster control plane:
    1. SSH into the admin cluster control plane node.
    2. Open the kube-controller-manager manifest for editing:
      sudo vi /etc/kubernetes/manifests/kube-controller-manager.yaml
      
    3. Find the list of controllers:
      --controllers=*,bootstrapsigner,tokencleaner,-csrapproving,-csrsigning
      
    4. Update this section as shown below:
      --controllers=*,bootstrapsigner,tokencleaner
      
  4. Open the Deployment Cluster API controller for editing:
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        edit deployment clusterapi-controllers -n kube-system
    
  5. Change the values of node-id-verification-enabled and node-id-verification-csr-signing-enabled to false:
    --node-id-verification-enabled=false
    --node-id-verification-csr-signing-enabled=false
    
Installation, Upgrades and updates 1.11.0-1.11.4

Admin control plane machine startup failure caused by private registry certificate bundle

The admin cluster creation/upgrade is stuck at the following log forever and eventually times out:

Waiting for Machine gke-admin-master-xxxx to become ready...

The Cluster API controller log in the external cluster snapshot includes the following log:

Invalid value 'XXXX' specified for property startup-data

Here is an example file path for the Cluster API controller log:

kubectlCommands/kubectl_logs_clusterapi-controllers-c4fbb45f-6q6g6_--container_vsphere-controller-manager_--kubeconfig_.home.ubuntu..kube.kind-config-gkectl_--request-timeout_30s_--namespace_kube-system_--timestamps
    

VMware has a 64k vApp property size limit. In the identified versions, the data passed via vApp property is close to the limit. When the private registry certificate contains a certificate bundle, it may cause the final data to exceed the 64k limit.


Workaround:

Only include the required certificates in the private registry certificate file configured in privateRegistry.caCertPath in the admin cluster config file.

Or upgrade to a version with the fix when available.

Networking 1.10, 1.11.0-1.11.3, 1.12.0-1.12.2, 1.13.0

NetworkGatewayNodes marked unhealthy from concurrent status update conflict

In networkgatewaygroups.status.nodes, some nodes switch between NotHealthy and Up.

Logs for the ang-daemon Pod running on that node reveal repeated errors:

2022-09-16T21:50:59.696Z ERROR ANGd Failed to report status {"angNode": "kube-system/my-node", "error": "updating Node CR status: sending Node CR update: Operation cannot be fulfilled on networkgatewaynodes.networking.gke.io \"my-node\": the object has been modified; please apply your changes to the latest version and try again"}

The NotHealthy status prevents the controller from assigning additional floating IPs to the node. This can result in higher burden on other nodes or a lack of redundancy for high availability.

Dataplane activity is otherwise not affected.

Contention on the networkgatewaygroup object causes some status updates to fail due to a fault in retry handling. If too many status updates fail, ang-controller-manager sees the node as past its heartbeat time limit and marks the node NotHealthy.

The fault in retry handling has been fixed in later versions.


Workaround:

Upgrade to a fixed version, when available.

Upgrades and updates 1.12.0-1.12.2, 1.13.0

Race condition blocks machine object deletion during and update or upgrade

A known issue that could cause the cluster upgrade or update to be stuck at waiting for the old machine object to be deleted. This is because the finalizer cannot be removed from the machine object. This affects any rolling update operation for node pools.

The symptom is that the gkectl command times out with the following error message:

E0821 18:28:02.546121   61942 console.go:87] Exit with error:
E0821 18:28:02.546184   61942 console.go:87] error: timed out waiting for the condition, message: Node pool "pool-1" is not ready: ready condition is not true: CreateOrUpdateNodePool: 1/3 replicas are updated
Check the status of OnPremUserCluster 'cluster-1-gke-onprem-mgmt/cluster-1' and the logs of pod 'kube-system/onprem-user-cluster-controller' for more detailed debugging information.

In clusterapi-controller Pod logs, the errors are like below:

$ kubectl logs clusterapi-controllers-[POD_NAME_SUFFIX] -n cluster-1
    -c vsphere-controller-manager --kubeconfig [ADMIN_KUBECONFIG]
    | grep "Error removing finalizer from machine object"
[...]
E0821 23:19:45.114993       1 machine_controller.go:269] Error removing finalizer from machine object cluster-1-pool-7cbc496597-t5d5p; Operation cannot be fulfilled on machines.cluster.k8s.io "cluster-1-pool-7cbc496597-t5d5p": the object has been modified; please apply your changes to the latest version and try again

The error repeats for the same machine for several minutes for successful runs even without this bug, for most of the time it can go through quickly, but for some rare cases it can be stuck at this race condition for several hours.

The issue is that the underlying VM is already deleted in vCenter, but the corresponding machine object cannot be removed, which is stuck at the finalizer removal due to very frequent updates from other controllers. This can cause the gkectl command to timeout, but the controller keeps reconciling the cluster so the upgrade or update process eventually completes.


Workaround:

We have prepared several different mitigation options for this issue, which depends on your environment and requirements.

  • Option 1: Wait for the upgrade to eventually complete by itself.

    Based on the analysis and reproduction in your environment, the upgrade can eventually finish by itself without any manual intervention. The caveat of this option is that it's uncertain how long it will take for the finalizer removal to go through for each machine object. It can go through immediately if lucky enough, or it could last for several hours if the machineset controller reconcile is too fast and the machine controller never gets a chance to remove the finalizer in between the reconciliations.

    The good thing is that this option doesn't need any action from your side, and the workloads won't be disrupted. It just needs a longer time for the upgrade to finish.
  • Option 2: Apply auto repair annotation to all the old machine objects.

    The machineset controller will filter out the machines that have the auto repair annotation and deletion timestamp being non zero, and won't keep issuing delete calls on those machines, this can help avoid the race condition.

    The downside is that the pods on the machines will be deleted directly instead of evicted, which means it won't respect the PDB configuration, this might potentially cause downtime for your workloads.

    The command for getting all machine names:
    kubectl --kubeconfig CLUSTER_KUBECONFIG get machines
    
    The command for applying auto repair annotation for each machine:
    kubectl annotate --kubeconfig CLUSTER_KUBECONFIG \
        machine MACHINE_NAME \
        onprem.cluster.gke.io/repair-machine=true
    

If you encounter this issue and the upgrade or update still can't complete after a long time, contact our support team for mitigations.

Installation, Upgrades and updates 1.10.2, 1.11, 1.12, 1.13

gkectl prepare OS image validation preflight failure

gkectl prepare command failed with:

- Validation Category: OS Images
    - [FAILURE] Admin cluster OS images exist: os images [os_image_name] don't exist, please run `gkectl prepare` to upload os images.

The preflight checks of gkectl prepare included an incorrect validation.


Workaround:

Run the same command with an additional flag --skip-validation-os-images.

Installation 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13

vCenter URL with https:// or http:// prefix may cause cluster startup failure

Admin cluster creation failed with:

Exit with error:
Failed to create root cluster: unable to apply admin base bundle to external cluster: error: timed out waiting for the condition, message:
Failed to apply external bundle components: failed to apply bundle objects from admin-vsphere-credentials-secret 1.x.y-gke.z to cluster external: Secret "vsphere-dynamic-credentials" is invalid:
[data[https://xxx.xxx.xxx.username]: Invalid value: "https://xxx.xxx.xxx.username": a valid config key must consist of alphanumeric characters, '-', '_' or '.'
(e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+'), data[https://xxx.xxx.xxx.password]:
Invalid value: "https://xxx.xxx.xxx.password": a valid config key must consist of alphanumeric characters, '-', '_' or '.'
(e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')]

The URL is used as part of a Secret key, which doesn't support "/" or ":".


Workaround:

Remove https:// or http:// prefix from the vCenter.Address field in the admin cluster or user cluster config yaml.

Installation, Upgrades and updates 1.10, 1.11, 1.12, 1.13

gkectl prepare panic on util.CheckFileExists

gkectl prepare can panic with the following stacktrace:

panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xde0dfa]

goroutine 1 [running]:
gke-internal.googlesource.com/syllogi/cluster-management/pkg/util.CheckFileExists(0xc001602210, 0x2b, 0xc001602210, 0x2b) pkg/util/util.go:226 +0x9a
gke-internal.googlesource.com/syllogi/cluster-management/gkectl/pkg/config/util.SetCertsForPrivateRegistry(0xc000053d70, 0x10, 0xc000f06f00, 0x4b4, 0x1, 0xc00015b400)gkectl/pkg/config/util/utils.go:75 +0x85
...

The issue is that gkectl prepare created the private registry certificate directory with a wrong permission.


Workaround:

To fix this issue, please run the following commands on the admin workstation:

sudo mkdir -p /etc/docker/certs.d/PRIVATE_REGISTRY_ADDRESS
sudo chmod 0755 /etc/docker/certs.d/PRIVATE_REGISTRY_ADDRESS
Upgrades and updates 1.10, 1.11, 1.12, 1.13

gkectl repair admin-master and resumable admin upgrade do not work together

After a failed admin cluster upgrade attempt, don't run gkectl repair admin-master. Doing so may cause subsequent admin upgrade attempts to fail with issues such as admin master power on failure or the VM being inaccessible.


Workaround:

If you've already encountered this failure scenario, contact support.

Upgrades and updates 1.10, 1.11

Resumed admin cluster upgrade can lead to missing admin control plane VM template

If the admin control plane machine isn't recreated after a resumed admin cluster upgrade attempt, the admin control plane VM template is deleted. The admin control plane VM template is the template of the admin master that is used to recover the control plane machine with gkectl repair admin-master.


Workaround:

The admin control plane VM template will be regenerated during the next admin cluster upgrade.

Operating system 1.12, 1.13

cgroup v2 could affect workloads

In version 1.12.0, cgroup v2 (unified) is enabled by default for Container Optimized OS (COS) nodes. This could potentially cause instability for your workloads in a COS cluster.


Workaround:

We switched back to cgroup v1 (hybrid) in version 1.12.1. If you are using COS nodes, we recommend that you upgrade to version 1.12.1 as soon as it is released.

Identity 1.10, 1.11, 1.12, 1.13

ClientConfig custom resource

gkectl update reverts any manual changes that you have made to the ClientConfig custom resource.


Workaround:

We strongly recommend that you back up the ClientConfig resource after every manual change.

Installation 1.10, 1.11, 1.12, 1.13

gkectl check-config validation fails: can't find F5 BIG-IP partitions

Validation fails because F5 BIG-IP partitions can't be found, even though they exist.

An issue with the F5 BIG-IP API can cause validation to fail.


Workaround:

Try running gkectl check-config again.

Installation 1.12

User cluster installation failed because of cert-manager/ca-injector's leader election issue

You might see an installation failure due to cert-manager-cainjector in crashloop, when the apiserver/etcd is slow:

# These are logs from `cert-manager-cainjector`, from the command
# `kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system
  cert-manager-cainjector-xxx`

I0923 16:19:27.911174       1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election: timed out waiting for the condition

E0923 16:19:27.911110       1 leaderelection.go:321] error retrieving resource lock kube-system/cert-manager-cainjector-leader-election-core:
  Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/cert-manager-cainjector-leader-election-core": context deadline exceeded

I0923 16:19:27.911593       1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election-core: timed out waiting for the condition

E0923 16:19:27.911629       1 start.go:163] cert-manager/ca-injector "msg"="error running core-only manager" "error"="leader election lost"

Workaround:

Security, Upgrades and updates 1.10, 1.11, 1.12, 1.13

Renewal of certificates might be required before an admin cluster upgrade

Before you begin the admin cluster upgrade process, you should make sure that your admin cluster certificates are currently valid, and renew these certificates if they are not.

If you have begun the upgrade process and discovered an error with certificate expiry, contact Google Support for assistance.

Note: This guidance is strictly for admin cluster certificates renewal.

Workaround:

VMware 1.10, 1.11, 1.12, 1.13

Restarting or upgrading vCenter for versions lower than 7.0U2

If the vCenter, for versions lower than 7.0U2, is restarted, after an upgrade or otherwise, the network name in vm information from vCenter is incorrect, and results in the machine being in an Unavailable state. This eventually leads to the nodes being auto-repaired to create new ones.

Related govmomi bug.


Workaround:

This workaround is provided by VMware support:

  1. The issue is fixed in vCenter versions 7.0U2 and above.
  2. For lower versions, right-click the host, and then select Connection > Disconnect. Next, reconnect, which forces an update of the VM's portgroup.
Operating system 1.10, 1.11, 1.12, 1.13

SSH connection closed by remote host

For Anthos clusters on VMware version 1.7.2 and above, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.

To meet the CIS rule "5.2.16 Ensure SSH Idle Timeout Interval is configured", /etc/ssh/sshd_config has the following settings:

ClientAliveInterval 300
ClientAliveCountMax 0

The purpose of these settings is to terminate a client session after 5 minutes of idle time. However, the ClientAliveCountMax 0 value causes unexpected behavior. When you use the ssh session on the admin workstation, or a cluster node, the SSH connection might be disconnected even your ssh client is not idle, such as when running a time-consuming command, and your command could get terminated with the following message:

Connection to [IP] closed by remote host.
Connection to [IP] closed.

Workaround:

You can either:

  • Use nohup to prevent your command being terminated on SSH disconnection,
    nohup gkectl upgrade admin --config admin-cluster.yaml \
        --kubeconfig kubeconfig
    
  • Update the sshd_config to use a non-zero ClientAliveCountMax value. The CIS rule recommends to use a value less than 3:
    sudo sed -i 's/ClientAliveCountMax 0/ClientAliveCountMax 1/g' \
        /etc/ssh/sshd_config
    sudo systemctl restart sshd
    

Make sure you reconnect your SSH session.

Installation 1.10, 1.11, 1.12, 1.13

Conflicting cert-manager installation

In 1.13 releases, monitoring-operator will install cert-manager in the cert-manager namespace. If for certain reasons, you need to install your own cert-manager, follow the following instructions to avoid conflicts:

You only need to apply this work around once for each cluster, and the changes will be preserved across cluster upgrade.

Note: One common symptom of installing your own cert-manager is that the cert-manager version or image (for example v1.7.2) may revert back to its older version. This is caused by monitoring-operator trying to reconcile the cert-manager, and reverting the version in the process.

Workaround:

Avoid conflicts during upgrade

  1. Uninstall your version of cert-manager. If you defined your own resources, you may want to backup them.
  2. Perform the upgrade.
  3. Follow the following instructions to restore your own cert-manager.

Restore your own cert-manager in user clusters

  • Scale the monitoring-operator Deployment to 0:
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        -n USER_CLUSTER_NAME \
        scale deployment monitoring-operator --replicas=0
    
  • Scale the cert-manager deployments managed by monitoring-operator to 0:
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        -n cert-manager scale deployment cert-manager --replicas=0
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        -n cert-manager scale deployment cert-manager-cainjector\
        --replicas=0
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        -n cert-manager scale deployment cert-manager-webhook --replicas=0
    
  • Reinstall your version of cert-manager. Restore your customized resources if you have.
  • You can skip this step if you are using upstream default cert-manager installation, or you are sure your cert-manager is installed in the cert-manager namespace. Otherwise, copy the metrics-ca cert-manager.io/v1 Certificate and the metrics-pki.cluster.local Issuer resources from cert-manager to the cluster resource namespace of your installed cert-manager.
    relevant_fields='
    {
      apiVersion: .apiVersion,
      kind: .kind,
      metadata: {
        name: .metadata.name,
        namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE"
      },
      spec: .spec
    }
    '
    f1=$(mktemp)
    f2=$(mktemp)
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        get issuer -n cert-manager metrics-pki.cluster.local -o json \
        | jq "${relevant_fields}" > $f1
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        get certificate -n cert-manager metrics-ca -o json \
        | jq "${relevant_fields}" > $f2
    kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f $f1
    kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f $f2
    

Restore your own cert-manager in admin clusters

In general, you shouldn't need to re-install cert-manager in admin clusters because admin clusters only run Anthos clusters on VMware control plane workloads. In the rare cases that you also need to install your own cert-manager in admin clusters, please follow the following instructions to avoid conflicts. Please note, if you are an Apigee customer and you only need cert-manager for Apigee, you do not need to run the admin cluster commands.

  • Scale the monitoring-operator deployment to 0.
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        -n kube-system scale deployment monitoring-operator --replicas=0
    
  • Scale the cert-manager deployments managed by monitoring-operator to 0.
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        -n cert-manager scale deployment cert-manager \
        --replicas=0
    
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
         -n cert-manager scale deployment cert-manager-cainjector \
         --replicas=0
    
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        -n cert-manager scale deployment cert-manager-webhook \
        --replicas=0
    
  • Reinstall your version of cert-manager. Restore your customized resources if you have.
  • You can skip this step if you are using upstream default cert-manager installation, or you are sure your cert-manager is installed in the cert-manager namespace. Otherwise, copy the metrics-ca cert-manager.io/v1 Certificate and the metrics-pki.cluster.local Issuer resources from cert-manager to the cluster resource namespace of your installed cert-manager.
    relevant_fields='
    {
      apiVersion: .apiVersion,
      kind: .kind,
      metadata: {
        name: .metadata.name,
        namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE"
      },
      spec: .spec
    }
    '
    f3=$(mktemp)
    f4=$(mktemp)
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \n
        get issuer -n cert-manager metrics-pki.cluster.local -o json \
        | jq "${relevant_fields}" > $f3
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
        get certificate -n cert-manager metrics-ca -o json \
        | jq "${relevant_fields}" > $f4
    kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f $f3
    kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f $f4
    
Operating system 1.10, 1.11, 1.12, 1.13

False positives in docker, containerd, and runc vulnerability scanning

The Docker, containerd, and runc in the Ubuntu OS images shipped with Anthos clusters on VMware are pinned to special versions using Ubuntu PPA. This ensures that any container runtime changes will be qualified by Anthos clusters on VMware before each release.

However, the special versions are unknown to the Ubuntu CVE Tracker, which is used as the vulnerability feeds by various CVE scanning tools. Therefore, you will see false positives in Docker, containerd, and runc vulnerability scanning results.

For example, you might see the following false positives from your CVE scanning results. These CVEs are already fixed in the latest patch versions of Anthos clusters on VMware.

Refer to the release notes] for any CVE fixes.


Workaround:

Canonical is aware of this issue, and the fix is tracked at https://github.com/canonical/sec-cvescan/issues/73.

Upgrades and updates 1.10, 1.11, 1.12, 1.13

Network connection between admin and user cluster might be unavailable for a short time during non-HA cluster upgrade

If you are upgrading non-HA clusters from 1.9 to 1.10, you might notice that the kubectl exec, kubectl log and webhook against user clusters might be unavailable for a short time. This downtime can be up to one minute. This happens because the incoming request (kubectl exec, kubectl log and webhook) is handled by kube-apiserver for the user cluster. User kube-apiserver is a Statefulset. In a non-HA cluster, there is only one replica for the Statefulset. So during upgrade, there is a chance that the old kube-apiserver is unavailable while the new kube-apiserver is not yet ready.


Workaround:

This downtime only happens during upgrade process. If you want a shorter downtime during upgrade, we recommend you to switch to HA clusters.

Installation, Upgrades and updates 1.10, 1.11, 1.12, 1.13

Konnectivity readiness check failed in HA cluster diagnose after cluster creation or upgrade

If you are creating or upgrading an HA cluster and notice konnectivity readiness check failed in cluster diagnose, in most cases it will not affect the functionality of Anthos clusters on VMware (kubectl exec, kubectl log and webhook). This happens because sometimes one or two of the konnectivity replicas might be unready for a period of time due to unstable networking or other issues.


Workaround:

The konnectivity will recover by itself. Wait for 30 minutes to 1 hour and rerun cluster diagnose.

Operating system 1.7, 1.8, 1.9, 1.10, 1.11

/etc/cron.daily/aide CPU and memory spike issue

Starting from Anthos clusters on VMware version 1.7.2, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.

As a result, the cron script /etc/cron.daily/aide has been installed so that an aide check is scheduled so as to ensure that the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked" is followed.

The cron job runs daily at 6:00 AM UTC. Depending on the number of files on the filesystem, you may experience CPU and memory usage spikes around that time that are caused by this aide process.


Workaround:

If the spikes are affecting your workload, you can disable the daily cron job:

sudo chmod -x /etc/cron.daily/aide
Networking 1.10, 1.11, 1.12, 1.13

Load balancers and NSX-T stateful distributed firewall rules interact unpredictably

When deploying Anthos clusters on VMware version 1.9 or later, when the deployment has the Seesaw bundled load balancer in an environment that uses NSX-T stateful distributed firewall rules, stackdriver-operator might fail to create gke-metrics-agent-conf ConfigMap and cause gke-connect-agent Pods to be in a crash loop.

The underlying issue is that the stateful NSX-T distributed firewall rules terminate the connection from a client to the user cluster API server through the Seesaw load balancer because Seesaw uses asymmetric connection flows. The integration issues with NSX-T distributed firewall rules affect all Anthos clusters on VMware releases that use Seesaw. You might see similar connection problems on your own applications when they create large Kubernetes objects whose sizes are bigger than 32K.


Workaround:

Follow these instructions to disable NSX-T distributed firewall rules, or to use stateless distributed firewall rules for Seesaw VMs.

If your clusters use a manual load balancer, follow these instructions to configure your load balancer to reset client connections when it detects a backend node failure. Without this configuration, clients of the Kubernetes API server might stop responding for several minutes when a server instance goes down.

Logging and monitoring 1.9.0-1.9.4, 1.10.0-1.10.1

High network traffic to monitoring.googleapis.com

You might see high network traffic to monitoring.googleapis.com, even in a new cluster that has no user workloads.

This issue affects version 1.10.0-1.10.1 and version 1.9.0-1.9.4. This issue is fixed in version 1.10.2 and 1.9.5.


Workaround:

Logging and monitoring 1.10, 1.11

gke-metrics-agent has frequent CrashLoopBackOff errors

For Anthos clusters on VMware version 1.10 and above, `gke-metrics-agent` DaemonSet has frequent CrashLoopBackOff errors when `enableStackdriverForApplications` is set to `true` in the `stackdriver` object.


Workaround:

To mitigate this issue, disable application metrics collection by running the following commands. These commands will not disable application logs collection.

  1. To prevent the following changes from reverting, scale down stackdriver-operator:
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        --namespace kube-system scale deploy stackdriver-operator \
        --replicas=0
    
    Replace USER_CLUSTER_KUBECONFIG with the path of the user cluster kubeconfig file.
  2. Open the gke-metrics-agent-conf ConfigMap for editing:
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        --namespace kube-system edit configmap gke-metrics-agent-conf
    
  3. Under services.pipelines, comment out the entire metrics/app-metrics section:
    services:
      pipelines:
        #metrics/app-metrics:
        #  exporters:
        #  - googlecloud/app-metrics
        #  processors:
        #  - resource
        #  - metric_to_resource
        #  - infer_resource
        #  - disk_buffer/app-metrics
        #  receivers:
        #  - prometheus/app-metrics
        metrics/metrics:
          exporters:
          - googlecloud/metrics
          processors:
          - resource
          - metric_to_resource
          - infer_resource
          - disk_buffer/metrics
          receivers:
          - prometheus/metrics
    
  4. Close the editing session.
  5. Restart the gke-metrics-agent DaemonSet:
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        --namespace kube-system rollout restart daemonset gke-metrics-agent
    
Logging and monitoring 1.11, 1.12, 1.13

Replace deprecated metrics in dashboard

If deprecated metrics are used in your OOTB dashboards, you will see some empty charts. To find deprecated metrics in the Monitoring dashboards, run the following commands:

gcloud monitoring dashboards list > all-dashboard.json

# find deprecated metrics
cat all-dashboard.json | grep -E \
  'kube_daemonset_updated_number_scheduled\
    |kube_node_status_allocatable_cpu_cores\
    |kube_node_status_allocatable_pods\
    |kube_node_status_capacity_cpu_cores'

The following deprecated metrics should be migrated to their replacements.

DeprecatedReplacement
kube_daemonset_updated_number_scheduled kube_daemonset_status_updated_number_scheduled
kube_node_status_allocatable_cpu_cores
kube_node_status_allocatable_memory_bytes
kube_node_status_allocatable_pods
kube_node_status_allocatable
kube_node_status_capacity_cpu_cores
kube_node_status_capacity_memory_bytes
kube_node_status_capacity_pods
kube_node_status_capacity
kube_hpa_status_current_replicas kube_horizontalpodautoscaler_status_current_replicas

Workaround:

To replace the deprecated metrics

  1. Delete "GKE on-prem node status" in the Google Cloud Monitoring dashboard. Reinstall "GKE on-prem node status" following these instructions.
  2. Delete "GKE on-prem node utilization" in the Google Cloud Monitoring dashboard. Reinstall "GKE on-prem node utilization" following these instructions.
  3. Delete "GKE on-prem vSphere vm health" in the Google Cloud Monitoring dashboard. Reinstall "GKE on-prem vSphere vm health" following these instructions.
  4. This deprecation is due to the upgrade of kube-state-metrics agent from v1.9 to v2.4, which is required for Kubernetes 1.22. You can replace all deprecated kube-state-metrics metrics, which have the prefix kube_, in your custom dashboards or alerting policies.

Logging and monitoring 1.10, 1.11, 1.12, 1.13

Unknown metric data in Cloud Monitoring

For Anthos clusters on VMware version 1.10 and above, the data for clusters in Cloud Monitoring may contain irrelevant summary metrics entries such as the following:

Unknown metric: kubernetes.io/anthos/go_gc_duration_seconds_summary_percentile

Other metrics types that may have irrelevant summary metrics include

:
  • apiserver_admission_step_admission_duration_seconds_summary
  • go_gc_duration_seconds
  • scheduler_scheduling_duration_seconds
  • gkeconnect_http_request_duration_seconds_summary
  • alertmanager_nflog_snapshot_duration_seconds_summary

While these summary type metrics are in the metrics list, they are not supported by gke-metrics-agent at this time.

Logging and monitoring 1.10, 1.11, 1.12, 1.13

Missing metrics on some nodes

You might find that the following metrics are missing on some, but not all, nodes:

  • kubernetes.io/anthos/container_memory_working_set_bytes
  • kubernetes.io/anthos/container_cpu_usage_seconds_total
  • kubernetes.io/anthos/container_network_receive_bytes_total

Workaround:

To fix this issue, perform the following steps as a workaround. For [version 1.9.5+, 1.10.2+, 1.11.0]: increase cpu for gke-metrics-agent by following steps 1 - 4

  1. Open your stackdriver resource for editing:
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        --namespace kube-system edit stackdriver stackdriver
    
  2. To increase the CPU request for gke-metrics-agent from 10m to 50m, CPU limit from 100m to 200m add the following resourceAttrOverride section to the stackdriver manifest :
    spec:
      resourceAttrOverride:
        gke-metrics-agent/gke-metrics-agent:
          limits:
            cpu: 100m
            memory: 4608Mi
          requests:
            cpu: 10m
            memory: 200Mi
    
    Your edited resource should look similar to the following:
    spec:
      anthosDistribution: on-prem
      clusterLocation: us-west1-a
      clusterName: my-cluster
      enableStackdriverForApplications: true
      gcpServiceAccountSecretName: ...
      optimizedMetrics: true
      portable: true
      projectID: my-project-191923
      proxyConfigSecretName: ...
      resourceAttrOverride:
        gke-metrics-agent/gke-metrics-agent:
          limits:
            cpu: 200m
            memory: 4608Mi
          requests:
            cpu: 50m
            memory: 200Mi
    
  3. Save your changes and close the text editor.
  4. To verify your changes have taken effect, run the following command:
    kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
        --namespace kube-system get daemonset gke-metrics-agent -o yaml \
        | grep "cpu: 50m"
    
    The command finds cpu: 50m if your edits have taken effect.
Logging and monitoring 1.11.0-1.11.2, 1.12.0

Missing scheduler and controller-manager metrics in admin cluster

If your admin cluster is affected by this issue, scheduler and controller-manager metrics are missing. For example, these two metrics are missing

# scheduler metric example
scheduler_pending_pods
# controller-manager metric example
replicaset_controller_rate_limiter_use

Workaround:

Upgrade to v1.11.3+, v1.12.1+, or v1.13+.

1.11.0-1.11.2, 1.12.0

Missing scheduler and controller-manager metrics in user cluster

If your user cluster is affected by this issue, scheduler and controller-manager metrics are missing. For example, these two metrics are missing:

# scheduler metric example
scheduler_pending_pods
# controller-manager metric example
replicaset_controller_rate_limiter_use

Workaround:

Installation 1.11, 1.12, 1.13

Installer fails when creating vSphere datadisk

The Anthos clusters on VMware installer can fail if custom roles are bound at the wrong permissions level.

When the role binding is incorrect, creating a vSphere datadisk with govc hangs and the disk is created with a size equal to 0. To fix the issue, you should bind the custom role at the vSphere vCenter level (root).


Workaround:

If you want to bind the custom role at the DC level (or lower than root), you also need to bind the read-only role to the user at the root vCenter level.

For more information on role creation, see vCenter user account privileges.

Installation, Upgrades and updates 1.10, 1.11, 1.12, 1.13

Failure to register admin cluster during creation

If you create an admin cluster for version 1.9.x or 1.10.0, and if the admin cluster fails to register with the provided gkeConnect spec during its creation, you will get the following error.

Failed to create root cluster: failed to register admin cluster: failed to register cluster: failed to apply Hub Membership: Membership API request failed: rpc error:  ode = PermissionDenied desc = Permission 'gkehub.memberships.get' denied on PROJECT_PATH

You will still be able to use this admin cluster, but you will get the following error if you later attempt to upgrade the admin cluster to version 1.10.y.

failed to migrate to first admin trust chain: failed to parse current version "": invalid version: "" failed to migrate to first admin trust chain: failed to parse current version "": invalid version: ""

Workaround:

Identity 1.10, 1.11, 1.12, 1.13

Using Anthos Identity Service can cause the Connect Agent to restart unpredictably

If you are using the Anthos Identity Service feature to manage Anthos Identity Service ClientConfig, the Connect Agent might restart unexpectedly.


Workaround:

If you have experienced this issue with an existing cluster, you can do one of the following:

  • Disable Anthos Identity Service (AIS). If you disable AIS, that will not remove the deployed AIS binary or remove AIS ClientConfig. To disable AIS, run this command:
    gcloud beta container hub identity-service disable \
        --project PROJECT_NAME
    
    Replace PROJECT_NAME with the name of the cluster's fleet host project.
  • Update the cluster to version 1.9.3 or later, or version 1.10.1 or later, so as to upgrade the Connect Agent version.
Networking 1.10, 1.11, 1.12, 1.13

Cisco ACI doesn't work with Direct Server Return (DSR)

Seesaw runs in DSR mode, and by default it doesn't work in Cisco ACI because of data-plane IP learning.


Workaround:

A possible workaround is to disable IP learning by adding the Seesaw IP address as a L4-L7 Virtual IP in the Cisco Application Policy Infrastructure Controller (APIC).

You can configure the L4-L7 Virtual IP option by going to Tenant > Application Profiles > Application EPGs or uSeg EPGs. Failure to disable IP learning will result in IP endpoint flapping between different locations in the Cisco API fabric.

VMware 1.10, 1.11, 1.12, 1.13

vSphere 7.0 Update 3 issues

VMWare has recently identified critical issues with the following vSphere 7.0 Update 3 releases:

  • vSphere ESXi 7.0 Update 3 (build 18644231)
  • vSphere ESXi 7.0 Update 3a (build 18825058)
  • vSphere ESXi 7.0 Update 3b (build 18905247)
  • vSphere vCenter 7.0 Update 3b (build 18901211)

Workaround:

VMWare has since removed these releases. You should upgrade the ESXi and vCenter Servers to a newer version.

Operating system 1.10, 1.11, 1.12, 1.13

Failure to mount emptyDir volume as exec into Pod running on COS nodes

For Pods running on nodes that use Container-Optimized OS (COS) images, you cannot mount emptyDir volume as exec. It mounts as noexec and you will get the following error: exec user process caused: permission denied. For example, you will see this error message if you deploy the following test Pod:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: test
  name: test
spec:
  containers:
  - args:
    - sleep
    - "5000"
    image: gcr.io/google-containers/busybox:latest
    name: test
    volumeMounts:
      - name: test-volume
        mountPath: /test-volume
    resources:
      limits:
        cpu: 200m
        memory: 512Mi
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  volumes:
    - emptyDir: {}
      name: test-volume

And in the test Pod, if you run mount | grep test-volume, it would show noexec option:

/dev/sda1 on /test-volume type ext4 (rw,nosuid,nodev,noexec,relatime,commit=30)

Workaround:

Upgrades and updates 1.10, 1.11, 1.12, 1.13

Cluster node pool replica update does not work after autoscaling has been disabled on the node pool

Node pool replicas do not update once autoscaling has been enabled and disabled on a node pool.


Workaround:

Removine the cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size and cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size annotations from the machine deployment of the corresponding node pool.

Logging and monitoring 1.11, 1.12, 1.13

Windows monitoring dashboards show data from Linux clusters

From version 1.11, on the out-of-the-box monitoring dashboards, the Windows Pod status dashboard and Windows node status dashboard also show data from Linux clusters. This is because the Windows node and Pod metrics are also exposed on Linux clusters.

Security 1.13

Kubelet service will be temporarily unavailable after NodeReady

there is a short period where node is ready but kubelet server certificate is not ready. kubectl exec and kubectl logs are unavailable during this tens of seconds. This is because it takes time for the new server certificate approver to see the updated valid IPs of the node.

This issue affects kubelet server certificate only, it will not affect Pod scheduling.

Upgrades and updates 1.12

Partial admin cluster upgrade does not block later user cluster upgrade

User cluster upgrade failed with:

.LBKind in body is required (Check the status of OnPremUserCluster 'cl-stg-gdl-gke-onprem-mgmt/cl-stg-gdl' and the logs of pod 'kube-system/onprem-user-cluster-controller' for more detailed debugging information.

The admin cluster is not fully upgraded, and the status version is still 1.10. User cluster upgrade to 1.12 won't be blocked by any preflight check, and fails with version skew issue.


Workaround:

Complete to upgrade the admin cluster to 1.11 first, and then upgrade the user cluster to 1.12.

Storage 1.10.0-1.10.5, 1.11.0-1.11.2, 1.12.0

Datastore incorrectly reports insufficient free space

gkectl diagnose cluster command failed with:

Checking VSphere Datastore FreeSpace...FAILURE
    Reason: vCenter datastore: [DATASTORE_NAME] insufficient FreeSpace, requires at least [NUMBER] GB

The validation of datastore free space should not be used for existing cluster node pools, and was added in gkectl diagnose cluster by mistake.


Workaround:

You can ignore the error message or skip the validation using --skip-validation-infra.

Operation, Networking 1.11, 1.12.0-1.12.1

Failure to add new user cluster when admin cluster is using MetalLB load balancer

You may not be able to add a new user cluster if your admin cluster is set up with a MetalLB load balancer configuration.

The user cluster deletion process may get stuck for some reason which results in an invalidation of the MatalLB ConfigMap. It won't be possible to add a new user cluster in this state.


Workaround:

You can force delete your user cluster.

Installation, Operating system 1.10, 1.11, 1.12, 1.13

Failure when using Container-Optimized OS (COS) for user cluster

If osImageType is using cos for admin cluster, and when gkectl check-config is executed after admin cluster creation and before user cluster creation, it would fail on:

Failed to create the test VMs: VM failed to get IP addresses on the network.

The test VM created for user cluster check-config by default uses the same osImageType from admin cluster, and currently test VM is not compatible with COS yet.


Workaround:

To avoid the slow preflight check which creates the test VM, using gkectl check-config --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG --fast.

Logging and monitoring 1.12.0-1.12.1

Grafana in the admin cluster unable to reach user clusters

This issue affects customers using Grafana in the admin cluster to monitor user clusters in Anthos clusters on VMware versions 1.12.0 and 1.12.1. It comes from a mismatch of pushprox-client certificates in user clusters and the allowlist in the pushprox-server in the admin cluster. The symptom is pushprox-client in user clusters printing error logs like the following:

level=error ts=2022-08-02T13:34:49.41999813Z caller=client.go:166 msg="Error reading request:" err="invalid method \"RBAC:\""

Workaround:

Other 1.11.3

gkectl repair admin-master does not provide the VM template to be used for recovery

gkectl repair admin-master command failed with:

Failed to repair: failed to select the template: no VM templates is available for repairing the admin master (check if the admin cluster version >= 1.4.0 or contact support

gkectl repair admin-master is not able to fetch the VM template to be used for repairing the admin control plane VM if the name of the admin control plane VM ends with the characters t, m, p, or l.


Workaround:

Rerun the command with --skip-validation.

Logging and monitoring 1.11

Cloud Audit Logging failure due to permission denied

Anthos Cloud Audit Logging needs a special permission setup that is currently only automatically performed for user clusters through GKE Hub. It is recommended to have at least one user cluster that uses the same project ID and service account with the admin cluster for cloud audit logging, so the admin cluster will have the right permission needed for cloud audit logging.

However in cases where admin cluster uses different project ID or different service account with any user cluster, audit logs from the admin cluster would fail to be injected into the cloud. The symptom is a series of Permission Denied errors in the audit-proxy pod in admin cluster.


Workaround:

s
Operation, Security 1.11

gkectl diagnose checking certificates failure

If your work station does not have access to user cluster worker nodes, it will get the following failures when running gkectl diagnose:

Checking user cluster certificates...FAILURE
    Reason: 3 user cluster certificates error(s).
    Unhealthy Resources:
    Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
    Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
    Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out

If your work station does not have access to admin cluster worker nodes or admin cluster worker nodes, it will get the following failures when running gkectl diagnose:

Checking admin cluster certificates...FAILURE
    Reason: 3 admin cluster certificates error(s).
    Unhealthy Resources:
    Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
    Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
    Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out

Workaround:

If is safe to ignore these messages.

Operating system 1.8, 1.9, 1.10, 1.11, 1.12, 1.13

/var/log/audit/ filling up disk space on Admin workstation

/var/log/audit/ is filled with audit logs. You can check the disk usage by running sudo du -h -d 1 /var/log/audit.

Certain gkectl commands on the admin workstation, for example, gkectl diagnose snapshot contribute to disk space usage.

Since Anthos v1.8, the Ubuntu image is hardened with CIS Level 2 Benchmark. And one of the compliance rules, "4.1.2.2 Ensure audit logs are not automatically deleted", ensures the auditd setting max_log_file_action = keep_logs. This results in all the audit rules kept on the disk.


Workaround:

Networking 1.10, 1.11.0-1.11.3, 1.12.0-1.12.2, 1.13.0

NetworkGatewayGroup Floating IP conflicts with node address

Users are unable to create or update NetworkGatewayGroup objects because of the following validating webhook error:

[1] admission webhook "vnetworkgatewaygroup.kb.io" denied the request: NetworkGatewayGroup.networking.gke.io "default" is invalid: [Spec.FloatingIPs: Invalid value: "10.0.0.100": IP address conflicts with node address with name: "my-node-name"

In affected versions, the kubelet can erroneously bind to a floating IP address assigned to the node and report it as a node address in node.status.addresses. The validating webhook checks NetworkGatewayGroup floating IP addresses against all node.status.addresses in the cluster and sees this as a conflict.


Workaround:

In the same cluster where create or update of NetworkGatewayGroup objects is failing, temporarily disable the ANG validating webhook and submit your change:

  1. Save the webhook config so it can be restored at the end:
    kubectl -n kube-system get validatingwebhookconfiguration \
        ang-validating-webhook-configuration -o yaml > webhook-config.yaml
    
  2. Edit the webhook config:
    kubectl -n kube-system edit validatingwebhookconfiguration \
        ang-validating-webhook-configuration
    
  3. Remove the vnetworkgatewaygroup.kb.io item from the webhook config list and close to apply the changes.
  4. Create or edit your NetworkGatewayGroup object.
  5. Reapply the original webhook config:
    kubectl -n kube-system apply -f webhook-config.yaml
    
Installation, Upgrades and updates 1.10.0-1.10.2

Creating or upgrading admin cluster timeout

During an admin cluster upgrade attempt, the admin control plane VM might get stuck during creation. The admin control plane VM goes into an infinite waiting loop during the boot up, and you will see the following infinite loop error in the /var/log/cloud-init-output.log file:

+ echo 'waiting network configuration is applied'
waiting network configuration is applied
++ get-public-ip
+++ ip addr show dev ens192 scope global
+++ head -n 1
+++ grep -v 192.168.231.1
+++ grep -Eo 'inet ([0-9]{1,3}\.){3}[0-9]{1,3}'
+++ awk '{print $2}'
++ echo
+ '[' -n '' ']'
+ sleep 1
+ echo 'waiting network configuration is applied'
waiting network configuration is applied
++ get-public-ip
+++ ip addr show dev ens192 scope global
+++ grep -Eo 'inet ([0-9]{1,3}\.){3}[0-9]{1,3}'
+++ awk '{print $2}'
+++ grep -v 192.168.231.1
+++ head -n 1
++ echo
+ '[' -n '' ']'
+ sleep 1

This is because when Anthos clusters on VMware tries to get the node IP address in the startup script, it uses grep -v ADMIN_CONTROL_PLANE_VIP to skip the admin cluster control-plane VIP which can be assigned to the NIC too. However, the command also skips over any IP address that has a prefix of the control-plane VIP, which causes the startup script to hang.

For example, suppose that the admin cluster control-plane VIP is 192.168.1.25. If the IP address of the admin cluster control-plane VM has the same prefix, for example,192.168.1.254, then the control-plane VM will get stuck during creation. This issue can also be trigerred if the broadcast address has the same prefix as the control-plane VIP, for example, 192.168.1.255.


Workaround:

  • If the reason for the admin cluster creation timeout is due to the broadcast IP address, run the following command on the admin cluster control-plane VM:
    ip addr add ${ADMIN_CONTROL_PLANE_NODE_IP}/32 dev ens192
    
    This will create a line without a broadcast address, and unblock the boot up process. After the startup script is unblocked, remove this added line by running the following command:
    ip addr del ${ADMIN_CONTROL_PLANE_NODE_IP}/32 dev ens192
    
  • However, if the reason for the admin cluster creation timeout is due to the IP address of the control-plane VM, you cannot unblock the startup script. Switch to a different IP address, and recreate or upgrade to version 1.10.3 or later.
Operating system, Upgrades and updates 1.10.0-1.10.2

The state of the admin cluster using COS image will get lost upon admin cluster upgrade or admin master repair

DataDisk can't be mounted correctly to admin cluster master node when using COS image and the state of the admin cluster using COS image will get lost upon admin cluster upgrade or admin master repair. (admin cluster using COS image is a preview feature)


Workaround:

Re-create admin cluster with osImageType set to ubuntu_containerd

After you create the admin cluster with osImageType set to cos, grab the admin cluster SSH key and SSH into admin master node. df -h result contains /dev/sdb1 98G 209M 93G 1% /opt/data. And lsblk result contains -sdb1 8:17 0 100G 0 part /opt/data

Operating system 1.10

systemd-resolved failed DNS lookup on .local domains

In Anthos clusters on VMware version 1.10.0, name resolutions on Ubuntu are routed to local systemd-resolved listening on 127.0.0.53 by default. The reason is that on the Ubuntu 20.04 image used in version 1.10.0, /etc/resolv.conf is sym-linked to /run/systemd/resolve/stub-resolv.conf, which points to the 127.0.0.53 localhost DNS stub.

As a result, the localhost DNS name resolution refuses to check the upstream DNS servers (specified in /run/systemd/resolve/resolv.conf) for names with a .local suffix, unless the names are specified as search domains.

This causes any lookups for .local names to fail. For example, during node startup, kubelet fails on pulling images from a private registry with a .local suffix. Specifying a vCenter address with a .local suffix will not work on an admin workstation.


Workaround:

You can avoid this issue for cluster nodes if you specify the searchDomainsForDNS field in your admin cluster configuration file and the user cluster configuration file to include the domains.

Currently gkectl update doesn't support updating the searchDomainsForDNS field yet.

Therefore, if you haven't set up this field before cluster creation, you must SSH into the nodes and bypass the local systemd-resolved stub by changing the symlink of /etc/resolv.conf from /run/systemd/resolve/stub-resolv.conf (which contains the 127.0.0.53 local stub) to /run/systemd/resolve/resolv.conf (which points to the actual upstream DNS):

sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf

As for the admin workstation, gkeadm doesn't support specifying search domains, so must work around this issue with this manual step.

This solution for does not persist across VM re-creations. You must reapply this workaround whenever VMs are re-created.

Installation, Operating system 1.10

Docker bridge IP uses 172.17.0.1/16 instead of 169.254.123.1/24

Anthos clusters on VMware specifies a dedicated subnet for the Docker bridge IP address that uses --bip=169.254.123.1/24, so that it won't reserve the default 172.17.0.1/16 subnet. However, in version 1.10.0, there is a bug in Ubuntu OS image that caused the customized Docker config to be ignored.

As a result, Docker picks the default 172.17.0.1/16 as its bridge IP address subnet. This might cause an IP address conflict if you already have workload running within that IP address range.


Workaround:

To work around this issue, you must rename the following systemd config file for dockerd, and then restart the service:

sudo mv /etc/systemd/system/docker.service.d/50-cloudimg-settings.cfg \
    /etc/systemd/system/docker.service.d/50-cloudimg-settings.conf

sudo systemctl daemon-reload

sudo systemctl restart docker

Verify that Docker picks the correct bridge IP address:

ip a | grep docker0

This solution does not persist across VM re-creations. You must reapply this workaround whenever VMs are re-created.

If you need additional assistance, reach out to Google support.