Version 1.8. This version is supported as outlined in the Anthos version support policy, offering the latest patches and updates for security vulnerabilities, exposures, and issues impacting Anthos clusters on VMware (GKE on-prem). Refer to the release notes for more details. This is the most recent version.

Known issues

This document describes known issues for version 1.8 of Anthos clusters on VMware (GKE on-prem).

ClientConfig custom resource

gkectl update reverts any manual chages that you have made to the ClientConfig custom resource. We strongly recommend that you back up the ClientConfig resource after every manual change.

kubectl describe CSINode and gkectl diagnose snapshot

kubectl describe CSINode and gkectl diagnose snapshot sometimes fail due to the OSS Kubernetes issue on dereferencing nil pointer fields.

OIDC and the CA certificate

The OIDC provider doesn't use the common CA by default. You must explicitly supply the CA certificate.

Upgrading the admin cluster from 1.5 to 1.6.0 breaks 1.5 user clusters that use an OIDC provider and have no value for authentication.oidc.capath in the user cluster configuration file.

To work around this issue, run the following script:

USER_CLUSTER_KUBECONFIG=YOUR_USER_CLUSTER_KUBECONFIG

IDENTITY_PROVIDER=YOUR_OIDC_PROVIDER_ADDRESS

openssl s_client -showcerts -verify 5 -connect $IDENTITY_PROVIDER:443 < /dev/null | awk '/BEGIN CERTIFICATE/,/END CERTIFICATE/{ if(/BEGIN CERTIFICATE/){i++}; out="tmpcert"i".pem"; print >out}'

ROOT_CA_ISSUED_CERT=$(ls tmpcert*.pem | tail -1)

ROOT_CA_CERT="/etc/ssl/certs/$(openssl x509 -in $ROOT_CA_ISSUED_CERT -noout -issuer_hash).0"

cat tmpcert*.pem $ROOT_CA_CERT > certchain.pem CERT=$(echo $(base64 certchain.pem) | sed 's\ \\g') rm tmpcert1.pem tmpcert2.pem

kubectl --kubeconfig $USER_CLUSTER_KUBECONFIG patch clientconfig default -n kube-public --type json -p "[{ \"op\": \"replace\", \"path\": \"/spec/authentication/0/oidc/certificateAuthorityData\", \"value\":\"${CERT}\"}]"

Replace the following:

  • YOUR_OIDC_IDENTITY_PROVICER: The address of your OIDC provider:

  • YOUR_YOUR_USER_CLUSTER_KUBECONFIG: The path of your user cluster kubeconfig file.

gkectl check-config validation fails: can't find F5 BIG-IP partitions

Symptoms

Validation fails because F5 BIG-IP partitions can't be found, even though they exist.

Potential causes

An issue with the F5 BIG-IP API can cause validation to fail.

Resolution

Try running gkectl check-config again.

Disruption for workloads with PodDisruptionBudgets

Upgrading clusters can cause disruption or downtime for workloads that use PodDisruptionBudgets (PDBs).

Nodes fail to complete their upgrade process

If you have Anthos Service Mesh or OSS Istio installed on your cluster, depending on your PodDisruptionBudget settings for the Istio components, user nodes might fail to upgrade to the control plane version after repeated attempts. To prevent this failure, we recommend that you increase the Horizontal Pod Autoscaling minReplicas setting from 1 to 2 for the components in the istio-system namespace before you upgrade. This will ensure that you always have an instance of the ASM control plane running.

If you have Anthos Service Mesh 1.5+ or OSS Istio 1.5+:

kubectl patch hpa -n istio-system istio-ingressgateway -p '{"spec":{"minReplicas": 2}}' --type=merge
kubectl patch hpa -n istio-system istiod -p '{"spec":{"minReplicas": 2}}' --type=merge

If you have Anthos Service Mesh 1.4.x or OSS Istio 1.4.x:

kubectl patch hpa -n istio-system istio-galley -p '{"spec":{"minReplicas": 2}}' --type=merge
kubectl patch hpa -n istio-system istio-ingressgateway -p '{"spec":{"minReplicas": 2}}' --type=merge
kubectl patch hpa -n istio-system istio-nodeagent -p '{"spec":{"minReplicas": 2}}' --type=merge
kubectl patch hpa -n istio-system istio-pilot -p '{"spec":{"minReplicas": 2}}' --type=merge
kubectl patch hpa -n istio-system istio-sidecar-injector -p '{"spec":{"minReplicas": 2}}' --type=merge

Log Forwarder makes an excessive number of OAuth 2.0 requests

With Anthos clusters on VMware, version 1.7.1, you might experience issues with Log Forwarder consuming memory by making excessive OAuth 2.0 requests. Here is a workaround, in which you downgrade the stackdriver-operator version, clean up the disk, and restart Log Forwarder.

Step 0: Download images to your private registry if appropriate

If you use a private registry, follow these steps to download these images to your private registry before proceeding. Omit this step if you do not use a private registry.

Replace PRIVATE_REGISTRY_HOST with the hostname or IP address of your private Docker registry.

stackdriver-operator

docker pull gcr.io/gke-on-prem-release/stackdriver-operator:v0.0.440

docker tag gcr.io/gke-on-prem-release/stackdriver-operator:v0.0.440 \
    PRIVATE_REGISTRY_HOST/stackdriver-operator:v0.0.440

docker push PRIVATE_REGISTRY_HOST/stackdriver-operator:v0.0.440

fluent-bit

docker pull gcr.io/gke-on-prem-release/fluent-bit:v1.6.10-gke.3

docker tag gcr.io/gke-on-prem-release/fluent-bit:v1.6.10-gke.3 \
    PRIVATE_REGISTRY_HOST/fluent-bit:v1.6.10-gke.3

docker push PRIVATE_REGISTRY_HOST/fluent-bit:v1.6.10-gke.3

prometheus

docker pull gcr.io/gke-on-prem-release/prometheus:2.18.1-gke.0

docker tag gcr.io/gke-on-prem-release/prometheus:2.18.1-gke.0 \
    PRIVATE_REGISTRY_HOST/prometheus:2.18.1-gke.0

docker push PRIVATE_REGISTRY_HOST/prometheus:2.18.1-gke.0

Step 1: Downgrade the stackdriver-operator version

Run the following command to downgrade your version of stackdriver-operator.

kubectl  --kubeconfig  -n kube-system patch deployment stackdriver-operator -p \
  '{"spec":{"template":{"spec":{"containers":[{"name":"stackdriver-operator","image":"gcr.io/gke-on-prem-release/stackdriver-operator:v0.0.440"}]}}}}'

Step 2: Clean up the disk buffer for Log Forwarder

  • Deploy the DaemonSet in the cluster to clean up the buffer.
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit-cleanup
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: fluent-bit-cleanup
  template:
    metadata:
      labels:
        app: fluent-bit-cleanup
    spec:
      containers:
      - name: fluent-bit-cleanup
        image: debian:10-slim
        command: ["bash", "-c"]
        args:
        - |
          rm -rf /var/log/fluent-bit-buffers/
          echo "Fluent Bit local buffer is cleaned up."
          sleep 3600
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        securityContext:
          privileged: true
      tolerations:
      - key: "CriticalAddonsOnly"
        operator: "Exists"
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      - key: node-role.gke.io/observability
        effect: NoSchedule
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
  • Verify the disk buffer is cleaned up.
kubectl --kubeconfig  logs -n kube-system -l app=fluent-bit-cleanup | grep "cleaned up" | wc -l

The output shows the number of nodes in the cluster.

kubectl --kubeconfig  -n kube-system get pods -l app=fluent-bit-cleanup --no-headers | wc -l

The output shows the number of nodes in the cluster.

  • Delete the cleanup DaemonSet.
kubectl --kubeconfig  -n kube-system delete ds fluent-bit-cleanup

Step 3: Restart Log Forwarder

kubectl --kubeconfig  -n kube-system rollout restart ds/stackdriver-log-forwarder

Logs and metrics are not sent to project specified by stackdriver.projectID

In Anthos clusters on VMware 1.7, logs are sent to the parent project of the service account specified in the stackdriver.serviceAccountKeyPath field of your cluster configuration file. The value of stackdriver.projectID is ignored. This issue will be fixed in an upcoming release.

As a workaround, view logs in the parent project of your logging-monitoring service account.

Renewal of certificates might be required before an admin cluster upgrade

Before you begin the admin cluster upgrade process, you should make sure that your admin cluster certificates are currently valid, and renew these certificates if they are not.

Admin cluster certificate renewal process

  1. Make sure that OpenSSL is installed on the admin workstation before you begin.

  2. Set the KUBECONFIG variable:

    KUBECONFIG=ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG
    

    Replace ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG with the absolute path to the admin cluster kubeconfig file.

  3. Get the IP address and SSH keys for the admin master node:

    kubectl --kubeconfig "${KUBECONFIG}" get secrets -n kube-system sshkeys \
    -o jsonpath='{.data.vsphere_tmp}' | base64 -d > \
    ~/.ssh/admin-cluster.key && chmod 600 ~/.ssh/admin-cluster.key
    
    export MASTER_NODE_IP=$(kubectl --kubeconfig "${KUBECONFIG}" get nodes -o \
    jsonpath='{.items[*].status.addresses[?(@.type=="ExternalIP")].address}' \
    --selector='node-role.kubernetes.io/master')
    
  4. Check if the certificates are expired:

    ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" \
    "sudo kubeadm alpha certs check-expiration"
    

    If the certificates are expired, you must renew them before upgrading the admin cluster.

  5. Back up old certificates:

    This is an optional, but recommended, step.

    # ssh into admin master
    ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
    
    # on admin master
    sudo tar -czvf backup.tar.gz /etc/kubernetes
    logout
    
    # on worker node
    sudo scp -i ~/.ssh/admin-cluster.key \
    ubuntu@"${MASTER_NODE_IP}":/home/ubuntu/backup.tar.gz .
    
  6. Renew the certificates with kubeadm:

     # ssh into admin master
     ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
     # on admin master
     sudo kubeadm alpha certs renew all
     

  7. Restart the admin master node:

      # on admin master
      cd /etc/kubernetes
      sudo mkdir tempdir
      sudo mv manifests/*.yaml tempdir/
      sleep 5
      echo "remove pods"
      # ensure kubelet detect those change remove those pods
      # wait until the result of this command is empty
      sudo docker ps | grep kube-apiserver
    
      # ensure kubelet start those pods again
      echo "start pods again"
      sudo mv tempdir/*.yaml manifests/
      sleep 30
      # ensure kubelet start those pods again
      # should show some results
      sudo docker ps | grep -e kube-apiserver -e kube-controller-manager -e kube-scheduler -e etcd
    
      # clean up
      sudo rm -rf tempdir
    
      logout
     
  8. Because the admin cluster kubeconfig file also expires if the admin certificates expire, you should back up this file before expiration.

    • Back up the admin cluster kubeconfig file:

      ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" 
      "sudo cat /etc/kubernetes/admin.conf" > new_admin.conf vi "${KUBECONFIG}"

    • Replace client-certificate-data and client-key-data in kubeconfig with client-certificate-data and client-key-data in the new_admin.conf file that you created.

  9. You must validate the renewed certificates, and validate the certificate of kube-apiserver.

    • Check certificates expiration:

      ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" 
      "sudo kubeadm alpha certs check-expiration"

    • Check certificate of kube-apiserver:

      # Get the IP address of kube-apiserver
      cat $KUBECONFIG | grep server
      # Get the current kube-apiserver certificate
      openssl s_client -showcerts -connect : 
      | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p'
      > current-kube-apiserver.crt # check expiration date of this cert openssl x509 -in current-kube-apiserver.crt -noout -enddate

/etc/cron.daily/aide script uses up all space in /run, causing a crashloop in Pods

Starting from Anthos clusters on VMware 1.7.2, the Ubuntu OS images are hardened with CIS L1 Server Benchmark. . As a result, the cron script /etc/cron.daily/aide has been installed so that an aide check is scheduled to ensure the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked".

The script uses /run/aide as a temporary directory to save its cron logs, and over time it could use up all the space in /run. See /etc/cron.daily/aide script uses all space in /run for a workaround.

If you see one or more Pods crashlooping on a node, run df -h /run on the node. If the command output shows 100% space usage, then you are likely experiencing this issue.

This issue is fixed in version 1.8.1. For the 1.7.2 and 1.8.0 versions, you can resolve this issue manually with either of the following two workarounds:

  1. Periodically remove the log files at /run/aide/cron.daily.old* (recommended).
  2. Follow the steps mentioned in /etc/cron.daily/aide script uses all space in /run. (Note: this workaround could potentially affect the node compliance state).

Upgrading Seesaw load balancer with version 1.8.0

If you use the gkectl upgrade loadbalancer to attempt to update some parameters of the Seesaw load balancer in version 1.8.0, this will not work in either DHCP or IPAM mode. If your setup includes this configuration, do not upgrade to version 1.8.0, but instead to version 1.8.1 or later.

Cannot log in to admin workstation due to password expiry issue

You might experience this issue if you are using one of the following versions of Anthos clusters on VMware.

  • 1.7.2-gke.2
  • 1.7.3-gke.2
  • 1.8.0-gke.21
  • 1.8.0-gke.24
  • 1.8.0-gke.25
  • 1.8.1-gke.7
  • 1.8.2-gke.8

You might get the following error when you attempt to SSH into your Anthos VMs, including the admin workstation, cluster nodes, and Seesaw nodes:

WARNING: Your password has expired.

This error occurs because the ubuntu user password on the VMs has expired. You must manually reset the user password's expiration time to a large value before logging into the VMs.

Prevention of password expiry error

If you are running the affected versions listed above, and the user password hasn't expired yet, you should extend the expiration time before seeing the SSH error.

Run the following command on each Anthos VM:

sudo change -M 99999 ubuntu

Mitigation of password expiry error

If the user password has already expired and you can't log in to the VMs to extend the expiration time, perform the following mitigation steps for each component.

Admin workstation

Use a temporary VM to perform the following steps. You can create an admin workstation using the 1.7.1-gke.4 version to use as the temporary VM.

  1. Ensure the temporary VM and the admin workstation are in a power off state.

  2. Attach the boot disk of the admin workstation to the temporary VM. The boot disk is the one with the label "Hard disk 1".

  3. Mount the boot disk inside the VM by running these commands. Substitute your own boot disk identifier for dev/sdc1.

    sudo mkdir -p /mnt/boot-disk
    sudo mount /dev/sdc1 /mnt/boot-disk
    
  4. Set the ubuntu user expiration date to a large value such as 99999 days.

    sudo chroot /mnt/boot-disk chage -M 99999 ubuntu
    
  5. Shut down the temporary VM.

  6. Power on the admin workstation. You should now be able to SSH as usual.

  7. As cleanup, delete the temporary VM.

Admin cluster control plane VM

Follow the instructions to recreate the admin cluster control plane VM.

Admin cluster addon VMs

Run the following command from the admin workstation to recreate the VM:

  kubectl --kubeconfig=ADMIN_CLUSTER_KUBECONFIG patch machinedeployment gke-admin-node --type=json -p="[{'op': 'add', 'path': '/spec/template/spec/metadata/annotations', 'value': {"kubectl.kubernetes.io/restartedAt": "version1"}}]"
  

After you run this command, wait for the admin cluster addon VMs to finish recreation and to be ready before you continue with the next steps.

User cluster control plane VMs

Run the following command from the admin workstation to recreate the VMs:

usermaster=`kubectl --kubeconfig=ADMIN_CLUSTER_KUBECONFIG get machinedeployments -l set=user-master -o name` && kubectl --kubeconfig=ADMIN_CLUSTER_KUBECONFIG patch $usermaster --type=json -p="[{'op': 'add', 'path': '/spec/template/spec/metadata/annotations', 'value': {"kubectl.kubernetes.io/restartedAt": "version1"}}]"

After you run this command, wait for the user cluster control plane VMs to finish recreation and to be ready before you continue with the next steps.

User cluster worker VMs

Run the following command from the admin workstation to recreate the VMs.

for md in `kubectl --kubeconfig=USER_CLUSTER_KUBECONFIG get machinedeployments -l set=node -o name`; do kubectl patch --kubeconfig=USER_CLUSTER_KUBECONFIG $md --type=json -p="[{'op': 'add', 'path': '/spec/template/spec/metadata/annotations', 'value': {"kubectl.kubernetes.io/restartedAt": "version1"}}]"; done

Seesaw VMs

Run the following commands from the admin workstation to recreate the Seesaw VMs. There will be some downtime. If HA is enabled for the load balancer, the maximum down time is two seconds.

gkectl upgrade loadbalancer --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config ADMIN_CLUSTER_CONFIG --admin-cluster --no-diff
gkectl upgrade loadbalancer --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG --no-diff

Restarting or upgrading vCenter for versions lower than 7.0U2

If the vCenter, for versions lower than 7.0U2, is restarted, after an upgrade or otherwise, the network name in vm information from vCenter is incorrect, and results in the machine being in an Unavailable state. This eventually leads to the nodes being auto-repaired to create new ones.

Related govmomi bug: https://github.com/vmware/govmomi/issues/2552

This workaround is provided by VMware support:

1. The issue is fixed in vCenter versions 7.0U2 and above.

2. For lower versions:
Right-click the host, and then select Connection > Disconnect. Next, reconnect, which forces an update of the 
VM's portgroup.