This document describes known issues for version 1.8 of Anthos clusters on VMware (GKE on-prem).
ClientConfig custom resource
gkectl update
reverts any manual changes that you have made to the ClientConfig
custom resource. We strongly recommend that you back up the ClientConfig
resource after every manual change.
gkectl check-config validation fails: can't find F5 BIG-IP partitions
- Symptoms
Validation fails because F5 BIG-IP partitions can't be found, even though they exist.
- Potential causes
An issue with the F5 BIG-IP API can cause validation to fail.
- Resolution
Try running
gkectl check-config
again.
Disruption for workloads with PodDisruptionBudgets
Upgrading clusters can cause disruption or downtime for workloads that use PodDisruptionBudgets (PDBs).
Nodes fail to complete their upgrade process
If you have PodDisruptionBudget
objects configured that are unable to
allow any additional disruptions, node upgrades might fail to upgrade to the
control plane version after repeated attempts. To prevent this failure, we
recommend that you scale up the Deployment
or HorizontalPodAutoscaler
to
allow the node to drain while still respecting the PodDisruptionBudget
configuration.
To see all PodDisruptionBudget
objects that do not allow any disruptions:
kubectl get poddisruptionbudget --all-namespaces -o jsonpath='{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.name}/{.metadata.namespace}{"\n"}{end}'
User cluster installation failed because of cert-manager/ca-injector's leader election issue
You might see an installation failure due to cert-manager-cainjector
in crashloop, when the apiserver/etcd is slow:
# These are logs from `cert-manager-cainjector`, from the command
# `kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system cert-manager-cainjector-xxx`
I0923 16:19:27.911174 1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election: timed out waiting for the condition
E0923 16:19:27.911110 1 leaderelection.go:321] error retrieving resource lock kube-system/cert-manager-cainjector-leader-election-core: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/cert-manager-cainjector-leader-election-core": context deadline exceeded
I0923 16:19:27.911593 1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election-core: timed out waiting for the condition
E0923 16:19:27.911629 1 start.go:163] cert-manager/ca-injector "msg"="error running core-only manager" "error"="leader election lost"
Run the following commands to mitigate the problem.
First Scale down the monitoring-operator
so it will not revert the changes to the
cert-manager
Deployment.
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment monitoring-operator --replicas=0
Second, edit the cert-manager-cainjector
Deployment to disable leader election, because we only have one replica running. It is not required for a single replica.
# Add a command line flag for cainjector: `--leader-elect=false`
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG edit -n kube-system deployment cert-manager-cainjector
The relevant yaml snippet for cert-manager-cainjector
deployment should looks like this:
...
apiVersion: apps/v1
kind: Deployment
metadata:
name: cert-manager-cainjector
namespace: kube-system
...
spec:
...
template:
...
spec:
...
containers:
- name: cert-manager
image: "gcr.io/gke-on-prem-staging/cert-manager-cainjector:v1.0.3-gke.0"
args:
...
- --leader-elect=false
...
Keep monitoring-operator
replicas at 0 as a mitigation until the installation is finished. Otherwise it will revert the change.
After the installation is finished and the cluster is up and running, turn on the monitoring-operator
for day-2 operations:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment monitoring-operator --replicas=1
After each upgrade, the changes will be reverted. Perform the same steps again to mitigate the issue until this is fixed in a future release.
Renewal of certificates might be required before an admin cluster upgrade
Before you begin the admin cluster upgrade process, you should make sure that your admin cluster certificates are currently valid, and renew these certificates if they are not.
Admin cluster certificate renewal process
Make sure that OpenSSL is installed on the admin workstation before you begin.
Set the
KUBECONFIG
variable:KUBECONFIG=ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG
Replace ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG with the absolute path to the admin cluster kubeconfig file.
Get the IP address and SSH keys for the admin master node:
kubectl --kubeconfig "${KUBECONFIG}" get secrets -n kube-system sshkeys \ -o jsonpath='{.data.vsphere_tmp}' | base64 -d > \ ~/.ssh/admin-cluster.key && chmod 600 ~/.ssh/admin-cluster.key export MASTER_NODE_IP=$(kubectl --kubeconfig "${KUBECONFIG}" get nodes -o \ jsonpath='{.items[*].status.addresses[?(@.type=="ExternalIP")].address}' \ --selector='node-role.kubernetes.io/master')
Check if the certificates are expired:
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" \ "sudo kubeadm alpha certs check-expiration"
If the certificates are expired, you must renew them before upgrading the admin cluster.
Back up old certificates:
This is an optional, but recommended, step.
# ssh into admin master ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" # on admin master sudo tar -czvf backup.tar.gz /etc/kubernetes logout # on worker node sudo scp -i ~/.ssh/admin-cluster.key \ ubuntu@"${MASTER_NODE_IP}":/home/ubuntu/backup.tar.gz .
Renew the certificates with kubeadm:
# ssh into admin master ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" # on admin master sudo kubeadm alpha certs renew all
Restart the admin master node:
# on admin master cd /etc/kubernetes sudo mkdir tempdir sudo mv manifests/*.yaml tempdir/ sleep 5 echo "remove pods" # ensure kubelet detect those change remove those pods # wait until the result of this command is empty sudo docker ps | grep kube-apiserver # ensure kubelet start those pods again echo "start pods again" sudo mv tempdir/*.yaml manifests/ sleep 30 # ensure kubelet start those pods again # should show some results sudo docker ps | grep -e kube-apiserver -e kube-controller-manager -e kube-scheduler -e etcd # clean up sudo rm -rf tempdir logout
Because the admin cluster kubeconfig file also expires if the admin certificates expire, you should back up this file before expiration.
Back up the admin cluster kubeconfig file:
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
"sudo cat /etc/kubernetes/admin.conf" > new_admin.conf vi "${KUBECONFIG}"Replace
client-certificate-data
andclient-key-data
in kubeconfig withclient-certificate-data
andclient-key-data
in thenew_admin.conf
file that you created.
Renew the certificates of admin cluster worker nodes
Check node certificates expiration date
kubectl get nodes -o wide # find the oldest node, fill NODE_IP with the internal ip of that node ssh -i ~/.ssh/admin-cluster.key ubuntu@"${NODE_IP}" openssl x509 -enddate -noout -in /var/lib/kubelet/pki/kubelet-client-current.pem logout
If the certificate is about to expire, renew node certificates by manual node repair.
You must validate the renewed certificates, and validate the certificate of kube-apiserver.
Check certificates expiration:
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
"sudo kubeadm alpha certs check-expiration"Check certificate of kube-apiserver:
# Get the IP address of kube-apiserver cat $KUBECONFIG | grep server # Get the current kube-apiserver certificate openssl s_client -showcerts -connect
:
| sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p'
> current-kube-apiserver.crt # check expiration date of this cert openssl x509 -in current-kube-apiserver.crt -noout -enddate
/etc/cron.daily/aide script uses up all space in /run, causing a crashloop in Pods
Starting from Anthos clusters on VMware 1.7.2, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.
.
As a result, the cron script /etc/cron.daily/aide
has been installed so that an aide check is scheduled to
ensure the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked".
The script uses /run/aide
as a temporary directory to save its cron logs, and over time it could use
up all the space in /run
. See /etc/cron.daily/aide script uses all space in /run for a workaround.
If you see one or more Pods crashlooping on a node, run df -h /run
on the node. If the command output shows 100% space usage,
then you are likely experiencing this issue.
This issue is fixed in version 1.8.1. For the 1.7.2 and 1.8.0 versions, you can resolve this issue manually with either of the following two workarounds:
- Periodically remove the log files at
/run/aide/cron.daily.old*
(recommended). - Follow the steps mentioned in /etc/cron.daily/aide script uses all space in /run. (Note: this workaround could potentially affect the node compliance state).
Upgrading Seesaw load balancer with version 1.8.0
If you use the gkectl upgrade loadbalancer
to attempt to update some parameters of the Seesaw load balancer in version 1.8.0, this will not work in either DHCP or IPAM mode. If your setup includes this configuration, do not upgrade to version 1.8.0, but instead to version 1.8.1 or later.
Cannot log in to admin workstation due to password expiry issue
You might experience this issue if you are using one of the following versions of Anthos clusters on VMware.
- 1.7.2-gke.2
- 1.7.3-gke.2
- 1.8.0-gke.21
- 1.8.0-gke.24
- 1.8.0-gke.25
- 1.8.1-gke.7
- 1.8.2-gke.8
You might get the following error when you attempt to SSH into your Anthos VMs, including the admin workstation, cluster nodes, and Seesaw nodes:
WARNING: Your password has expired.
This error occurs because the ubuntu user password on the VMs has expired. You must manually reset the user password's expiration time to a large value before logging into the VMs.
Prevention of password expiry error
If you are running the affected versions listed above, and the user password hasn't expired yet, you should extend the expiration time before seeing the SSH error.
Run the following command on each Anthos VM:
sudo chage -M 99999 ubuntu
Mitigation of password expiry error
If the user password has already expired and you can't log in to the VMs to extend the expiration time, perform the following mitigation steps for each component.
Admin workstation
Use a temporary VM to perform the following steps. You can create an admin workstation using the 1.7.1-gke.4 version to use as the temporary VM.
Ensure the temporary VM and the admin workstation are in a power off state.
Attach the boot disk of the admin workstation to the temporary VM. The boot disk is the one with the label "Hard disk 1".
Mount the boot disk inside the VM by running these commands. Substitute your own boot disk identifier for
dev/sdc1
.sudo mkdir -p /mnt/boot-disk sudo mount /dev/sdc1 /mnt/boot-disk
Set the ubuntu user expiration date to a large value such as
99999
days.sudo chroot /mnt/boot-disk chage -M 99999 ubuntu
Shut down the temporary VM.
Power on the admin workstation. You should now be able to SSH as usual.
As cleanup, delete the temporary VM.
Admin cluster control plane VM
Follow the instructions to recreate the admin cluster control plane VM.
Admin cluster addon VMs
Run the following command from the admin workstation to recreate the VM:
kubectl --kubeconfig=ADMIN_CLUSTER_KUBECONFIG patch machinedeployment gke-admin-node --type=json -p="[{'op': 'add', 'path': '/spec/template/spec/metadata/annotations', 'value': {"kubectl.kubernetes.io/restartedAt": "version1"}}]"
After you run this command, wait for the admin cluster addon VMs to finish recreation and to be ready before you continue with the next steps.
User cluster control plane VMs
Run the following command from the admin workstation to recreate the VMs:
usermaster=`kubectl --kubeconfig=ADMIN_CLUSTER_KUBECONFIG get machinedeployments -l set=user-master -o name` && kubectl --kubeconfig=ADMIN_CLUSTER_KUBECONFIG patch $usermaster --type=json -p="[{'op': 'add', 'path': '/spec/template/spec/metadata/annotations', 'value': {"kubectl.kubernetes.io/restartedAt": "version1"}}]"
After you run this command, wait for the user cluster control plane VMs to finish recreation and to be ready before you continue with the next steps.
User cluster worker VMs
Run the following command from the admin workstation to recreate the VMs.
for md in `kubectl --kubeconfig=USER_CLUSTER_KUBECONFIG get machinedeployments -l set=node -o name`; do kubectl patch --kubeconfig=USER_CLUSTER_KUBECONFIG $md --type=json -p="[{'op': 'add', 'path': '/spec/template/spec/metadata/annotations', 'value': {"kubectl.kubernetes.io/restartedAt": "version1"}}]"; done
Seesaw VMs
Run the following commands from the admin workstation to recreate the Seesaw VMs. There will be some downtime. If HA is enabled for the load balancer, the maximum down time is two seconds.
gkectl upgrade loadbalancer --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config ADMIN_CLUSTER_CONFIG --admin-cluster --no-diff gkectl upgrade loadbalancer --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG --no-diff
Restarting or upgrading vCenter for versions lower than 7.0U2
If the vCenter, for versions lower than 7.0U2, is restarted, after an upgrade or otherwise,
the network name in VM Information from vCenter is incorrect, and results in the machine being in an Unavailable
state. This eventually leads to the nodes being auto-repaired to create new ones.
Related govmomi bug: https://github.com/vmware/govmomi/issues/2552
This workaround is provided by VMware support:
1. The issue is fixed in vCenter versions 7.0U2 and above. 2. For lower versions: Right-click the host, and then select Connection > Disconnect. Next, reconnect, which forces an update of the VM's portgroup.
gkectl create-config admin
and gkectl create-config cluster
panic
In versions 1.8.0-1.8.3, the gkectl create-config admin/cluster
command panics with the message panic: invalid version: "latest"
.
As a workaround, use gkectl create-config admin/cluster --gke-on-prem-version=DESIRED_CLUSTER_VERSION
. Replace DESIRED_CLUSTER_VERSION
with the desired version, such as 1.8.2-gke.8.
Creating/upgrading admin cluster timeout
This issue affects 1.8.0-1.8.3.
Your admin cluster creation or admin cluster upgrade might time out with the following error:
Error getting kubeconfig: error running remote command 'sudo cat /etc/kubernetes/admin.conf': error: Process exited with status 1, stderr: 'cat: /etc/kubernetes/admin.conf: No such file or directory
In addition, the log at nodes/ADMIN_MASTER_NODE/files/var/log/startup.log
in the external cluster snapshot ends with this message:
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
This error happens when the network is slow between the admin control-plane VM and the container registry. Make sure to inspect your network or proxy setup to reduce the latency and increase the bandwidth.
SSH connection closed by remote host
For Anthos clusters on VMware version 1.7.2 and above, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.
To meet the CIS rule "5.2.16 Ensure SSH Idle Timeout Interval is configured", /etc/ssh/sshd_config
has the following settings:
ClientAliveInterval 300 ClientAliveCountMax 0
The purpose of these settings is to terminate a client session after 5 minutes of idle time. However, the ClientAliveCountMax 0
value causes
unexpected behavior. When you use the ssh session on the admin workstation, or a cluster node, the SSH connection might be disconnected
even your ssh client is not idle, such as when running a time-consuming command, and your command could get terminated with the following message:
Connection to [IP] closed by remote host. Connection to [IP] closed.
As a workaround, you can either:
Use
nohup
to prevent your command being terminated on SSH disconnection,nohup gkectl upgrade admin --config admin-cluster.yaml --kubeconfig kubeconfig
Update the
sshd_config
to use a non-zeroClientAliveCountMax
value. The CIS rule recommends to use a value less than 3.sudo sed -i 's/ClientAliveCountMax 0/ClientAliveCountMax 1/g' /etc/ssh/sshd_config sudo systemctl restart sshd
Make sure you reconnect your ssh session.
Conflict with cert-manager
when upgrading to version 1.8.2 or above
If you have your own cert-manager
installation with Anthos clusters on VMware, you might experience a failure when you attempt to upgrade to versions 1.8.2 or above. This is a result of a conflict between your version of cert-manager
, which is likely installed in the cert-manager
namespace, and the monitoring-operator
version.
If you try to install another copy of cert-manager
after upgrading to Anthos clusters on VMware version 1.8.2 or above, the installation might fail due to a conflict with the existing one managed by monitoring-operator
.
The metrics-ca
cluster issuer, which control-plane and observability components rely on for creation and rotation of cert secrets, requires a metrics-ca
cert secret to be stored in the cluster resource namespace. This namespace is kube-system
for the monitoring-operator installation, and likely to be cert-manager
for your installation.
If you have experienced an installation failure, follow these steps to upgrade successfully to version 1.8.2 or later:
Uninstall your version of
cert-manager
.Perform the upgrade.
If you want to restore your own installation of cert-manager, follow these steps:
Scale the
monitoring-operator
deployment to 0. For the admin cluster, run this command:kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment monitoring-operator --replicas=0
For a user cluster, run this command:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME scale deployment monitoring-operator --replicas=0
Scale the
cert-manager
deployments managed bymonitoring-operator
to 0. For the admin cluster, run this command:kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager --replicas=0 kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager-cainjector --replicas=0 kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager-webhook --replicas=0
For a user cluster, run this command:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager --replicas=0 kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager-cainjector --replicas=0 kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager-webhook --replicas=0
Reinstall your
cert-manager
.Copy the
metrics-ca
cert-manager.io/v1 Certificate and themetrics-pki.cluster.local
Issuer resources fromkube-system
to the cluster resource namespace of your installed cert-manager. This namespace might becert-manager
if using the upstream cert-manager release, such as v1.0.3, but that depends on your installation.
False positives in docker, containerd, and runc vulnerability scanning
The docker, containerd, and runc in the Ubuntu OS images shipped with Anthos clusters on VMware are pinned to special versions using Ubuntu PPA. This ensures that any container runtime changes will be qualified by Anthos clusters on VMware before each release.
However, the special versions are unknown to the Ubuntu CVE Tracker, which is used as the vulnerability feeds by various CVE scanning tools. Therefore, you will see false positives in docker, containerd, and runc vulnerability scanning results.
For example, you might see the following false positives from your CVE scanning results. These CVEs are already fixed in the latest patch versions of Anthos clusters on VMware.
Refer to the release notes for any CVE fixes.
Canonical is aware of this issue, and the fix is tracked at https://github.com/canonical/sec-cvescan/issues/73.
/etc/cron.daily/aide
CPU and memory spike issue
Starting from Anthos clusters on VMware version 1.7.2, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.
As a result, the cron script /etc/cron.daily/aide
has been installed so that an aide
check is scheduled so as to
ensure that the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked" is followed.
The cron job runs daily at 6:00 AM UTC. Depending on the number of files on the filesystem,
you may experience CPU and memory usage spikes around that time that are caused by this aide
process.
If the spikes are affecting your workload, you can disable the daily cron job:
`sudo chmod -x /etc/cron.daily/aide`.
Cisco ACI doesn't work with Direct Server Return (DSR)
Seesaw runs in DSR mode, and by default it doesn't work in Cisco ACI because of data-plane IP learning. A possible workaround is to disable IP learning by adding the Seesaw IP address as a L4-L7 Virtual IP in the Cisco Application Policy Infrastructure Controller (APIC).
You can configure the L4-L7 Virtual IP option by going to Tenant > Application Profiles > Application EPGs or uSeg EPGs. Failure to disable IP learning will result in IP endpoint flapping between different locations in the Cisco API fabric.
A service account bearer token that is too long can break Seesaw load balancer logs
If your logging-monitoring service account bearer token is larger than 512 KB, it can break the Seesaw load balancer logs. To fix this issue, upgrade to version 1.9 or later.
Connectivity issues between Pods due to anetd
daemons in software deadlock
Clusters with enableDataplaneV2
set to true
can experience connectivity issues between Pods due to anetd
daemons (running as a Daemonset) entering a software deadlock. While in this state, anetd
daemons will see stale nodes (previously deleted nodes) as peers and miss newly added nodes as new peers.
If you have experienced this issue, complete the following steps to restart the anetd
daemons to refresh the peer nodes, and connectivity should be restored.
Find all
anetd
daemons in the cluster:kubectl --kubeconfig=USER_CLUSTER_KUBECONFIG -n kube-system get pods -o wide | grep anetd
Check whether
anetd
daemons currently see stale peers:kubectl --kubeconfig=USER_CLUSTER_KUBECONFIG -n kube-system exec -it ANETD_XYZ -- cilium-health status
Replace ANETD_XYZ with the name of an
anetd
Pod.Restart all affected Pods:
kubectl --kubeconfig=USER_CLUSTER_KUBECONFIG -n kube-system delete pod ANETD_XYZ
gkectl diagnose checking certificates failure
If your work station does not have access to user cluster worker nodes, it will get the following failures when running gkectl diagnose
, it is safe to ignore them.
Checking user cluster certificates...FAILURE
Reason: 3 user cluster certificates error(s).
Unhealthy Resources:
Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out