Storage |
1.14, 1.15, 1.16 |
Data corruption on NFSv3 when parallel appends to a shared file are
done from multiple hosts
If you use Nutanix storage arrays to provide NFSv3 shares to your
hosts, you might experience data corruption or the inability for Pods to
run successfully. This issue is caused by a known compatibility issue
between certain versions of VMware and Nutanix versions. For more
information, see the associated
VMware KB
article.
Workaround:
The VMware KB article is out of date in noting that there is no
current resolution. To resolve this issue, update to the latest version
of ESXi on your hosts and to the latest Nutanix version on your storage
arrays.
|
Operating system |
1.13.10, 1.14.6, 1.15.3 |
Version mismatch between the kubelet and the Kubernetes control plane
For certain Anthos clusters on VMware releases, the kubelet running on the
nodes uses a different version than the Kubernetes control plane. There is a
mismatch because the kubelet binary preloaded on the OS image is using a
different version.
The following table lists the identified version mismatches:
Anthos version |
kubelet version |
Kubernetes version |
1.13.10 |
v1.24.11-gke.1200 |
v1.24.14-gke.2100 |
1.14.6 |
v1.25.8-gke.1500 |
v1.25.10-gke.1200 |
1.15.3 |
v1.26.2-gke.1001 |
v1.26.5-gke.2100 |
Workaround:
No action is needed. The inconsistency is only between Kubernetes patch
versions and no problems have been caused by this version skew.
|
Upgrades and updates |
1.15.0-1.15.4 |
Upgrading or updating an admin cluster with a CA version greater than 1 fails
When an admin cluster has a certificate authority (CA) version greater
than 1, an update or upgrade fails due to the CA version validation in the
webhook. The output of
gkectl upgrade/update contains the following error message:
CAVersion must start from 1
Workaround:
-
Scale down the
auto-resize-controller deployment in the
admin cluster to disable node auto-resizing. This is necessary
because a new field introduced to the admin cluster Custom Resource in
1.15 can cause a nil pointer error in the auto-resize-controller .
kubectl scale deployment auto-resize-controller -n kube-system --replicas=0 --kubeconfig KUBECONFIG
-
Run
gkectl commands with --disable-admin-cluster-webhook flag.For example:
gkectl upgrade admin --config ADMIN_CLUSTER_CONFIG_FILE --kubeconfig KUBECONFIG --disable-admin-cluster-webhook
|
Operation |
1.13, 1.14.0-1.14.8, 1.15.0-1.15.4, 1.16.0-1.16.1 |
Non-HA Controlplane V2 cluster deletion stuck until timeout
When a non-HA Controlplane V2 cluster is deleted, it is stuck at node
deletion until it timesout.
Workaround:
If the cluster contains a StatefulSet with critical data, contact
contact Cloud Customer Care to
resolve this issue.
Otherwise, do the following steps:
|
Storage |
1.15.0+, 1.16.0+ |
Constant CNS attachvolume tasks appear every minute for in-tree PVC/PV after upgrading to Anthos 1.15+
When a cluster contains in-tree vSphere persistent volumes (for example, PVCs created with the standard StorageClass), you will observe com.vmware.cns.tasks.attachvolume tasks triggered every minute from vCenter.
Workaround:
Edit the vSphere CSI feature configMap and set list-volumes to false:
kubectl edit configmap internal-feature-states.csi.vsphere.vmware.com -n kube-system --kubeconfig KUBECONFIG
Restart the vSphere CSI controller pods:
kubectl rollout restart vsphere-csi-controller -n kube-system --kubeconfig KUBECONFIG
|
Storage |
1.16.0 |
False warnings agaisnt PVCs
When a cluster contains intree vSphere persistent volumes, the commands
gkectl diagnose and gkectl upgrade might raise
false warnings against their persistent volume claims (PVCs) when
validating the cluster storage settings. The warning message looks like
the following
CSIPrerequisites pvc/pvc-name: PersistentVolumeClaim pvc-name bounds to an in-tree vSphere volume created before CSI migration enabled, but it doesn't have the annotation pv.kubernetes.io/migrated-to set to csi.vsphere.vmware.com after CSI migration is enabled
Workaround:
Run the following command to check the annotations of a PVC with the
above warning:
kubectl get pvc PVC_NAME -n PVC_NAMESPACE -oyaml --kubeconfig KUBECONFIG
If the annotations field in the
output contains the following, you can safely ignore the warning:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: csi.vsphere.vmware.com
|
Upgrades and updates |
1.15.0+, 1.16.0+ |
Service account key rotation fails when multiple keys are expired
If your cluster is not using a private registry, and your component
access service account key and Logging-monitoring (or Connect-register)
service account keys are expired, when you
rotate the
service account keys, gkectl update credentials
fails with an error similar to the following:
Error: reconciliation failed: failed to update platform: ...
Workaround:
First, rotate the component access service account key. Although the
same error message is displayed, you should be able to rotate the other
keys after the component access service account key rotation.
If the update is still not successful, contact Cloud Customer Care
to resolve this issue.
|
Upgrades and updates |
1.16.0 |
Control plane node fails to be created
During an upgrade or update of an admin cluster, a race condition might
cause the vSphere cloud controller manager to unexpectedly delete a new
control plane node. This causes the clusterapi-controller to be stuck
waiting for the node to be created, and evenutally the upgrade/update
times out. In this case, the output of the gkectl
upgrade/update command is similar to the following:
controlplane 'default/gke-admin-hfzdg' is not ready: condition "Ready": condition is not ready with reason "MachineInitializing", message "Wait for the control plane machine "gke-admin-hfzdg-6598459f9zb647c8-0\" to be rebooted"...
To identify the symptom, run the command below to get log in vSphere cloud controller manager in the admin cluster:
kubectl get pods --kubeconfig ADMIN_KUBECONFIG -n kube-system | grep vsphere-cloud-controller-manager
kubectl logs -f vsphere-cloud-controller-manager-POD_NAME_SUFFIX --kubeconfig ADMIN_KUBECONFIG -n kube-system
Here is a sample error message from the above command:
node name: 81ff17e25ec6-qual-335-1500f723 has a different uuid. Skip deleting this node from cache.
Workaround:
-
Reboot the failed machine to recreate the deleted node object.
-
SSH into each control plane node and restart the vSphere cloud controller manager static pod:
sudo crictl ps | grep vsphere-cloud-controller-manager | awk '{print $1}'
sudo crictl stop PREVIOUS_COMMAND_OUTPUT
-
Rerun upgrade/update command.
|
Storage |
1.11+, 1.12+, 1.13+, 1.14+, 1.15+, 1.16 |
PVC creation failure after node is recreated with the same name
After a node is deleted and then recreated with the same node name,
there is a slight chance that a subsequent PersistentVolumeClaim (PVC)
creation fails with an error like the following:
The object 'vim.VirtualMachine:vm-988369' has already been deleted or has not been completely created
This is caused by race condition where vSphere CSI controller does not delete a removed machine from its cache.
Workaround:
Restart the vSphere CSI controller pods:
kubectl rollout restart vsphere-csi-controller -n kube-system --kubeconfig KUBECONFIG
|
Operation |
1.16.0 |
gkectl repair admin-master returns kubeconfig unmarshall error
When you run the gkectl repair admin-master command on an HA
admin cluster, gkectl returns the following error message:
Exit with error: Failed to repair: failed to select the template: failed to get cluster name from kubeconfig, please contact Google support. failed to decode kubeconfig data: yaml: unmarshal errors:
line 3: cannot unmarshal !!seq into map[string]*api.Cluster
line 8: cannot unmarshal !!seq into map[string]*api.Context
Workaround:
Add the --admin-master-vm-template= flag to the command and
provide the VM template of the machine to repair:
gkectl repair admin-master --kubeconfig=ADMIN_CLUSTER_KUBECONFIG \
--config ADMIN_CLUSTER_CONFIG_FILE \
--admin-master-vm-template=/DATA_CENTER/vm/VM_TEMPLATE_NAME
To find the VM template of the machine:
- Go to the Hosts and Clusters page in the vSphere client.
- Click VM Templates and filter by the admin cluster name.
You should see the three VM templates for the admin cluster.
- Copy the name VM template that matches the name of the machine
you're repairing and use the template name in the repair command.
gkectl repair admin-master \
--config=/home/ubuntu/admin-cluster.yaml \
--kubeconfig=/home/ubuntu/kubeconfig \
--admin-master-vm-template=/atl-qual-vc07/vm/gke-admin-98g94-zx...7vx-0-tmpl
|
Networking |
1.10.0+, 1.11.0+, 1.12.0+, 1.13.0+, 1.14.0-1.14.7, 1.15.0-1.15.3, 1.16.0 |
Seesaw VM broken due to disk space low
If you use Seesaw as the load balancer type for your cluster and you see that
a Seesaw VM is down or keeps failing to boot, you might see the following error
message in the vSphere console:
GRUB_FORCE_PARTUUID set, initrdless boot failed. Attempting with initrd
This error indicates that the disk space is low on the VM because the fluent-bit
running on the Seesaw VM is not configured with correct log rotation.
Workaround:
Locate the log files that consume most of the disk space using du -sh -- /var/lib/docker/containers/* | sort -rh . Clean up the log file with largest size and reboot the VM.
Note: If the VM is completely inaccessible, attach the disk to a working VM (e.g. admin workstation), remove the file from the attached disk, then reattach the disk back to the original Seesaw VM.
To prevent the issue from happening again, connect to the VM and modify the /etc/systemd/system/docker.fluent-bit.service file. Add --log-opt max-size=10m --log-opt max-file=5 in the Docker command, then run systemctl restart docker.fluent-bit.service
|
Operation |
1.13, 1.14.0-1.14.6, 1.15 |
Admin SSH public key error after admin cluster upgrade or update
When you try to upgrade (gkectl upgrade admin ) or update
(gkectl update admin ) a non-High-Availability admin cluster
with checkpoint enabled, the upgrade or update may fail with errors like the
following:
Checking admin cluster certificates...FAILURE
Reason: 20 admin cluster certificates error(s).
Unhealthy Resources:
AdminMaster clusterCA bundle: failed to get clusterCA bundle on admin master, command [ssh -o IdentitiesOnly=yes -i admin-ssh-key -o StrictHostKeyChecking=no -o ConnectTimeout=30 ubuntu@AdminMasterIP -- sudo cat /etc/kubernetes/pki/ca-bundle.crt] failed with error: exit status 255, stderr: Authorized uses only. All activity may be monitored and reported.
ubuntu@AdminMasterIP: Permission denied (publickey).
failed to ssh AdminMasterIP, failed with error: exit status 255, stderr: Authorized uses only. All activity may be monitored and reported.
ubuntu@AdminMasterIP: Permission denied (publickey)
error dialing ubuntu@AdminMasterIP: failed to establish an authenticated SSH connection: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey]...
Workaround:
If you're unable to upgrade to a patch version of Anthos clusters on VMware with the fix,
contact Google Support for assistance.
|
Upgrades |
1.13.0-1.13.9, 1.14.0-1.14.6, 1.15.1-1.15.2 |
Upgrading an admin cluster enrolled in the Anthos On-Prem API could fail
When an admin cluster is enrolled in the Anthos On-Prem API, upgrading the
admin cluster to the affected versions could fail because the fleet membership
couldn't be updated. When this failure happens, you see the
following error when trying to upgrade the cluster:
failed to register cluster: failed to apply Hub Membership: Membership API request failed: rpc error: code = InvalidArgument desc = InvalidFieldError for field endpoint.on_prem_cluster.resource_link: field cannot be updated
An admin cluster is enrolled in the API when you
explicitly enroll the
cluster, or when you upgrade
a user cluster using an Anthos On-Prem API client.
Workaround:
Unenroll the admin cluster:
gcloud alpha container vmware admin-clusters unenroll ADMIN_CLUSTER_NAME --project CLUSTER_PROJECT --location=CLUSTER_LOCATION --allow-missing
and resume
upgrading the admin cluster. You might see the stale `failed to
register cluster` error temporarily. After a while, it should be updated
automatically.
|
Upgrades and updates |
1.13.0-1.13.9, 1.14.0-1.14.4, 1.15.0 |
Enrolled admin cluster's resource link annotation is not preserved
When an admin cluster is enrolled in the Anthos On-Prem API, its resource
link annotation is applied to the OnPremAdminCluster custom
resource, which is not preserved during later admin cluster updates due to
the wrong annotation key being used. This can cause the admin cluster to be
enrolled in the Anthos On-Prem API again by mistake.
An admin cluster is enrolled in the API when you
explicitly enroll the
cluster, or when you upgrade
a user cluster using an Anthos On-Prem API client.
Workaround:
Unenroll the admin cluster:
gcloud alpha container vmware admin-clusters unenroll ADMIN_CLUSTER_NAME --project CLUSTER_PROJECT --location=CLUSTER_LOCATION --allow-missing
and re-enroll
the admin cluster again.
|
Networking |
1.15.0-1.15.2 |
CoreDNS orderPolicy not recognized
OrderPolicy doesn't get recognized as a parameter and
isn't used. Instead, Anthos clusters on VMware always uses Random .
This issue occurs because the CoreDNS template was not updated, which
causes orderPolicy to be ignored.
Workaround:
Update the CoreDNS template and apply the fix. This fix persists until
an upgrade.
- Edit the existing template:
kubectl edit cm -n kube-system coredns-template
Replace the contents of the template with the following:
coredns-template: |-
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
{{- if .PrivateGoogleAccess }}
import zones/private.Corefile
{{- end }}
{{- if .RestrictedGoogleAccess }}
import zones/restricted.Corefile
{{- end }}
prometheus :9153
forward . {{ .UpstreamNameservers }} {
max_concurrent 1000
{{- if ne .OrderPolicy "" }}
policy {{ .OrderPolicy }}
{{- end }}
}
cache 30
{{- if .DefaultDomainQueryLogging }}
log
{{- end }}
loop
reload
loadbalance
}{{ range $i, $stubdomain := .StubDomains }}
{{ $stubdomain.Domain }}:53 {
errors
{{- if $stubdomain.QueryLogging }}
log
{{- end }}
cache 30
forward . {{ $stubdomain.Nameservers }} {
max_concurrent 1000
{{- if ne $.OrderPolicy "" }}
policy {{ $.OrderPolicy }}
{{- end }}
}
}
{{- end }}
|
Upgrades and updates |
1.10, 1.11, 1.12, 1.13.0-1.13.7, 1.14.0-1.14.3 |
OnPremAdminCluster status inconsistent between checkpoint and actual CR
Certain race conditions could cause the OnPremAdminCluster status to be inconsistent between checkpoint and actual CR. When the issue happens, you could encounter the following error when update the admin cluster after you upgraded it:
Exit with error:
E0321 10:20:53.515562 961695 console.go:93] Failed to update the admin cluster: OnPremAdminCluster "gke-admin-rj8jr" is in the middle of a create/upgrade ("" -> "1.15.0-gke.123"), which must be completed before it can be updated
Failed to update the admin cluster: OnPremAdminCluster "gke-admin-rj8jr" is in the middle of a create/upgrade ("" -> "1.15.0-gke.123"), which must be completed before it can be updated
To workaround this issue, you will need to either edit the checkpoint or disable the checkpoint for upgrade/update, please reach out to our support team to proceed with the workaround.
|
Operation |
1.13.0-1.13.9, 1.14.0-1.14.5, 1.15.0-1.15.1 |
Reconciliation process changes admin certificates on admin clusters
Anthos clusters on VMware changes the admin certificates on admin cluster control planes
with every reconciliation process, such as during a cluster upgrade. This behavior
increases the possibility of getting invalid certificates for your admin cluster,
especially for version 1.15 clusters.
If you're affected by this issue, you may encounter problems like the
following:
- Invalid certificates may cause the following commands to time out and
return errors:
gkectl create admin
gkectl upgrade amdin
gkectl update admin
These commands may return authorization errors like the following:
Failed to reconcile admin cluster: unable to populate admin clients: failed to get admin controller runtime client: Unauthorized
- The
kube-apiserver logs for your admin cluster may contain errors
like the following:
Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid...
Workaround:
Upgrade to a version of Anthos clusters on VMware with the fix:
1.13.10+, 1.14.6+, 1.15.2+.
If upgrading isn't feasible for you, contact Cloud Customer Care to resolve this issue.
|
Networking, Operation |
1.10, 1.11, 1.12, 1.13, 1.14 |
Anthos Network Gateway components evicted or pending due to missing
priority class
Network gateway Pods in kube-system might show a status of
Pending or Evicted , as shown in the following
condensed example output:
$ kubectl -n kube-system get pods | grep ang-node
ang-node-bjkkc 2/2 Running 0 5d2h
ang-node-mw8cq 0/2 Evicted 0 6m5s
ang-node-zsmq7 0/2 Pending 0 7h
These errors indicate eviction events or an inability to schedule Pods
due to node resources. As Anthos Network Gateway Pods have no
PriorityClass, they have the same default priority as other workloads.
When nodes are resource-constrained, the network gateway Pods might be
evicted. This behavior is particularly bad for the ang-node
DaemonSet, as those Pods must be scheduled on a specific node and can't
migrate.
Workaround:
Upgrade to 1.15 or later.
As a short-term fix, you can manually assign a
PriorityClass
to the Anthos Network Gateway components. The Anthos clusters on VMware controller
overwrites these manual changes during a reconciliation process, such as
during a cluster upgrade.
- Assign the
system-cluster-critical PriorityClass to the
ang-controller-manager and autoscaler cluster
controller Deployments.
- Assign the
system-node-critical PriorityClass to the
ang-daemon node DaemonSet.
|
Upgrades and updates |
1.12, 1.13, 1.14, 1.15.0-1.15.2 |
admin cluster upgrade fails after registering the cluster with gcloud
After you use gcloud to register an admin cluster with non-empty
gkeConnect section, you might see the following error when trying to upgrade the cluster:
failed to register cluster: failed to apply Hub Mem\
bership: Membership API request failed: rpc error: code = InvalidArgument desc = InvalidFieldError for field endpoint.o\
n_prem_cluster.admin_cluster: field cannot be updated
Delete the gke-connect namespace:
kubectl delete ns gke-connect --kubeconfig=ADMIN_KUBECONFIG
Get the admin cluster name:
kubectl get onpremadmincluster -n kube-system --kubeconfig=ADMIN_KUBECONFIG
Delete the fleet membership:
gcloud container fleet memberships delete ADMIN_CLUSTER_NAME
and resume upgrading the admin cluster.
|
Operation |
1.13.0-1.13.8, 1.14.0-1.14.5, 1.15.0-1.15.1 |
gkectl diagnose snapshot --log-since fails to limit the time window for
journalctl commands running on the cluster nodes
This does not affect the functionality of taking a snapshot of the
cluster, as the snapshot still includes all logs that are collected by
default by running journalctl on the cluster nodes. Therefore,
no debugging information is missed.
|
Installation, Upgrades and Updates |
1.9+, 1.10+, 1.11+, 1.12+ |
gkectl prepare windows fails
gkectl prepare windows fails to install Docker on
Anthos clusters on VMware versions earlier than 1.13 because
MicrosoftDockerProvider
is deprecated.
Workaround:
The general idea to workaround this issue is to upgrade to Anthos clusters on VMware 1.13
and use the 1.13 gkectl to create a Windows VM template and then create
Windows node pools. There are two options to get to Anthos clusters on VMware 1.13 from your
current version as shown below.
Note: We do have options to workaround this issue in your current version
without needing to upgrade all the way to 1.13, but it will need more manual
steps, please reach out to our support team if you would like to consider
this option.
Option 1: Blue/Green upgrade
You can create a new cluster using Anthos clusters on VMware 1.13+ version with windows node pools, and
migrate your workloads to the new cluster, then tear down the current
cluster. It's recommended to use the latest Anthos minor version.
Note: This will require extra resources to provision the new cluster, but
less downtime and disruption for existing workloads.
Option 2: Delete Windows node pools and add them back when
upgrading to Anthos clusters on VMware 1.13
Note: For this option, the Windows workloads will not be able to run until
the cluster is upgraded to 1.13 and Windows node pools are added back.
- Delete existing Windows node pools by removing the windows node pools
config from user-cluster.yaml file, then run the command:
gkectl update cluster --kubeconfig=ADMIN_KUBECONFIG --config USER_CLUSTER_CONFIG_FILE
- Upgrade the Linux-only admin+user clusters to 1.12 following the
upgrade user guide for the corresponding target minor version.
- (Make sure to perform this step before upgrading to 1.13) Ensure the
enableWindowsDataplaneV2: true is configured in OnPremUserCluster CR, otherwise the cluster will keep using Docker for Windows node pools, which will not be compatible with the newly created 1.13 Windows VM template that not have Docker installed. If not configured or setting to false, update your cluster to set it to true in user-cluster.yaml, then run:
gkectl update cluster --kubeconfig=ADMIN_KUBECONFIG --config USER_CLUSTER_CONFIG_FILE
- Upgrade the Linux-only admin+user clusters to 1.13 following the
upgrade user guide.
- Prepare Windows VM template using 1.13 gkectl:
gkectl prepare windows --base-vm-template BASE_WINDOWS_VM_TEMPLATE_NAME --bundle-path 1.13_BUNDLE_PATH --kubeconfig=ADMIN_KUBECONFIG
- Add back the Windows node pool configuration to user-cluster.yaml with the
OSImage field set to the newly created Windows VM template.
- Update the cluster to add Windows node pools
gkectl update cluster --kubeconfig=ADMIN_KUBECONFIG --config USER_CLUSTER_CONFIG_FILE
|
Installation, Upgrades and Updates |
1.13.0-1.13.9, 1.14.0-1.14.5, 1.15.0-1.15.1 |
RootDistanceMaxSec configuration not taking effect for
ubuntu nodes
The 5 seconds default value for RootDistanceMaxSec will be
used on the nodes, instead of 20 seconds which should be the expected
configuration. If you check the node startup log by SSH'ing into the VM,
which is located at `/var/log/startup.log`, you can find the following
error:
+ has_systemd_unit systemd-timesyncd
/opt/bin/master.sh: line 635: has_systemd_unit: command not found
Using a 5 seconds RootDistanceMaxSec might cause the system
clock to be out of sync with NTP server when the clock drift is larger than
5 seconds.
Workaround:
SSH into the nodes and configure the RootDistanceMaxSec :
mkdir -p /etc/systemd/timesyncd.conf.d
cat > /etc/systemd/timesyncd.conf.d/90-gke.conf <<EOF
[Time]
RootDistanceMaxSec=20
EOF
systemctl restart systemd-timesyncd
|
Upgrades and updates |
1.12.0-1.12.6, 1.13.0-1.13.6, 1.14.0-1.14.2 |
gkectl update admin fails because of empty osImageType field
When you use version 1.13 gkectl to update a version 1.12
admin cluster, you might see the following error:
Failed to update the admin cluster: updating OS image type in admin cluster
is not supported in "1.12.x-gke.x"
When you use gkectl update admin for version 1.13 or 1.14
clusters, you might see the following message in the response:
Exit with error:
Failed to update the cluster: the update contains multiple changes. Please
update only one feature at a time
If you check the gkectl log, you might see that the multiple
changes include setting osImageType from an empty string to
ubuntu_containerd .
These update errors are due to improper backfilling of the
osImageType field in the admin cluster config since it was
introduced in version 1.9.
Workaround:
Upgrade to a version of Anthos clusters on VMware with the fix. If upgrading
isn't feasible for you, contact Cloud Customer Care to resolve this issue.
|
Installation, Security |
1.13, 1.14, 1.15, 1.16 |
SNI doesn't work on user clusters with Controlplane V2
The ability to provide an additional serving certificate for the
Kubernetes API server of a user cluster with
authentication.sni doesn't work when the Controlplane V2 is
enabled (
enableControlplaneV2: true ).
Workaround:
Until a Anthos clusters on VMware patch is available with the fix, if you
need to use SNI, disable Controlplane V2 (enableControlplaneV2: false ).
|
Installation |
1.0-1.11, 1.12, 1.13.0-1.13.9, 1.14.0-1.14.5, 1.15.0-1.15.1 |
$ in the private registry username causes admin control plane machine startup failure
The admin control plane machine fails to start up when the private registry username contains $ .
When checking the /var/log/startup.log on the admin control plane machine, you see the
following error:
++ REGISTRY_CA_CERT=xxx
++ REGISTRY_SERVER=xxx
/etc/startup/startup.conf: line 7: anthos: unbound variable
Workaround:
Use a private registry username without $ , or use a version of Anthos clusters on VMware with
the fix.
|
Upgrades and updates |
1.12.0-1.12.4 |
False-positive warnings about unsupported changes during admin cluster update
When you
update admin clusters, you will see the following false-positive warnings in the log, and you can ignore them.
console.go:47] detected unsupported changes: &v1alpha1.OnPremAdminCluster{
...
- CARotation: &v1alpha1.CARotationConfig{Generated: &v1alpha1.CARotationGenerated{CAVersion: 1}},
+ CARotation: nil,
...
}
|
Upgrades and updates |
1.13.0-1.13.9, 1.14.0-1.14.5, 1.15.0-1.15.1 |
Update user cluster failed after KSA signing key rotation
After you rotate
KSA signing keys and subsequently
update a user cluster, gkectl update might fail with the
following error message:
Failed to apply OnPremUserCluster 'USER_CLUSTER_NAME-gke-onprem-mgmt/USER_CLUSTER_NAME':
admission webhook "vonpremusercluster.onprem.cluster.gke.io" denied the request:
requests must not decrement *v1alpha1.KSASigningKeyRotationConfig Version, old version: 2, new version: 1"
Workaround:
Change the version of your KSA signing key version back to 1, but retain the latest key data:
- Check the secret in admin cluster under
USER_CLUSTER_NAME namespace, and get the name of ksa-signing-key secret:
kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME get secrets | grep ksa-signing-key
- Copy the ksa-signing-key secret, and name the copied secret as service-account-cert:
kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME get secret KSA-KEY-SECRET-NAME -oyaml | \
sed 's/ name: .*/ name: service-account-cert/' | \
kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME apply -f -
- Delete the previous ksa-signing-key secret:
kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME delete secret KSA-KEY-SECRET-NAME
- Update the
data.data field in ksa-signing-key-rotation-stage configmap to '{"tokenVersion":1,"privateKeyVersion":1,"publicKeyVersions":[1]}' :
kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME \
edit configmap ksa-signing-key-rotation-stage
- Disable the validation webhook to edit the version information in the OnPremUserCluster custom resource:
kubectl --kubeconfig=ADMIN_KUBECONFIG patch validatingwebhookconfiguration onprem-user-cluster-controller -p '
webhooks:
- name: vonpremnodepool.onprem.cluster.gke.io
rules:
- apiGroups:
- onprem.cluster.gke.io
apiVersions:
- v1alpha1
operations:
- CREATE
resources:
- onpremnodepools
- name: vonpremusercluster.onprem.cluster.gke.io
rules:
- apiGroups:
- onprem.cluster.gke.io
apiVersions:
- v1alpha1
operations:
- CREATE
resources:
- onpremuserclusters
'
- Update the
spec.ksaSigningKeyRotation.generated.ksaSigningKeyRotation field to 1 in your OnPremUserCluster custom resource:
kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME-gke-onprem-mgmt \
edit onpremusercluster USER_CLUSTER_NAME
- Wait until the target user cluster to be ready, you can check the status by:
kubectl --kubeconfig=ADMIN_KUBECONFIG -n=USER_CLUSTER_NAME-gke-onprem-mgmt \
get onpremusercluster
- Restore the validation webhook for the user cluster:
kubectl --kubeconfig=ADMIN_KUBECONFIG patch validatingwebhookconfiguration onprem-user-cluster-controller -p '
webhooks:
- name: vonpremnodepool.onprem.cluster.gke.io
rules:
- apiGroups:
- onprem.cluster.gke.io
apiVersions:
- v1alpha1
operations:
- CREATE
- UPDATE
resources:
- onpremnodepools
- name: vonpremusercluster.onprem.cluster.gke.io
rules:
- apiGroups:
- onprem.cluster.gke.io
apiVersions:
- v1alpha1
operations:
- CREATE
- UPDATE
resources:
- onpremuserclusters
'
- Avoid another KSA signing key rotation until the cluster is
upgraded to the version with the fix.
|
Operation |
1.13.1+, 1.14, 1., 1.16 |
When you use Terraform to delete a user cluster with a F5 BIG-IP load
balancer, the F5 BIG-IP virtual servers aren't removed after the cluster
deletion.
Workaround:
To remove the F5 resources, follow the steps to
clean up a user cluster F5 partition
|
Installation, Upgrades and Updates |
1.13.8, 1.14.4 |
kind cluster pulls container images from docker.io
If you create a version 1.13.8 or version 1.14.4 admin cluster, or
upgrade an admin cluster to version 1.13.8 or 1.14.4, the kind cluster pulls
the following container images from docker.io :
docker.io/kindest/kindnetd
docker.io/kindest/local-path-provisioner
docker.io/kindest/local-path-helper
If docker.io isn't accessible from your admin workstation,
the admin cluster creation or upgrade fails to bring up the kind cluster.
Running the following command on the admin workstation shows the
corresponding containers pending with ErrImagePull :
docker exec gkectl-control-plane kubectl get pods -A
The response contains entries like the following:
...
kube-system kindnet-xlhmr 0/1
ErrImagePull 0 3m12s
...
local-path-storage local-path-provisioner-86666ffff6-zzqtp 0/1
Pending 0 3m12s
...
These container images should be preloaded in the kind cluster container
image. However, kind v0.18.0 has
an issue with the preloaded container images,
which causes them to be pulled from the internet by mistake.
Workaround:
Run the following commands on the admin workstation, while your admin cluster
is pending on creation or upgrade:
docker exec gkectl-control-plane ctr -n k8s.io images tag docker.io/kindest/kindnetd:v20230330-48f316cd@sha256:c19d6362a6a928139820761475a38c24c0cf84d507b9ddf414a078cf627497af docker.io/kindest/kindnetd:v20230330-48f316cd
docker exec gkectl-control-plane ctr -n k8s.io images tag docker.io/kindest/kindnetd:v20230330-48f316cd@sha256:c19d6362a6a928139820761475a38c24c0cf84d507b9ddf414a078cf627497af docker.io/kindest/kindnetd@sha256:c19d6362a6a928139820761475a38c24c0cf84d507b9ddf414a078cf627497af
docker exec gkectl-control-plane ctr -n k8s.io images tag docker.io/kindest/local-path-helper:v20230330-48f316cd@sha256:135203f2441f916fb13dad1561d27f60a6f11f50ec288b01a7d2ee9947c36270 docker.io/kindest/local-path-helper:v20230330-48f316cd
docker exec gkectl-control-plane ctr -n k8s.io images tag docker.io/kindest/local-path-helper:v20230330-48f316cd@sha256:135203f2441f916fb13dad1561d27f60a6f11f50ec288b01a7d2ee9947c36270 docker.io/kindest/local-path-helper@sha256:135203f2441f916fb13dad1561d27f60a6f11f50ec288b01a7d2ee9947c36270
docker exec gkectl-control-plane ctr -n k8s.io images tag docker.io/kindest/local-path-provisioner:v0.0.23-kind.0@sha256:f2d0a02831ff3a03cf51343226670d5060623b43a4cfc4808bd0875b2c4b9501 docker.io/kindest/local-path-provisioner:v0.0.23-kind.0
docker exec gkectl-control-plane ctr -n k8s.io images tag docker.io/kindest/local-path-provisioner:v0.0.23-kind.0@sha256:f2d0a02831ff3a03cf51343226670d5060623b43a4cfc4808bd0875b2c4b9501 docker.io/kindest/local-path-provisioner@sha256:f2d0a02831ff3a03cf51343226670d5060623b43a4cfc4808bd0875b2c4b9501
|
Operation |
1.13.0-1.13.7, 1.14.0-1.14.4, 1.15.0 |
Unsuccessful failover on HA Controlplane V2 user cluster and admin cluster when the network filters out duplicate GARP requests
If your cluster VMs are connected with a switch that filters out duplicate GARP (gratuitous ARP) requests, the
keepalived leader election might encounter a race condition, which causes some nodes to have incorrect ARP table entries.
The affected nodes can ping the control plane VIP, but a TCP connection to the control plane VIP
will time out.
Workaround:
Run the following command on each control plane node of the affected cluster:
iptables -I FORWARD -i ens192 --destination CONTROL_PLANE_VIP -j DROP
|
Upgrades and Updates |
1.13.0-1.13.7, 1.14.0-1.14.4, 1.15.0 |
vsphere-csi-controller needs be restarted after the vCenter certificate rotation
vsphere-csi-controller should refresh its vCenter secret after vCenter certificate rotation. However, the current system does not properly restart the pods of vsphere-csi-controller , causing vsphere-csi-controller to crash after the rotation.
Workaround:
For clusters created at 1.13 and later versions, follow the instructions below to restart vsphere-csi-controller
kubectl --kubeconfig=ADMIN_KUBECONFIG rollout restart deployment vsphere-csi-controller -n kube-system
|
Installation |
1.10.3-1.10.7, 1.11, 1.12, 1.13.0-1.13.1 |
Admin cluster creation does not fail on cluster registration errors
Even when
cluster registration fails during admin cluster creation, the command gkectl create admin does not fail on the error and might succeed. In other words, the admin cluster creation could "succeed" without being registered to a fleet.
To identify the symptom, you can look for the following error messages in the log of `gkectl create admin`,
Failed to register admin cluster
You can also check whether you can find the cluster among registered clusters on cloud console.
Workaround:
For clusters created at 1.12 and later versions, follow the instructions for re-attempting the admin cluster registration after cluster creation. For clusters created at earlier versions,
-
Append a fake key-value pair like "foo: bar" to your connect-register SA key file
-
Run
gkectl update admin to re-register the admin cluster.
|
Upgrades and Updates |
1.10, 1.11, 1.12, 1.13.0-1.13.1 |
Admin cluster re-registration might be skipped during admin cluster upgrade
During admin cluster upgrade, if upgrading user control plane nodes times out, the admin cluster will not be re-registered with the updated connect agent version.
Workaround:
Check whether the cluster shows among registered clusters.
As an optional step, Log in to the cluster after setting up authentication. If the cluster is still registered, you might skip the following instructions for re-attempting the registration.
For clusters upgraded to 1.12 and later versions, follow the instructions for re-attempting the admin cluster registration after cluster creation. For clusters upgraded to earlier versions,
-
Append a fake key-value pair like "foo: bar" to your connect-register SA key file
-
Run
gkectl update admin to re-register the admin cluster.
|
Configuration |
1.15.0 |
False error message about vCenter.dataDisk
For a high-availability admin cluster, gkectl prepare shows
this false error message:
vCenter.dataDisk must be present in the AdminCluster spec
Workaround:
You can safely ignore this error message.
|
VMware |
1.15.0 |
Node pool creation fails because of redundant VM-Host affinity rules
During creation of a node pool that uses
VM-Host affinity,
a race condition might result in multiple
VM-Host affinity rules
being created with the same name. This can cause node pool creation to fail.
Workaround:
Remove the old redundant rules so that node pool creation can proceed.
These rules are named [USER_CLUSTER_NAME]-[HASH].
|
Operation |
1.15.0 |
gkectl repair admin-master may fail due to failed
to delete the admin master node object and reboot the admin master VM
The gkectl repair admin-master command may fail due to a
race condition with the following error.
Failed to repair: failed to delete the admin master node object and reboot the admin master VM
Workaround:
This command is idempotent. It can rerun safely until the command
succeeds.
|
Upgrades and updates |
1.15.0 |
Pods remain in Failed state afer re-creation or update of a
control-plane node
After you re-create or update a control-plane node, certain Pods might
be left in the Failed state due to NodeAffinity predicate
failure. These failed Pods don't affect normal cluster operations or health.
Workaround:
You can safely ignore the failed Pods or manually delete them.
|
Security, Configuration |
1.15.0-1.15.1 |
OnPremUserCluster not ready because of private registry credentials
If you use
prepared credentials
and a private registry, but you haven't configured prepared credentials for
your private registry, the OnPremUserCluster might not become ready, and
you might see the following error message:
failed to check secret reference for private registry …
Workaround:
Prepare the private registry credentials for the user cluster according
to the instructions in
Configure prepared credentials.
|
Upgrades and updates |
1.15.0 |
During gkectl upgrade admin , the storage preflight check for CSI Migration verifies
that the StorageClasses don't have parameters that are ignored after CSI Migration.
For example, if there's a StorageClass with the parameter diskformat then
gkectl upgrade admin flags the StorageClass and reports a failure in the preflight validation.
Admin clusters created in Anthos 1.10 and before have a StorageClass with diskformat: thin
which will fail this validation however this StorageClass still works
fine after CSI Migration. These failures should be interpreted as warnings instead.
For more information, check the StorageClass parameter section in Migrating In-Tree vSphere Volumes to vSphere Container Storage Plug-in.
Workaround:
After confirming that your cluster has a StorageClass with parameters ignored after CSI Migration
run gkectl upgrade admin with the flag --skip-validation-cluster-health .
|
Storage |
1.15, 1.16 |
Migrated in-tree vSphere volumes using the Windows file system can't be used with vSphere CSI driver
Under certain conditions disks can be attached as readonly to Windows
nodes. This results in the corresponding volume being readonly inside a Pod.
This problem is more likely to occur when a new set of nodes replaces an old
set of nodes (for example, cluster upgrade or node pool update). Stateful
workloads that previously worked fine might be unable to write to their
volumes on the new set of nodes.
Workaround:
-
Get the UID of the Pod that is unable to write to its volume:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get pod \
POD_NAME --namespace POD_NAMESPACE \
-o=jsonpath='{.metadata.uid}{"\n"}'
-
Use the PersistentVolumeClaim to get the name of the PersistentVolume:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get pvc \
PVC_NAME --namespace POD_NAMESPACE \
-o jsonpath='{.spec.volumeName}{"\n"}'
-
Determine the name of the node where the Pod is running:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIGget pods \
--namespace POD_NAMESPACE \
-o jsonpath='{.spec.nodeName}{"\n"}'
-
Obtain powershell access to the node, either through SSH or the vSphere
web interface.
-
Set environment variables:
PS C:\Users\administrator> pvname=PV_NAME
PS C:\Users\administrator> podid=POD_UID
- Identify the disk number for the disk associated with the
PersistentVolume:
PS C:\Users\administrator> disknum=(Get-Partition -Volume (Get-Volume -UniqueId ("\\?\"+(Get-Item (Get-Item
"C:\var\lib\kubelet\pods\$podid\volumes\kubernetes.io~csi\$pvname\mount").Target).Target))).DiskNumber
-
Verify that the disk is
readonly :
PS C:\Users\administrator> (Get-Disk -Number $disknum).IsReadonly
The result should be True .
- Set
readonly to false .
PS C:\Users\administrator> Set-Disk -Number $disknum -IsReadonly $false
PS C:\Users\administrator> (Get-Disk -Number $disknum).IsReadonly
-
Delete the Pod so that it will get restarted:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG delete pod POD_NAME \
--namespace POD_NAMESPACE
-
The Pod should get scheduled to the same node. But in case the Pod gets
scheduled to a new node, you might need to repeat the preceding steps on
the new node.
|
Upgrades and updates |
1.12, 1.13.0-1.13.7, 1.14.0-1.14.4 |
vsphere-csi-secret is not updated after gkectl update credentials vsphere --admin-cluster
If you update the vSphere credentials for an admin cluster following
updating cluster credentials,
you might find vsphere-csi-secret under kube-system namespace in the admin cluster still uses the old credential.
Workaround:
- Get the
vsphere-csi-secret secret name:
kubectl --kubeconfig=ADMIN_KUBECONFIG -n=kube-system get secrets | grep vsphere-csi-secret
- Update the data of the
vsphere-csi-secret secret you got from the above step:
kubectl --kubeconfig=ADMIN_KUBECONFIG -n=kube-system patch secret CSI_SECRET_NAME -p \
"{\"data\":{\"config\":\"$( \
kubectl --kubeconfig=ADMIN_KUBECONFIG -n=kube-system get secrets CSI_SECRET_NAME -ojsonpath='{.data.config}' \
| base64 -d \
| sed -e '/user/c user = \"VSPHERE_USERNAME_TO_BE_UPDATED\"' \
| sed -e '/password/c password = \"VSPHERE_PASSWORD_TO_BE_UPDATED\"' \
| base64 -w 0 \
)\"}}"
- Restart
vsphere-csi-controller :
kubectl --kubeconfig=ADMIN_KUBECONFIG -n=kube-system rollout restart deployment vsphere-csi-controller
- You can track the rollout status with:
kubectl --kubeconfig=ADMIN_KUBECONFIG -n=kube-system rollout status deployment vsphere-csi-controller
After the deployment is successfully rolled out, the updated vsphere-csi-secret should be used by the controller.
|
Upgrades and updates |
1.10, 1.11, 1.12.0-1.12.6, 1.13.0-1.13.6, 1.14.0-1.14.2 |
audit-proxy crashloop when enabling Cloud Audit Logs with gkectl update cluster
audit-proxy might crashloop because of empty --cluster-name .
This behavior is caused by a bug in the update logic, where the cluster name is not propagated to the
audit-proxy pod / container manifest.
Workaround:
For a control plane v2 user cluster with enableControlplaneV2: true , connect to the user control plane machine using SSH,
and update /etc/kubernetes/manifests/audit-proxy.yaml with --cluster_name=USER_CLUSTER_NAME .
For a control plane v1 user cluster, edit the audit-proxy container in
the kube-apiserver statefulset to add --cluster_name=USER_CLUSTER_NAME :
kubectl edit statefulset kube-apiserver -n USER_CLUSTER_NAME --kubeconfig=ADMIN_CLUSTER_KUBECONFIG
|
Upgrades and updates |
1.11, 1.12, 1.13.0-1.13.5, 1.14.0-1.14.1 |
An additional control plane redeployment right after gkectl upgrade cluster
Right after gkectl upgrade cluster , the control plane pods might be re-deployed again.
The cluster state from gkectl list clusters change from RUNNING TO RECONCILING .
Requests to the user cluster might timeout.
This behavior is because of the control plane certificate rotation happens automatically after
gkectl upgrade cluster .
This issue only happens to user clusters that do NOT use control plane v2.
Workaround:
Wait for the cluster state to change back to RUNNING again in gkectl list clusters , or
upgrade to versions with the fix: 1.13.6+, 1.14.2+ or 1.15+.
|
Upgrades and updates |
1.12.7 |
Bad release 1.12.7-gke.19 has been removed
Anthos clusters on VMware 1.12.7-gke.19 is a bad release
and you should not use it. The artifacts have been removed
from the Cloud Storage bucket.
Workaround:
Use the 1.12.7-gke.20 release instead.
|
Upgrades and updates |
1.12.0+, 1.13.0-1.13.7, 1.14.0-1.14.3 |
gke-connect-agent
continues to use the older image after registry credential updated
If you update the registry credential using one of the following methods:
gkectl update credentials componentaccess if not using private registry
gkectl update credentials privateregistry if using private registry
you might find gke-connect-agent continues to use the older
image or the gke-connect-agent pods cannot be pulled up due
to ImagePullBackOff .
This issue will be fixed in Anthos clusters on VMware releases 1.13.8,
1.14.4, and subsequent releases.
Workaround:
Option 1: Redeploy gke-connect-agent manually:
- Delete the
gke-connect namespace:
kubectl --kubeconfig=KUBECONFIG delete namespace gke-connect
- Redeploy
gke-connect-agent with the original register
service account key (no need to update the key):
For admin cluster:
gkectl update credentials register --kubeconfig=ADMIN_CLUSTER_KUBECONFIG --config ADMIN_CLUSTER_CONFIG_FILE --admin-cluster
For user cluster:
gkectl update credentials register --kubeconfig=ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG_FILE
Option 2: You can manually change the data of the image pull secret
regcred which is used by gke-connect-agent
deployment:
kubectl --kubeconfig=KUBECONFIG -n=gke-connect patch secrets regcred -p "{\"data\":{\".dockerconfigjson\":\"$(kubectl --kubeconfig=KUBECONFIG -n=kube-system get secrets private-registry-creds -ojsonpath='{.data.\.dockerconfigjson}')\"}}"
Option 3: You can add the default image pull secret for your cluster in
the gke-connect-agent deployment by:
- Copy the default secret to
gke-connect namespace:
kubectl --kubeconfig=KUBECONFIG -n=kube-system get secret private-registry-creds -oyaml | sed 's/ namespace: .*/ namespace: gke-connect/' | kubectl --kubeconfig=KUBECONFIG -n=gke-connect apply -f -
- Get the
gke-connect-agent deployment name:
kubectl --kubeconfig=KUBECONFIG -n=gke-connect get deployment | grep gke-connect-agent
- Add the default secret to
gke-connect-agent deployment:
kubectl --kubeconfig=KUBECONFIG -n=gke-connect patch deployment DEPLOYMENT_NAME -p '{"spec":{"template":{"spec":{"imagePullSecrets": [{"name": "private-registry-creds"}, {"name": "regcred"}]}}}}'
|
Installation |
1.13, 1.14 |
Manual LB configuration check failure
When you validate the configuration before creating a cluster with Manual load balancer by running gkectl check-config , then the command will fail with the following error messages.
- Validation Category: Manual LB Running validation check for "Network
configuration"...panic: runtime error: invalid memory address or nil pointer
dereference
Workaround:
Option 1: You can use the patch version 1.13.7 and 1.14.4 that will include the fix.
Option 2: You can also run the same command to validate the configuration but skip the load balancer validation.
gkectl check-config --skip-validation-load-balancer
|
Operation |
1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13, and 1.14 |
etcd watch starvation
Clusters running etcd version 3.4.13 or earlier may experience watch
starvation and non-operational resource watches, which can lead to the
following problems:
- Pod scheduling is disrupted
- Nodes are unable to register
- kubelet doesn't observe pod changes
These problems can make the cluster non-functional.
This issue is fixed in Anthos clusters on VMware releases 1.12.7, 1.13.6,
1.14.3, and subsequent releases. These newer releases use etcd version
3.4.21. All prior versions of Anthos clusters on VMware are affected by
this issue.
Workaround
If you can't upgrade immediately, you can mitigate the risk of
cluster failure by reducing the number of nodes in your cluster. Remove
nodes until the etcd_network_client_grpc_sent_bytes_total
metric is less than 300 MBps.
To view this metric in Metrics Explorer:
- Go to the Metrics Explorer in the Google Cloud console:
Go to Metrics Explorer
- Select the Configuration tab.
- Expand the Select a metric, enter
Kubernetes Container
in the filter bar, and then use the submenus to select the metric:
- In the Active resources menu, select Kubernetes Container.
- In the Active metric categories menu, select Anthos.
- In the Active metrics menu, select
etcd_network_client_grpc_sent_bytes_total .
- Click Apply.
|
Upgrades and updates |
1.10, 1.11, 1.12, 1.13, and 1.14 |
Anthos Identity Service can cause control plane latencies
At cluster restarts or upgrades, Anthos Identity Service can get
overwhelmed with traffic consisting of expired JWT tokens forwarded from
the kube-apiserver to Anthos Identity Service over the
authentication webhook. Although Anthos Identity Service doesn't
crashloop, it becomes unresponsive and ceases to serve further requests.
This problem ultimately leads to higher control plane latencies.
This issue is fixed in the following Anthos clusters on VMware releases:
To determine if you're affected by this issue, perform the following steps:
- Check whether the Anthos Identity Service endpoint can be reached externally:
curl -s -o /dev/null -w "%{http_code}" \
-X POST https://CLUSTER_ENDPOINT/api/v1/namespaces/anthos-identity-service/services/https:ais:https/proxy/authenticate -d '{}'
Replace CLUSTER_ENDPOINT
with the control plane VIP and control plane load balancer port for your
cluster (for example, 172.16.20.50:443 ).
If you're affected by this issue, the command returns a 400
status code. If the request times out, restart the ais Pod and
rerun the curl command to see if that resolves the problem. If
you get a status code of 000 , the problem has been resolved and
you are done. If you still get a 400 status code, the
Anthos Identity Service HTTP server isn't starting. In this case, continue.
- Check the Anthos Identity Service and kube-apiserver logs:
- Check the Anthos Identity Service log:
kubectl logs -f -l k8s-app=ais -n anthos-identity-service \
--kubeconfig KUBECONFIG
If the log contains an entry like the following, then you are affected by this issue:
I0811 22:32:03.583448 32 authentication_plugin.cc:295] Stopping OIDC authentication for ???. Unable to verify the OIDC ID token: JWT verification failed: The JWT does not appear to be from this identity provider. To match this provider, the 'aud' claim must contain one of the following audiences:
- Check the
kube-apiserver logs for your clusters:
In the following commands, KUBE_APISERVER_POD is the name of the kube-apiserver Pod on the given cluster.
Admin cluster:
kubectl --kubeconfig ADMIN_KUBECONFIG logs \
-n kube-system KUBE_APISERVER_POD kube-apiserver
User cluster:
kubectl --kubeconfig ADMIN_KUBECONFIG logs \
-n USER_CLUSTER_NAME KUBE_APISERVER_POD kube-apiserver
If the kube-apiserver logs contain entries like the following,
then you are affected by this issue:
E0811 22:30:22.656085 1 webhook.go:127] Failed to make webhook authenticator request: error trying to reach service: net/http: TLS handshake timeout
E0811 22:30:22.656266 1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, error trying to reach service: net/http: TLS handshake timeout]"
Workaround
If you can't upgrade your clusters immediately to get the fix, you can
identify and restart the offending pods as a workaround:
- Increase the Anthos Identity Service verbosity level to 9:
kubectl patch deployment ais -n anthos-identity-service --type=json \
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", \
"value":"--vmodule=cloud/identity/hybrid/charon/*=9"}]' \
--kubeconfig KUBECONFIG
- Check the Anthos Identity Service log for the invalid token context:
kubectl logs -f -l k8s-app=ais -n anthos-identity-service \
--kubeconfig KUBECONFIG
- To get the token payload associated with each invalid token context,
parse each related service account secret with the following command:
kubectl -n kube-system get secret SA_SECRET \
--kubeconfig KUBECONFIG \
-o jsonpath='{.data.token}' | base64 --decode
- To decode the token and see the source pod name and namespace, copy
the token to the debugger at jwt.io.
- Restart the pods identified from the tokens.
|
Operation |
1.8, 1.9, 1.10 |
The memory usage increase issue of etcd-maintenance pods
The etcd maintenance pods that use etcddefrag:gke_master_etcddefrag_20210211.00_p0 image are affected. The `etcddefrag` container opens a new connection to etcd server during each defrag cycle and the old connections are not cleaned up.
Workaround:
Option 1: Upgrade to the latest patch version from 1.8 to 1.11 which contain the fix.
Option 2: If you are using patch version earlier than 1.9.6 and 1.10.3, you need to scale down the etcd-maintenance pod for admin and user cluster:
kubectl scale --replicas 0 deployment/gke-master-etcd-maintenance -n USER_CLUSTER_NAME --kubeconfig ADMIN_CLUSTER_KUBECONFIG
kubectl scale --replicas 0 deployment/gke-master-etcd-maintenance -n kube-system --kubeconfig ADMIN_CLUSTER_KUBECONFIG
|
Operation |
1.9, 1.10, 1.11, 1.12, 1.13 |
Miss the health checks of user cluster control plane pods
Both the cluster health controller and the gkectl diagnose cluster command perform a set of health checks including the pods health checks across namespaces. However, they start to skip the user control plane pods by mistake. If you use the control plane v2 mode, this won't affect your cluster.
Workaround:
This won't affect any workload or cluster management. If you want to check the control plane pods healthiness, you can run the following commands:
kubectl get pods -owide -n USER_CLUSTER_NAME --kubeconfig ADMIN_CLUSTER_KUBECONFIG
|
Upgrades and updates |
1.6+, 1.7+ |
1.6 and 1.7 admin cluster upgrades may be affected by the k8s.gcr.io -> registry.k8s.io redirect
Kubernetes redirected the traffic from k8s.gcr.io to registry.k8s.io on 3/20/2023. In Anthos clusters on VMware 1.6.x and 1.7.x, the admin cluster upgrades use the container image k8s.gcr.io/pause:3.2 . If you use a proxy for your admin workstation and the proxy doesn't allow registry.k8s.io and the container image k8s.gcr.io/pause:3.2 is not cached locally, the admin cluster upgrades will fail when pulling the container image.
Workaround:
Add registry.k8s.io to the allowlist of the proxy for your admin workstation.
|
Networking |
1.10, 1.11, 1.12.0-1.12.6, 1.13.0-1.13.6, 1.14.0-1.14.2 |
Seesaw validation failure on load balancer creation
gkectl create loadbalancer fails with the following error message:
- Validation Category: Seesaw LB - [FAILURE] Seesaw validation: xxx cluster lb health check failed: LB"xxx.xxx.xxx.xxx" is not healthy: Get "http://xxx.xxx.xxx.xxx:xxx/healthz": dial tcpxxx.xxx.xxx.xxx:xxx: connect: no route to host
This is due to the seesaw group file already existing. And the preflight check
tries to validate a non-existent seesaw load balancer.
Workaround:
Remove the existing seesaw group file for this cluster. The file name
is seesaw-for-gke-admin.yaml for the admin cluster, and
seesaw-for-{CLUSTER_NAME}.yaml for a user cluster.
|
Networking |
1.14 |
Application timeouts caused by conntrack table insertion failures
Anthos clusters on VMware version 1.14 is susceptible to netfilter
connection tracking (conntrack) table insertion failures when using
Ubuntu or COS operating system images. Insertion failures lead to random
application timeouts and can occur even when the conntrack table has room
for new entries. The failures are caused by changes in
kernel 5.15 and higher that restrict table insertions based on chain
length.
To see if you are affected by this issue, you can check the in-kernel
connection tracking system statistics on each node with the following
command:
sudo conntrack -S
The response looks like this:
cpu=0 found=0 invalid=4 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0
cpu=1 found=0 invalid=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0
cpu=2 found=0 invalid=16 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0
cpu=3 found=0 invalid=13 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0
cpu=4 found=0 invalid=9 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0 clash_resolve=0 chaintoolong=0
cpu=5 found=0 invalid=1 insert=0 insert_failed=0 drop=0 early_drop=0 error=519 search_restart=0 clash_resolve=126 chaintoolong=0
...
If a chaintoolong value in the response is a non-zero
number, you're affected by this issue.
Workaround
The short term mitigation is to increase the size of both the netfiler
hash table (nf_conntrack_buckets ) and the netfilter
connection tracking table (nf_conntrack_max ). Use the
following commands on each cluster node to increase the size of the
tables:
sysctl -w net.netfilter.nf_conntrack_buckets=TABLE_SIZE
sysctl -w net.netfilter.nf_conntrack_max=TABLE_SIZE
Replace TABLE_SIZE with new size in bytes. The
default table size value is 262144 . We suggest that you set a
value equal to 65,536 times the number of cores on the node. For example,
if your node has eight cores, set the table size to 524288 .
|
Networking |
1.13.0-1.13.2 |
calico-typha or anetd-operator crash loop on Windows nodes with Controlplane v2
With Controlplane v2 or a new installation model, calico-typha or anetd-operator might be scheduled to Windows nodes and get into crash loop.
The reason is that the two deployments tolerate all taints including Windows node taint.
Workaround:
Either upgrade to 1.13.3+, or run the following commands to edit the `calico-typha` or `anetd-operator` deployment:
# If dataplane v2 is not used.
kubectl edit deployment -n kube-system calico-typha --kubeconfig USER_CLUSTER_KUBECONFIG
# If dataplane v2 is used.
kubectl edit deployment -n kube-system anetd-operator --kubeconfig USER_CLUSTER_KUBECONFIG
Remove the following spec.template.spec.tolerations :
- effect: NoSchedule
operator: Exists
- effect: NoExecute
operator: Exists
And add the following toleration:
- key: node-role.kubernetes.io/master
operator: Exists
|
Configuration |
1.14.0-1.14.2 |
User cluster private registry credential file cannot be loaded
You might not be able to create a user cluster if you specify the
privateRegistry section with credential fileRef .
Preflight might fail with the following message:
[FAILURE] Docker registry access: Failed to login.
Workaround:
|
Operations |
1.10+ |
Anthos Service Mesh and other service meshes not compatible with Dataplane v2
Dataplane V2 takes over load balancing and creates a kernel socket instead of a packet based DNAT. This means that Anthos Service Mesh
cannot do packet inspection as the pod is bypassed and never uses IPTables.
This manifests in kube-proxy free mode by loss of connectivity or incorrect traffic routing for services with Anthos Service Mesh as the sidecar cannot do packet inspection.
This issue is present on all versions of Anthos clusters on bare metal 1.10, however some newer versions of 1.10 (1.10.2+) have a workaround.
Workaround:
Either upgrade to 1.11 for full compatibility or if running 1.10.2 or later, run:
kubectl edit cm -n kube-system cilium-config --kubeconfig USER_CLUSTER_KUBECONFIG
Add bpf-lb-sock-hostns-only: true to the configmap and then restart the anetd daemonset:
kubectl rollout restart ds anetd -n kube-system --kubeconfig USER_CLUSTER_KUBECONFIG
|
Storage |
1.12+, 1.13.3 |
kube-controller-manager might detach persistent volumes
forcefully after 6 minutes
kube-controller-manager might timeout when detaching
PV/PVCs after 6 minutes, and forcefully detach the PV/PVCs. Detailed logs
from kube-controller-manager show events similar to the
following:
$ cat kubectl_logs_kube-controller-manager-xxxx | grep "DetachVolume started" | grep expired
kubectl_logs_kube-controller-manager-gke-admin-master-4mgvr_--container_kube-controller-manager_--kubeconfig_kubeconfig_--request-timeout_30s_--namespace_kube-system_--timestamps:2023-01-05T16:29:25.883577880Z W0105 16:29:25.883446 1 reconciler.go:224] attacherDetacher.DetachVolume started for volume "pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^126f913b-4029-4055-91f7-beee75d5d34a") on node "sandbox-corp-ant-antho-0223092-03-u-tm04-ml5m8-7d66645cf-t5q8f"
This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching
To verify the issue, log into the node and run the following commands:
# See all the mounting points with disks
lsblk -f
# See some ext4 errors
sudo dmesg -T
In the kubelet log, errors like the following are displayed:
Error: GetDeviceMountRefs check failed for volume "pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^126f913b-4029-4055-91f7-beee75d5d34a") on node "sandbox-corp-ant-antho-0223092-03-u-tm04-ml5m8-7d66645cf-t5q8f" :
the device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16/globalmount" is still mounted by other references [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16/globalmount
Workaround:
Connect to the affected node using SSH and reboot the node.
|
Upgrades and updates |
1.12+, 1.13+, 1.14+ |
Cluster upgrade is stuck if 3rd party CSI driver is used
You might not be able to upgrade a cluster if you use a 3rd party CSI
driver. The gkectl cluster diagnose command might return the
following error:
"virtual disk "kubernetes.io/csi/csi.netapp.io^pvc-27a1625f-29e3-4e4f-9cd1-a45237cc472c" IS NOT attached to machine "cluster-pool-855f694cc-cjk5c" but IS listed in the Node.Status"
Workaround:
Perform the upgrade using the --skip-validation-all
option.
|
Operation |
1.10+, 1.11+, 1.12+, 1.13+, 1.14+ |
gkectl repair admin-master creates the admin master VM
without upgrading its vm hardware version
The admin master node created via gkectl repair admin-master
may use a lower VM hardware version than expected. When the issue happens,
you will see the error from the gkectl diagnose cluster
report.
CSIPrerequisites [VM Hardware]: The current VM hardware versions are lower than vmx-15 which is unexpected. Please contact Anthos support to resolve this issue.
Workaround:
Shutdown the admin master node, follow
https://kb.vmware.com/s/article/1003746
to upgrade the node to the expected version described in the error
message, and then start the node.
|
Operating system |
1.10+, 1.11+, 1.12+, 1.13+, 1.14+, 1.15+, 1.16+ |
VM releases DHCP lease on shutdown/reboot unexpectedly, which may
result in IP changes
In systemd v244, systemd-networkd has a
default behavior change
on the KeepConfiguration configuration. Before this change,
VMs did not send a DHCP lease release message to the DHCP server on
shutdown or reboot. After this change, VMs send such a message and
return the IPs to the DHCP server. As a result, the released IP may be
reallocated to a different VM and/or a different IP may be assigned to the
VM, resulting in IP conflict (at Kubernetes level, not vSphere level)
and/or IP change on the VMs, which can break the clusters in various ways.
For example, you may see the following symptoms.
- vCenter UI shows that no VMs use the same IP, but
kubectl get
nodes -o wide returns nodes with duplicate IPs.
NAME STATUS AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node1 Ready 28h v1.22.8-gke.204 10.180.85.130 10.180.85.130 Ubuntu 20.04.4 LTS 5.4.0-1049-gkeop containerd://1.5.13
node2 NotReady 71d v1.22.8-gke.204 10.180.85.130 10.180.85.130 Ubuntu 20.04.4 LTS 5.4.0-1049-gkeop containerd://1.5.13
- New nodes fail to start due to
calico-node error
2023-01-19T22:07:08.817410035Z 2023-01-19 22:07:08.817 [WARNING][9] startup/startup.go 1135: Calico node 'node1' is already using the IPv4 address 10.180.85.130.
2023-01-19T22:07:08.817514332Z 2023-01-19 22:07:08.817 [INFO][9] startup/startup.go 354: Clearing out-of-date IPv4 address from this node IP="10.180.85.130/24"
2023-01-19T22:07:08.825614667Z 2023-01-19 22:07:08.825 [WARNING][9] startup/startup.go 1347: Terminating
2023-01-19T22:07:08.828218856Z Calico node failed to start
Workaround:
Deploy the following DaemonSet on the cluster to revert the
systemd-networkd default behavior change. The VMs that run
this DaemonSet will not release the IPs to the DHCP server on
shutdown/reboot. The IPs will be freed automatically by the DHCP server
when the leases expire.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: set-dhcp-on-stop
spec:
selector:
matchLabels:
name: set-dhcp-on-stop
template:
metadata:
labels:
name: set-dhcp-on-stop
spec:
hostIPC: true
hostPID: true
hostNetwork: true
containers:
- name: set-dhcp-on-stop
image: ubuntu
tty: true
command:
- /bin/bash
- -c
- |
set -x
date
while true; do
export CONFIG=/host/run/systemd/network/10-netplan-ens192.network;
grep KeepConfiguration=dhcp-on-stop "${CONFIG}" > /dev/null
if (( $? != 0 )) ; then
echo "Setting KeepConfiguration=dhcp-on-stop"
sed -i '/\[Network\]/a KeepConfiguration=dhcp-on-stop' "${CONFIG}"
cat "${CONFIG}"
chroot /host systemctl restart systemd-networkd
else
echo "KeepConfiguration=dhcp-on-stop has already been set"
fi;
sleep 3600
done
volumeMounts:
- name: host
mountPath: /host
resources:
requests:
memory: "10Mi"
cpu: "5m"
securityContext:
privileged: true
volumes:
- name: host
hostPath:
path: /
tolerations:
- operator: Exists
effect: NoExecute
- operator: Exists
effect: NoSchedule
|
Operation, upgrades and updates |
1.12.0-1.12.5, 1.13.0-1.13.5, 1.14.0-1.14.1 |
Component access service account key wiped out after admin cluster
upgraded from 1.11.x
This issue will only affect admin clusters which are upgraded
from 1.11.x, and won't affect admin clusters which are newly created after
1.12.
After upgrading a 1.11.x cluster to 1.12.x, the
component-access-sa-key field in
admin-cluster-creds secret will be wiped out to empty.
This can be checked by running the following command:
kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get secret admin-cluster-creds -o yaml | grep 'component-access-sa-key'
If you find the output is empty that means the key is wiped out.
After the component access service account key been deleted,
installing new user clusters or upgrading existing user clusters will
fail. The following lists some error messages you might encounter:
- Slow validation preflight failure with error message:
"Failed
to create the test VMs: failed to get service account key: service
account is not configured."
- Prepare by
gkectl prepare failed with error message:
"Failed to prepare OS images: dialing: unexpected end of JSON
input"
- If you are upgrading a 1.13 user cluster using the Google Cloud
Console or the gcloud CLI, when you run
gkectl update admin --enable-preview-user-cluster-central-upgrade
to deploy the upgrade platform controller, the command fails
with the message: "failed to download bundle to disk: dialing:
unexpected end of JSON input" (You can see this message
in the status field in
the output of kubectl --kubeconfig
ADMIN_KUBECONFIG -n kube-system get onprembundle -oyaml ).
Workaround:
Add the component access service account key back into the secret
manually by running the following command:
kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get secret admin-cluster-creds -ojson | jq --arg casa "$(cat COMPONENT_ACESS_SERVICE_ACOOUNT_KEY_PATH | base64 -w 0)" '.data["component-access-sa-key"]=$casa' | kubectl --kubeconfig ADMIN_KUBECONFIG apply -f -
|
Operation |
1.13.0+, 1.14.0+ |
Cluster autoscaler does not work when Controlplane V2 is enabled
For user clusters created with Controlplane V2 or a new installation model, node pools with autoscaling enabled always use their autoscaling.minReplicas in the user-cluster.yaml. The log of the cluster-autoscaler pod also shows that their are unhealthy.
> kubectl --kubeconfig $USER_CLUSTER_KUBECONFIG -n kube-system \
logs $CLUSTER_AUTOSCALER_POD --container_cluster-autoscaler
TIMESTAMP 1 gkeonprem_provider.go:73] error getting onpremusercluster ready status: Expected to get a onpremusercluster with id foo-user-cluster-gke-onprem-mgmt/foo-user-cluster
TIMESTAMP 1 static_autoscaler.go:298] Failed to get node infos for groups: Expected to get a onpremusercluster with id foo-user-cluster-gke-onprem-mgmt/foo-user-cluster
The cluster autoscaler pod can be found by running the following commands.
> kubectl --kubeconfig $USER_CLUSTER_KUBECONFIG -n kube-system \
get pods | grep cluster-autoscaler
cluster-autoscaler-5857c74586-txx2c 4648017n 48076Ki 30s
Workaround:
Disable autoscaling in all the node pools with `gkectl update cluster` until upgrading to a version with the fix
|
Installation |
1.12.0-1.12.4, 1.13.0-1.13.3, 1.14.0 |
CIDR is not allowed in the IP block file
When users use CIDR in the IP block file, the config validation will fail with the following error:
- Validation Category: Config Check
- [FAILURE] Config: AddressBlock for admin cluster spec is invalid: invalid IP:
172.16.20.12/30
Workaround:
Include individual IPs in the IP block file until upgrading to a version with the fix: 1.12.5, 1.13.4, 1.14.1+.
|
Upgrades and updates |
1.14.0-1.14.1 |
OS image type update in the admin-cluster.yaml doesn't wait for user control plane machines to be re-created
When Updating control plane OS image type in the admin-cluster.yaml, and if its corresponding user cluster was created via Controlplane V2, the user control plane machines may not finish their re-creation when the gkectl command finishes.
Workaround:
After the update is finished, keep waiting for the user control plane machines to also finish their re-creation by monitoring their node os image types using kubectl --kubeconfig USER_KUBECONFIG get nodes -owide . e.g. If updating from Ubuntu to COS, we should wait for all the control plane machines to completely change from Ubuntu to COS even after the update command is complete.
|
Operation |
1.14.0 |
Pod create or delete errors due to Calico CNI service account auth token
issue
An issue with Calico in Anthos clusters on VMware 1.14.0
causes Pod creation and deletion to fail with the following error message in
the output of kubectl describe pods :
error getting ClusterInformation: connection is unauthorized: Unauthorized
This issue is only observed 24 hours after the cluster is
created or upgraded to 1.14 using Calico.
Admin clusters are always using Calico, while for user cluster there is
a config field `enableDataPlaneV2` in user-cluster.yaml, if that field is
set to `false`, or not specified, that means you are using Calico in user
cluster.
The nodes' install-cni container creates a kubeconfig with a
token that is valid for 24 hours. This token needs to be periodically
renewed by the calico-node Pod. The calico-node
Pod is unable to renew the token as it doesn't have access to the directory
that contains the kubeconfig file on the node.
Workaround:
To mitigate the issue, apply the following patch on the
calico-node DaemonSet in your admin and user cluster:
kubectl -n kube-system get daemonset calico-node \
--kubeconfig ADMIN_CLUSTER_KUBECONFIG -o json \
| jq '.spec.template.spec.containers[0].volumeMounts += [{"name":"cni-net-dir","mountPath":"/host/etc/cni/net.d"}]' \
| kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f -
kubectl -n kube-system get daemonset calico-node \
--kubeconfig USER_CLUSTER_KUBECONFIG -o json \
| jq '.spec.template.spec.containers[0].volumeMounts += [{"name":"cni-net-dir","mountPath":"/host/etc/cni/net.d"}]' \
| kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f -
Replace the following:
ADMIN_CLUSTER_KUBECONFIG : the path
of the admin cluster kubeconfig file.
USER_CLUSTER_CONFIG_FILE : the path
of your user cluster configuration file.
|
Installation |
1.12.0-1.12.4, 1.13.0-1.13.3, 1.14.0 |
IP block validation fails when using CIDR
Cluster creation fails despite the user having the proper configuration. User sees creation failing due to the cluster not having enough IPs.
Workaround:
Split CIDR's into several smaller CIDR blocks, such as 10.0.0.0/30 becomes 10.0.0.0/31, 10.0.0.2/31 . As long as there are N+1 CIDR's, where N is the number of nodes in the cluster, this should suffice.
|
Operation, Upgrades and updates |
1.11.0 - 1.11.1, 1.10.0 - 1.10.4, 1.9.0 - 1.9.6 |
Admin cluster backup does not include the always-on secrets encryption keys and configuration
When the always-on secrets encryption feature is enabled along with cluster backup, the admin cluster backup fails to include the encryption keys and configuration required by always-on secrets encryption feature. As a result, repairing the admin master with this backup using gkectl repair admin-master --restore-from-backup causes the following error:
Validating admin master VM xxx ...
Waiting for kube-apiserver to be accessible via LB VIP (timeout "8m0s")... ERROR
Failed to access kube-apiserver via LB VIP. Trying to fix the problem by rebooting the admin master
Waiting for kube-apiserver to be accessible via LB VIP (timeout "13m0s")... ERROR
Failed to access kube-apiserver via LB VIP. Trying to fix the problem by rebooting the admin master
Waiting for kube-apiserver to be accessible via LB VIP (timeout "18m0s")... ERROR
Failed to access kube-apiserver via LB VIP. Trying to fix the problem by rebooting the admin master
Workaround:
- Use the gkectl binary of the latest available patch version for the corresponding minor version to perform the admin cluster backup after critical cluster operations. For example, if the cluster is running a 1.10.2 version, use the 1.10.5 gkectl binary to perform a manual admin cluster backup as described in Backup and Restore an admin cluster with gkectl.
|
Operation, Upgrades and updates |
1.10+ |
Recreating the admin master VM with a new boot disk (e.g., gkectl repair admin-master ) will fail if the always-on secrets encryption feature is enabled using `gkectl update` command.
If the always-on secrets encryption feature is not enabled at cluster creation, but enabled later using gkectl update operation then the gkectl repair admin-master fails to repair the admin cluster control plane node. It is recommend that always-on secrets encryption feature is enabled at cluster creation. There is no current mitigation.
|
Upgrades and updates |
1.10 |
Upgrading the first user cluster from 1.9 to 1.10 recreates nodes in other user clusters
Upgrading the first user cluster from 1.9 to 1.10 could recreate nodes in other user clusters under the same admin cluster. The recreation is performed in a rolling fashion.
The disk_label was removed from MachineTemplate.spec.template.spec.providerSpec.machineVariables , which triggered an update on all MachineDeployment s unexpectedly.
Workaround:
View workaround steps
- Scale down the replica of
clusterapi-controllers to 0 for all user clusters.
kubectl scale --replicas=0 -n=USER_CLUSTER_NAME deployment/clusterapi-controllers --kubeconfig ADMIN_CLUSTER_KUBECONFIG
Upgrade each user cluster one by one.
|
Upgrades and updates |
1.10.0 |
Docker restarts frequently after cluster upgrade
Upgrade user cluster to 1.10.0 might cause docker restart frequently.
You can detect this issue by running kubectl describe node NODE_NAME --kubeconfig USER_CLUSTER_KUBECONFIG
A node condition will show whether the docker restart frequently. Here is an example output:
Normal FrequentDockerRestart 41m (x2 over 141m) systemd-monitor Node condition FrequentDockerRestart is now: True, reason: FrequentDockerRestart
To understand the root cause, you need to ssh to the node that has the symptom and run commands like sudo journalctl --utc -u docker or sudo journalctl -x
Workaround:
|
Upgrades and updates |
1.11, 1.12 |
Self-deployed GMP components not preserved after upgrading to version 1.12
If you are using an Anthos clusters on VMware version below 1.12, and have manually set up Google-managed Prometheus (GMP) components in the gmp-system
namespace for your cluster, the components are not preserved when you
upgrade to version 1.12.x.
From version 1.12, GMP components in the gmp-system namespace and CRDs are managed by stackdriver
object, with the enableGMPForApplications flag set to false by
default. If you manually deploy GMP components in the namespace prior to upgrading to 1.12, the resources will be deleted by stackdriver .
Workaround:
|
Operation |
1.11, 1.12, 1.13.0 - 1.13.1 |
Missing ClusterAPI objects in cluster snapshot system scenario
In the system scenario, the cluster snapshot doesn't include any resources under the default namespace.
However, some Kubernetes resources like Cluster API objects that are under this namespace contain useful debugging information. The cluster snapshot should include them.
Workaround:
You can manually run the following commands to collect the debugging information.
export KUBECONFIG=USER_CLUSTER_KUBECONFIG
kubectl get clusters.cluster.k8s.io -o yaml
kubectl get controlplanes.cluster.k8s.io -o yaml
kubectl get machineclasses.cluster.k8s.io -o yaml
kubectl get machinedeployments.cluster.k8s.io -o yaml
kubectl get machines.cluster.k8s.io -o yaml
kubectl get machinesets.cluster.k8s.io -o yaml
kubectl get services -o yaml
kubectl describe clusters.cluster.k8s.io
kubectl describe controlplanes.cluster.k8s.io
kubectl describe machineclasses.cluster.k8s.io
kubectl describe machinedeployments.cluster.k8s.io
kubectl describe machines.cluster.k8s.io
kubectl describe machinesets.cluster.k8s.io
kubectl describe services
where:
USER_CLUSTER_KUBECONFIG is the user cluster's
kubeconfig file.
|
Upgrades and updates |
1.11.0-1.11.4, 1.12.0-1.12.3, 1.13.0-1.13.1 |
User cluster deletion stuck at node drain for vSAN setup
When deleting, updating or upgrading a user cluster, node drain may be stuck in the following scenarios:
- The admin cluster has been using vSphere CSI driver on vSAN since version 1.12.x, and
- There are no PVC/PV objects created by in-tree vSphere plugins in the admin and user cluster.
To identify the symptom, run the command below:
kubectl logs clusterapi-controllers-POD_NAME_SUFFIX --kubeconfig ADMIN_KUBECONFIG -n USER_CLUSTER_NAMESPACE
Here is a sample error message from the above command:
E0920 20:27:43.086567 1 machine_controller.go:250] Error deleting machine object [MACHINE]; Failed to delete machine [MACHINE]: failed to detach disks from VM "[MACHINE]": failed to convert disk path "kubevols" to UUID path: failed to convert full path "ds:///vmfs/volumes/vsan:[UUID]/kubevols": ServerFaultCode: A general system error occurred: Invalid fault
kubevols is the default directory for vSphere in-tree driver. When there are no PVC/PV objects created, you may hit a bug that node drain will be stuck at finding kubevols , since the current implementation assumes that kubevols always exists.
Workaround:
Create the directory kubevols in the datastore where the node is created. This is defined in the vCenter.datastore field in the user-cluster.yaml or admin-cluster.yaml files.
|
Configuration |
1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13, 1.14 |
Cluster Autoscaler clusterrolebinding and clusterrole are deleted after deleting a user cluster.
On user cluster deletion, the corresponding clusterrole and clusterrolebinding for cluster-autoscaler are also deleted. This affects all other user clusters on the same admin cluster with cluster autoscaler enabled. This is because the same clusterrole and clusterrolebinding are used for all cluster autoscaler pods within the same admin cluster.
The symptoms are the following:
kubectl logs --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system \
cluster-autoscaler
where ADMIN_CLUSTER_KUBECONFIG is the admin cluster's
kubeconfig file.
Here is an example of error messages you might see:
2023-03-26T10:45:44.866600973Z W0326 10:45:44.866463 1 reflector.go:424] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: failed to list *unstructured.Unstructured: onpremuserclusters.onprem.cluster.gke.io is forbidden: User "..." cannot list resource "onpremuserclusters" in API group "onprem.cluster.gke.io" at the cluster scope
2023-03-26T10:45:44.866646815Z E0326 10:45:44.866494 1 reflector.go:140] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: onpremuserclusters.onprem.cluster.gke.io is forbidden: User "..." cannot list resource "onpremuserclusters" in API group "onprem.cluster.gke.io" at the cluster scope
Workaround:
View workaround steps
Verify whether the clusterrole and clusterrolebinding are missing on the admin cluster
-
kubectl get clusterrolebindings --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system | grep cluster-autoscaler
-
kubectl get clusterrole --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system | grep cluster-autoscaler
Apply the following clusterrole and clusterrolebinding to the admin cluster if they are missing. Add the service account subjects to the clusterrolebinding for each user cluster.
-
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-autoscaler
rules:
- apiGroups: ["cluster.k8s.io"]
resources: ["clusters"]
verbs: ["get", "list", "watch"]
- apiGroups: ["cluster.k8s.io"]
resources: ["machinesets","machinedeployments", "machinedeployments/scale","machines"]
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: ["onprem.cluster.gke.io"]
resources: ["onpremuserclusters"]
verbs: ["get", "list", "watch"]
- apiGroups:
- coordination.k8s.io
resources:
- leases
resourceNames: ["cluster-autoscaler"]
verbs:
- get
- list
- watch
- create
- update
- patch
- apiGroups:
- ""
resources:
- nodes
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups:
- ""
resources:
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- ""
resources:
- pods/eviction
verbs: ["create"]
# read-only access to cluster state
- apiGroups: [""]
resources: ["services", "replicationcontrollers", "persistentvolumes", "persistentvolumeclaims"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["daemonsets", "replicasets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "list", "watch"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses", "csinodes"]
verbs: ["get", "list", "watch"]
# misc access
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "update", "patch"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["create"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["cluster-autoscaler-status"]
verbs: ["get", "update", "patch", "delete"]
-
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
k8s-app: cluster-autoscaler
name: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: NAMESPACE_OF_USER_CLUSTER_1
- kind: ServiceAccount
name: cluster-autoscaler
namespace: NAMESPACE_OF_USER_CLUSTER_2
...
|
Configuration |
1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13 |
admin cluster cluster-health-controller and vsphere-metrics-exporter do not work after deleting user cluster
On user cluster deletion, the corresponding clusterrole is also deleted, which results in auto repair and vsphere metrics exporter not working
The symptoms are the following:
cluster-health-controller logs
kubectl logs --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system \
cluster-health-controller
where ADMIN_CLUSTER_KUBECONFIG is the admin cluster's
kubeconfig file.
Here is an example of error messages you might see:
error retrieving resource lock default/onprem-cluster-health-leader-election: configmaps "onprem-cluster-health-leader-election" is forbidden: User "system:serviceaccount:kube-system:cluster-health-controller" cannot get resource "configmaps" in API group "" in the namespace "default": RBAC: clusterrole.rbac.authorization.k8s.io "cluster-health-controller-role" not found
vsphere-metrics-exporter logs
kubectl logs --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system \
vsphere-metrics-exporter
where ADMIN_CLUSTER_KUBECONFIG is the admin cluster's
kubeconfig file.
Here is an example of error messages you might see:
vsphere-metrics-exporter/cmd/vsphere-metrics-exporter/main.go:68: Failed to watch *v1alpha1.Cluster: failed to list *v1alpha1.Cluster: clusters.cluster.k8s.io is forbidden: User "system:serviceaccount:kube-system:vsphere-metrics-exporter" cannot list resource "clusters" in API group "cluster.k8s.io" in the namespace "default"
Workaround:
View workaround steps
Apply the following yaml to the admin cluster
- For vsphere-metrics-exporter
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: vsphere-metrics-exporter
rules:
- apiGroups:
- cluster.k8s.io
resources:
- clusters
verbs: [get, list, watch]
- apiGroups:
- ""
resources:
- nodes
verbs: [get, list, watch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
k8s-app: vsphere-metrics-exporter
name: vsphere-metrics-exporter
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: vsphere-metrics-exporter
subjects:
- kind: ServiceAccount
name: vsphere-metrics-exporter
namespace: kube-system
For cluster-health-controller
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-health-controller-role
rules:
- apiGroups:
- "*"
resources:
- "*"
verbs:
- "*"
|
Configuration |
1.12.1-1.12.3, 1.13.0-1.13.2 |
gkectl check-config fails at OS image validation
A known issue that could fail the gkectl check-config without running gkectl prepare . This is confusing because we suggest running the command before running gkectl prepare
The symptom is that the gkectl check-config command will fail with the
following error message:
Validator result: {Status:FAILURE Reason:os images [OS_IMAGE_NAME] don't exist, please run `gkectl prepare` to upload os images. UnhealthyResources:[]}
Workaround:
Option 1: run gkectl prepare to upload the missing OS images.
Option 2: use gkectl check-config --skip-validation-os-images to skip the OS images validation.
|
Upgrades and updates |
1.11, 1.12, 1.13 |
gkectl update admin/cluster fails at updating anti affinity groups
A known issue that could fail the gkectl update admin/cluster when updating anti affinity groups .
The symptom is that the gkectl update command will fail with the
following error message:
Waiting for machines to be re-deployed... ERROR
Exit with error:
Failed to update the cluster: timed out waiting for the condition
Workaround:
View workaround steps
For the update to take effect, the machines need to be recreated after the failed update.
For admin cluster update, user master and admin addon nodes need to be recreated
For user cluster update, user worker nodes need to be recreated
To recreate user worker nodes
Option 1 Follow update a node pool and change the cpu or memory to trigger a rolling recreation of the nodes.
Option 2 Use kubectl delete to recreate the machines one at a time
kubectl delete machines MACHINE_NAME --kubeconfig USER_KUBECONFIG
To recreate user master nodes
Option 1 Follow resize control plane and change the cpu or memory to trigger a rolling recreation of the nodes.
Option 2 Use kubectl delete to recreate the machines one at a time
kubectl delete machines MACHINE_NAME --kubeconfig ADMIN_KUBECONFIG
To recreate admin addon nodes
Use kubectl delete to recreate the machines one at a time
kubectl delete machines MACHINE_NAME --kubeconfig ADMIN_KUBECONFIG
|
Installation, Upgrades and updates |
1.13.0-1.13.8, 1.14.0-1.14.4, 1.15.0 |
Node registration fails during cluster creation, upgrade, update and
node auto repair, when ipMode.type is static and
the configured hostname in the
IP block file contains one
or more periods. In this case, Certificate Signing Requests (CSR) for a
node are not automatically approved.
To see pending CSRs for a node, run the following command:
kubectl get csr -A -o wide
Check the following logs for error messages:
- View the logs in the admin cluster for the
clusterapi-controller-manager container in the
clusterapi-controllers Pod:
kubectl logs clusterapi-controllers-POD_NAME \
-c clusterapi-controller-manager -n kube-system \
--kubeconfig ADMIN_CLUSTER_KUBECONFIG
- To view the same logs in the user cluster, run the following
command:
kubectl logs clusterapi-controllers-POD_NAME \
-c clusterapi-controller-manager -n USER_CLUSTER_NAME \
--kubeconfig ADMIN_CLUSTER_KUBECONFIG
where:
- ADMIN_CLUSTER_KUBECONFIG is the admin cluster's
kubeconfig file.
- USER_CLUSTER_NAME is the name of the user cluster.
Here is an example of error messages you might see: "msg"="failed
to validate token id" "error"="failed to find machine for node
node-worker-vm-1" "validate"="csr-5jpx9"
- View the
kubelet logs on the problematic node:
journalctl --u kubelet
Here is an example of error messages you might see: "Error getting
node" err="node \"node-worker-vm-1\" not found"
If you specify a domain name in the hostname field of an IP block file,
any characters following the first period will be ignored. For example, if
you specify the hostname as bob-vm-1.bank.plc , the VM
hostname and node name will be set to bob-vm-1 .
When node ID verification is enabled, the CSR approver compares the
node name with the hostname in the Machine spec, and fails to reconcile
the name. The approver rejects the CSR, and the node fails to
bootstrap.
Workaround:
User cluster
Disable node ID verification by completing the following steps:
- Add the following fields in your user cluster configuration file:
disableNodeIDVerification: true
disableNodeIDVerificationCSRSigning: true
- Save the file, and update the user cluster by running the following
command:
gkectl update cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
--config USER_CLUSTER_CONFIG_FILE
Replace the following:
ADMIN_CLUSTER_KUBECONFIG : the path
of the admin cluster kubeconfig file.
USER_CLUSTER_CONFIG_FILE : the path
of your user cluster configuration file.
Admin cluster
- Open the
OnPremAdminCluster custom resource for
editing:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
edit onpremadmincluster -n kube-system
- Add the following annotation to the custom resource:
features.onprem.cluster.gke.io/disable-node-id-verification: enabled
- Edit the
kube-controller-manager manifest in the admin
cluster control plane:
- SSH into the
admin cluster control plane node.
- Open the
kube-controller-manager manifest for
editing:
sudo vi /etc/kubernetes/manifests/kube-controller-manager.yaml
- Find the list of
controllers :
--controllers=*,bootstrapsigner,tokencleaner,-csrapproving,-csrsigning
- Update this section as shown below:
--controllers=*,bootstrapsigner,tokencleaner
- Open the Deployment Cluster API controller for editing:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
edit deployment clusterapi-controllers -n kube-system
- Change the values of
node-id-verification-enabled and
node-id-verification-csr-signing-enabled to
false :
--node-id-verification-enabled=false
--node-id-verification-csr-signing-enabled=false
|
Installation, Upgrades and updates |
1.11.0-1.11.4 |
Admin control plane machine startup failure caused by private registry
certificate bundle
The admin cluster creation/upgrade is stuck at the following log forever
and eventually times out:
Waiting for Machine gke-admin-master-xxxx to become ready...
The Cluster API controller log in the
external cluster snapshot includes the following log:
Invalid value 'XXXX' specified for property startup-data
Here is an example file path for the Cluster API controller log:
kubectlCommands/kubectl_logs_clusterapi-controllers-c4fbb45f-6q6g6_--container_vsphere-controller-manager_--kubeconfig_.home.ubuntu..kube.kind-config-gkectl_--request-timeout_30s_--namespace_kube-system_--timestamps
VMware has a 64k vApp property size limit. In the identified versions,
the data passed via vApp property is close to the limit. When the private
registry certificate contains a certificate bundle, it may cause the final
data to exceed the 64k limit.
Workaround:
Only include the required certificates in the private registry
certificate file configured in privateRegistry.caCertPath in
the admin cluster config file.
Or upgrade to a version with the fix when available.
|
Networking |
1.10, 1.11.0-1.11.3, 1.12.0-1.12.2, 1.13.0 |
NetworkGatewayNodes marked unhealthy from concurrent
status update conflict
In networkgatewaygroups.status.nodes , some nodes switch
between NotHealthy and Up .
Logs for the ang-daemon Pod running on that node reveal
repeated errors:
2022-09-16T21:50:59.696Z ERROR ANGd Failed to report status {"angNode": "kube-system/my-node", "error": "updating Node CR status: sending Node CR update: Operation cannot be fulfilled on networkgatewaynodes.networking.gke.io \"my-node\": the object has been modified; please apply your changes to the latest version and try again"}
The NotHealthy status prevents the controller from
assigning additional floating IPs to the node. This can result in higher
burden on other nodes or a lack of redundancy for high availability.
Dataplane activity is otherwise not affected.
Contention on the networkgatewaygroup object causes some
status updates to fail due to a fault in retry handling. If too many
status updates fail, ang-controller-manager sees the node as
past its heartbeat time limit and marks the node NotHealthy .
The fault in retry handling has been fixed in later versions.
Workaround:
Upgrade to a fixed version, when available.
|
Upgrades and updates |
1.12.0-1.12.2, 1.13.0 |
Race condition blocks machine object deletion during and update or
upgrade
A known issue that could cause the cluster upgrade or update to be
stuck at waiting for the old machine object to be deleted. This is because
the finalizer cannot be removed from the machine object. This affects any
rolling update operation for node pools.
The symptom is that the gkectl command times out with the
following error message:
E0821 18:28:02.546121 61942 console.go:87] Exit with error:
E0821 18:28:02.546184 61942 console.go:87] error: timed out waiting for the condition, message: Node pool "pool-1" is not ready: ready condition is not true: CreateOrUpdateNodePool: 1/3 replicas are updated
Check the status of OnPremUserCluster 'cluster-1-gke-onprem-mgmt/cluster-1' and the logs of pod 'kube-system/onprem-user-cluster-controller' for more detailed debugging information.
In clusterapi-controller Pod logs, the errors are like
below:
$ kubectl logs clusterapi-controllers-[POD_NAME_SUFFIX] -n cluster-1
-c vsphere-controller-manager --kubeconfig [ADMIN_KUBECONFIG]
| grep "Error removing finalizer from machine object"
[...]
E0821 23:19:45.114993 1 machine_controller.go:269] Error removing finalizer from machine object cluster-1-pool-7cbc496597-t5d5p; Operation cannot be fulfilled on machines.cluster.k8s.io "cluster-1-pool-7cbc496597-t5d5p": the object has been modified; please apply your changes to the latest version and try again
The error repeats for the same machine for several minutes for
successful runs even without this issue, for most of the time it can go
through quickly, but for some rare cases it can be stuck at this race
condition for several hours.
The issue is that the underlying VM is already deleted in vCenter, but
the corresponding machine object cannot be removed, which is stuck at the
finalizer removal due to very frequent updates from other controllers.
This can cause the gkectl command to timeout, but the
controller keeps reconciling the cluster so the upgrade or update process
eventually completes.
Workaround:
We have prepared several different mitigation options for this issue,
which depends on your environment and requirements.
If you encounter this issue and the upgrade or update still can't
complete after a long time,
contact
our support team for mitigations.
|
Installation, Upgrades and updates |
1.10.2, 1.11, 1.12, 1.13 |
gkectl prepare OS image validation preflight failure
gkectl prepare command failed with:
- Validation Category: OS Images
- [FAILURE] Admin cluster OS images exist: os images [os_image_name] don't exist, please run `gkectl prepare` to upload os images.
The preflight checks of gkectl prepare included an
incorrect validation.
Workaround:
Run the same command with an additional flag
--skip-validation-os-images .
|
Installation |
1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13 |
vCenter URL with https:// or http:// prefix
may cause cluster startup failure
Admin cluster creation failed with:
Exit with error:
Failed to create root cluster: unable to apply admin base bundle to external cluster: error: timed out waiting for the condition, message:
Failed to apply external bundle components: failed to apply bundle objects from admin-vsphere-credentials-secret 1.x.y-gke.z to cluster external: Secret "vsphere-dynamic-credentials" is invalid:
[data[https://xxx.xxx.xxx.username]: Invalid value: "https://xxx.xxx.xxx.username": a valid config key must consist of alphanumeric characters, '-', '_' or '.'
(e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+'), data[https://xxx.xxx.xxx.password]:
Invalid value: "https://xxx.xxx.xxx.password": a valid config key must consist of alphanumeric characters, '-', '_' or '.'
(e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')]
The URL is used as part of a Secret key, which doesn't
support "/" or ":".
Workaround:
Remove https:// or http:// prefix from the
vCenter.Address field in the admin cluster or user cluster
config yaml.
|
Installation, Upgrades and updates |
1.10, 1.11, 1.12, 1.13 |
gkectl prepare panic on util.CheckFileExists
gkectl prepare can panic with the following
stacktrace:
panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xde0dfa]
goroutine 1 [running]:
gke-internal.googlesource.com/syllogi/cluster-management/pkg/util.CheckFileExists(0xc001602210, 0x2b, 0xc001602210, 0x2b) pkg/util/util.go:226 +0x9a
gke-internal.googlesource.com/syllogi/cluster-management/gkectl/pkg/config/util.SetCertsForPrivateRegistry(0xc000053d70, 0x10, 0xc000f06f00, 0x4b4, 0x1, 0xc00015b400)gkectl/pkg/config/util/utils.go:75 +0x85
...
The issue is that gkectl prepare created the private
registry certificate directory with a wrong permission.
Workaround:
To fix this issue, please run the following commands on the admin
workstation:
sudo mkdir -p /etc/docker/certs.d/PRIVATE_REGISTRY_ADDRESS
sudo chmod 0755 /etc/docker/certs.d/PRIVATE_REGISTRY_ADDRESS
|
Upgrades and updates |
1.10, 1.11, 1.12, 1.13 |
gkectl repair admin-master and resumable admin upgrade do
not work together
After a failed admin cluster upgrade attempt, don't run gkectl
repair admin-master . Doing so may cause subsequent admin upgrade
attempts to fail with issues such as admin master power on failure or the
VM being inaccessible.
Workaround:
If you've already encountered this failure scenario,
contact support.
|
Upgrades and updates |
1.10, 1.11 |
Resumed admin cluster upgrade can lead to missing admin control plane
VM template
If the admin control plane machine isn't recreated after a resumed
admin cluster upgrade attempt, the admin control plane VM template is
deleted. The admin control plane VM template is the template of the admin
master that is used to recover the control plane machine with
gkectl
repair admin-master .
Workaround:
The admin control plane VM template will be regenerated during the next
admin cluster upgrade.
|
Operating system |
1.12, 1.13 |
cgroup v2 could affect workloads
In version 1.12.0, cgroup v2 (unified) is enabled by default for
Container Optimized OS (COS) nodes. This could potentially cause
instability for your workloads in a COS cluster.
Workaround:
We switched back to cgroup v1 (hybrid) in version 1.12.1. If you are
using COS nodes, we recommend that you upgrade to version 1.12.1 as soon
as it is released.
|
Identity |
1.10, 1.11, 1.12, 1.13 |
ClientConfig custom resource
gkectl update reverts any manual changes that you have
made to the ClientConfig custom resource.
Workaround:
We strongly recommend that you back up the ClientConfig resource after
every manual change.
|
Installation |
1.10, 1.11, 1.12, 1.13 |
gkectl check-config validation fails: can't find F5
BIG-IP partitions
Validation fails because F5 BIG-IP partitions can't be found, even
though they exist.
An issue with the F5 BIG-IP API can cause validation to fail.
Workaround:
Try running gkectl check-config again.
|
Installation |
1.12 |
User cluster installation failed because of cert-manager/ca-injector's
leader election issue
You might see an installation failure due to
cert-manager-cainjector in crashloop, when the apiserver/etcd
is slow:
# These are logs from `cert-manager-cainjector`, from the command
# `kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system
cert-manager-cainjector-xxx`
I0923 16:19:27.911174 1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election: timed out waiting for the condition
E0923 16:19:27.911110 1 leaderelection.go:321] error retrieving resource lock kube-system/cert-manager-cainjector-leader-election-core:
Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/cert-manager-cainjector-leader-election-core": context deadline exceeded
I0923 16:19:27.911593 1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election-core: timed out waiting for the condition
E0923 16:19:27.911629 1 start.go:163] cert-manager/ca-injector "msg"="error running core-only manager" "error"="leader election lost"
Workaround:
View workaround steps
Run the following commands to mitigate the problem.
First scale down the monitoring-operator so it won't
revert the changes to the cert-manager Deployment:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system \
scale deployment monitoring-operator --replicas=0
Edit the cert-manager-cainjector Deployment to disable
leader election, because we only have one replica running. It isn't
required for a single replica:
# Add a command line flag for cainjector: `--leader-elect=false`
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG edit \
-n kube-system deployment cert-manager-cainjector
The relevant YAML snippet for cert-manager-cainjector
deployment should looks like the following example:
...
apiVersion: apps/v1
kind: Deployment
metadata:
name: cert-manager-cainjector
namespace: kube-system
...
spec:
...
template:
...
spec:
...
containers:
- name: cert-manager
image: "gcr.io/gke-on-prem-staging/cert-manager-cainjector:v1.0.3-gke.0"
args:
...
- --leader-elect=false
...
Keep monitoring-operator replicas at 0 as a mitigation
until the installation is finished. Otherwise it will revert the change.
After the installation is finished and the cluster is up and running,
turn on the monitoring-operator for day-2 operations:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system \
scale deployment monitoring-operator --replicas=1
After each upgrade, the changes are reverted. Perform the same
steps again to mitigate the issue until this is fixed in a future
release.
|
VMware |
1.10, 1.11, 1.12, 1.13 |
Restarting or upgrading vCenter for versions lower than 7.0U2
If the vCenter, for versions lower than 7.0U2, is restarted, after an
upgrade or otherwise, the network name in vm information from vCenter is
incorrect, and results in the machine being in an Unavailable
state. This eventually leads to the nodes being auto-repaired to create
new ones.
Related govmomi
bug.
Workaround:
This workaround is provided by VMware support:
- The issue is fixed in vCenter versions 7.0U2 and above.
- For lower versions, right-click the host, and then select
Connection > Disconnect. Next, reconnect, which forces an update
of the VM's portgroup.
|
Operating system |
1.10, 1.11, 1.12, 1.13 |
SSH connection closed by remote host
For Anthos clusters on VMware version 1.7.2 and above, the Ubuntu OS
images are hardened with
CIS L1 Server Benchmark.
To meet the CIS rule "5.2.16 Ensure SSH Idle Timeout Interval is
configured", /etc/ssh/sshd_config has the following
settings:
ClientAliveInterval 300
ClientAliveCountMax 0
The purpose of these settings is to terminate a client session after 5
minutes of idle time. However, the ClientAliveCountMax 0
value causes unexpected behavior. When you use the ssh session on the
admin workstation, or a cluster node, the SSH connection might be
disconnected even your ssh client is not idle, such as when running a
time-consuming command, and your command could get terminated with the
following message:
Connection to [IP] closed by remote host.
Connection to [IP] closed.
Workaround:
You can either:
Make sure you reconnect your SSH session.
|
Installation |
1.10, 1.11, 1.12, 1.13 |
Conflicting cert-manager installation
In 1.13 releases, monitoring-operator will install
cert-manager in the cert-manager namespace. If for certain
reasons, you need to install your own cert-manager, follow the following
instructions to avoid conflicts:
You only need to apply this work around once for each cluster, and the
changes will be preserved across cluster upgrade.
Note: One common symptom of installing your own cert-manager
is that the cert-manager version or image (for example
v1.7.2) may revert back to its older version. This is caused by
monitoring-operator trying to reconcile the
cert-manager , and reverting the version in the process.
Workaround:
Avoid conflicts during upgrade
- Uninstall your version of
cert-manager . If you defined
your own resources, you may want to
backup
them.
- Perform the upgrade.
- Follow the following instructions to restore your own
cert-manager .
Restore your own cert-manager in user clusters
- Scale the
monitoring-operator Deployment to 0:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
-n USER_CLUSTER_NAME \
scale deployment monitoring-operator --replicas=0
- Scale the
cert-manager deployments managed by
monitoring-operator to 0:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
-n cert-manager scale deployment cert-manager --replicas=0
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
-n cert-manager scale deployment cert-manager-cainjector\
--replicas=0
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
-n cert-manager scale deployment cert-manager-webhook --replicas=0
- Reinstall your version of
cert-manager .
Restore
your customized resources if you have.
- You can skip this step if you are using
upstream default cert-manager installation, or you are sure your
cert-manager is installed in the
cert-manager namespace.
Otherwise, copy the metrics-ca cert-manager.io/v1
Certificate and the metrics-pki.cluster.local Issuer
resources from cert-manager to the cluster resource
namespace of your installed cert-manager.
relevant_fields='
{
apiVersion: .apiVersion,
kind: .kind,
metadata: {
name: .metadata.name,
namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE"
},
spec: .spec
}
'
f1=$(mktemp)
f2=$(mktemp)
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
get issuer -n cert-manager metrics-pki.cluster.local -o json \
| jq "${relevant_fields}" > $f1
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
get certificate -n cert-manager metrics-ca -o json \
| jq "${relevant_fields}" > $f2
kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f $f1
kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f $f2
Restore your own cert-manager in admin clusters
In general, you shouldn't need to re-install cert-manager in admin
clusters because admin clusters only run Anthos clusters on VMware control
plane workloads. In the rare cases that you also need to install your own
cert-manager in admin clusters, please follow the following instructions
to avoid conflicts. Please note, if you are an Apigee customer and you
only need cert-manager for Apigee, you do not need to run the admin
cluster commands.
- Scale the
monitoring-operator deployment to 0.
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
-n kube-system scale deployment monitoring-operator --replicas=0
- Scale the
cert-manager deployments managed by
monitoring-operator to 0.
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
-n cert-manager scale deployment cert-manager \
--replicas=0
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
-n cert-manager scale deployment cert-manager-cainjector \
--replicas=0
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
-n cert-manager scale deployment cert-manager-webhook \
--replicas=0
- Reinstall your version of
cert-manager .
Restore
your customized resources if you have.
- You can skip this step if you are using
upstream default cert-manager installation, or you are sure your
cert-manager is installed in the
cert-manager namespace.
Otherwise, copy the metrics-ca cert-manager.io/v1
Certificate and the metrics-pki.cluster.local Issuer
resources from cert-manager to the cluster resource
namespace of your installed cert-manager.
relevant_fields='
{
apiVersion: .apiVersion,
kind: .kind,
metadata: {
name: .metadata.name,
namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE"
},
spec: .spec
}
'
f3=$(mktemp)
f4=$(mktemp)
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \n
get issuer -n cert-manager metrics-pki.cluster.local -o json \
| jq "${relevant_fields}" > $f3
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
get certificate -n cert-manager metrics-ca -o json \
| jq "${relevant_fields}" > $f4
kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f $f3
kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f $f4
|
Operating system |
1.10, 1.11, 1.12, 1.13 |
False positives in docker, containerd, and runc vulnerability scanning
The Docker, containerd, and runc in the Ubuntu OS images shipped with
Anthos clusters on VMware are pinned to special versions using
Ubuntu PPA. This ensures
that any container runtime changes will be qualified by
Anthos clusters on VMware before each release.
However, the special versions are unknown to the
Ubuntu CVE
Tracker, which is used as the vulnerability feeds by various CVE
scanning tools. Therefore, you will see false positives in Docker,
containerd, and runc vulnerability scanning results.
For example, you might see the following false positives from your CVE
scanning results. These CVEs are already fixed in the latest patch
versions of Anthos clusters on VMware.
Refer to the release notes]
for any CVE fixes.
Workaround:
Canonical is aware of this issue, and the fix is tracked at
https://github.com/canonical/sec-cvescan/issues/73.
|
Upgrades and updates |
1.10, 1.11, 1.12, 1.13 |
Network connection between admin and user cluster might be unavailable
for a short time during non-HA cluster upgrade
If you are upgrading non-HA clusters from 1.9 to 1.10, you might notice
that the kubectl exec , kubectl log and webhook
against user clusters might be unavailable for a short time. This downtime
can be up to one minute. This happens because the incoming request
(kubectl exec, kubectl log and webhook) is handled by kube-apiserver for
the user cluster. User kube-apiserver is a
Statefulset. In a non-HA cluster, there is only one replica for the
Statefulset. So during upgrade, there is a chance that the old
kube-apiserver is unavailable while the new kube-apiserver is not yet
ready.
Workaround:
This downtime only happens during upgrade process. If you want a
shorter downtime during upgrade, we recommend you to switch to
HA
clusters.
|
Installation, Upgrades and updates |
1.10, 1.11, 1.12, 1.13 |
Konnectivity readiness check failed in HA cluster diagnose after
cluster creation or upgrade
If you are creating or upgrading an HA cluster and notice konnectivity
readiness check failed in cluster diagnose, in most cases it will not
affect the functionality of Anthos clusters on VMware (kubectl exec, kubectl
log and webhook). This happens because sometimes one or two of the
konnectivity replicas might be unready for a period of time due to
unstable networking or other issues.
Workaround:
The konnectivity will recover by itself. Wait for 30 minutes to 1 hour
and rerun cluster diagnose.
|
Operating system |
1.7, 1.8, 1.9, 1.10, 1.11 |
/etc/cron.daily/aide CPU and memory spike issue
Starting from Anthos clusters on VMware version 1.7.2, the Ubuntu OS
images are hardened with
CIS L1 Server
Benchmark.
As a result, the cron script /etc/cron.daily/aide has been
installed so that an aide check is scheduled so as to ensure
that the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is
regularly checked" is followed.
The cron job runs daily at 6:25 AM UTC. Depending on the number of
files on the filesystem, you may experience CPU and memory usage spikes
around that time that are caused by this aide process.
Workaround:
If the spikes are affecting your workload, you can disable the daily
cron job:
sudo chmod -x /etc/cron.daily/aide
|
Networking |
1.10, 1.11, 1.12, 1.13 |
Load balancers and NSX-T stateful distributed firewall rules interact
unpredictably
When deploying Anthos clusters on VMware version 1.9 or later, when the
deployment has the Seesaw bundled load balancer in an environment that
uses NSX-T stateful distributed firewall rules,
stackdriver-operator might fail to create
gke-metrics-agent-conf ConfigMap and cause
gke-connect-agent Pods to be in a crash loop.
The underlying issue is that the stateful NSX-T distributed firewall
rules terminate the connection from a client to the user cluster API
server through the Seesaw load balancer because Seesaw uses asymmetric
connection flows. The integration issues with NSX-T distributed firewall
rules affect all Anthos clusters on VMware releases that use Seesaw. You
might see similar connection problems on your own applications when they
create large Kubernetes objects whose sizes are bigger than 32K.
Workaround:
Follow
these instructions to disable NSX-T distributed firewall rules, or to
use stateless distributed firewall rules for Seesaw VMs.
If your clusters use a manual load balancer, follow
these instructions to configure your load balancer to reset client
connections when it detects a backend node failure. Without this
configuration, clients of the Kubernetes API server might stop responding
for several minutes when a server instance goes down.
|
Logging and monitoring |
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16 |
Unexpected monitoring billing
For Anthos clusters on VMware versions 1.10 to latest, some customers have
found unexpectedly high billing for Metrics volume on the
Billing page. This issue affects you only when all of the
following circumstances apply:
- Application monitoring is enabled (
enableStackdriverForApplications=true )
- Managed Service for Prometheus is not enabled (
enableGMPForApplications )
- Application Pods have the
prometheus.io/scrap=true
annotation. (Installing Anthos Service Mesh can also add this annotation.)
To confirm whether you are affected by this issue,
list your
user-defined metrics. If you see billing for unwanted metrics, then
this issue applies to you.
Workaround
If you are affected by this issue, we recommend that you upgrade your
clusters to version 1.12 and switch to new application monitoring solution managed-service-for-prometheus that address this issue:
Separate flags to control the collection of application logs versus application metrics
Bundled Google Cloud Managed Service for Prometheus
If you can't upgrade to version 1.12, use the following steps:
- Find the source Pods and Services that have the unwanted billed
kubectl --kubeconfig KUBECONFIG \
get pods -A -o yaml | grep 'prometheus.io/scrape: "true"'
kubectl --kubeconfig KUBECONFIG get \
services -A -o yaml | grep 'prometheus.io/scrape: "true"'
- Remove the
prometheus.io/scrap=true annotation from the
Pod or Service. If the annotation is added by Anthos Service Mesh, consider
configuring Anthos Service Mesh without the Prometheus option,
or turning off the Istio Metrics Merging feature.
|
Installation |
1.11, 1.12, 1.13 |
Installer fails when creating vSphere datadisk
The Anthos clusters on VMware installer can fail if custom roles are bound
at the wrong permissions level.
When the role binding is incorrect, creating a vSphere datadisk with
govc hangs and the disk is created with a size equal to 0. To
fix the issue, you should bind the custom role at the vSphere vCenter
level (root).
Workaround:
If you want to bind the custom role at the DC level (or lower than
root), you also need to bind the read-only role to the user at the root
vCenter level.
For more information on role creation, see
vCenter user account privileges.
|
Logging and monitoring |
1.9.0-1.9.4, 1.10.0-1.10.1 |
High network traffic to monitoring.googleapis.com
You might see high network traffic to
monitoring.googleapis.com , even in a new cluster that has no
user workloads.
This issue affects version 1.10.0-1.10.1 and version 1.9.0-1.9.4. This
issue is fixed in version 1.10.2 and 1.9.5.
Workaround:
View workaround steps
Upgrade to version 1.10.2/1.9.5 or later.
To mitigate this issue for an earlier version:
- Scale down `stackdriver-operator`:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
--namespace kube-system \
scale deployment stackdriver-operator --replicas=0
Replace USER_CLUSTER_KUBECONFIG with the path of the user
cluster kubeconfig file.
- Open the
gke-metrics-agent-conf ConfigMap for editing:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
--namespace kube-system \
edit configmap gke-metrics-agent-conf
- Increase the probe interval from 0.1 seconds to 13 seconds:
processors:
disk_buffer/metrics:
backend_endpoint: https://monitoring.googleapis.com:443
buffer_dir: /metrics-data/nsq-metrics-metrics
probe_interval: 13s
retention_size_mib: 6144
disk_buffer/self:
backend_endpoint: https://monitoring.googleapis.com:443
buffer_dir: /metrics-data/nsq-metrics-self
probe_interval: 13s
retention_size_mib: 200
disk_buffer/uptime:
backend_endpoint: https://monitoring.googleapis.com:443
buffer_dir: /metrics-data/nsq-metrics-uptime
probe_interval: 13s
retention_size_mib: 200
- Close the editing session.
- Change
gke-metrics-agent DaemonSet version to
1.1.0-anthos.8:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
--namespace kube-system set image daemonset/gke-metrics-agent \
gke-metrics-agent=gcr.io/gke-on-prem-release/gke-metrics-agent:1.1.0-anthos.8
|
Logging and monitoring |
1.10, 1.11 |
gke-metrics-agent has frequent CrashLoopBackOff errors
For Anthos clusters on VMware version 1.10 and above, `gke-metrics-agent`
DaemonSet has frequent CrashLoopBackOff errors when
`enableStackdriverForApplications` is set to `true` in the `stackdriver`
object.
Workaround:
To mitigate this issue, disable application metrics collection by
running the following commands. These commands will not disable
application logs collection.
- To prevent the following changes from reverting, scale down
stackdriver-operator :
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
--namespace kube-system scale deploy stackdriver-operator \
--replicas=0
Replace USER_CLUSTER_KUBECONFIG with the path of the user
cluster kubeconfig file.
- Open the
gke-metrics-agent-conf ConfigMap for editing:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
--namespace kube-system edit configmap gke-metrics-agent-conf
- Under
services.pipelines , comment out the entire
metrics/app-metrics section:
services:
pipelines:
#metrics/app-metrics:
# exporters:
# - googlecloud/app-metrics
# processors:
# - resource
# - metric_to_resource
# - infer_resource
# - disk_buffer/app-metrics
# receivers:
# - prometheus/app-metrics
metrics/metrics:
exporters:
- googlecloud/metrics
processors:
- resource
- metric_to_resource
- infer_resource
- disk_buffer/metrics
receivers:
- prometheus/metrics
- Close the editing session.
- Restart the
gke-metrics-agent DaemonSet:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
--namespace kube-system rollout restart daemonset gke-metrics-agent
|
Logging and monitoring |
1.11, 1.12, 1.13 |
Replace deprecated metrics in dashboard
If deprecated metrics are used in your OOTB dashboards, you will see
some empty charts. To find deprecated metrics in the Monitoring
dashboards, run the following commands:
gcloud monitoring dashboards list > all-dashboard.json
# find deprecated metrics
cat all-dashboard.json | grep -E \
'kube_daemonset_updated_number_scheduled\
|kube_node_status_allocatable_cpu_cores\
|kube_node_status_allocatable_pods\
|kube_node_status_capacity_cpu_cores'
The following deprecated metrics should be migrated to their
replacements.
Deprecated | Replacement |
kube_daemonset_updated_number_scheduled |
kube_daemonset_status_updated_number_scheduled |
kube_node_status_allocatable_cpu_cores
kube_node_status_allocatable_memory_bytes
kube_node_status_allocatable_pods |
kube_node_status_allocatable |
kube_node_status_capacity_cpu_cores
kube_node_status_capacity_memory_bytes
kube_node_status_capacity_pods |
kube_node_status_capacity |
kube_hpa_status_current_replicas |
kube_horizontalpodautoscaler_status_current_replicas |
Workaround:
To replace the deprecated metrics
- Delete "GKE on-prem node status" in the Google Cloud Monitoring
dashboard. Reinstall "GKE on-prem node status" following
these instructions.
- Delete "GKE on-prem node utilization" in the Google Cloud Monitoring
dashboard. Reinstall "GKE on-prem node utilization" following
these instructions.
- Delete "GKE on-prem vSphere vm health" in the Google Cloud
Monitoring dashboard. Reinstall "GKE on-prem vSphere vm health"
following
these instructions.
This deprecation is due to the upgrade of
kube-state-metrics agent from v1.9 to v2.4, which is required for
Kubernetes 1.22. You can replace all deprecated
kube-state-metrics metrics, which have the prefix
kube_ , in your custom dashboards or alerting policies.
|
Logging and monitoring |
1.10, 1.11, 1.12, 1.13 |
Unknown metric data in Cloud Monitoring
For Anthos clusters on VMware version 1.10 and above, the data for
clusters in Cloud Monitoring may contain irrelevant summary metrics
entries such as the following:
Unknown metric: kubernetes.io/anthos/go_gc_duration_seconds_summary_percentile
Other metrics types that may have irrelevant summary metrics
include :
apiserver_admission_step_admission_duration_seconds_summary
go_gc_duration_seconds
scheduler_scheduling_duration_seconds
gkeconnect_http_request_duration_seconds_summary
alertmanager_nflog_snapshot_duration_seconds_summary
While these summary type metrics are in the metrics list, they are not
supported by gke-metrics-agent at this time.
|
Logging and monitoring |
1.10, 1.11, 1.12, 1.13 |
Missing metrics on some nodes
You might find that the following metrics are missing on some, but not
all, nodes:
kubernetes.io/anthos/container_memory_working_set_bytes
kubernetes.io/anthos/container_cpu_usage_seconds_total
kubernetes.io/anthos/container_network_receive_bytes_total
Workaround:
To fix this issue, perform the following steps as a workaround. For
[version 1.9.5+, 1.10.2+, 1.11.0]: increase cpu for gke-metrics-agent
by following steps 1 - 4
- Open your
stackdriver resource for editing:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
--namespace kube-system edit stackdriver stackdriver
- To increase the CPU request for
gke-metrics-agent from
10m to 50m , CPU limit from 100m
to 200m add the following resourceAttrOverride
section to the stackdriver manifest :
spec:
resourceAttrOverride:
gke-metrics-agent/gke-metrics-agent:
limits:
cpu: 100m
memory: 4608Mi
requests:
cpu: 10m
memory: 200Mi
Your edited resource should look similar to the following:
spec:
anthosDistribution: on-prem
clusterLocation: us-west1-a
clusterName: my-cluster
enableStackdriverForApplications: true
gcpServiceAccountSecretName: ...
optimizedMetrics: true
portable: true
projectID: my-project-191923
proxyConfigSecretName: ...
resourceAttrOverride:
gke-metrics-agent/gke-metrics-agent:
limits:
cpu: 200m
memory: 4608Mi
requests:
cpu: 50m
memory: 200Mi
- Save your changes and close the text editor.
- To verify your changes have taken effect, run the following command:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
--namespace kube-system get daemonset gke-metrics-agent -o yaml \
| grep "cpu: 50m"
The command finds cpu: 50m if your edits have taken effect.
|
Logging and monitoring |
1.11.0-1.11.2, 1.12.0 |
Missing scheduler and controller-manager metrics in admin cluster
If your admin cluster is affected by this issue, scheduler and
controller-manager metrics are missing. For example, these two metrics are
missing
# scheduler metric example
scheduler_pending_pods
# controller-manager metric example
replicaset_controller_rate_limiter_use
Workaround:
Upgrade to v1.11.3+, v1.12.1+, or v1.13+.
|
|
1.11.0-1.11.2, 1.12.0 |
Missing scheduler and controller-manager metrics in user cluster
If your user cluster is affected by this issue, scheduler and
controller-manager metrics are missing. For example, these two metrics are
missing:
# scheduler metric example
scheduler_pending_pods
# controller-manager metric example
replicaset_controller_rate_limiter_use
Workaround:
This issue is fixed in Anthos clusters on VMware version 1.13.0 and later.
Upgrade your cluster to a version with the fix.
|
Installation, Upgrades and updates |
1.10, 1.11, 1.12, 1.13 |
Failure to register admin cluster during creation
If you create an admin cluster for version 1.9.x or 1.10.0, and if the
admin cluster fails to register with the provided gkeConnect
spec during its creation, you will get the following error.
Failed to create root cluster: failed to register admin cluster: failed to register cluster: failed to apply Hub Membership: Membership API request failed: rpc error: ode = PermissionDenied desc = Permission 'gkehub.memberships.get' denied on PROJECT_PATH
You will still be able to use this admin cluster, but you will get the
following error if you later attempt to upgrade the admin cluster to
version 1.10.y.
failed to migrate to first admin trust chain: failed to parse current version "": invalid version: "" failed to migra |