This page lists all known issues for Google Distributed Cloud on VMware. This page is
for IT administrators and Operators who manage the lifecycle of the
underlying tech infrastructure, and respond to alerts and pages when service
level objectives (SLOs) aren't met or applications fail. To learn more about
common roles and example tasks that we reference in Google Cloud content, see
Common GKE Enterprise user roles and tasks.
If you're part of the Google Developer Program, save this page to receive
notifications when a release note related to this page is published. To
learn more, see
Saved Pages.
To filter the
known issues by a product version or category, select your desired filters from
the following drop-down menus.
Bundled ingress is not compatible with gateway.networking.k8s.io resources
Istiod pods of bundled ingress cannot be ready if gateway.networking.k8s.io
resources are installed into the user cluster. The following example error message can be found
in pod logs:
failed to list *v1beta1.Gateway: gateways.gateway.networking.k8s.io is forbidden: User \"system:serviceaccount:gke-system:istiod-service-account\" cannot list resource \"gateways\" in API group \"gateway.networking.k8s.io\" at the cluster scope"
Workaround:
Apply the following ClusterRole and ClusterRoleBinding to your user cluster:
Admin cluster control plane nodes keep rebooting after running gkectl create admin
If hostnames in the
ipblocks
field contain uppercase letters, the admin cluster control plane nodes might
reboot over and over.
Workaround:
Use lowercase hostnames only.
Installation, Upgrades
1.30.0-1.30.500, 1.31.0-1.31.100
Runtime: out of memory "error" after running gkeadm create or upgrade
When creating or upgrading admin workstations with gkeadm,
commands, you might get an OOM error when verifying the downloaded OS image.
For example,
Downloading OS image
"gs://gke-on-prem-release/admin-appliance/1.30.400-gke.133/gke-on-prem-admin-appliance-vsphere-1.30.400-gke.133.ova"...
[==================================================>] 10.7GB/10.7GB
Image saved to
/anthos/gke-on-prem-admin-appliance-vsphere-1.30.400-gke.133.ova
Verifying image gke-on-prem-admin-appliance-vsphere-1.30.400-gke.133.ova...
|
runtime: out of memory
Workaround:
Increase the OS memory size where you execute the gkeadm command.
Upgrades
1.30.0-1.30.400
Non-HA admin cluster upgrade stuck at Creating or updating cluster control plane workloads
When upgrading a non-HA admin cluster, the upgrade might get stuck at the
Creating or updating cluster control plane workloads.
This issue happens if in the admin master VM, ip a | grep cali returns a non-empty result.
For example,
ubuntu@abf8a975479b-qual-342-0afd1d9c ~ $ ip a | grep cali
4: cali2251589245f@if3: mtu 1500 qdisc noqueue state UP group default
This field is not needed and can
be safely removed from the configuration file.
Migration
1.29.0-1.29.800, 1.30.0-1.30.400, 1.31.0
Admin add-on nodes stuck at NotReady during non-HA to HA
admin cluster migration
When migrating a non-HA admin cluster that uses MetalLB to HA, admin
add-on nodes might get stuck at a NotReady status, preventing
the migration from completing.
This issue only affects admin clusters configured with MetalLB, where
auto-repair isn't enabled.
This issue is caused by a race condition during migration where MetalLB
speakers are still using the old metallb-memberlist secret. As
a result of the race condition, the old control plane VIP becomes
inaccessible, which causes the migration to stall.
Cluster backup for non-HA admin cluster fails due to long datastore and datadisk names
When attempting to backup a non-HA admin cluster, the backup fails due to the combined length of the datastore and datadisk names exceeding the maximum character length.
The maximum character length for a datastore name is 80. The backup path for a non-HA admin cluster has the naming syntax "__". So if the concatenated name exceeds the maximum length, backup folder creation will fail
Workaround:
Rename the datastore or datadisk to a shorter name. Ensure that the combined length of the datastore and datadisk names does not exceed the maximum character length.
Upgrades
1.28, 1.29, 1.30
HA admin control plane node shows older version after running gkectl repair admin-master
After running the gkectl repair admin-master command, an
admin control plane node might show an older version than the expected version.
This issue occurs because the backed up VM template used for the HA
admin control plane node repair isn't refreshed in vCenter after an
upgrade, because the backup VM template wasn't cloned during machine
creation if the machine name remains unchanged.
Workaround:
Find out the machine name that is using the older Kubernetes version:
kubectl get machine -o wide --kubeconfig=ADMIN_KUBECONFIG
Remove the onprem.cluster.gke.io/prevented-deletion
annotation:
A new machine will be created with the correct version.
Configuration
1.30.0
Terraform removes vCenter fields unexpectedly
When updating a user cluster or nodepool using Terraform, Terraform
might attempt to set vCenter fields to empty values.
This issue can occur if the cluster wasn't originally created using
Terraform.
Workaround:
To prevent the unexpected update, ensure that the update is safe before
running terraform apply, as described in the following:
Run terraform plan
In the output, check whether the vCenter fields are set to
nil.
If any vCenter field is set to an empty value,
in the Terraform configuration, add vcenter to the
ignore_changes list following
the Terraform documentation. This prevents updates to these fields.
Run terraform plan again and check the output to confirm
the update is as expected
Updates
1.13, 1.14, 1.15, 1.16
User cluster control plane nodes always get rebooted during the first admin cluster update operation
After the kubeception user clusters' control plane nodes are created, updated or upgraded, they will be rebooted one by one, during the first admin cluster operation when the admin cluster is created at or upgraded to one of the affected versions. For kubeception clusters with 3 control-plane nodes, this shouldn't lead to control-plane downtime and the only impact is that the admin cluster operation takes longer.
Installation, Upgrades and updates
1.31
Errors creating custom resources
In version 1.31 of Google Distributed Cloud, you might get errors when
you try to create custom resources, such as clusters (all types) and
workloads. The issue is caused by a breaking change introduced in
Kubernetes 1.31 that prevents the caBundle field in a custom
resource definition from transitioning from a valid to an invalid state.
For more information about the change, see the
Kubernetes 1.31 changelog.
Prior to Kubernetes 1.31, the caBundle field was often set
to a makeshift value of \n, because in earlier Kubernetes
versions the API server didn't allow empty CA bundle content. Using
\n was a reasonable workaround to avoid confusion, as the
cert-manager typically updates the caBundle
later.
If the caBundle has been patched once from an invalid to a
valid state, there shouldn't be issues. However, if the custom resource
definition is reconciled back to \n (or another invalid
value), you might encounter the following error:
...Invalid value: []byte{0x5c, 0x6e}: unable to load root certificates: unable to parse bytes as PEM block]
Workaround
If you have a custom resource definition in which caBundle
is set to an invalid value, you can safely remove the caBundle
field entirely. This should resolve the issue.
OS
1.31
cloud-init status always returns error
When upgrading a cluster that uses the Container Optimized OS (COS)
OS image to 1.31, the cloud-init status command fails although
cloud-init finished without errors.
Workaround:
Run the following command to check the status of cloud-init:
systemctl show -p Result cloud-final.service
If the output is similar to the following, then cloud-init finished
successfully:
Result=success
Upgrades
1.28
Admin workstation preflight check fails when upgrading to 1.28 with
disk size less than 100 GB
When upgrading a cluster to 1.28, the gkectl prepare
command fails while running admin workstation preflight checks if the
admin workstation disk size is less than 100 GB. In this case, the
command displays an error message similar to the following:
Workstation Hardware: Workstation hardware requirements are not satisfied
In 1.28, the admin workstation disk size prerequisite was increased
from 50 GB to 100 GB.
The gkectl upgrade command returns an incorrect error
about the netapp storageclass. The error message is similar to the following,
detectedunsupporteddrivers:
csi.trident.netapp.io
Workaround:
Run gkectl upgrade with `--skip-pre-upgrade-checks` flag.
Identity
all versions
Invalid CA certificate after cluster CA rotation in ClientConfig
prevents cluster authentication
After you rotate the certificate authority (CA) certificates on a user
cluster, the spec.certificateAuthorityData field in the
ClientConfig contains an invalid CA certificate, which prevents
authentication to the cluster.
Workaround:
Before the next gcloud CLI authentication, manually update the
spec.certificateAuthorityData field in the
ClientConfig with the correct CA certificate.
Copy the cluster CA certificate from the
certificate-authority-data field in the admin cluster
kubeconfig.
Edit the ClientConfig and paste the CA certificate in the
spec.certificateAuthorityData field.
Preflight check fails when disabling bundled ingress
When you disable bundled ingress by removing the
loadBalancer.vips.ingressVIP field in the cluster
configuration file, a bug in the MetalLB preflight check causes the cluster
update to fail with the "invalid user ingress vip: invalid IP" error
message.
Workaround:
Ignore the error message. Skip the preflight check using one of the
following methods:
Add the --skip-validation-load-balancer flag to the
gkectl update cluster command.
Annotate the onpremusercluster object with
onprem.cluster.gke.io/server-side-preflight-skip: skip-validation-load-balancer
.
VMware, Upgrades
1.16
Cluster upgrade fails due to missing anti-affinity group rule in vCenter
During a cluster upgrade, the machine objects may get stuck in the `Creating` phase and fail to link to the node objects due to a missing anti-affinity group (AAG) rule in vCenter.
If you describe the problematic machine objects, you can see recurring messages like "Reconfigure DRS rule task "task-xxxx" complete"
Disable the anti-affinity group setting for in both the admin cluster config and the user cluster config and trigger force update command to unblock cluster upgrade:
Migrating a user cluster to Controlplane V2 fails if secrets encryption has ever been enabled
When migrating a user cluster to Controlplane V2, if
always-on secrets encryption has ever been enabled, the migration
process fails to properly handle the secret encryption key. Because of this
issue, the new Controlplane V2 cluster is unable to decrypt secrets. If the
output of the following command isn't empty, then always-on secrets
encryption has been enabled at some point and the cluster is affected by
this issue:
Migrating an admin cluster from non-HA to HA fails if secrets encryption is enabled
If the admin cluster has enabled always-on secrets encryption at 1.14 or earlier, and upgraded all the way from old versions to the affected 1.29 and 1.30 versions, when migrating admin cluster from non-HA to HA, the migration process fails to properly handle the secret encryption key, Because of this issue, the new HA admin cluster is unable to decrypt secrets.
To check if the cluster could be using the old formatted key:
credential.yaml regenerated incorrectly during admin
workstation upgrade
When upgrading the admin workstation using the gkeadm upgrade
admin-workstation command, the credential.yaml file
is regenerated incorrectly. The username and password fields are empty.
Additionally, the privateRegistry key contains a typo.
The same misspelling of the privateRegistry key is also in
the admin-cluster.yaml file. Since the
credential.yaml file is regenerated during the admin cluster
upgrade process, the typo is present even if you corrected previously.
Workaround:
Update the private registry key name in credential.yaml to
match the privateRegistry.credentials.fileRef.entry in the
admin-cluster.yaml.
Update the private registry username and password in the
credential.yaml.
Upgrades
1.16+
User cluster upgrade fails due to pre-upgrade reconcile timeout
When upgrading a user cluster, the pre-upgrade reconcile operation might
take longer than the defined timeout, resulting in an upgrade failure.
The error message looks like the following:
Failed to reconcile the user cluster before upgrade: the pre-upgrade reconcile failed, error message:
failed to wait for reconcile to complete: error: timed out waiting for the condition,
message: Cluster reconcile hasn't finished yet, please fix that before
rerun the upgrade.
The timeout for the pre-upgrade reconcile operation is 5 minutes plus 1 minute per node pool in the user cluster.
Workaround:
Ensure that the
gkectl diagnose cluster
command passes without errors. Skip the pre-upgrade reconcile operation by adding the --skip-reconcile-before-preflight flag to the gkectl upgrade cluster command. For example:
When you update the user cluster
dataplaneV2.forwardMode
field using gkectl update cluster, the change is only updated
in the ConfigMap, the anetd DaemonSet won't pick up the config change until restarted and your changes aren't applied.
Workaround:
When the gkectl update cluster command is done, you see
output of Done updating the user cluster. After you see that
message, run the following command to restart the anetd
DaemonSet to pick up the config change:
In the output of the preceding command, verify that the number in the DESIRED column matches the number in the READY column.
Upgrades
1.16
etcdctl command not found during cluster upgrade at the admin cluster backup stage
During a 1.16 to 1.28 user cluster upgrade, the admin cluster is backed
up. The admin cluster backup process displays the error message
"etcdctl: command not found". The user cluster upgrade succeeds, and the
admin cluster remains in a healthy state. The only issue is that the
metadata file on the admin cluster isn't backed up.
The cause of the issue is that the etcdctl binary
was added in 1.28, and isn't available on 1.16 nodes.
The admin cluster backup involve several steps, including taking an etcd
snapshot and then writing the metadata file for the admin cluster.
The etcd backup still succeeds because etcdctl can still be
triggered after an exec into the etcd Pod. But writing the metadata file
fails as it still relies on the etcdctl binary to be
installed on the node. However, the metadata file backup isn't a blocker
for taking a backup, so the backup process still succeeds, as does the
user cluster upgrade.
Workaround:
If want to take a backup of the metadata file, follow
Back
up and restore an admin cluster with gkectl to trigger a separate
admin cluster backup using the version of gkectl that matches
the version of your admin cluster.
Installation
1.16-1.29
User cluster creation failure with manual load balancing
When creating a user cluster configured for manual load balancing, a
gkectl check-config failure occurs indicating that the
ingressHTTPNodePort value must be at least 30000, even when
bundled ingress is disabled.
This issue occurs regardless of whether the ingressHTTPNodePort
and ingressHTTPSNodePort fields are set or left blank.
Workaround:
To work around this issue, ignore the result returned by
gkectl check-config. To disable bundled ingress, see
Disable bundled ingress.
Updates
1.29.0
After migrating a user cluster to Controlplane V2, the admin cluster
metrics-server PDB might become incorrectly configured
The issue with the PodDisruptionBudget (PDB) occurs when
using high availability (HA) admin clusters, and there is 0 or 1 admin
cluster node without a role after the migration. To check if there are node
objects without a role, run the following command:
Checking all poddisruptionbudgets...FAILURE
Reason: 1 pod disruption budget error(s).
Unhealthy Resources:
PodDisruptionBudget metrics-server: gke-managed-metrics-server/metrics-server might be configured incorrectly: the total replicas(1) should be larger than spec.MinAvailable(1).
Binary Authorization webook blocks CNI plugin to start causing one of nodepool failed to come up
Under rare race conditions, an incorrect installation sequence of the Binary Authorization webhook and the gke-connect pod may cause user cluster creation to stall due to a node failing to reach a ready state. In affected scenarios, user cluster creation may stall due to a node failing to reach a ready state. If this occurs, the following message will be displayed:
Node pool is not ready: ready condition is not true: CreateOrUpdateNodePool: 2/3 replicas are ready
To unblock an unhealthy node during the current cluster creation process, temporarily remove the Binary Authorization webhook configuration in user cluster using the following command.
Once the bootstrap process is complete, you can re-add the following webhook configuration.
apiVersion:admissionregistration.k8s.io/v1kind:ValidatingWebhookConfigurationmetadata:name:binauthz-validating-webhook-configurationwebhooks:-name:"binaryauthorization.googleapis.com"namespaceSelector:matchExpressions:-key:control-planeoperator:DoesNotExistobjectSelector:matchExpressions:-key:"image-policy.k8s.io/break-glass"operator:NotInvalues:["true"]rules:-apiGroups:-""apiVersions:-v1operations:-CREATE-UPDATEresources:-pods-pods/ephemeralcontainersadmissionReviewVersions:-"v1beta1"clientConfig:service:name:binauthznamespace:binauthz-systempath:/binauthz# CA Bundle will be updated by the cert rotator.caBundle:Cg==timeoutSeconds:10# Fail OpenfailurePolicy:"Ignore"sideEffects:None
Upgrades
1.16, 1.28, 1.29
CPV2 user cluster upgrade stuck due to mirrored machine with deletionTimestamp
During a user cluster upgrade, the upgrade operation might get stuck
if the mirrored machine object in the user cluster contains a
deletionTimestamp. The following error message is displayed
if the upgrade is stuck:
machine is still in the process of being drained and subsequently removed
This issue can occur if you previously attempted to repair the user
control plane node by running gkectl delete machine against the
mirrored machine in the user cluster.
Workaround:
Get the mirrored machine object and save it to a local file for backup
purposes.
Run the following command to delete the finalizer from the mirrored
machine and wait for it to be deleted from the user cluster.
Follow the steps in
Controlplane V2 user cluster control plane node to trigger node repair
on the control plane nodes, so that the correct source machine spec will be
re-synced into the user cluster.
Rerun the gkectl upgrade cluster to resume the upgrade
Configuration, Installation
1.15, 1.16, 1.28, 1.29
Cluster creation failure due to control plane VIP in different subnet
For HA admin cluster or ControlPlane V2 user cluster, the control plane
VIP needs to be in the same subnet as other cluster nodes. Otherwise, cluster
creation fails because kubelet can't communicate with the API server using
the control plane VIP.
Workaround:
Before cluster creation, ensure that the control plane VIP is configured
in the same subnet as the other cluster nodes.
Installation, Upgrades, Updates
1.29.0 - 1.29.100
Cluster Creation/Upgrade Failure due to non-FQDN vCenter Username
Cluster creation/upgrade fails with an error in vsphere CSI pods indicating that the vCenter username is invalid. This occurs because the username used is not a fully qualified domain name. Error message in the vsphere-csi-controller pod as below:
GetCnsconfig failed with err: username is invalid, make sure it is a fully qualified domain username
This issue only occurs in version 1.29 and later, as a validation was added to the vSphere CSI driver to enforce the use of fully qualified domain usernames.
Workaround:
Use a fully qualified domain name for the vCenter username in the credentials configuration file. For example, instead of using "username1", use "username1@example.com".
Upgrades, Updates
1.28.0 - 1.28.500
Admin cluster upgrade fails for clusters created on versions 1.10 or
earlier
When upgrading an admin cluster from 1.16 to 1.28, the bootstrap of the
new admin master machine might fail to generate the control-plane
certificate. The issue is caused by changes in how certificates are
generated for the Kubernetes API server in version 1.28 and later. The
issue reproduces for clusters created on versions 1.10 and earlier that
have been upgraded all the way to 1.16 and the leaf certificate was not
rotated before the upgrade.
To determine if the admin cluster upgrade failure is caused by this
issue, do the following steps:
Connect to the failed admin master machine by using SSH.
Open /var/log/startup.log and search for an error like the
following:
Error adding extensions from section apiserver_ext
801B3213B57F0000:error:1100007B:X509 V3 routines:v2i_AUTHORITY_KEYID:unable to get issuer keyid:../crypto/x509/v3_akid.c:177:
801B3213B57F0000:error:11000080:X509 V3 routines:X509V3_EXT_nconf_int:error in extension:../crypto/x509/v3_conf.c:48:section=apiserver_ext, name=authorityKeyIdentifier, value=keyid>
Make a copy /etc/startup/pki-yaml.sh and name it /etc/startup/pki-yaml-copy.sh
Edit /etc/startup/pki-yaml-copy.sh. Find
authorityKeyIdentifier=keyidset and change it to
authorityKeyIdentifier=keyid,issuer in the sections for
the following extensions:
apiserver_ext, client_ext,
etcd_server_ext, and kubelet_server_ext. For
example:
Save the changes to /etc/startup/pki-yaml-copy.sh.
Using a text editor, open /opt/bin/master.sh, find and replace all occurrences of /etc/startup/pki-yaml.sh with /etc/startup/pki-yaml-copy.sh, then save the changes.
Run /opt/bin/master.sh to generate the certificate and
complete the machine startup.
Run the gkectl upgrade admin again to upgrade the admin
cluster.
After the upgrade completes, rotate the leaf certificate for both admin
and user clusters, as described in Start the rotation.
After the certificate rotation completes, make the same edits to
/etc/startup/pki-yaml-copy.sh as you did previously, and run
/opt/bin/master.sh.
Configuration
1.29.0
Incorrect warning message for clusters with Dataplane V2 enabled
The following incorrect warning message is output when you run
gkectl to create, update, or upgrade a cluster that already has
Dataplane V2 enabled:
WARNING: Your user cluster is currently running our original architecture with
[DataPlaneV1(calico)]. To enable new and advanced features we strongly recommend
to update it to the newer architecture with [DataPlaneV2] once our migration
tool is available.
There's a bug in gkectl which causes it to always show this warning as
long as the dataplaneV2.forwardMode is not being used, even if
you already have set enableDataplaneV2: true in your cluster
configuration file.
Workaround:
You can safely ignore this warning.
Configuration
1.28.0-1.28.400, 1.29.0
HA admin cluster installation preflight check reports wrong number of
required static IPs
When you create an HA admin cluster, the preflight check displays the
following incorrect error message:
- Validation Category: Network Configuration
- [FAILURE] CIDR, VIP and static IP (availability and overlapping): needed
at least X+1 IP addresses for admin cluster with X nodes
The requirement is incorrect for 1.28 and higher HA admin clusters
because they no longer have add-on nodes. Additionally, because the 3
admin cluster control plane node IPs are specified in the
network.controlPlaneIPBlock section in the admin cluster
configuration file, the IPs in IP block file are only needed for
kubeception user cluster control plane nodes.
Workaround:
To skip the incorrect preflight check in a non-fixed release, add --skip-validation-net-config to the gkectl
command.
Operation
1.29.0-1.29.100
Connect Agent loses connection to Google Cloud after non-HA to HA
admin cluster migration
If you migrated
from a non-HA admin cluster to an HA admin cluster, the Connect Agent
in the admin cluster loses the connection to
gkeconnect.googleapis.com with the error "Failed to verify JWT
signature". This is because during the migration, the KSA signing key is
changed, thus a re-registration is needed to refresh the OIDC JWKs.
Workaround:
To reconnect the admin cluster to Google Cloud, do the following steps
to trigger a re-registration:
The idea is that the onprem-admin-cluster-controller will
always redeploy the gke-connect deployment and re-register
the cluster if it finds no existing gke-connect deployment
available.
After the workaround (it may take a few minutes for the controller to
finish the reconcile), you can verify that the "Failed to
verify JWT signature" 400 error is gone from the gke-connect-agent logs:
Docker bridge IP uses 172.17.0.1/16 for COS cluster control plane nodes
Google Distributed Cloud specifies a dedicated subnet,
--bip=169.254.123.1/24, for the Docker bridge IP in the
Docker configuration to prevent reserving the default
172.17.0.1/16 subnet. However, in version 1.28.0-1.28.500 and
1.29.0, the Docker service wasn't restarted after Google Distributed Cloud
customized the Docker configuration because of a regression in the COS OS
image. As a result, Docker picks the default 172.17.0.1/16 as
its bridge IP address subnet. This might cause an IP address conflict if you already
have a workload running within that IP address range.
Workaround:
To work around this issue, you must restart the docker service:
sudosystemctlrestartdocker
Verify that Docker picks the correct bridge IP address:
ipa|grepdocker0
This solution does not persist across VM re-creations. You must reapply
this workaround whenever VMs are re-created.
update
1.28.0-1.28.400, 1.29.0-1.29.100
Using multiple network interfaces with standard CNI does not work
The standard CNI binaries bridge, ipvlan, vlan, macvlan, dhcp, tuning,
host-local, ptp, portmap are not included in the OS images in the affected
versions. These CNI binaries are not used by data plane v2, but can be used
for additional network interfaces in the multiple network interface feature.
Multiple network interface with these CNI plugins won't work.
Workaround:
Upgrade to the version with the fix if you are using this feature.
update
1.15, 1.16, 1.28
Netapp trident dependencies interfere with vSphere CSI driver
Installing multipathd on cluster nodes interferes with the vSphere CSI driver resulting in user workloads being unable to start.
Workaround:
Disable multipathd
Updates
1.15, 1.16
The admin cluster webhook might block updates when you
add required configurations
If some required configurations are empty in the admin cluster
because validations were skipped, adding them might be blocked by the admin
cluster webhook. For example, if the gkeConnect field wasn't
set in an existing admin cluster, adding it with the
gkectl update admin command might get the following error
message:
admission webhook "vonpremadmincluster.onprem.cluster.gke.io" denied the request: connect: Required value: GKE connect is required for user clusters
Workaround:
For 1.15 admin clusters, run gkectl update admin command with --disable-admin-cluster-webhook flag. For example:
controlPlaneNodePort field defaults to 30968 when
manualLB spec is empty
If you will be using a manual load balancer
(loadBalancer.kind is set to "ManualLB"),
you shouldn't need to configre the loadBalancer.manualLB
section in the configuration file for a high availability (HA) admin
cluster in versions 1.16 and higher. But when this section is empty,
Google Distributed Cloud assigns default values to all NodePorts including
manualLB.controlPlaneNodePort, which causes cluster
creation to fail with the following error message:
Storage policy field is missing in the admin cluster configuration template
SPBM in admin clusters is supported in 1.28.0 and later versions. But the field
vCenter.storagePolicyName is missing in the configuration file template.
Workaround:
Add the `vCenter.storagePolicyName` field in you admin cluster configuration file if
you want to configure the storage policy for the admin cluster. Please refer to the instructions.
Logging and monitoring
1.28.0-1.28.100
Kubernetes Metadata API does not support VPC-SC
The recently added API kubernetesmetadata.googleapis.com does not support VPC-SC.
This will cause metadata collecting agent to fail to reach this API under VPC-SC. Subsequently, metric metadata labels will be missing.
Workaround:
Set in `kube-system` namespace the CR `stackdriver` set `featureGates.disableExperimentalMetadataAgent` field to `true` by running the command
The clusterapi-controller may crash when the admin cluster and any user cluster with ControlPlane V2 enabled use different vSphere credentials
When an admin cluster and any user cluster with ControlPlane V2 enabled use
different vSphere credentials, e.g., after updating vSphere credentials for the
admin cluster, the clusterapi-controller may fail to connect to the vCenter after restart. View the log of the clusterapi-controller running in the admin cluster's
`kube-system` namespace,
To check if this alert is a false positive that can be ignored,
complete the following steps:
Check the raw grpc_server_handled_total metric against
the grpc_method given in the alert message. In this
example, check the grpc_code label for
Watch.
You can check this metric using Cloud Monitoring with the following
MQL query:
If silencing the alert isn't an option, review the following steps
to suppress the false positives:
Scale down the monitoring operator to 0 replicas so
that the modifications can persist.
Modify the prometheus-config configmap, and add
grpc_method!="Watch" to the
etcdHighNumberOfFailedGRPCRequests alert config as shown
in the following example:
Replace the following CLUSTER_NAME with
the name of your cluster.
Restart the Prometheus and Alertmanager Statefulset to pick up the
new configuration.
If the code falls into one of the problematic cases, then check etcd
log and kube-apiserver log to debug more.
Networking
1.16.0-1.16.2, 1.28.0
Egress NAT long lived connections are dropped
Egress NAT connections might be dropped after 5 to 10 minutes of a
connection being established if there's no traffic.
As the conntrack only matters in the inbound direction (external
connections to the cluster), this issue only happens if the connection
doesn't transmit any information for a while and then the destination side
transmits something. If the egress NAT'd Pod always instantiates the
messaging, then this issue won't be seen.
This issue occurs because the anetd garbage collection inadvertently
removes conntrack entries that the daemon thinks are unused.
An upstream fix
was recently integrated into anetd to correct the behavior.
Workaround:
There is no easy workaround, and we haven't seen issues in version 1.16
from this behavior. If you notice long lived connections dropped due to
this issue, workarounds would be to use a workload on the same node as the
egress IP address, or to consistently send messages on the TCP
connection.
Operation
1.14, 1.15, 1.16
The CSR signer ignores spec.expirationSeconds when signing
certificates
If you create a CertificateSigningRequest (CSR) with
expirationSeconds set, the expirationSeconds
is ignored.
Workaround:
If you're affected by this issue, you can update your user cluster by
adding disableNodeIDVerificationCSRSigning: true in the user
cluster configuration file and run the gkectl update cluster
command to update the cluster with this configuration.
Networking, Upgrades, Updates
1.16.0-1.16.3
User cluster load balancer validation fails for
disable_bundled_ingress
[FAILURE] Config: ingress IP is required in user cluster spec
This error happens because gkectl checks for a load
balancer ingress IP address during preflight checks. Although this check
isn't required when disabling bundled ingress, the gkectl
preflight check fails when disableBundledIngress is set to
true.
Workaround:
Use the --skip-validation-load-balancer parameter when you
update the cluster, as shown in the following example:
If you rotate admin cluster certificate authority (CA) certificates,
subsequent attempts to run the gkectl update admin command fail.
The error returned is similar to the following:
failed to get last CARotationStage: configmaps "ca-rotation-stage" not found
Workaround:
If you're affected by this issue, you can update your admin cluster by
using the --disable-update-from-checkpoint flag with the
gkectl update admin command:
When you use the --disable-update-from-checkpoint flag, the
update command doesn't use the checkpoint file as the source of truth during the
cluster update. The checkpoint file is still updated for future use.
Storage
1.15.0-1.15.6, 1.16.0-1.16.2
CSI Workload preflight check fails due to Pod startup failure
During preflight checks, the CSI Workload validation check installs a
Pod in the default namespace. The CSI Workload Pod validates
that the vSphere CSI Driver is installed and can do dynamic volume
provisioning. If this Pod doesn't start, the CSI Workload validation check
fails.
There are a few known issues that can prevent this Pod from starting:
If the Pod doesn't have resources limits specified, which is the case
for some clusters with admissions webhooks installed, the Pod doesn't start.
If Cloud Service Mesh is installed in the cluster with
automatic Istio sidecar injection enabled in the default
namespace, the CSI Workload Pod doesn't start.
If the CSI Workload Pod doesn't start, you see a timeout error like the
following during preflight validations:
If the CSI Workload Pod doesn't start because of Istio sidecar injection,
you can temporarily disable the automatic Istio sidecar injection in the
default namespace. Check the labels of the namespace and use
the following command to delete the label that starts with istio.io/rev:
kubectllabelnamespacedefaultistio.io/rev-
If the Pod is misconfigured, manually verify that dynamic volume
provisioning with the vSphere CSI Driver works:
Create a PVC that uses the standard-rwo StorageClass.
Create a Pod that uses the PVC.
Verify that the Pod can read/write data to the volume.
Remove the Pod and the PVC after you've verified proper operation.
If dynamic volume provisioning with the vSphere CSI Driver works, run
gkectl diagnose or gkectl upgrade
with the --skip-validation-csi-workload flag to skip the CSI
Workload check.
Operation
1.16.0-1.16.2
User cluster update timeouts when admin cluster version is 1.15
When you are logged on to a
user-managed admin workstation, the gkectl update cluster
command might timeout and fail to update the user cluster. This happens if
the admin cluster version is 1.15 and you run gkectl update admin
before you run the gkectl update cluster.
When this failure happens, you see the following error when trying to update the cluster:
Preflight check failed with failed to run server-side preflight checks: server-side preflight checks failed: timed out waiting for the condition
Preflight check failed with failed to run server-side preflight checks: server-side preflight checks failed: timed out waiting for the condition
During the update of a 1.15 admin cluster, the validation-controller
that triggers the preflight checks is removed from the cluster. If you then
try to update the user cluster, the preflight check hangs until the
timeout is reached.
Workaround:
Run the following command to redeploy the validation-controller:
After the prepare completes, run the gkectl update cluster again to update the user cluster
Operation
1.16.0-1.16.2
User cluster create timeouts when admin cluster version is 1.15
When you are logged on to a
user-managed admin workstation, the gkectl create cluster
command might timeout and fail to create the user cluster. This happens if
the admin cluster version is 1.15.
When this failure happens, you see the following error when trying to create the cluster:
Preflight check failed with failed to run server-side preflight checks: server-side preflight checks failed: timed out waiting for the condition
Preflight check failed with failed to run server-side preflight checks: server-side preflight checks failed: timed out waiting for the condition
Since the validation-controller was added in 1.16 then when using
1.15 admin cluster the validation-controller that is responsible to trigger the preflight checks is missing. Therefore, when trying to create user cluster the preflight checks
hang till timeout is reached.
Workaround:
Run the following command to deploy the validation-controller:
After the prepare completes, run the gkectl create cluster again to create the user cluster
Upgrades, Updates
1.16.0-1.16.2
Admin cluster update or upgrade fails if the projects or locations of
add-on services don't match each other
When you upgrade an admin cluster from version 1.15.x to 1.16.x, or
add a connect, stackdriver,
cloudAuditLogging, or gkeOnPremAPI configuration
when you update an admin cluster, the operation might be rejected by admin
cluster webhook. One of the following error messages might be displayed:
"projects for connect, stackdriver and cloudAuditLogging must be the
same when specified during cluster creation."
"locations for connect, gkeOnPremAPI, stackdriver and
cloudAuditLogging must be in the same region when specified during
cluster creation."
"locations for stackdriver and cloudAuditLogging must be the same
when specified during cluster creation."
An admin cluster update or upgrade requires the
onprem-admin-cluster-controller to reconcile the admin
cluster in a kind cluster. When the admin cluster state is restored in the
kind cluster, the admin cluster webhook can't distinguish if the
OnPremAdminCluster object is for an admin cluster creation,
or to resume operations for an update or upgrade. Some create-only
validations are invoked on updating and upgrading unexpectedly.
Workaround:
Add the
onprem.cluster.gke.io/skip-project-location-sameness-validation: true
annotation to the OnPremAdminCluster object:
ADMIN_CLUSTER_NAME: the name of
the admin cluster.
ADMIN_CLUSTER_KUBECONFIG: the path
of the admin cluster kubeconfig file.
Add the
onprem.cluster.gke.io/skip-project-location-sameness-validation: true
annotation and save the custom resource.
Depending on the type of admin clusters, complete one of the
following steps:
For non-HA admin clusters with a checkpoint file: add the
parameter disable-update-from-checkpoint in the
update command, or add the parameter
`disable-upgrade-from-checkpoint` in the upgrade command. These
parameters are only needed for the next time that you run the
update or upgrade command:
For HA admin clusters or checkpoint file is disabled:
update or upgrade admin cluster as normal. No additional parameters
are needed on the update or upgrade commands.
Operation
1.16.0-1.16.2
User cluster deletion fails when using a user-managed admin workstation
When you are logged on to a
user-managed admin workstation, the gkectl delete cluster
command might timeout and fail to delete the user cluster. This happens if
you have first run gkectl on the user-managed workstation to
create, update, or upgrade the user cluster. When this failure happens,
you see the following error when trying to delete the cluster:
failed to wait for user cluster management namespace "USER_CLUSTER_NAME-gke-onprem-mgmt"
to be deleted: timed out waiting for the condition
During deletion, a cluster first deletes all of its objects. The
deletion of the Validation objects (that were created during the create,
update, or upgrade) are stuck at the deleting phase. This happens
because a finalizer blocks the object's deletion, which causes
cluster deletion to fail.
Workaround:
Get the names of all the Validation objects:
kubectl --kubeconfig ADMIN_KUBECONFIG get validations \
-n USER_CLUSTER_NAME-gke-onprem-mgmt
For each Validation object, run the following command to remove the
finalizer from the Validation object:
After removing the finalizer from all Validation objects, the objects
are removed and the user cluster delete operation completes
automatically. You don't need to take additional action.
Networking
1.15, 1.16
Egress NAT gateway traffic to external server fails
If the source Pod and egress NAT gateway Pod are on two different
worker nodes, traffic from the source Pod can't reach any external
services. If the Pods are located on the same host, the connection to
external service or application is successful.
This issue is caused by vSphere dropping VXLAN packets when tunnel
aggregation is enabled. There's a known issue with NSX and VMware that
only sends aggregated traffic on known VXLAN ports (4789).
This workaround reverts every time the cluster is upgraded. You must
reconfigure after each upgrade. VMware must resolve their issue in vSphere
for a permanent fix.
Upgrades
1.15.0-1.15.4
Upgrading an admin cluster with always-on secrets encryption enabled fails
The admin cluster upgrade from 1.14.x to 1.15.x with
always-on
secrets encryption enabled fails due to a mismatch between the
controller-generated encryption key with the key that persists on the
admin master data disk. The output of gkectl upgrade
admin contains the following error message:
E0926 14:42:21.796444 40110 console.go:93] Exit with error:
E0926 14:42:21.796491 40110 console.go:93] Failed to upgrade the admin cluster: failed to create admin cluster: failed to wait for OnPremAdminCluster "admin-cluster-name" to become ready: failed to wait for OnPremAdminCluster "admin-cluster-name" to be ready: error: timed out waiting for the condition, message: failed to wait for OnPremAdminCluster "admin-cluster-name" to stay in ready status for duration "2m0s": OnPremAdminCluster "non-prod-admin" is not ready: ready condition is not true: CreateOrUpdateControlPlane: Creating or updating credentials for cluster control plane
Running kubectl get secrets -A --kubeconfig KUBECONFIG` fails with the following error:
Internal error occurred: unable to transform key "/registry/secrets/anthos-identity-service/ais-secret": rpc error: code = Internal desc = failed to decrypt: unknown jwk
Workaround
If you have a backup of the admin cluster, do the following steps to
workaround the upgrade failure:
When the new admin master VM is created, SSH to the admin master VM,
replace the new key on the data disk with the old one from the
backup. The key is located at /opt/data/gke-k8s-kms-plugin/generatedkeys
on the admin master.
Update the kms-plugin.yaml static Pod manifest in /etc/kubernetes/manifests
to update the --kek-id to match the kid
field in the original encryption key.
Restart the kms-plugin static Pod by moving the
/etc/kubernetes/manifests/kms-plugin.yaml to another
directory then move it back.
Resume the admin upgrade by running gkectl upgrade admin again.
Preventing the upgrade failure
If you haven't already upgraded, we recommend that you don't upgrade
to 1.15.0-1.15.4. If you must upgrade to an affected version, do
the following steps before upgrading the admin cluster:
Disk errors and attach failures when using Changed Block Tracking
(CBT)
Google Distributed Cloud does not support Changed Block Tracking (CBT) on
disks. Some backup software uses the CBT feature to track disk state and
perform backups, which causes the disk to be unable to connect to a VM
that runs Google Distributed Cloud. For more information, see the
VMware KB
article.
Workaround:
Don't back up the Google Distributed Cloud VMs, as 3rd party backup software
might cause CBT to be enabled on their disks. It's not necessary to back
up these VMs.
Don't enable CBT on the node, as this change won't persist across
updates or upgrades.
If you already have disks with CBT enabled, follow the
Resolution steps in the
VMware KB
articleto disable CBT on the First Class Disk.
Storage
1.14, 1.15, 1.16
Data corruption on NFSv3 when parallel appends to a shared file are
done from multiple hosts
If you use Nutanix storage arrays to provide NFSv3 shares to your
hosts, you might experience data corruption or the inability for Pods to
run successfully. This issue is caused by a known compatibility issue
between certain versions of VMware and Nutanix versions. For more
information, see the associated
VMware KB
article.
Workaround:
The VMware KB article is out of date in noting that there is no
current resolution. To resolve this issue, update to the latest version
of ESXi on your hosts and to the latest Nutanix version on your storage
arrays.
Operating system
1.13.10, 1.14.6, 1.15.3
Version mismatch between the kubelet and the Kubernetes control plane
For certain Google Distributed Cloud releases, the kubelet running on the
nodes uses a different version than the Kubernetes control plane. There is a
mismatch because the kubelet binary preloaded on the OS image is using a
different version.
The following table lists the identified version mismatches:
Google Distributed Cloud version
kubelet version
Kubernetes version
1.13.10
v1.24.11-gke.1200
v1.24.14-gke.2100
1.14.6
v1.25.8-gke.1500
v1.25.10-gke.1200
1.15.3
v1.26.2-gke.1001
v1.26.5-gke.2100
Workaround:
No action is needed. The inconsistency is only between Kubernetes patch
versions and no problems have been caused by this version skew.
Upgrades, Updates
1.15.0-1.15.4
Upgrading or updating an admin cluster with a CA version greater than 1 fails
When an admin cluster has a certificate authority (CA) version greater
than 1, an update or upgrade fails due to the CA version validation in the
webhook. The output of
gkectl upgrade/update contains the following error message:
CAVersionmuststartfrom1
Workaround:
Scale down the auto-resize-controller deployment in the
admin cluster to disable node auto-resizing. This is necessary
because a new field introduced to the admin cluster Custom Resource in
1.15 can cause a nil pointer error in the auto-resize-controller.
Constant CNS attachvolume tasks appear every minute for in-tree PVC/PV
after upgrading to version 1.15+
When a cluster contains in-tree vSphere persistent volumes (for example, PVCs created with the standard StorageClass), you will observe com.vmware.cns.tasks.attachvolume tasks triggered every minute from vCenter.
Workaround:
Edit the vSphere CSI feature configMap and set list-volumes to false:
When a cluster contains intree vSphere persistent volumes, the commands
gkectl diagnose and gkectl upgrade might raise
false warnings against their persistent volume claims (PVCs) when
validating the cluster storage settings. The warning message looks like
the following
Service account key rotation fails when multiple keys are expired
If your cluster is not using a private registry, and your component
access service account key and Logging-monitoring (or Connect-register)
service account keys are expired, when you
rotate the
service account keys, gkectl update credentials
fails with an error similar to the following:
First, rotate the component access service account key. Although the
same error message is displayed, you should be able to rotate the other
keys after the component access service account key rotation.
1.15 User master machine encounters an unexpected recreation when the user cluster controller is upgraded to 1.16
During a user cluster upgrade, after the user cluster controller is upgraded to 1.16, if you have other 1.15 user clusters managed by the same admin cluster, their user master machine might be unexpectedly recreated.
There is a bug in the 1.16 user cluster controller which can trigger the 1.15 user master machine recreation.
The workaround that you do depends on how you encounter this issue.
Workaround when upgrading the user cluster using the Google Cloud console:
Option 1: Use a 1.16.6+ version of GKE on VMware with the fix.
Option 2: Do the following steps:
Manually add the rerun annotation by the following command:
Monitor the upgrade progress by checking the status field of the OnPremUserCluster.
Workaround when upgrading the user cluster using your own admin workstation:
Option 1: Use a 1.16.6+ version of GKE on VMware with the fix.
Option 2: Do the following steps:
Add the build info file /etc/cloud/build.info with the following content. This causes the preflight checks to run locally on your admin workstation rather than on the server.
gke_on_prem_version:GKE_ON_PREM_VERSION
For example:
gke_on_prem_version:1.16.0-gke.669
Rerun the upgrade command.
After the upgrade completes, delete the build.info file.
Create
1.16.0-1.16.5, 1.28.0-1.28.100
Preflight check fails when the hostname isn't in the IP block file.
During cluster creation, if you don't specify a hostname for every IP
address in the IP block file, the preflight check fails with the
following error message:
There is a bug in the preflight check which assumes empty hostname as duplicate.
Workaround:
Option 1: Use a version with the fix.
Option 2: Bypass this preflight check by adding --skip-validation-net-config flag.
Option 3: Specify a unique hostname for each IP address in IP block file.
Upgrades, Updates
1.16
Volume mount failure when upgrade/update the admin cluster if using non-HA admin cluster and control plane v1 user cluster
For a non-HA admin cluster and a control plane v1 user cluster, when you upgrade or update the admin cluster, the admin cluster master machine recreation might happen at the same time as the user cluster master machine reboot, which can surface a race condition.
This causes the user cluster control plane Pods to be unable to communicate to the admin cluster control plane, which causes volume attach issues for kube-etcd and kube-apiserver on the user cluster control plane.
To verify the issue, run the following commands for the impacted pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 101s kubelet Unable to attach or mount volumes: unmounted volumes=[kube-audit], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
Warning FailedMount 86s (x2 over 3m28s) kubelet MountVolume.SetUp failed for volume "pvc-77cd0635-57c2-4392-b191-463a45c503cb" : rpc error: code = FailedPrecondition desc = volume ID: "bd313c62-d29b-4ecd-aeda-216648b0f7dc" does not appear staged to "/var/lib/kubelet/plugins/kubernetes.io/csi/csi.vsphere.vmware.com/92435c96eca83817e70ceb8ab994707257059734826fedf0c0228db6a1929024/globalmount"
After restart, the kubelet can reconstruct stage global mount properly.
Upgrades, Updates
1.16.0
Control plane node fails to be created
During an upgrade or update of an admin cluster, a race condition might
cause the vSphere cloud controller manager to unexpectedly delete a new
control plane node. This causes the clusterapi-controller to be stuck
waiting for the node to be created, and evenutally the upgrade/update
times out. In this case, the output of the gkectl
upgrade/update command is similar to the following:
controlplane'default/gke-admin-hfzdg'isnotready:condition"Ready":conditionisnotreadywithreason"MachineInitializing",message"Wait for the control plane machine "gke-admin-hfzdg-6598459f9zb647c8-0\"toberebooted"...
To identify the symptom, run the command below to get log in vSphere cloud controller manager in the admin cluster:
Duplicate hostname in the same data center causes cluster upgrade or creation failures
Upgrading a 1.15 cluster or creating a 1.16 cluster with static IPs fails if there are duplicate
hostnames in the same data center. This failure happens because the
vSphere cloud controller manager fails to add an external IP and provider
ID in the node object. This causes the cluster upgrade/create to timeout.
To identify the issue, get the vSphere cloud controller manager pod logs
for the cluster. The command that you use depends on the cluster type,
as follows:
Update the hostname of the affected machine in user-ip-block.yaml to a
unique name.
Rerun gkectl create cluster.
Operation
1.16.0, 1.16.1, 1.16.2, 1.16.3
$ and ` are not supported in vSphere username or password
The following operations fail when the vSphere username or password
contains $ or `:
Upgrading a 1.15 user cluster with Controlplane V2 enabled to 1.16
Upgrading a 1.15 high-availability (HA) admin cluster to 1.16
Creating a 1.16 user cluster with Controlplane V2 enabled
Creating a 1.16 HA admin cluster
Use a 1.16.4+ version of Google Distributed Cloud with the fix or perform the below workaround. The workaround that you do depends on the operation that failed.
Workaround for upgrades:
Change the vCenter username or password on the vCenter side to remove
the $ and `.
PVC creation failure after node is recreated with the same name
After a node is deleted and then recreated with the same node name,
there is a slight chance that a subsequent PersistentVolumeClaim (PVC)
creation fails with an error like the following:
If you use Seesaw as the load balancer type for your cluster and you see that
a Seesaw VM is down or keeps failing to boot, you might see the following error
message in the vSphere console:
This error indicates that the disk space is low on the VM because the fluent-bit
running on the Seesaw VM is not configured with correct log rotation.
Workaround:
Locate the log files that consume most of the disk space using du -sh -- /var/lib/docker/containers/* | sort -rh. Clean up the log file with largest size and reboot the VM.
Note: If the VM is completely inaccessible, attach the disk to a working VM (e.g. admin workstation), remove the file from the attached disk, then reattach the disk back to the original Seesaw VM.
To prevent the issue from happening again, connect to the VM and modify the /etc/systemd/system/docker.fluent-bit.service file. Add --log-opt max-size=10m --log-opt max-file=5 in the Docker command, then run systemctl restart docker.fluent-bit.service
Operation
1.13, 1.14.0-1.14.6, 1.15
Admin SSH public key error after admin cluster upgrade or update
When you try to upgrade (gkectl upgrade admin) or update
(gkectl update admin) a non-High-Availability admin cluster
with checkpoint enabled, the upgrade or update may fail with errors like the
following:
If you're unable to upgrade to a patch version of Google Distributed Cloud with the fix,
contact Google Support for assistance.
Upgrades
1.13.0-1.13.9, 1.14.0-1.14.6, 1.15.1-1.15.2
Upgrading an admin cluster enrolled in the GKE On-Prem API could fail
When an admin cluster is enrolled in the GKE On-Prem API, upgrading the
admin cluster to the affected versions could fail because the fleet membership
couldn't be updated. When this failure happens, you see the
following error when trying to upgrade the cluster:
and resume
upgrading the admin cluster. You might see the stale `failed to
register cluster` error temporarily. After a while, it should be updated
automatically.
Upgrades, Updates
1.13.0-1.13.9, 1.14.0-1.14.4, 1.15.0
Enrolled admin cluster's resource link annotation is not preserved
When an admin cluster is enrolled in the GKE On-Prem API, its resource
link annotation is applied to the OnPremAdminCluster custom
resource, which is not preserved during later admin cluster updates due to
the wrong annotation key being used. This can cause the admin cluster to be
enrolled in the GKE On-Prem API again by mistake.
OnPremAdminCluster status inconsistent between checkpoint and actual CR
Certain race conditions could cause the OnPremAdminCluster status to be inconsistent between checkpoint and actual CR. When the issue happens, you could encounter the following error when update the admin cluster after you upgraded it:
To workaround this issue, you will need to either edit the checkpoint or disable the checkpoint for upgrade/update, please reach out to our support team to proceed with the workaround.
Operation
1.13.0-1.13.9, 1.14.0-1.14.5, 1.15.0-1.15.1
Reconciliation process changes admin certificates on admin clusters
Google Distributed Cloud changes the admin certificates on admin cluster control planes
with every reconciliation process, such as during a cluster upgrade. This behavior
increases the possibility of getting invalid certificates for your admin cluster,
especially for version 1.15 clusters.
If you're affected by this issue, you may encounter problems like the
following:
Invalid certificates may cause the following commands to time out and
return errors:
gkectl create admin
gkectl upgrade amdin
gkectl update admin
These commands may return authorization errors like the following:
Upgrade to a version of Google Distributed Cloud with the fix:
1.13.10+, 1.14.6+, 1.15.2+.
If upgrading isn't feasible for you, contact Cloud Customer Care to resolve this issue.
Networking, Operation
1.10, 1.11, 1.12, 1.13, 1.14
Anthos Network Gateway components evicted or pending due to missing
priority class
Network gateway Pods in kube-system might show a status of
Pending or Evicted, as shown in the following
condensed example output:
These errors indicate eviction events or an inability to schedule Pods
due to node resources. As Anthos Network Gateway Pods have no
PriorityClass, they have the same default priority as other workloads.
When nodes are resource-constrained, the network gateway Pods might be
evicted. This behavior is particularly bad for the ang-node
DaemonSet, as those Pods must be scheduled on a specific node and can't
migrate.
Workaround:
Upgrade to 1.15 or later.
As a short-term fix, you can manually assign a
PriorityClass
to the Anthos Network Gateway components. The Google Distributed Cloud controller
overwrites these manual changes during a reconciliation process, such as
during a cluster upgrade.
Assign the system-cluster-critical PriorityClass to the
ang-controller-manager and autoscaler cluster
controller Deployments.
Assign the system-node-critical PriorityClass to the
ang-daemon node DaemonSet.
Upgrades, Updates
1.12, 1.13, 1.14, 1.15.0-1.15.2
admin cluster upgrade fails after registering the cluster with gcloud
After you use gcloud to register an admin cluster with non-empty
gkeConnect section, you might see the following error when trying to upgrade the cluster:
gkectl diagnose snapshot --log-since fails to limit the time window for
journalctl commands running on the cluster nodes
This does not affect the functionality of taking a snapshot of the
cluster, as the snapshot still includes all logs that are collected by
default by running journalctl on the cluster nodes. Therefore,
no debugging information is missed.
Installation, Upgrades, Updates
1.9+, 1.10+, 1.11+, 1.12+
gkectl prepare windows fails
gkectl prepare windows fails to install Docker on
Google Distributed Cloud versions earlier than 1.13 because
MicrosoftDockerProvider
is deprecated.
Workaround:
The general idea to workaround this issue is to upgrade to Google Distributed Cloud 1.13
and use the 1.13 gkectl to create a Windows VM template and then create
Windows node pools. There are two options to get to Google Distributed Cloud 1.13 from your
current version as shown below.
Note: We do have options to workaround this issue in your current version
without needing to upgrade all the way to 1.13, but it will need more manual
steps, please reach out to our support team if you would like to consider
this option.
Option 1: Blue/Green upgrade
You can create a new cluster using Google Distributed Cloud 1.13+ version with windows node pools, and
migrate your workloads to the new cluster, then tear down the current
cluster. It's recommended to use the latest Google Distributed Cloud minor version.
Note: This will require extra resources to provision the new cluster, but
less downtime and disruption for existing workloads.
Option 2: Delete Windows node pools and add them back when
upgrading to Google Distributed Cloud 1.13
Note: For this option, the Windows workloads will not be able to run until
the cluster is upgraded to 1.13 and Windows node pools are added back.
Delete existing Windows node pools by removing the windows node pools
config from user-cluster.yaml file, then run the command:
Upgrade the Linux-only admin+user clusters to 1.12 following the
upgrade user guide for the corresponding target minor version.
(Make sure to perform this step before upgrading to 1.13) Ensure the enableWindowsDataplaneV2: true is configured in OnPremUserCluster CR, otherwise the cluster will keep using Docker for Windows node pools, which will not be compatible with the newly created 1.13 Windows VM template that not have Docker installed. If not configured or setting to false, update your cluster to set it to true in user-cluster.yaml, then run:
RootDistanceMaxSec configuration not taking effect for
ubuntu nodes
The 5 seconds default value for RootDistanceMaxSec will be
used on the nodes, instead of 20 seconds which should be the expected
configuration. If you check the node startup log by SSH'ing into the VM,
which is located at `/var/log/startup.log`, you can find the following
error:
If you check the gkectl log, you might see that the multiple
changes include setting osImageType from an empty string to
ubuntu_containerd.
These update errors are due to improper backfilling of the
osImageType field in the admin cluster config since it was
introduced in version 1.9.
Workaround:
Upgrade to a version of Google Distributed Cloud with the fix. If upgrading
isn't feasible for you, contact Cloud Customer Care to resolve this issue.
Installation, Security
1.13, 1.14, 1.15, 1.16
SNI doesn't work on user clusters with Controlplane V2
The ability to provide an additional serving certificate for the
Kubernetes API server of a user cluster with
authentication.sni doesn't work when the Controlplane V2 is
enabled (enableControlplaneV2: true).
Workaround:
Until a Google Distributed Cloud patch is available with the fix, if you
need to use SNI, disable Controlplane V2 (enableControlplaneV2: false).
$ in the private registry username causes admin control plane machine startup failure
The admin control plane machine fails to start up when the private registry username contains $.
When checking the /var/log/startup.log on the admin control plane machine, you see the
following error:
Avoid another KSA signing key rotation until the cluster is
upgraded to the version with the fix.
Operation
1.13.1+, 1.14, 1., 1.16
F5 BIG-IP virtual servers aren't cleaned up when Terraform deletes user clusters
When you use Terraform to delete a user cluster with a F5 BIG-IP load
balancer, the F5 BIG-IP virtual servers aren't removed after the cluster
deletion.
kind cluster pulls container images from docker.io
If you create a version 1.13.8 or version 1.14.4 admin cluster, or
upgrade an admin cluster to version 1.13.8 or 1.14.4, the kind cluster pulls
the following container images from docker.io:
docker.io/kindest/kindnetd
docker.io/kindest/local-path-provisioner
docker.io/kindest/local-path-helper
If docker.io isn't accessible from your admin workstation,
the admin cluster creation or upgrade fails to bring up the kind cluster.
Running the following command on the admin workstation shows the
corresponding containers pending with ErrImagePull:
These container images should be preloaded in the kind cluster container
image. However, kind v0.18.0 has
an issue with the preloaded container images,
which causes them to be pulled from the internet by mistake.
Workaround:
Run the following commands on the admin workstation, while your admin cluster
is pending on creation or upgrade:
Unsuccessful failover on HA Controlplane V2 user cluster and admin cluster when the network filters out duplicate GARP requests
If your cluster VMs are connected with a switch that filters out duplicate GARP (gratuitous ARP) requests, the
keepalived leader election might encounter a race condition, which causes some nodes to have incorrect ARP table entries.
The affected nodes can ping the control plane VIP, but a TCP connection to the control plane VIP
will time out.
Workaround:
Run the following command on each control plane node of the affected cluster:
vsphere-csi-controller needs be restarted after the vCenter certificate rotation
vsphere-csi-controller should refresh its vCenter secret after vCenter certificate rotation. However, the current system does not properly restart the pods of vsphere-csi-controller, causing vsphere-csi-controller to crash after the rotation.
Workaround:
For clusters created at 1.13 and later versions, follow the instructions below to restart vsphere-csi-controller
Even when
cluster registration fails during admin cluster creation, the command gkectl create admin does not fail on the error and might succeed. In other words, the admin cluster creation could "succeed" without being registered to a fleet.
To identify the symptom, you can look for the following error messages in the log of `gkectl create admin`,
Failed to register admin cluster
You can also check whether you can find the cluster among registered clusters on cloud console.
Workaround:
For clusters created at 1.12 and later versions, follow the instructions for re-attempting the admin cluster registration after cluster creation. For clusters created at earlier versions,
Append a fake key-value pair like "foo: bar" to your connect-register SA key file
Admin cluster re-registration might be skipped during admin cluster upgrade
During admin cluster upgrade, if upgrading user control plane nodes times out, the admin cluster will not be re-registered with the updated connect agent version.
For a high-availability admin cluster, gkectl prepare shows
this false error message:
vCenter.dataDisk must be present in the AdminCluster spec
Workaround:
You can safely ignore this error message.
VMware
1.15.0
Node pool creation fails because of redundant VM-Host affinity rules
During creation of a node pool that uses
VM-Host affinity,
a race condition might result in multiple
VM-Host affinity rules
being created with the same name. This can cause node pool creation to fail.
Workaround:
Remove the old redundant rules so that node pool creation can proceed.
These rules are named [USER_CLUSTER_NAME]-[HASH].
Operation
1.15.0
gkectl repair admin-master may fail due to failed
to delete the admin master node object and reboot the admin master VM
The gkectl repair admin-master command may fail due to a
race condition with the following error.
This command is idempotent. It can rerun safely until the command
succeeds.
Upgrades, Updates
1.15.0
Pods remain in Failed state afer re-creation or update of a
control-plane node
After you re-create or update a control-plane node, certain Pods might
be left in the Failed state due to NodeAffinity predicate
failure. These failed Pods don't affect normal cluster operations or health.
Workaround:
You can safely ignore the failed Pods or manually delete them.
Security, Configuration
1.15.0-1.15.1
OnPremUserCluster not ready because of private registry credentials
If you use
prepared credentials
and a private registry, but you haven't configured prepared credentials for
your private registry, the OnPremUserCluster might not become ready, and
you might see the following error message:
failed to check secret reference for private registry …
Workaround:
Prepare the private registry credentials for the user cluster according
to the instructions in
Configure prepared credentials.
Upgrades, Updates
1.15.0
gkectl upgrade admin fails with StorageClass standard sets the parameter diskformat which is invalid for CSI Migration
During gkectl upgrade admin, the storage preflight check for CSI Migration verifies
that the StorageClasses don't have parameters that are ignored after CSI Migration.
For example, if there's a StorageClass with the parameter diskformat then
gkectl upgrade admin flags the StorageClass and reports a failure in the preflight validation.
Admin clusters created in Google Distributed Cloud 1.10 and before have a StorageClass with diskformat: thin
which will fail this validation however this StorageClass still works
fine after CSI Migration. These failures should be interpreted as warnings instead.
After confirming that your cluster has a StorageClass with parameters ignored after CSI Migration
run gkectl upgrade admin with the flag --skip-validation-cluster-health.
Storage
1.15, 1.16
Migrated in-tree vSphere volumes using the Windows file system can't be used with vSphere CSI driver
Under certain conditions disks can be attached as readonly to Windows
nodes. This results in the corresponding volume being readonly inside a Pod.
This problem is more likely to occur when a new set of nodes replaces an old
set of nodes (for example, cluster upgrade or node pool update). Stateful
workloads that previously worked fine might be unable to write to their
volumes on the new set of nodes.
Workaround:
Get the UID of the Pod that is unable to write to its volume:
The Pod should get scheduled to the same node. But in case the Pod gets
scheduled to a new node, you might need to repeat the preceding steps on
the new node.
Upgrades, Updates
1.12, 1.13.0-1.13.7, 1.14.0-1.14.4
vsphere-csi-secret is not updated after gkectl update credentials vsphere --admin-cluster
If you update the vSphere credentials for an admin cluster following
updating cluster credentials,
you might find vsphere-csi-secret under kube-system namespace in the admin cluster still uses the old credential.
audit-proxy crashloop when enabling Cloud Audit Logs with gkectl update cluster
audit-proxy might crashloop because of empty --cluster-name.
This behavior is caused by a bug in the update logic, where the cluster name is not propagated to the
audit-proxy pod / container manifest.
Workaround:
For a control plane v2 user cluster with enableControlplaneV2: true, connect to the user control plane machine using SSH,
and update /etc/kubernetes/manifests/audit-proxy.yaml with --cluster_name=USER_CLUSTER_NAME.
For a control plane v1 user cluster, edit the audit-proxy container in
the kube-apiserver statefulset to add --cluster_name=USER_CLUSTER_NAME:
An additional control plane redeployment right after gkectl upgrade cluster
Right after gkectl upgrade cluster, the control plane pods might be re-deployed again.
The cluster state from gkectl list clusters change from RUNNING TO RECONCILING.
Requests to the user cluster might timeout.
This behavior is because of the control plane certificate rotation happens automatically after
gkectl upgrade cluster.
This issue only happens to user clusters that do NOT use control plane v2.
Workaround:
Wait for the cluster state to change back to RUNNING again in gkectl list clusters, or
upgrade to versions with the fix: 1.13.6+, 1.14.2+ or 1.15+.
Upgrades, Updates
1.12.7
Bad release 1.12.7-gke.19 has been removed
Google Distributed Cloud 1.12.7-gke.19 is a bad release
and you should not use it. The artifacts have been removed
from the Cloud Storage bucket.
Workaround:
Use the 1.12.7-gke.20 release instead.
Upgrades, Updates
1.12.0+, 1.13.0-1.13.7, 1.14.0-1.14.3
gke-connect-agent
continues to use the older image after registry credential updated
If you update the registry credential using one of the following methods:
gkectl update credentials componentaccess if not using private registry
gkectl update credentials privateregistry if using private registry
you might find gke-connect-agent continues to use the older
image or the gke-connect-agent pods cannot be pulled up due
to ImagePullBackOff.
This issue will be fixed in Google Distributed Cloud releases 1.13.8,
1.14.4, and subsequent releases.
When you validate the configuration before creating a cluster with Manual load balancer by running gkectl check-config, then the command will fail with the following error messages.
Clusters running etcd version 3.4.13 or earlier may experience watch
starvation and non-operational resource watches, which can lead to the
following problems:
Pod scheduling is disrupted
Nodes are unable to register
kubelet doesn't observe pod changes
These problems can make the cluster non-functional.
This issue is fixed in Google Distributed Cloud releases 1.12.7, 1.13.6,
1.14.3, and subsequent releases. These newer releases use etcd version
3.4.21. All prior versions of Google Distributed Cloud are affected by
this issue.
Workaround
If you can't upgrade immediately, you can mitigate the risk of
cluster failure by reducing the number of nodes in your cluster. Remove
nodes until the etcd_network_client_grpc_sent_bytes_total
metric is less than 300 MBps.
To view this metric in Metrics Explorer:
Go to the Metrics Explorer in the Google Cloud console:
Expand the Select a metric, enter Kubernetes Container
in the filter bar, and then use the submenus to select the metric:
In the Active resources menu, select Kubernetes Container.
In the Active metric categories menu, select Anthos.
In the Active metrics menu, select etcd_network_client_grpc_sent_bytes_total.
Click Apply.
Upgrades, Updates
1.10, 1.11, 1.12, 1.13, and 1.14
GKE Identity Service can cause control plane latencies
At cluster restarts or upgrades, GKE Identity Service can get
overwhelmed with traffic consisting of expired JWT tokens forwarded from
the kube-apiserver to GKE Identity Service over the
authentication webhook. Although GKE Identity Service doesn't
crashloop, it becomes unresponsive and ceases to serve further requests.
This problem ultimately leads to higher control plane latencies.
This issue is fixed in the following Google Distributed Cloud releases:
1.12.6+
1.13.6+
1.14.2+
To determine if you're affected by this issue, perform the following steps:
Check whether the GKE Identity Service endpoint can be reached externally:
Replace CLUSTER_ENDPOINT
with the control plane VIP and control plane load balancer port for your
cluster (for example, 172.16.20.50:443).
If you're affected by this issue, the command returns a 400
status code. If the request times out, restart the ais Pod and
rerun the curl command to see if that resolves the problem. If
you get a status code of 000, the problem has been resolved and
you are done. If you still get a 400 status code, the
GKE Identity Service HTTP server isn't starting. In this case, continue.
Check the GKE Identity Service and kube-apiserver logs:
To decode the token and see the source pod name and namespace, copy
the token to the debugger at jwt.io.
Restart the pods identified from the tokens.
Operation
1.8, 1.9, 1.10
The memory usage increase issue of etcd-maintenance pods
The etcd maintenance pods that use etcddefrag:gke_master_etcddefrag_20210211.00_p0 image are affected. The `etcddefrag` container opens a new connection to etcd server during each defrag cycle and the old connections are not cleaned up.
Workaround:
Option 1: Upgrade to the latest patch version from 1.8 to 1.11 which contain the fix.
Option 2: If you are using patch version earlier than 1.9.6 and 1.10.3, you need to scale down the etcd-maintenance pod for admin and user cluster:
Miss the health checks of user cluster control plane pods
Both the cluster health controller and the gkectl diagnose cluster command perform a set of health checks including the pods health checks across namespaces. However, they start to skip the user control plane pods by mistake. If you use the control plane v2 mode, this won't affect your cluster.
Workaround:
This won't affect any workload or cluster management. If you want to check the control plane pods healthiness, you can run the following commands:
1.6 and 1.7 admin cluster upgrades may be affected by the k8s.gcr.io -> registry.k8s.io redirect
Kubernetes redirected the traffic from k8s.gcr.io to registry.k8s.io on 3/20/2023. In Google Distributed Cloud 1.6.x and 1.7.x, the admin cluster upgrades use the container image k8s.gcr.io/pause:3.2. If you use a proxy for your admin workstation and the proxy doesn't allow registry.k8s.io and the container image k8s.gcr.io/pause:3.2 is not cached locally, the admin cluster upgrades will fail when pulling the container image.
Workaround:
Add registry.k8s.io to the allowlist of the proxy for your admin workstation.
This is due to the seesaw group file already existing. And the preflight check
tries to validate a non-existent seesaw load balancer.
Workaround:
Remove the existing seesaw group file for this cluster. The file name
is seesaw-for-gke-admin.yaml for the admin cluster, and
seesaw-for-{CLUSTER_NAME}.yaml for a user cluster.
Networking
1.14
Application timeouts caused by conntrack table insertion failures
Google Distributed Cloud version 1.14 is susceptible to netfilter
connection tracking (conntrack) table insertion failures when using
Ubuntu or COS operating system images. Insertion failures lead to random
application timeouts and can occur even when the conntrack table has room
for new entries. The failures are caused by changes in
kernel 5.15 and higher that restrict table insertions based on chain
length.
To see if you are affected by this issue, you can check the in-kernel
connection tracking system statistics on each node with the following
command:
If a chaintoolong value in the response is a non-zero
number, you're affected by this issue.
Workaround
The short term mitigation is to increase the size of both the netfiler
hash table (nf_conntrack_buckets) and the netfilter
connection tracking table (nf_conntrack_max). Use the
following commands on each cluster node to increase the size of the
tables:
Replace TABLE_SIZE with new size in bytes. The
default table size value is 262144. We suggest that you set a
value equal to 65,536 times the number of cores on the node. For example,
if your node has eight cores, set the table size to 524288.
Networking
1.13.0-1.13.2
calico-typha or anetd-operator crash loop on Windows nodes with Controlplane V2
With
Controlplane V2 enabled, calico-typha or anetd-operator might be scheduled to Windows nodes and get into crash loop.
The reason is that the two deployments tolerate all taints including Windows node taint.
Workaround:
Either upgrade to 1.13.3+, or run the following commands to edit the `calico-typha` or `anetd-operator` deployment:
# If dataplane v2 is not used.kubectleditdeployment-nkube-systemcalico-typha--kubeconfigUSER_CLUSTER_KUBECONFIG# If dataplane v2 is used.kubectleditdeployment-nkube-systemanetd-operator--kubeconfigUSER_CLUSTER_KUBECONFIG
Remove the following spec.template.spec.tolerations:
User cluster private registry credential file cannot be loaded
You might not be able to create a user cluster if you specify the
privateRegistry section with credential fileRef.
Preflight might fail with the following message:
[FAILURE] Docker registry access: Failed to login.
Workaround:
If you did not intend to specify the field or you want to use the same
private registry credential as admin cluster, you can simply remove or
comment the privateRegistry section in your user cluster
config file.
If you want to use a specific private registry credential for your
user cluster, you may temporarily specify the privateRegistry
section this way:
(NOTE: This is only a temporarily fix and these fields are already
deprecated, consider using the credential file when upgrading to 1.14.3+.)
Operations
1.10+
Cloud Service Mesh and other service meshes not compatible with Dataplane v2
Dataplane V2 takes over load balancing and creates a kernel socket instead of a packet based DNAT. This means that Cloud Service Mesh
cannot do packet inspection as the pod is bypassed and never uses IPTables.
This manifests in kube-proxy free mode by loss of connectivity or incorrect traffic routing for services with Cloud Service Mesh as the sidecar cannot do packet inspection.
This issue is present on all versions of Google Distributed Cloud 1.10, however some newer versions of 1.10 (1.10.2+) have a workaround.
Workaround:
Either upgrade to 1.11 for full compatibility or if running 1.10.2 or later, run:
kube-controller-manager might detach persistent volumes
forcefully after 6 minutes
kube-controller-manager might timeout when detaching
PV/PVCs after 6 minutes, and forcefully detach the PV/PVCs. Detailed logs
from kube-controller-manager show events similar to the
following:
$ cat kubectl_logs_kube-controller-manager-xxxx | grep "DetachVolume started" | grep expired
kubectl_logs_kube-controller-manager-gke-admin-master-4mgvr_--container_kube-controller-manager_--kubeconfig_kubeconfig_--request-timeout_30s_--namespace_kube-system_--timestamps:2023-01-05T16:29:25.883577880Z W0105 16:29:25.883446 1 reconciler.go:224] attacherDetacher.DetachVolume started for volume "pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^126f913b-4029-4055-91f7-beee75d5d34a") on node "sandbox-corp-ant-antho-0223092-03-u-tm04-ml5m8-7d66645cf-t5q8f"
This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching
To verify the issue, log into the node and run the following commands:
# See all the mounting points with disks
lsblk-f
# See some ext4 errors
sudodmesg-T
In the kubelet log, errors like the following are displayed:
Error: GetDeviceMountRefs check failed for volume "pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^126f913b-4029-4055-91f7-beee75d5d34a") on node "sandbox-corp-ant-antho-0223092-03-u-tm04-ml5m8-7d66645cf-t5q8f" :
the device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16/globalmount" is still mounted by other references [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8bb4780b-ba8e-45f4-a95b-19397a66ce16/globalmount
Workaround:
Connect to the affected node using SSH and reboot the node.
Upgrades, Updates
1.12+, 1.13+, 1.14+
Cluster upgrade is stuck if 3rd party CSI driver is used
You might not be able to upgrade a cluster if you use a 3rd party CSI
driver. The gkectl cluster diagnose command might return the
following error:
"virtual disk "kubernetes.io/csi/csi.netapp.io^pvc-27a1625f-29e3-4e4f-9cd1-a45237cc472c" IS NOT attached to machine "cluster-pool-855f694cc-cjk5c" but IS listed in the Node.Status"
Workaround:
Perform the upgrade using the --skip-validation-all
option.
Operation
1.10+, 1.11+, 1.12+, 1.13+, 1.14+
gkectl repair admin-master creates the admin master VM
without upgrading its vm hardware version
The admin master node created via gkectl repair admin-master
may use a lower VM hardware version than expected. When the issue happens,
you will see the error from the gkectl diagnose cluster
report.
CSIPrerequisites [VM Hardware]: The current VM hardware versions are lower than vmx-15 which is unexpected. Please contact Anthos support to resolve this issue.
Workaround:
Shutdown the admin master node, follow
https://kb.vmware.com/s/article/1003746
to upgrade the node to the expected version described in the error
message, and then start the node.
Operating system
1.10+, 1.11+, 1.12+, 1.13+, 1.14+, 1.15+, 1.16+
VM releases DHCP lease on shutdown/reboot unexpectedly, which may
result in IP changes
In systemd v244, systemd-networkd has a
default behavior change
on the KeepConfiguration configuration. Before this change,
VMs did not send a DHCP lease release message to the DHCP server on
shutdown or reboot. After this change, VMs send such a message and
return the IPs to the DHCP server. As a result, the released IP may be
reallocated to a different VM and/or a different IP may be assigned to the
VM, resulting in IP conflict (at Kubernetes level, not vSphere level)
and/or IP change on the VMs, which can break the clusters in various ways.
For example, you may see the following symptoms.
vCenter UI shows that no VMs use the same IP, but kubectl get
nodes -o wide returns nodes with duplicate IPs.
NAME STATUS AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node1 Ready 28h v1.22.8-gke.204 10.180.85.130 10.180.85.130 Ubuntu 20.04.4 LTS 5.4.0-1049-gkeop containerd://1.5.13
node2 NotReady 71d v1.22.8-gke.204 10.180.85.130 10.180.85.130 Ubuntu 20.04.4 LTS 5.4.0-1049-gkeop containerd://1.5.13
New nodes fail to start due to calico-node error
2023-01-19T22:07:08.817410035Z 2023-01-19 22:07:08.817 [WARNING][9] startup/startup.go 1135: Calico node 'node1' is already using the IPv4 address 10.180.85.130.
2023-01-19T22:07:08.817514332Z 2023-01-19 22:07:08.817 [INFO][9] startup/startup.go 354: Clearing out-of-date IPv4 address from this node IP="10.180.85.130/24"
2023-01-19T22:07:08.825614667Z 2023-01-19 22:07:08.825 [WARNING][9] startup/startup.go 1347: Terminating
2023-01-19T22:07:08.828218856Z Calico node failed to start
Workaround:
Deploy the following DaemonSet on the cluster to revert the
systemd-networkd default behavior change. The VMs that run
this DaemonSet will not release the IPs to the DHCP server on
shutdown/reboot. The IPs will be freed automatically by the DHCP server
when the leases expire.
Component access service account key wiped out after admin cluster
upgraded from 1.11.x
This issue will only affect admin clusters which are upgraded
from 1.11.x, and won't affect admin clusters which are newly created after
1.12.
After upgrading a 1.11.x cluster to 1.12.x, the
component-access-sa-key field in
admin-cluster-creds secret will be wiped out to empty.
This can be checked by running the following command:
If you find the output is empty that means the key is wiped out.
After the component access service account key been deleted,
installing new user clusters or upgrading existing user clusters will
fail. The following lists some error messages you might encounter:
Slow validation preflight failure with error message: "Failed
to create the test VMs: failed to get service account key: service
account is not configured."
Prepare by gkectl prepare failed with error message:
"Failed to prepare OS images: dialing: unexpected end of JSON
input"
If you are upgrading a 1.13 user cluster using the Google Cloud
Console or the gcloud CLI, when you run
gkectl update admin --enable-preview-user-cluster-central-upgrade
to deploy the upgrade platform controller, the command fails
with the message: "failed to download bundle to disk: dialing:
unexpected end of JSON input" (You can see this message
in the status field in
the output of kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get onprembundle -oyaml).
Workaround:
Add the component access service account key back into the secret
manually by running the following command:
Cluster autoscaler does not work when Controlplane V2 is enabled
For user clusters created with Controlplane V2
enabled, node pools with autoscaling enabled always use their autoscaling.minReplicas in the user-cluster.yaml. The log of the cluster-autoscaler pod
shows an error similar to the following:
Include individual IPs in the IP block file until upgrading to a version with the fix: 1.12.5, 1.13.4, 1.14.1+.
Upgrades, Updates
1.14.0-1.14.1
OS image type update in the admin-cluster.yaml doesn't wait for user control plane machines to be re-created
When Updating control plane OS image type in the admin-cluster.yaml, and if its corresponding user cluster was created via Controlplane V2, the user control plane machines may not finish their re-creation when the gkectl command finishes.
Workaround:
After the update is finished, keep waiting for the user control plane machines to also finish their re-creation by monitoring their node os image types using kubectl --kubeconfig USER_KUBECONFIG get nodes -owide. e.g. If updating from Ubuntu to COS, we should wait for all the control plane machines to completely change from Ubuntu to COS even after the update command is complete.
Operation
1.10, 1.11, 1.12, 1.13, 1.14.0
Pod create or delete errors due to Calico CNI service account auth token
issue
An issue with Calico in Google Distributed Cloud 1.14.0
causes Pod creation and deletion to fail with the following error message in
the output of kubectl describe pods:
error getting ClusterInformation: connection is unauthorized: Unauthorized
This issue is only observed 24 hours after the cluster is
created or upgraded to 1.14 using Calico.
Admin clusters are always using Calico, while for user cluster there is
a config field `enableDataPlaneV2` in user-cluster.yaml, if that field is
set to `false`, or not specified, that means you are using Calico in user
cluster.
The nodes' install-cni container creates a kubeconfig with a
token that is valid for 24 hours. This token needs to be periodically
renewed by the calico-node Pod. The calico-node
Pod is unable to renew the token as it doesn't have access to the directory
that contains the kubeconfig file on the node.
Workaround:
This issue was fixed in Google Distributed Cloud version 1.14.1. Upgrade to
this or a later version.
If you can't upgrade right away, apply the following patch on the
calico-node DaemonSet in your admin and user cluster:
ADMIN_CLUSTER_KUBECONFIG: the path
of the admin cluster kubeconfig file.
USER_CLUSTER_CONFIG_FILE: the path
of your user cluster configuration file.
Installation
1.12.0-1.12.4, 1.13.0-1.13.3, 1.14.0
IP block validation fails when using CIDR
Cluster creation fails despite the user having the proper configuration. User sees creation failing due to the cluster not having enough IPs.
Workaround:
Split CIDR's into several smaller CIDR blocks, such as 10.0.0.0/30 becomes 10.0.0.0/31, 10.0.0.2/31. As long as there are N+1 CIDR's, where N is the number of nodes in the cluster, this should suffice.
Operation, Upgrades, Updates
1.11.0 - 1.11.1, 1.10.0 - 1.10.4, 1.9.0 - 1.9.6
Admin cluster backup does not include the always-on secrets encryption keys and configuration
When the always-on secrets encryption feature is enabled along with cluster backup, the admin cluster backup fails to include the encryption keys and configuration required by always-on secrets encryption feature. As a result, repairing the admin master with this backup using gkectl repair admin-master --restore-from-backup causes the following error:
Use the gkectl binary of the latest available patch version for the corresponding minor version to perform the admin cluster backup after critical cluster operations. For example, if the cluster is running a 1.10.2 version, use the 1.10.5 gkectl binary to perform a manual admin cluster backup as described in Backup and Restore an admin cluster with gkectl.
Operation, Upgrades, Updates
1.10+
Recreating the admin master VM with a new boot disk (e.g., gkectl repair admin-master) will fail if the always-on secrets encryption feature is enabled using `gkectl update` command.
If the always-on secrets encryption feature is not enabled at cluster creation, but enabled later using gkectl update operation then the gkectl repair admin-master fails to repair the admin cluster control plane node. It is recommend that always-on secrets encryption feature is enabled at cluster creation. There is no current mitigation.
Upgrades, Updates
1.10
Upgrading the first user cluster from 1.9 to 1.10 recreates nodes in other user clusters
Upgrading the first user cluster from 1.9 to 1.10 could recreate nodes in other user clusters under the same admin cluster. The recreation is performed in a rolling fashion.
The disk_label was removed from MachineTemplate.spec.template.spec.providerSpec.machineVariables, which triggered an update on all MachineDeployments unexpectedly.
To understand the root cause, you need to ssh to the node that has the symptom and run commands like sudo journalctl --utc -u docker or sudo journalctl -x
Self-deployed GMP components not preserved after upgrading to version 1.12
If you are using a Google Distributed Cloud version below 1.12, and have manually set up Google-managed Prometheus (GMP) components in the gmp-system
namespace for your cluster, the components are not preserved when you
upgrade to version 1.12.x.
From version 1.12, GMP components in the gmp-system namespace and CRDs are managed by stackdriver
object, with the enableGMPForApplications flag set to false by
default. If you manually deploy GMP components in the namespace prior to upgrading to 1.12, the resources will be deleted by stackdriver.
Workaround:
Back up all existing PodMonitoring custom resources (CRs).
Missing ClusterAPI objects in cluster snapshot system scenario
In the system scenario, the cluster snapshot doesn't include any resources under the default namespace.
However, some Kubernetes resources like Cluster API objects that are under this namespace contain useful debugging information. The cluster snapshot should include them.
Workaround:
You can manually run the following commands to collect the debugging information.
kubevols is the default directory for vSphere in-tree driver. When there are no PVC/PV objects created, you may hit a bug that node drain will be stuck at finding kubevols, since the current implementation assumes that kubevols always exists.
Workaround:
Create the directory kubevols in the datastore where the node is created. This is defined in the vCenter.datastore field in the user-cluster.yaml or admin-cluster.yaml files.
Configuration
1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13, 1.14
Cluster Autoscaler clusterrolebinding and clusterrole are deleted after deleting a user cluster.
On user cluster deletion, the corresponding clusterrole and clusterrolebinding for cluster-autoscaler are also deleted. This affects all other user clusters on the same admin cluster with cluster autoscaler enabled. This is because the same clusterrole and clusterrolebinding are used for all cluster autoscaler pods within the same admin cluster.
Apply the following clusterrole and clusterrolebinding to the admin cluster if they are missing. Add the service account subjects to the clusterrolebinding for each user cluster.
apiVersion:rbac.authorization.k8s.io/v1kind:ClusterRolemetadata:name:cluster-autoscalerrules:-apiGroups:["cluster.k8s.io"]resources:["clusters"]verbs:["get","list","watch"]-apiGroups:["cluster.k8s.io"]resources:["machinesets","machinedeployments","machinedeployments/scale","machines"]verbs:["get","list","watch","update","patch"]-apiGroups:["onprem.cluster.gke.io"]resources:["onpremuserclusters"]verbs:["get","list","watch"]-apiGroups:-coordination.k8s.ioresources:-leasesresourceNames:["cluster-autoscaler"]verbs:-get-list-watch-create-update-patch-apiGroups:-""resources:-nodesverbs:["get","list","watch","update","patch"]-apiGroups:-""resources:-podsverbs:["get","list","watch"]-apiGroups:-""resources:-pods/evictionverbs:["create"]# read-only access to cluster state-apiGroups:[""]resources:["services","replicationcontrollers","persistentvolumes","persistentvolumeclaims"]verbs:["get","list","watch"]-apiGroups:["apps"]resources:["daemonsets","replicasets"]verbs:["get","list","watch"]-apiGroups:["apps"]resources:["statefulsets"]verbs:["get","list","watch"]-apiGroups:["batch"]resources:["jobs"]verbs:["get","list","watch"]-apiGroups:["policy"]resources:["poddisruptionbudgets"]verbs:["get","list","watch"]-apiGroups:["storage.k8s.io"]resources:["storageclasses","csinodes"]verbs:["get","list","watch"]# misc access-apiGroups:[""]resources:["events"]verbs:["create","update","patch"]-apiGroups:[""]resources:["configmaps"]verbs:["create"]-apiGroups:[""]resources:["configmaps"]resourceNames:["cluster-autoscaler-status"]verbs:["get","update","patch","delete"]
A known issue that could fail the gkectl check-config without running gkectl prepare. This is confusing because we suggest running the command before running gkectl prepare
The symptom is that the gkectl check-config command will fail with the
following error message:
For the update to take effect, the machines need to be recreated after the failed update.
For admin cluster update, user master and admin addon nodes need to be recreated
For user cluster update, user worker nodes need to be recreated
To recreate user worker nodes
Option 1 In the 1.11 version of the documentation, follow
update a node pool and change the cpu or memory to trigger a rolling recreation of the nodes.
Option 2 Use kubectl delete to recreate the machines one at a time
Option 1 In the 1.11 version of the documentation, follow
resize control plane and change the cpu or memory to trigger a rolling recreation of the nodes.
Option 2 Use kubectl delete to recreate the machines one at a time
Nodes fail to register if configured hostname contains a period
Node registration fails during cluster creation, upgrade, update and
node auto repair, when ipMode.type is static and
the configured hostname in the
IP block file contains one
or more periods. In this case, Certificate Signing Requests (CSR) for a
node are not automatically approved.
To see pending CSRs for a node, run the following command:
kubectlgetcsr-A-owide
Check the following logs for error messages:
View the logs in the admin cluster for the
clusterapi-controller-manager container in the
clusterapi-controllers Pod:
ADMIN_CLUSTER_KUBECONFIG is the admin cluster's
kubeconfig file.
USER_CLUSTER_NAME is the name of the user cluster.
Here is an example of error messages you might see: "msg"="failed
to validate token id" "error"="failed to find machine for node
node-worker-vm-1" "validate"="csr-5jpx9"
View the kubelet logs on the problematic node:
journalctl--ukubelet
Here is an example of error messages you might see: "Error getting
node" err="node \"node-worker-vm-1\" not found"
If you specify a domain name in the hostname field of an IP block file,
any characters following the first period will be ignored. For example, if
you specify the hostname as bob-vm-1.bank.plc, the VM
hostname and node name will be set to bob-vm-1.
When node ID verification is enabled, the CSR approver compares the
node name with the hostname in the Machine spec, and fails to reconcile
the name. The approver rejects the CSR, and the node fails to
bootstrap.
Workaround:
User cluster
Disable node ID verification by completing the following steps:
Add the following fields in your user cluster configuration file:
VMware has a 64k vApp property size limit. In the identified versions,
the data passed via vApp property is close to the limit. When the private
registry certificate contains a certificate bundle, it may cause the final
data to exceed the 64k limit.
Workaround:
Only include the required certificates in the private registry
certificate file configured in privateRegistry.caCertPath in
the admin cluster config file.
Or upgrade to a version with the fix when available.
Networking
1.10, 1.11.0-1.11.3, 1.12.0-1.12.2, 1.13.0
NetworkGatewayNodes marked unhealthy from concurrent
status update conflict
In networkgatewaygroups.status.nodes, some nodes switch
between NotHealthy and Up.
Logs for the ang-daemon Pod running on that node reveal
repeated errors:
2022-09-16T21:50:59.696Z ERROR ANGd Failed to report status {"angNode": "kube-system/my-node", "error": "updating Node CR status: sending Node CR update: Operation cannot be fulfilled on networkgatewaynodes.networking.gke.io \"my-node\": the object has been modified; please apply your changes to the latest version and try again"}
The NotHealthy status prevents the controller from
assigning additional floating IPs to the node. This can result in higher
burden on other nodes or a lack of redundancy for high availability.
Dataplane activity is otherwise not affected.
Contention on the networkgatewaygroup object causes some
status updates to fail due to a fault in retry handling. If too many
status updates fail, ang-controller-manager sees the node as
past its heartbeat time limit and marks the node NotHealthy.
The fault in retry handling has been fixed in later versions.
Workaround:
Upgrade to a fixed version, when available.
Upgrades, Updates
1.12.0-1.12.2, 1.13.0
Race condition blocks machine object deletion during and update or
upgrade
A known issue that could cause the cluster upgrade or update to be
stuck at waiting for the old machine object to be deleted. This is because
the finalizer cannot be removed from the machine object. This affects any
rolling update operation for node pools.
The symptom is that the gkectl command times out with the
following error message:
E0821 18:28:02.546121 61942 console.go:87] Exit with error:
E0821 18:28:02.546184 61942 console.go:87] error: timed out waiting for the condition, message: Node pool "pool-1" is not ready: ready condition is not true: CreateOrUpdateNodePool: 1/3 replicas are updated
Check the status of OnPremUserCluster 'cluster-1-gke-onprem-mgmt/cluster-1' and the logs of pod 'kube-system/onprem-user-cluster-controller' for more detailed debugging information.
In clusterapi-controller Pod logs, the errors are like
below:
$ kubectl logs clusterapi-controllers-[POD_NAME_SUFFIX] -n cluster-1
-c vsphere-controller-manager --kubeconfig [ADMIN_KUBECONFIG]
| grep "Error removing finalizer from machine object"
[...]
E0821 23:19:45.114993 1 machine_controller.go:269] Error removing finalizer from machine object cluster-1-pool-7cbc496597-t5d5p; Operation cannot be fulfilled on machines.cluster.k8s.io "cluster-1-pool-7cbc496597-t5d5p": the object has been modified; please apply your changes to the latest version and try again
The error repeats for the same machine for several minutes for
successful runs even without this issue, for most of the time it can go
through quickly, but for some rare cases it can be stuck at this race
condition for several hours.
The issue is that the underlying VM is already deleted in vCenter, but
the corresponding machine object cannot be removed, which is stuck at the
finalizer removal due to very frequent updates from other controllers.
This can cause the gkectl command to timeout, but the
controller keeps reconciling the cluster so the upgrade or update process
eventually completes.
Workaround:
We have prepared several different mitigation options for this issue,
which depends on your environment and requirements.
Option 1: Wait for the upgrade to eventually complete by
itself.
Based on the analysis and reproduction in your environment, the upgrade
can eventually finish by itself without any manual intervention. The
caveat of this option is that it's uncertain how long it will take for
the finalizer removal to go through for each machine object. It can go
through immediately if lucky enough, or it could last for several hours
if the machineset controller reconcile is too fast and the machine
controller never gets a chance to remove the finalizer in between the
reconciliations.
The good thing is that this option doesn't need any action from your
side, and the workloads won't be disrupted. It just needs a longer time
for the upgrade to finish.
Option 2: Apply auto repair annotation to all the old machine
objects.
The machineset controller will filter out the machines that have the
auto repair annotation and deletion timestamp being non zero, and won't
keep issuing delete calls on those machines, this can help avoid the
race condition.
The downside is that the pods on the machines will be deleted directly
instead of evicted, which means it won't respect the PDB configuration,
this might potentially cause downtime for your workloads.
The command for getting all machine names:
kubectl--kubeconfigCLUSTER_KUBECONFIGgetmachines
The command for applying auto repair annotation for each machine:
If you encounter this issue and the upgrade or update still can't
complete after a long time,
contact
our support team for mitigations.
Installation, Upgrades, Updates
1.10.2, 1.11, 1.12, 1.13
gkectl prepare OS image validation preflight failure
gkectl prepare command failed with:
- Validation Category: OS Images
- [FAILURE] Admin cluster OS images exist: os images [os_image_name] don't exist, please run `gkectl prepare` to upload os images.
The preflight checks of gkectl prepare included an
incorrect validation.
Workaround:
Run the same command with an additional flag
--skip-validation-os-images.
Installation
1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13
vCenter URL with https:// or http:// prefix
may cause cluster startup failure
Admin cluster creation failed with:
Exit with error:
Failed to create root cluster: unable to apply admin base bundle to external cluster: error: timed out waiting for the condition, message:
Failed to apply external bundle components: failed to apply bundle objects from admin-vsphere-credentials-secret 1.x.y-gke.z to cluster external: Secret "vsphere-dynamic-credentials" is invalid:
[data[https://xxx.xxx.xxx.username]: Invalid value: "https://xxx.xxx.xxx.username": a valid config key must consist of alphanumeric characters, '-', '_' or '.'
(e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+'), data[https://xxx.xxx.xxx.password]:
Invalid value: "https://xxx.xxx.xxx.password": a valid config key must consist of alphanumeric characters, '-', '_' or '.'
(e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')]
The URL is used as part of a Secret key, which doesn't
support "/" or ":".
Workaround:
Remove https:// or http:// prefix from the
vCenter.Address field in the admin cluster or user cluster
config yaml.
Installation, Upgrades, Updates
1.10, 1.11, 1.12, 1.13
gkectl prepare panic on util.CheckFileExists
gkectl prepare can panic with the following
stacktrace:
gkectl repair admin-master and resumable admin upgrade do
not work together
After a failed admin cluster upgrade attempt, don't run gkectl
repair admin-master. Doing so may cause subsequent admin upgrade
attempts to fail with issues such as admin master power on failure or the
VM being inaccessible.
Workaround:
If you've already encountered this failure scenario,
contact support.
Upgrades, Updates
1.10, 1.11
Resumed admin cluster upgrade can lead to missing admin control plane
VM template
If the admin control plane machine isn't recreated after a resumed
admin cluster upgrade attempt, the admin control plane VM template is
deleted. The admin control plane VM template is the template of the admin
master that is used to recover the control plane machine with
gkectl
repair admin-master.
Workaround:
The admin control plane VM template will be regenerated during the next
admin cluster upgrade.
Operating system
1.12, 1.13
cgroup v2 could affect workloads
In version 1.12.0, cgroup v2 (unified) is enabled by default for
Container Optimized OS (COS) nodes. This could potentially cause
instability for your workloads in a COS cluster.
Workaround:
We switched back to cgroup v1 (hybrid) in version 1.12.1. If you are
using COS nodes, we recommend that you upgrade to version 1.12.1 as soon
as it is released.
Identity
1.10, 1.11, 1.12, 1.13
ClientConfig custom resource
gkectl update reverts any manual changes that you have
made to the ClientConfig custom resource.
Workaround:
We strongly recommend that you back up the ClientConfig resource after
every manual change.
Edit the cert-manager-cainjector Deployment to disable
leader election, because we only have one replica running. It isn't
required for a single replica:
# Add a command line flag for cainjector: `--leader-elect=false`
kubectl--kubeconfigUSER_CLUSTER_KUBECONFIGedit\-nkube-systemdeploymentcert-manager-cainjector
The relevant YAML snippet for cert-manager-cainjector
deployment should looks like the following example:
After each upgrade, the changes are reverted. Perform the same
steps again to mitigate the issue until this is fixed in a future
release.
VMware
1.10, 1.11, 1.12, 1.13
Restarting or upgrading vCenter for versions lower than 7.0U2
If the vCenter, for versions lower than 7.0U2, is restarted, after an
upgrade or otherwise, the network name in vm information from vCenter is
incorrect, and results in the machine being in an Unavailable
state. This eventually leads to the nodes being auto-repaired to create
new ones.
The issue is fixed in vCenter versions 7.0U2 and above.
For lower versions, right-click the host, and then select
Connection > Disconnect. Next, reconnect, which forces an update
of the VM's portgroup.
Operating system
1.10, 1.11, 1.12, 1.13
SSH connection closed by remote host
For Google Distributed Cloud version 1.7.2 and above, the Ubuntu OS
images are hardened with
CIS L1 Server Benchmark.
To meet the CIS rule "5.2.16 Ensure SSH Idle Timeout Interval is
configured", /etc/ssh/sshd_config has the following
settings:
ClientAliveInterval 300
ClientAliveCountMax 0
The purpose of these settings is to terminate a client session after 5
minutes of idle time. However, the ClientAliveCountMax 0
value causes unexpected behavior. When you use the ssh session on the
admin workstation, or a cluster node, the SSH connection might be
disconnected even your ssh client is not idle, such as when running a
time-consuming command, and your command could get terminated with the
following message:
Connection to [IP] closed by remote host.
Connection to [IP] closed.
Workaround:
You can either:
Use nohup to prevent your command being terminated on
SSH disconnection,
In 1.13 releases, monitoring-operator will install
cert-manager in the cert-manager namespace. If for certain
reasons, you need to install your own cert-manager, follow the following
instructions to avoid conflicts:
You only need to apply this work around once for each cluster, and the
changes will be preserved across cluster upgrade.
Note: One common symptom of installing your own cert-manager
is that the cert-manager version or image (for example
v1.7.2) may revert back to its older version. This is caused by
monitoring-operator trying to reconcile the
cert-manager, and reverting the version in the process.
Workaround:
Avoid conflicts during upgrade
Uninstall your version of cert-manager. If you defined
your own resources, you may want to
backup
them.
You can skip this step if you are using
upstream default cert-manager installation, or you are sure your
cert-manager is installed in the cert-manager namespace.
Otherwise, copy the metrics-ca cert-manager.io/v1
Certificate and the metrics-pki.cluster.local Issuer
resources from cert-manager to the cluster resource
namespace of your installed cert-manager.
In general, you shouldn't need to re-install cert-manager in admin
clusters because admin clusters only run Google Distributed Cloud control
plane workloads. In the rare cases that you also need to install your own
cert-manager in admin clusters, please follow the following instructions
to avoid conflicts. Please note, if you are an Apigee customer and you
only need cert-manager for Apigee, you do not need to run the admin
cluster commands.
You can skip this step if you are using
upstream default cert-manager installation, or you are sure your
cert-manager is installed in the cert-manager namespace.
Otherwise, copy the metrics-ca cert-manager.io/v1
Certificate and the metrics-pki.cluster.local Issuer
resources from cert-manager to the cluster resource
namespace of your installed cert-manager.
False positives in docker, containerd, and runc vulnerability scanning
The Docker, containerd, and runc in the Ubuntu OS images shipped with
Google Distributed Cloud are pinned to special versions using
Ubuntu PPA. This ensures
that any container runtime changes will be qualified by
Google Distributed Cloud before each release.
However, the special versions are unknown to the
Ubuntu CVE
Tracker, which is used as the vulnerability feeds by various CVE
scanning tools. Therefore, you will see false positives in Docker,
containerd, and runc vulnerability scanning results.
For example, you might see the following false positives from your CVE
scanning results. These CVEs are already fixed in the latest patch
versions of Google Distributed Cloud.
Network connection between admin and user cluster might be unavailable
for a short time during non-HA cluster upgrade
If you are upgrading non-HA clusters from 1.9 to 1.10, you might notice
that the kubectl exec, kubectl log and webhook
against user clusters might be unavailable for a short time. This downtime
can be up to one minute. This happens because the incoming request
(kubectl exec, kubectl log and webhook) is handled by kube-apiserver for
the user cluster. User kube-apiserver is a
Statefulset. In a non-HA cluster, there is only one replica for the
Statefulset. So during upgrade, there is a chance that the old
kube-apiserver is unavailable while the new kube-apiserver is not yet
ready.
Workaround:
This downtime only happens during upgrade process. If you want a
shorter downtime during upgrade, we recommend you to switch to
HA
clusters.
Installation, Upgrades, Updates
1.10, 1.11, 1.12, 1.13
Konnectivity readiness check failed in HA cluster diagnose after
cluster creation or upgrade
If you are creating or upgrading an HA cluster and notice konnectivity
readiness check failed in cluster diagnose, in most cases it will not
affect the functionality of Google Distributed Cloud (kubectl exec, kubectl
log and webhook). This happens because sometimes one or two of the
konnectivity replicas might be unready for a period of time due to
unstable networking or other issues.
Workaround:
The konnectivity will recover by itself. Wait for 30 minutes to 1 hour
and rerun cluster diagnose.
Operating system
1.7, 1.8, 1.9, 1.10, 1.11
/etc/cron.daily/aide CPU and memory spike issue
Starting from Google Distributed Cloud version 1.7.2, the Ubuntu OS
images are hardened with
CIS L1 Server
Benchmark.
As a result, the cron script /etc/cron.daily/aide has been
installed so that an aide check is scheduled so as to ensure
that the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is
regularly checked" is followed.
The cron job runs daily at 6:25 AM UTC. Depending on the number of
files on the filesystem, you may experience CPU and memory usage spikes
around that time that are caused by this aide process.
Workaround:
If the spikes are affecting your workload, you can disable the daily
cron job:
sudochmod-x/etc/cron.daily/aide
Networking
1.10, 1.11, 1.12, 1.13
Load balancers and NSX-T stateful distributed firewall rules interact
unpredictably
When deploying Google Distributed Cloud version 1.9 or later, when the
deployment has the Seesaw bundled load balancer in an environment that
uses NSX-T stateful distributed firewall rules,
stackdriver-operator might fail to create
gke-metrics-agent-conf ConfigMap and cause
gke-connect-agent Pods to be in a crash loop.
The underlying issue is that the stateful NSX-T distributed firewall
rules terminate the connection from a client to the user cluster API
server through the Seesaw load balancer because Seesaw uses asymmetric
connection flows. The integration issues with NSX-T distributed firewall
rules affect all Google Distributed Cloud releases that use Seesaw. You
might see similar connection problems on your own applications when they
create large Kubernetes objects whose sizes are bigger than 32K.
Workaround:
In the 1.13 version of the documentation, follow
these instructions to disable NSX-T distributed firewall rules, or to
use stateless distributed firewall rules for Seesaw VMs.
If your clusters use a manual load balancer, follow
these instructions to configure your load balancer to reset client
connections when it detects a backend node failure. Without this
configuration, clients of the Kubernetes API server might stop responding
for several minutes when a server instance goes down.
Logging and monitoring
1.10, 1.11, 1.12, 1.13, 1.14, 1.15
Unexpected monitoring billing
For Google Distributed Cloud versions 1.10 to 1.15, some customers have
found unexpectedly high billing for Metrics volume on the
Billing page. This issue affects you only when all of the
following circumstances apply:
Application logging and monitoring is enabled (enableStackdriverForApplications=true)
Application Pods have the prometheus.io/scrap=true
annotation. (Installing Cloud Service Mesh can also add this annotation.)
To confirm whether you are affected by this issue,
list your
user-defined metrics. If you see billing for unwanted metrics with external.googleapis.com/prometheus name prefix and also see enableStackdriverForApplications set to true in the response of kubectl -n kube-system get stackdriver stackdriver -o yaml, then
this issue applies to you.
Workaround
If you are affected by this issue, we recommend that you upgrade your
clusters to version 1.12 or above, stop using the enableStackdriverForApplications flag, and switch to new application monitoring solution managed-service-for-prometheus that no longer relies on the prometheus.io/scrap=true annotation. With the new solution, you can also control logs and metrics collection separately for your applications, with the enableCloudLoggingForApplications and enableGMPForApplications flag, respectively.
To stop using the enableStackdriverForApplications flag, open the `stackdriver` object for editing:
The Google Distributed Cloud installer can fail if custom roles are bound
at the wrong permissions level.
When the role binding is incorrect, creating a vSphere datadisk with
govc hangs and the disk is created with a size equal to 0. To
fix the issue, you should bind the custom role at the vSphere vCenter
level (root).
Workaround:
If you want to bind the custom role at the DC level (or lower than
root), you also need to bind the read-only role to the user at the root
vCenter level.
gke-metrics-agent has frequent CrashLoopBackOff errors
For Google Distributed Cloud version 1.10 and above, `gke-metrics-agent`
DaemonSet has frequent CrashLoopBackOff errors when
`enableStackdriverForApplications` is set to `true` in the `stackdriver`
object.
Workaround:
To mitigate this issue, disable application metrics collection by
running the following commands. These commands will not disable
application logs collection.
To prevent the following changes from reverting, scale down
stackdriver-operator:
If deprecated metrics are used in your OOTB dashboards, you will see
some empty charts. To find deprecated metrics in the Monitoring
dashboards, run the following commands:
Delete "GKE on-prem node status" in the Google Cloud Monitoring
dashboard. Reinstall "GKE on-prem node status" following
these instructions.
Delete "GKE on-prem node utilization" in the Google Cloud Monitoring
dashboard. Reinstall "GKE on-prem node utilization" following
these instructions.
Delete "GKE on-prem vSphere vm health" in the Google Cloud
Monitoring dashboard. Reinstall "GKE on-prem vSphere vm health"
following
these instructions.
This deprecation is due to the upgrade of
kube-state-metrics agent from v1.9 to v2.4, which is required for
Kubernetes 1.22. You can replace all deprecated
kube-state-metrics metrics, which have the prefix
kube_, in your custom dashboards or alerting policies.
Logging and monitoring
1.10, 1.11, 1.12, 1.13
Unknown metric data in Cloud Monitoring
For Google Distributed Cloud version 1.10 and above, the data for
clusters in Cloud Monitoring may contain irrelevant summary metrics
entries such as the following:
To fix this issue, perform the following steps as a workaround. For
[version 1.9.5+, 1.10.2+, 1.11.0]: increase cpu for gke-metrics-agent
by following steps 1 - 4
To increase the CPU request for gke-metrics-agent from
10m to 50m, CPU limit from 100m
to 200m add the following resourceAttrOverride
section to the stackdriver manifest :
The command finds cpu: 50m if your edits have taken effect.
Logging and monitoring
1.11.0-1.11.2, 1.12.0
Missing scheduler and controller-manager metrics in admin cluster
If your admin cluster is affected by this issue, scheduler and
controller-manager metrics are missing. For example, these two metrics are
missing
# scheduler metric example
scheduler_pending_pods
# controller-manager metric example
replicaset_controller_rate_limiter_use
Workaround:
Upgrade to v1.11.3+, v1.12.1+, or v1.13+.
1.11.0-1.11.2, 1.12.0
Missing scheduler and controller-manager metrics in user cluster
If your user cluster is affected by this issue, scheduler and
controller-manager metrics are missing. For example, these two metrics are
missing:
# scheduler metric example
scheduler_pending_pods
# controller-manager metric example
replicaset_controller_rate_limiter_use
Workaround:
This issue is fixed in Google Distributed Cloud version 1.13.0 and later.
Upgrade your cluster to a version with the fix.
Installation, Upgrades, Updates
1.10, 1.11, 1.12, 1.13
Failure to register admin cluster during creation
If you create an admin cluster for version 1.9.x or 1.10.0, and if the
admin cluster fails to register with the provided gkeConnect
spec during its creation, you will get the following error.
Failed to create root cluster: failed to register admin cluster: failed to register cluster: failed to apply Hub Membership: Membership API request failed: rpc error: ode = PermissionDenied desc = Permission 'gkehub.memberships.get' denied on PROJECT_PATH
You will still be able to use this admin cluster, but you will get the
following error if you later attempt to upgrade the admin cluster to
version 1.10.y.
failed to migrate to first admin trust chain: failed to parse current version "": invalid version: "" failed to migrate to first admin trust chain: failed to parse current version "": invalid version: ""
If this error occurs, follow these steps to fix the cluster
registration issue. After you do this fix, you can then upgrade your admin
cluster.
Run gkectl update admin to register the admin cluster
with the correct service account key.
Create a dedicated service account for patching the
OnPremAdminCluster custom resource.
exportKUBECONFIG=ADMIN_CLUSTER_KUBECONFIG# Create Service Account modify-admin
kubectlapply-f-<<EOF
apiVersion:v1
kind:ServiceAccount
metadata:
name:modify-admin
namespace:kube-system
EOF
# Create ClusterRole
kubectlapply-f-<<EOF
apiVersion:rbac.authorization.k8s.io/v1
kind:ClusterRole
metadata:
creationTimestamp:null
name:modify-admin-role
rules:
-apiGroups:
-"onprem.cluster.gke.io"resources:
-"onpremadminclusters/status"verbs:
-"patch"
EOF
# Create ClusterRoleBinding for binding the permissions to the modify-admin SA
kubectlapply-f-<<EOF
apiVersion:rbac.authorization.k8s.io/v1
kind:ClusterRoleBinding
metadata:
creationTimestamp:null
name:modify-admin-rolebinding
roleRef:
apiGroup:rbac.authorization.k8s.io
kind:ClusterRole
name:modify-admin-role
subjects:
-kind:ServiceAccount
name:modify-admin
namespace:kube-system
EOF
Replace ADMIN_CLUSTER_KUBECONFIG with the path of your admin
cluster kubeconfig file.
Run these commands to update the OnPremAdminCluster
custom resource.
exportKUBECONFIG=ADMIN_CLUSTER_KUBECONFIGSERVICE_ACCOUNT=modify-admin
SECRET=$(kubectlgetserviceaccount${SERVICE_ACCOUNT}\-nkube-system-ojson\|jq-Mr'.secrets[].name | select(contains("token"))')TOKEN=$(kubectlgetsecret${SECRET}-nkube-system-ojson\|jq-Mr'.data.token'|base64-d)
kubectlgetsecret${SECRET}-nkube-system-ojson\|jq-Mr'.data["ca.crt"]'\|base64-d>/tmp/ca.crt
APISERVER=https://$(kubectl-ndefaultgetendpointskubernetes\--no-headers|awk'{ print $2 }')# Find out the admin cluster name and gkeOnPremVersion from the OnPremAdminCluster CRADMIN_CLUSTER_NAME=$(kubectlgetonpremadmincluster-nkube-system\--no-headers|awk'{ print $1 }')GKE_ON_PREM_VERSION=$(kubectlgetonpremadmincluster\-nkube-system$ADMIN_CLUSTER_NAME\-o=jsonpath='{.spec.gkeOnPremVersion}')# Create the Status field and set the gkeOnPremVersion in OnPremAdminCluster CR
curl-H"Accept: application/json"\--header"Authorization: Bearer $TOKEN"-XPATCH\-H"Content-Type: application/merge-patch+json"\--cacert/tmp/ca.crt\--data'{"status": {"gkeOnPremVersion": "'$GKE_ON_PREM_VERSION'"}}'\$APISERVER/apis/onprem.cluster.gke.io/v1alpha1/namespaces/kube-system/onpremadminclusters/$ADMIN_CLUSTER_NAME/status
Attempt to upgrade the admin cluster again with the
--disable-upgrade-from-checkpoint flag.
If you have experienced this issue with an existing cluster, you can do
one of the following:
Disable GKE Identity Service. If you disable
GKE Identity Service, that won't remove the deployed
GKE Identity Service binary or remove
GKE Identity Service ClientConfig. To disable
GKE Identity Service, run this command:
Update the cluster to version 1.9.3 or later, or version 1.10.1 or
later, so as to upgrade the Connect Agent version.
Networking
1.10, 1.11, 1.12, 1.13
Cisco ACI doesn't work with Direct Server Return (DSR)
Seesaw runs in DSR mode, and by default it doesn't work in Cisco ACI
because of data-plane IP learning.
Workaround:
A possible workaround is to disable IP learning by adding the Seesaw IP
address as a L4-L7 Virtual IP in the Cisco Application Policy
Infrastructure Controller (APIC).
You can configure the L4-L7 Virtual IP option by going to Tenant >
Application Profiles > Application EPGs or uSeg EPGs. Failure
to disable IP learning will result in IP endpoint flapping between
different locations in the Cisco API fabric.
VMware
1.10, 1.11, 1.12, 1.13
vSphere 7.0 Update 3 issues
VMWare has recently identified critical issues with the following
vSphere 7.0 Update 3 releases:
vSphere ESXi 7.0 Update 3 (build 18644231)
vSphere ESXi 7.0 Update 3a (build 18825058)
vSphere ESXi 7.0 Update 3b (build 18905247)
vSphere vCenter 7.0 Update 3b (build 18901211)
Workaround:
VMWare has since removed these releases. You should upgrade the
ESXi and
vCenter
Servers to a newer version.
Operating system
1.10, 1.11, 1.12, 1.13
Failure to mount emptyDir volume as exec into Pod running
on COS nodes
For Pods running on nodes that use Container-Optimized OS (COS) images,
you cannot mount emptyDir volume as exec. It mounts as
noexec and you will get the following error: exec user
process caused: permission denied. For example, you will see this
error message if you deploy the following test Pod:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fix-cos-noexec
namespace: kube-system
spec:
selector:
matchLabels:
app: fix-cos-noexec
template:
metadata:
labels:
app: fix-cos-noexec
spec:
hostIPC: true
hostPID: true
containers:
- name: fix-cos-noexec
image: ubuntu
command: ["chroot", "/host", "bash", "-c"]
args:
- |
set -ex
while true; do
if ! $(nsenter -a -t 1 findmnt -l | grep -qe "^/var/lib/kubelet\s"); then
echo "remounting /var/lib/kubelet with exec"
nsenter -a -t 1 mount --bind /var/lib/kubelet /var/lib/kubelet
nsenter -a -t 1 mount -o remount,exec /var/lib/kubelet
fi
sleep 3600
done
volumeMounts:
- name: host
mountPath: /host
securityContext:
privileged: true
volumes:
- name: host
hostPath:
path: /
Upgrades, Updates
1.10, 1.11, 1.12, 1.13
Cluster node pool replica update does not work after autoscaling has
been disabled on the node pool
Node pool replicas do not update once autoscaling has been enabled and
disabled on a node pool.
Workaround:
Removing the
cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size
and cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size
annotations from the machine deployment of the corresponding node pool.
Logging and monitoring
1.11, 1.12, 1.13
Windows monitoring dashboards show data from Linux clusters
From version 1.11, on the out-of-the-box monitoring dashboards, the
Windows Pod status dashboard and Windows node status dashboard also show
data from Linux clusters. This is because the Windows node and Pod metrics
are also exposed on Linux clusters.
Logging and monitoring
1.10, 1.11, 1.12
stackdriver-log-forwarder in constant CrashLoopBackOff
For Google Distributed Cloud version 1.10, 1.11, and 1.12, stackdriver-log-forwarder
DaemonSet might have CrashLoopBackOff errors when there are
broken buffered logs on the disk.
Workaround:
To mitigate this issue, we will need to clean up the buffered logs on
the node.
To prevent the unexpected behaviour, scale down
stackdriver-log-forwarder:
To make sure the clean-up DaemonSet has cleaned up all the chunks,
you can run the following commands. The output of the two commands
should be equal to your node number in the cluster:
It's likely the logs input rate exceeds the limit of the logging agent,
which causes stackdriver-log-forwarder to not send logs.
This issue occurs in all Google Distributed Cloud versions.
Workaround:
To mitigate this issue, you need to increase the resource limit on
the logging agent.
The command finds cpu: 1200m if your edits have taken effect.
Security
1.13
Kubelet service will be temporarily unavailable after NodeReady
there is a short period where node is ready but kubelet server
certificate is not ready. kubectl exec and
kubectl logs are unavailable during this tens of seconds.
This is because it takes time for the new server certificate approver to
see the updated valid IPs of the node.
This issue affects kubelet server certificate only, it will not affect
Pod scheduling.
Upgrades, Updates
1.12
Partial admin cluster upgrade does not block later user cluster
upgrade
User cluster upgrade failed with:
.LBKind in body is required (Check the status of OnPremUserCluster 'cl-stg-gdl-gke-onprem-mgmt/cl-stg-gdl' and the logs of pod 'kube-system/onprem-user-cluster-controller' for more detailed debugging information.
The admin cluster is not fully upgraded, and the status version is
still 1.10. User cluster upgrade to 1.12 won't be blocked by any preflight
check, and fails with version skew issue.
Workaround:
Complete to upgrade the admin cluster to 1.11 first, and then upgrade
the user cluster to 1.12.
Storage
1.10.0-1.10.5, 1.11.0-1.11.2, 1.12.0
Datastore incorrectly reports insufficient free space
gkectl diagnose cluster command failed with:
Checking VSphere Datastore FreeSpace...FAILURE
Reason: vCenter datastore: [DATASTORE_NAME] insufficient FreeSpace, requires at least [NUMBER] GB
The validation of datastore free space should not be used for existing
cluster node pools, and was added in gkectl diagnose cluster
by mistake.
Workaround:
You can ignore the error message or skip the validation using
--skip-validation-infra.
Operation, Networking
1.11, 1.12.0-1.12.1
Failure to add new user cluster when admin cluster is using MetalLB
load balancer
You may not be able to add a new user cluster if your admin cluster is
set up with a MetalLB load balancer configuration.
The user cluster deletion process may get stuck for some reason which
results in an invalidation of the MatalLB ConfigMap. It won't be possible
to add a new user cluster in this state.
Failure when using Container-Optimized OS (COS) for user cluster
If osImageType is using cos for admin
cluster, and when gkectl check-config is executed after admin
cluster creation and before user cluster creation, it would fail on:
Failed to create the test VMs: VM failed to get IP addresses on the network.
The test VM created for user cluster check-config by
default uses the same osImageType from admin cluster, and
currently test VM is not compatible with COS yet.
Workaround:
To avoid the slow preflight check which creates the test VM, using
gkectl check-config --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config
USER_CLUSTER_CONFIG --fast.
Logging and monitoring
1.12.0-1.12.1
Grafana in the admin cluster unable to reach user clusters
This issue affects customers using Grafana in the admin cluster to
monitor user clusters in Google Distributed Cloud versions 1.12.0 and
1.12.1. It comes from a mismatch of pushprox-client certificates in user
clusters and the allowlist in the pushprox-server in the admin cluster.
The symptom is pushprox-client in user clusters printing error logs like
the following:
Locate the principals line for the
external-pushprox-server-auth-proxy listener and correct
the principal_name for all user clusters by removing the
kube-system substring from
pushprox-client.metrics-consumers.kube-system.cluster.
The new config should look like the following:
The preceding steps should resolve the issue. Once the cluster is
upgraded to 1.12.2 and later where the issue is fixed, scale up the
admin cluster kube-system monitoring-operator so that it can manage the
pipeline again.
gkectl repair admin-master does not provide the VM
template to be used for recovery
gkectl repair admin-master command failed with:
Failed to repair: failed to select the template: no VM templates is available for repairing the admin master (check if the admin cluster version >= 1.4.0 or contact support
gkectl repair admin-master is not able to fetch the VM
template to be used for repairing the admin control plane VM if the name
of the admin control plane VM ends with the characters t,
m, p, or l.
Workaround:
Rerun the command with --skip-validation.
Logging and monitoring
1.11, 1.12, 1.13, 1.14, 1.15, 1.16
Cloud audit logging failure due to permission denied
Cloud Audit Logs needs a special permission setup that is
currently only automatically performed for user clusters through GKE Hub.
It is recommended to have at least one user cluster that uses the same
project ID and service account with the admin cluster for
Cloud Audit Logs so the admin cluster will have the required
permission.
However in cases where the admin cluster uses a different project ID or
different service account than any user cluster, audit logs from the admin
cluster would fail to be injected into Google Cloud. The symptom is a
series of Permission Denied errors in the
audit-proxy Pod in the admin cluster.
If you received a 404 Not_found error, it means
there is no service account allowlisted for this project ID. You can
allowlist a service account by enabling the
cloudauditlogging Hub feature:
If you received a feature spec that contains
"lifecycleState": "ENABLED" with "code":
"OK" and a list of service accounts in
allowlistedServiceAccounts, it means there are existing
service accounts allowed for this project, you can either use a
service account from this list in your cluster, or add a new service
account to the allowlist:
If you received a feature spec that contains
"lifecycleState": "ENABLED" with "code":
"FAILED", it means the permission setup was not successful.
Try to address the issues in the description field of
the response, or back up the current allowlist, delete the
cloudauditlogging hub feature, and re-enable it following step 1 of
this section again. You can delete the cloudauditlogging
Hub feature by:
If your work station does not have access to user cluster worker nodes,
it will get the following failures when running
gkectl diagnose:
Checking user cluster certificates...FAILURE
Reason: 3 user cluster certificates error(s).
Unhealthy Resources:
Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
If your work station does not have access to admin cluster worker nodes
or admin cluster worker nodes, it will get the following failures when
running gkectl diagnose:
Checking admin cluster certificates...FAILURE
Reason: 3 admin cluster certificates error(s).
Unhealthy Resources:
Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
Workaround:
If is safe to ignore these messages.
Operating system
1.8, 1.9, 1.10, 1.11, 1.12, 1.13
/var/log/audit/ filling up disk space on VMs
/var/log/audit/ is filled with audit logs. You can check
the disk usage by running sudo du -h -d 1 /var/log/audit.
Certain gkectl commands on the admin workstation, for
example, gkectl diagnose snapshot contribute to disk space
usage.
Since Google Distributed Cloud v1.8, the Ubuntu image is hardened with CIS Level 2
Benchmark. And one of the compliance rules, "4.1.2.2 Ensure audit logs are
not automatically deleted", ensures the auditd setting
max_log_file_action = keep_logs. This results in all the
audit rules kept on the disk.
The above setting would make auditd automatically rotate its logs
once it has generated more than 250 files (each with 8M size).
Cluster nodes
For cluster nodes, upgrade to 1.11.5+, 1.12.4+, 1.13.2+ or 1.14+. If
you can't upgrade to those versions yet, apply the following DaemonSet to your cluster:
apiVersion:apps/v1kind:DaemonSetmetadata:name:change-auditd-log-actionnamespace:kube-systemspec:selector:matchLabels:app:change-auditd-log-actiontemplate:metadata:labels:app:change-auditd-log-actionspec:hostIPC:truehostPID:truecontainers:-name:update-audit-ruleimage:ubuntucommand:["chroot","/host","bash","-c"]args:-|while true; doif $(grep -q "max_log_file_action = keep_logs" /etc/audit/auditd.conf); thenecho "updating auditd max_log_file_action to rotate with a max of 250 files"sed -i 's/max_log_file_action = keep_logs/max_log_file_action = rotate/g' /etc/audit/auditd.confsed -i 's/num_logs = .*/num_logs = 250/g' /etc/audit/auditd.confecho "restarting auditd"systemctl restart auditdelseecho "auditd setting is expected, skip update"fisleep 600donevolumeMounts:-name:hostmountPath:/hostsecurityContext:privileged:truevolumes:-name:hosthostPath:path:/
Note that making this auditd config change would violate CIS Level 2
rule "4.1.2.2 Ensure audit logs are not automatically deleted".
Networking
1.10, 1.11.0-1.11.3, 1.12.0-1.12.2, 1.13.0
NetworkGatewayGroup Floating IP conflicts with node
address
Users are unable to create or update NetworkGatewayGroup
objects because of the following validating webhook error:
[1] admission webhook "vnetworkgatewaygroup.kb.io" denied the request: NetworkGatewayGroup.networking.gke.io "default" is invalid: [Spec.FloatingIPs: Invalid value: "10.0.0.100": IP address conflicts with node address with name: "my-node-name"
In affected versions, the kubelet can erroneously bind to a floating IP
address assigned to the node and report it as a node address in
node.status.addresses. The validating webhook checks
NetworkGatewayGroup floating IP addresses against all
node.status.addresses in the cluster and sees this as a
conflict.
Workaround:
In the same cluster where create or update of
NetworkGatewayGroup objects is failing, temporarily disable
the ANG validating webhook and submit your change:
Save the webhook config so it can be restored at the end:
Remove the vnetworkgatewaygroup.kb.io item from the
webhook config list and close to apply the changes.
Create or edit your NetworkGatewayGroup object.
Reapply the original webhook config:
kubectl-nkube-systemapply-fwebhook-config.yaml
Installation, Upgrades, Updates
1.10.0-1.10.2
Creating or upgrading admin cluster timeout
During an admin cluster upgrade attempt, the admin control plane VM
might get stuck during creation. The admin control plane VM goes into an
infinite waiting loop during the boot up, and you will see the following
infinite loop error in the /var/log/cloud-init-output.log
file:
+ echo 'waiting network configuration is applied'
waiting network configuration is applied
++ get-public-ip
+++ ip addr show dev ens192 scope global
+++ head -n 1
+++ grep -v 192.168.231.1
+++ grep -Eo 'inet ([0-9]{1,3}\.){3}[0-9]{1,3}'
+++ awk '{print $2}'
++ echo
+ '[' -n '' ']'
+ sleep 1
+ echo 'waiting network configuration is applied'
waiting network configuration is applied
++ get-public-ip
+++ ip addr show dev ens192 scope global
+++ grep -Eo 'inet ([0-9]{1,3}\.){3}[0-9]{1,3}'
+++ awk '{print $2}'
+++ grep -v 192.168.231.1
+++ head -n 1
++ echo
+ '[' -n '' ']'
+ sleep 1
This is because when Google Distributed Cloud tries to get the node IP
address in the startup script, it uses grep -v
ADMIN_CONTROL_PLANE_VIP to skip the admin cluster control-plane VIP
which can be assigned to the NIC too. However, the command also skips over
any IP address that has a prefix of the control-plane VIP, which causes
the startup script to hang.
For example, suppose that the admin cluster control-plane VIP is
192.168.1.25. If the IP address of the admin cluster control-plane VM has
the same prefix, for example,192.168.1.254, then the control-plane VM will
get stuck during creation. This issue can also be triggered if the
broadcast address has the same prefix as the control-plane VIP, for
example, 192.168.1.255.
Workaround:
If the reason for the admin cluster creation timeout is due to the
broadcast IP address, run the following command on the admin cluster
control-plane VM:
This will create a line without a broadcast address, and unblock the
boot up process. After the startup script is unblocked, remove this
added line by running the following command:
However, if the reason for the admin cluster creation timeout is due
to the IP address of the control-plane VM, you cannot unblock the
startup script. Switch to a different IP address, and recreate or
upgrade to version 1.10.3 or later.
Operating system, Upgrades, Updates
1.10.0-1.10.2
The state of the admin cluster using COS image will get lost upon
admin cluster upgrade or admin master repair
DataDisk can't be mounted correctly to admin cluster master node when
using COS image and the state of the admin cluster using COS image will
get lost upon admin cluster upgrade or admin master repair. (admin cluster
using COS image is a preview feature)
Workaround:
Re-create admin cluster with osImageType set to ubuntu_containerd
After you create the admin cluster with osImageType set to cos, grab
the admin cluster SSH key and SSH into admin master node.
df -h result contains /dev/sdb1 98G 209M 93G
1% /opt/data. And lsblk result contains -sdb1
8:17 0 100G 0 part /opt/data
Operating system
1.10
systemd-resolved failed DNS lookup on .local domains
In Google Distributed Cloud version 1.10.0, name resolutions on Ubuntu
are routed to local systemd-resolved listening on 127.0.0.53
by default. The reason is that on the Ubuntu 20.04 image used in version
1.10.0, /etc/resolv.conf is sym-linked to
/run/systemd/resolve/stub-resolv.conf, which points to the
127.0.0.53 localhost DNS stub.
As a result, the localhost DNS name resolution refuses to check the
upstream DNS servers (specified in
/run/systemd/resolve/resolv.conf) for names with a
.local suffix, unless the names are specified as search
domains.
This causes any lookups for .local names to fail. For
example, during node startup, kubelet fails on pulling images
from a private registry with a .local suffix. Specifying a
vCenter address with a .local suffix will not work on an
admin workstation.
Workaround:
You can avoid this issue for cluster nodes if you specify the
searchDomainsForDNS field in your admin cluster configuration
file and the user cluster configuration file to include the domains.
Currently gkectl update doesn't support updating the
searchDomainsForDNS field yet.
Therefore, if you haven't set up this field before cluster creation,
you must SSH into the nodes and bypass the local systemd-resolved stub by
changing the symlink of /etc/resolv.conf from
/run/systemd/resolve/stub-resolv.conf (which contains the
127.0.0.53 local stub) to
/run/systemd/resolve/resolv.conf (which points to the actual
upstream DNS):
As for the admin workstation, gkeadm doesn't support
specifying search domains, so must work around this issue with this manual
step.
This solution for does not persist across VM re-creations. You must
reapply this workaround whenever VMs are re-created.
Installation, Operating system
1.10
Docker bridge IP uses 172.17.0.1/16 instead of
169.254.123.1/24
Google Distributed Cloud specifies a dedicated subnet for the Docker
bridge IP address that uses --bip=169.254.123.1/24, so that
it won't reserve the default 172.17.0.1/16 subnet. However,
in version 1.10.0, there is a bug in Ubuntu OS image that caused the
customized Docker config to be ignored.
As a result, Docker picks the default 172.17.0.1/16 as its
bridge IP address subnet. This might cause an IP address conflict if you
already have workload running within that IP address range.
Workaround:
To work around this issue, you must rename the following systemd config
file for dockerd, and then restart the service:
Verify that Docker picks the correct bridge IP address:
ipa|grepdocker0
This solution does not persist across VM re-creations. You must reapply
this workaround whenever VMs are re-created.
Upgrades, Updates
1.11
Upgrade to 1.11 blocked by stackdriver readiness
In Google Distributed Cloud version 1.11.0, there are changes in the definition of custom resources related to logging and monitoring:
Group name of the stackdriver custom resource changed from addons.sigs.k8s.io to addons.gke.io;
Group name of the monitoring and metricsserver custom resources changed from addons.k8s.io to addons.gke.io;
The specs of the above resources start to be valiidated against its schema. In particular, the resourceAttrOverride and storageSizeOverride spec in the stackdriver custom resource need to have string type in the values of the cpu, memory and storage size requests and limits.
There is no action required if you do not have additional logic that applies or edits the affected custom resources. The Google Distributed Cloud upgrade process will take care of the migration of the affected resources and keep their existing specs after the group name change.
However if you run any logic that applies or edits the affected resources, special attention is needed. First, they need to be referenced with the new group name in your manifest file. For example:
apiVersion:addons.gke.io/v1alpha1## instead of `addons.sigs.k8s.io/v1alpha1`kind:Stackdriver
Secondly, make sure the resourceAttrOverride and storageSizeOverride spec values are of string type. For example:
spec:resourceAttrOverride:stackdriver-log-forwarder/stackdriver-log-forwarderlimits:cpu:1000m# or "1"# cpu: 1 # integer value like this would not workmemory:3000Mi
Otherwise, the applies and edits will not take effect and may lead to unexpected status in logging and monitoring components. Potential symptoms may include:
Reconciliation error logs in onprem-user-cluster-controller, for example:
potential reconciliation error: Apply bundle components failed, requeue after 10s, error: failed to apply addon components: failed to apply bundle objects from stackdriver-operator-addon 1.11.2-gke.53 to cluster my-cluster: failed to create typed live object: .spec.resourceAttrOverride.stackdriver-log-forwarder/stackdriver-log-forwarder.limits.cpu: expected string, got &value.valueUnstructured{Value:1}
Failure in kubectl edit stackdriver stackdriver, for example:
Error from server (NotFound): stackdrivers.addons.gke.io "stackdriver" not found
If you encounter the above errors, it means an unsupported type under stackdriver CR spec was already present before the upgrade. As a workaround, you could manually edit the stackdriver CR under the old group name kubectl edit stackdrivers.addons.sigs.k8s.io stackdriver and do the following:
Change the resource requests and limits to string type;
Remove any addons.gke.io/migrated-and-deprecated: true annotation if present.
Then resume or restart the upgrade process.
Operating system
1.7 and later
COS VMs show no IPs when VMs are moved through non-graceful shutdown of the host
Whenever there is a fault in a ESXi server and the vCenter HA function has been enabled for the server, all VMs in the faulty ESXi server trigger the vMotion mechanism and are moved to another normal ESXi server. Migrated COS VMs would lose their IPs.
Workaround:
Reboot the VM
Networking
all versions prior to 1.14.7, 1.15.0-1.15.3, 1.16.0
GARP reply sent by Seesaw doesn't set target IP
The periodic GARP (Gratuitous ARP) sent by Seesaw every 20s doesn't set
the target IP in the ARP header. Some networks might not accept such packets (like Cisco ACI). It can cause longer service down time after a split brain (due to VRRP packet drops) is recovered.
Workaround:
Trigger a Seeaw failover by running sudo seesaw -c failover on either of the Seesaw VMs. This should
restore the traffic.
Operating system
1.16, 1.28.0-1.28.200
Kubelet is flooded with logs stating that "/etc/kubernetes/manifests" does not exist on the worker nodes
"staticPodPath" was mistakenly set for worker nodes
Workaround:
Manually create the folder "/etc/kubernetes/manifests"
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-02-19 UTC."],[],[]]