Google Distributed Cloud for bare metal known issues
Stay organized with collections
Save and categorize content based on your preferences.
This page lists all known issues for Google Distributed Cloud (software only) for bare
metal (formerly known as Google Distributed Cloud Virtual, previously known as
Anthos clusters on bare metal). This page is for
Admins and architects and Operators who manage the lifecycle of the
underlying tech infrastructure, and respond to alerts and pages when service
level objectives (SLOs) aren't met or applications fail. To learn more about
common roles and example tasks that we reference in Google Cloud content, see
Common GKE Enterprise user roles and
tasks.
If you're part of the Google Developer Program, save this page to receive
notifications when a release note related to this page is published. To
learn more, see
Saved Pages.
To filter the known issues by a
product version or category, select your filters from the following drop-down
menus.
Select your Google Distributed Cloud version:
Select your problem category:
Or, search for your issue:
Category
Identified version(s)
Issue and workaround
Installation, Upgrades and updates
1.31
Errors creating custom resources
In version 1.31 of Google Distributed Cloud, you might get errors when
you try to create custom resources, such as clusters (all types) and
workloads. The issue is caused by a breaking change introduced in
Kubernetes 1.31 that prevents the caBundle field in a custom
resource definition from transitioning from a valid to an invalid state.
For more information about the change, see the
Kubernetes 1.31 changelog.
Prior to Kubernetes 1.31, the caBundle field was often set
to a makeshift value of \n, because in earlier Kubernetes
versions the API server didn't allow empty CA bundle content. Using
\n was a reasonable workaround to avoid confusion, as the
cert-manager typically updates the caBundle
later.
If the caBundle has been patched once from an invalid to a
valid state, there shouldn't be issues. However, if the custom resource
definition is reconciled back to \n (or another invalid
value), you might encounter the following error:
...Invalid value: []byte{0x5c, 0x6e}: unable to load root certificates: unable to parse bytes as PEM block]
Workaround
If you have a custom resource definition in which caBundle
is set to an invalid value, you can safely remove the caBundle
field entirely. This should resolve the issue.
Installation, Upgrades and updates
1.28, 1.29, and 1.30
Cluster upgrades take too long
In a cluster upgrade, each cluster node is drained and upgraded. In
releases 1.28 and later, Google Distributed Cloud switched from
taint-based node draining
to
eviction-based draining.
Additionally, to address pod inter-dependencies, eviction-based draining
follows a multi-stage draining order.
At each stage of draining, pods have a 20-minute grace period to terminate,
whereas the previous taint-based draining had a single 20-minute timeout.
If each stage requires the full 20 minutes to evict all pods, the time to
drain a node can be significantly longer than the previous taint-based
draining. In turn, increased node draining time can significantly increase
the time it takes to complete a cluster upgrade or to put a cluster into
maintenance mode.
There is also an
upstream Kubernetes issue
that affects the timeout logic for eviction-based draining. This issue
might also increase node draining times.
Workaround:
As a workaround, you can
disable eviction-based node draining.
This reverts to taint-based draining. We don't recommend taint-based draining,
however, because it doesn't honor PodDisruptionBudgets (PDBs), which might
lead to service disruptions.
Installation, Upgrades and updates
1.16, 1.28, and 1.29
Stale failed preflight check might block cluster operations
Cluster reconciliation is a standard phase for most cluster operations,
including cluster creation and cluster upgrades. During cluster
reconciliation, the Google Distributed Cloud cluster controller triggers a
preflight check. If this preflight check fails, then further cluster
reconciliation is blocked. As a result, cluster operations that include
cluster reconciliation are also blocked.
This preflight check doesn't run periodically, it runs as part of
cluster reconciliation only. Therefore, even if you fix the issue that
caused the initial preflight failure and on-demand preflight checks run
successfully, cluster reconciliation is still blocked due to this stale
failed preflight check.
If you have a cluster installation or upgrade that's stuck, you can
check to see if you're affected by this issue with the following steps:
Check the anthos-cluster-operator Pod logs for entries
like the following:
"msg"="Preflight check not ready. Won't reconcile"
Check whether the preflight check triggered by the cluster
controller is in a failed state:
Once the stale failed preflight check has been deleted, the cluster
controller is able to create a new preflight check.
Installation, Upgrades and updates
1.30.100, 1.30.200 and 1.30.300
User cluster creation or upgrade operations might not succeed
Creating user clusters at, or upgrading existing user clusters to,
versions 1.30.100, 1.30.200 or 1.30.300 might not succeed. This issue
applies only when kubectl or a GKE On-Prem API client (the
Google Cloud console, the gcloud CLI, or Terraform) is used
for creation and upgrade operations of the user cluster.
In this situation, the user cluster creation operation gets stuck in
the Provisioning state and a user cluster upgrade gets stuck
in the Reconciling state.
To check whether a cluster is affected, use the following steps:
CLUSTER_NAME: the name of the user
cluster that is stuck.
USER_CLUSTER_NAMESPACE: the user
cluster namespace name.
ADMIN_KUBECONFIG: the path of the \
kubeconfig file of the managing cluster.
If the CLUSTER STATE value is Provisioning
or Reconciling, you might be affected by this issue. The
following example response is an indicator that an upgrade is stuck:
NAME ABM VERSION DESIRED ABM VERSION CLUSTER STATE
some-cluster 1.30.0-gke.1930 1.30.100-gke.96 Reconciling
The mismatched versions are also an indication that the cluster upgrade
hasn't completed.
Find the full name of the anthos-cluster-operator Pod:
Stream the anthos-cluster-operator Pod logs for a
repeating message, indicating that the cluster is stuck provisioning or
reconciling:
kubectllogsPOD_NAME-nkube-system-f--since=15s\--kubeconfigADMIN_KUBECONFIG|\grep"Waiting for configMapForwarder to forward kube-system/metadata-image-digests to the cluster namespace, requeuing"
Replace POD_NAME with the full name
of the anthos-cluster-operator Pod from the preceding step.
As the command runs, watch for a continuous stream of matching log
lines, which is an indication that the cluster operation is stuck. The
following sample output is similar to what you see when a cluster is
stuck reconciling:
...
I1107 17:06:32.528471 1 reconciler.go:1475] "msg"="Waiting for configMapForwarder to forward kube-system/metadata-image-digests to the cluster namespace, requeuing" "Cluster"={"name":"user-t05db3f0761d4061-cluster","namespace":"cluster-user-t05db3f0761d4061-cluster"} "controller"="cluster" "controllerGroup"="baremetal.cluster.gke.io" "controllerKind"="Cluster" "name"="user-t05db3f0761d4061-cluster" "namespace"="cluster-user-t05db3f0761d4061-cluster" "reconcileID"="a09c70a6-059f-4e81-b6b2-aaf19fd5f926"
I1107 17:06:37.575174 1 reconciler.go:1475] "msg"="Waiting for configMapForwarder to forward kube-system/metadata-image-digests to the cluster namespace, requeuing" "Cluster"={"name":"user-t05db3f0761d4061-cluster","namespace":"cluster-user-t05db3f0761d4061-cluster"} "controller"="cluster" "controllerGroup"="baremetal.cluster.gke.io" "controllerKind"="Cluster" "name"="user-t05db3f0761d4061-cluster" "namespace"="cluster-user-t05db3f0761d4061-cluster" "reconcileID"="e1906c8a-cee0-43fd-ad78-88d106d4d30a""Name":"user-test-v2"} "err"="1 error occurred:\n\t* failed to construct the job: ConfigMap \"metadata-image-digests\" not found\n\n"
...
The response contains reasons and messages from the ConfigMapForwarder
resource. When the ConfigMapForwarder is stalled, you should see
output like the following:
Reason: Stalled
Message: cannot forward configmap kube-system/metadata-image-digests without "baremetal.cluster.gke.io/mark-source" annotation
Confirm that the metadata-image-digests ConfigMap isn't
present in the user cluster namespace:
Non-root users can't run bmctl restore to restore quorum
When running bmctl restore --control-plane-node as a non-root
user, a chown issue occurs while copying files from the control
plane node to the workstation machine.
Workaround:
Run the bmctl restore --control-plane-node command with
sudo for non-root users.
Upgrades
1.30.0-gke.1930
Upgrade-health-check job remains in active state due to missing pause:3.9 image
During an upgrade, the upgrade-health-check job may remain in an active
state due to the missing pause:3.9 image.
This issue does not affect the success of the upgrade.
Workaround:
Manually delete the upgrade-health-check job with the following
command:
Downloads of artifacts with sizes that exceed the cgroup memory.max
limit might be extremely slow. This issue is caused by a bug in the Linux
kernel for Red Hat Enterprise Linux (RHEL) 9.2. Kernels with cgroup v2 enabled are
affected. The issue is fixed in kernel versions 5.14.0-284.40.1.el_9.2 and later.
Workaround:
For affected pods, increase the memory limit settings for its containers
(spec.containers[].resources.limits.memory) so that the limits
are greater than the size of downloaded artifacts.
Upgrades
1.28 to 1.29.200
Cluster upgrade fails due to conflict in networks.networking.gke.io custom resource definition
During a bare metal cluster upgrade, the upgrade might fail with an
error message indicating that there's conflict in the
networks.networking.gke.io custom resource definition.
Specifically, the error calls out that v1alpha1 isn't present
in spec.versions.
This issue occurs because the v1alpha1 version of the
custom resource definition wasn't migrated to v1 during the
upgrade process.
Workaround:
Patch the affected clusters with the following commands:
Machine preflight check failures for check_inotify_max_user_instances and check_inotify_max_user_watches settings
During cluster installation or upgrade, the machine preflight checks
related to fs.inotify kernel settings might fail. If you're
affected by this issue, the machine preflight check log contains an error
like the following:
Minimum kernel setting required for fs.inotify.max_user_instances is 8192. Current fs.inotify.max_user_instances value is 128. Please run "echo "fs.inotify.max_user_instances=8192" | sudo tee --append /etc/sysctl.conf" to set the correct value.
This issue occurs because the fs.inotify max_user_instances and
max_user_watches values are read incorrectly from the control
plane and bootstrap hosts, instead of the intended node machines.
Workaround:
To work around this issue, adjust the fs.inotify.max_user_instances
and fs.inotify.max_user_watches to the recommended values on
all control plane and the bootstrap machines:
After the installation or upgrade operation completes, these values can be
reverted, if necessary.
Configuration, Installation, Upgrades and updates, Networking, Security
1.15, 1.16, 1.28, 1.29
Cluster installation and upgrade fails when ipam-controller-manager is required
Cluster installation and upgrade fails when the
ipam-controller-manager is required and your cluster is
running on Red Hat Enterprise Linux (RHEL) 8.9 or higher (depending on
upstream RHEL changes) with SELinux running in enforcing mode. This
applies specifically when the container-selinux version is
higher than 2.225.0.
Your cluster requires the ipam-controller-manager in any of
the following situations:
Your cluster is configured for IPv4/IPv6 dual-stack networking
Your cluster is configured with clusterNetwork.flatIPv4
set to true
Your cluster is configured with the
preview.baremetal.cluster.gke.io/multi-networking: enable
annotation
Cluster installation and upgrade don't succeed when the
ipam-controller-manager is installed.
Workaround
Set the default context for the /etc/kubernetes directory
on each control plane node to type etc_t:
Cluster upgrade fails with Google Cloud reachability check error
When you use bmctl to upgrade a cluster, the upgrade might
fail with a GCP reachability check failed error even though the
target URL is reachable from the admin workstation. This issue is caused by
a bug in bmctl versions 1.28.0 to 1.28.500.
Workaround:
Before you run the bmctl upgrade command, set the
GOOGLE_APPLICATION_CREDENTIALS environment variable to point to
a valid service account key file:
Setting Application Default Credentials (ADC) this way ensures that
bmctl has the necessary credentials to access the Google API
endpoint.
Configuration, Installation, Upgrades and updates, Networking, Security
1.15, 1.16, 1.28, 1.29
Cluster installation and upgrade fails when ipam-controller-manager is required
Cluster installation and upgrade fails when the
ipam-controller-manager is required and your cluster is
running on Red Hat Enterprise Linux (RHEL) 8.9 or higher (depending on
upstream RHEL changes) with SELinux running in enforcing mode. This
applies specifically when the container-selinux version is
higher than 2.225.0.
Your cluster requires the ipam-controller-manager in any of
the following situations:
Your cluster is configured for IPv4/IPv6 dual-stack networking
Your cluster is configured with clusterNetwork.flatIPv4
set to true
Your cluster is configured with the
preview.baremetal.cluster.gke.io/multi-networking: enable
annotation
Cluster installation and upgrade don't succeed when the
ipam-controller-manager is installed.
Workaround
Set the default context for the /etc/kubernetes directory
on each control plane node to type etc_t:
Cluster upgrade fails with Google Cloud reachability check error
When you use bmctl to upgrade a cluster, the upgrade might
fail with a GCP reachability check failed error even though the
target URL is reachable from the admin workstation. This issue is caused by
a bug in bmctl versions 1.28.0 to 1.28.500.
Workaround:
Before you run the bmctl upgrade command, set the
GOOGLE_APPLICATION_CREDENTIALS environment variable to point to
a valid service account key file:
admission webhook "binaryauthorization.googleapis.com" denied the
request: failed to post request to endpoint: Post
"https://binaryauthorization.googleapis.com/internal/projects/PROJECT_NUMBER/policy/locations/LOCATION/clusters/CLUSTER_NAME:admissionReview":
oauth2/google: status code 400:
{"error":"invalid_target","error_description":"The
target service indicated by the \"audience\" parameters is invalid.
This might either be because the pool or provider is disabled or deleted
or because it doesn't exist."}
If you see the preceding message, your cluster has this issue.
Workaround:
To workaround this issue, complete the following steps:
Cancel the cluster creation operation.
Remove the spec.binaryAuthorization block from the
cluster configuration file.
Create the cluster with Binary Authorization disabled.
If you have SELinux enabled and mount file systems to Kubernetes
related directories, you might experience issues such as cluster creation
failure, unreadable files, or permission issues.
To determine if you're affected by this issue, run the following command:
ls-Z/var/lib/containerd
.
If you see
system_u:object_r:unlabeled_t:s0 where you would expect to
see another label, such as
system_u:object_r:container_var_lib_t:s0, you're affected.
Workaround:
If you've recently mounted file systems to directories, make sure those
directories are up to date with your SELinux configuration.
You should also run the following commands on each machine before
running bmctl create cluster:
restorecon-R-v/var
restorecon-R-v/etc
This one time fix will persist after the reboot but is required every
time a new node with the same mount points is added. To learn more, see
Mounting File Systems in the Red Hat documentation.
Reset/Deletion
1.29.0
Reset user cluster fails trying to delete namespace
When running bmctl reset cluster -c ${USER_CLUSTER},
after all related jobs have finished, the command fails to delete the
user cluster namespace. The user cluster namespace is stuck in the
Terminating state. Eventually, the cluster reset times out
and returns an error.
Workaround:
To remove the namespace and complete the user cluster reset, use the
following steps:
Delete the metrics-server Pod from the admin cluster:
Once the finalizer is removed, the cluster namespace is removed and the
cluster reset is complete.
Configuration, Installation, Security
1.16.0 to 1.16.7 and 1.28.0 to 1.28.400
Binary Authorization Deployment is missing a nodeSelector
If you've enabled
Binary Authorization
for Google Distributed Cloud and are using a version of 1.16.0 to
1.16.7 or 1.28.0 to 1.28.400, you might experience an issue with where
the Pods for the feature are scheduled. In these versions, the
Binary Authorization Deployment is missing a nodeSelector, so the
Pods for the feature can be scheduled on worker nodes instead of control
plane nodes. This behavior doesn't cause anything to fail, but isn't
intended.
Workaround:
For all affected clusters, complete the following steps:
After the change is saved, the Pods are re-deployed only to the control
plane nodes. This fix needs to be applied after every upgrade.
Upgrades and updates
1.28.0, 1.28.100, 1.28.200, 1.28.300
Error when upgrading a cluster to 1.28.0-1.28.300
Upgrading clusters created before version 1.11.0 to versions 1.28.0-1.28.300
might cause the lifecycle controller deployer Pod to enter an error state
during upgrade. When this happens, the logs of the lifecycle controller
deployer Pod have an error message similar to the following:
"inventorymachines.baremetal.cluster.gke.io\" is invalid: status.storedVersions[0]: Invalid value: \"v1alpha1\": must appear in spec.versions
Workaround:
This issue was fixed in version 1.28.400. Upgrade to version 1.28.400 or
later to resolve the issue.
If you're not able to upgrade, run the following commands to resolve the
problem:
Sometimes cluster or container logs are tagged with a different project ID
in resource.labels.project_id in the Logs Explorer.
This can happen when the cluster is configured to use observability
PROJECT_ONE, which is set in the
clusterOperations.projectID field in the cluster config.
However, the cloudOperationsServiceAccountKeyPath in the
config has a service account key from project
PROJECT_TWO.
In such cases, all logs are routed to PROJECT_ONE,
but resource.labels.project_id is labeled as
PROJECT_TWO.
Workaround:
Use one of the following options to resolve the issue:
Use a service account from the same destination project.
Change the project_id in the service account key JSON file to the current project.
Change the project_id directly in the log filter from the Logs Explorer.
Networking
1.29, 1.30
Performance degradation for clusters using bundled load balancing with BGP
For version 1.29.0 clusters using bundled load balancing with BGP,
load balancing performance can degrade as the total number of Services of
type LoadBalancer approaches 2,000. As performance degrades,
Services that are newly created either take a long time to connect or
can't be connected to by a client. Existing Services continue to work,
but don't handle failure modes, such as the loss of a load balancer node,
effectively. These Service problems happen when the
ang-controller-manager Deployment is terminated due to
running out of memory.
If your cluster is affected by this issue, Services in the cluster are
unreachable and not healthy and the ang-controller-manager
Deployment is in a CrashLoopBackOff. The response when
listing the ang-controller-manager Deployments is similar to
the following:
Container Registry endpoint gcr.io connectivity issues can block cluster operations
Multiple cluster operations for admin clusters create a bootstrap
cluster. Before creating a bootstrap cluster, bmctl
performs a Google Cloud reachability check from the admin workstation.
This check might fail due to connectivity issues with the
Container Registry endpoint, gcr.io, and you might see an
error message like the following:
To work around this issue, retry the operation with the flag
--ignore-validation-errors.
Networking
1.15, 1.16
GKE Dataplane V2 incompatible with some storage drivers
Bare metal clusters use GKE Dataplane V2, which
is incompatible with some storage providers. You might experience
problems with stuck NFS volumes or Pods. This is especially likely if
you have workloads using ReadWriteMany volumes backed by
storage drivers that are susceptible to this issue:
Robin.io
Portworx (sharedv4 service volumes)
csi-nfs
This list is not exhaustive.
Workaround
A fix for this issue is available for the following Ubuntu versions:
20.04 LTS: Use a 5.4.0 kernel image later than
linux-image-5.4.0-166-generic
22.04 LTS: Either use a 5.15.0 kernel image later than
linux-image-5.15.0-88-generic or use the 6.5 HWE kernel.
If you're not using one of these versions, contact
Google Support.
Logging and monitoring
1.15, 1.16, 1.28
kube-state-metrics OOM in large cluster
You might notice that kube-state-metrics or the
gke-metrics-agent Pod that exists on the same node as
kube-state-metrics is out of memory (OOM).
This can happen in clusters with more that 50 nodes or with many
Kubernetes objects.
Workaround
To resolve this issue, update the stackdriver custom
resource definition to use the ksmNodePodMetricsOnly
feature gate. This feature gate makes sure that only a small number of
critical metrics are exposed.
To use this workaround, complete the following steps:
Check the stackdriver custom resource definition for
available feature gates:
Preflight check fails on RHEL 9.2 due to missing iptables
When installing a cluster on the Red Hat Enterprise Linux (RHEL) 9.2
operating system, you might experience a failure due to the missing
iptables package. The failure occurs during preflight
checks and triggers an error message similar to the following:
'check_package_availability_pass':"The following packages are not available: ['iptables']"
RHEL 9.2 is in Preview
for Google Distributed Cloud version 1.28.
Workaround
Bypass the preflight check error by setting
spec.bypassPreflightCheck to true on your
Cluster resource.
Operation
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16
Slow MetalLB failover at high scale
When MetalLB handles a high number of services (over 10,000),
failover can take over an hour. This happens because MetalLB uses a rate
limited queue that, when under high scale, can take a while to get to
the service that needs to fail over.
Workaround
Upgrade your cluster to version 1.28 or later. If you're unable to
upgrade, manually editing the service (for example, adding an
annotation) causes the service to failover more quickly.
Operation
1.16.0-1.16.6, 1.28.0-1.28.200
Environment variables have to be set on the admin workstation if proxy is enabled
bmctl check cluster can fail due to proxy failures if you don't have the environment variables HTTPS_PROXY and NO_PROXY defined on the admin workstation. The bmctl command reports an error message about failing to call some google services, like the following example:
Manually set the HTTPS_PROXY and NO_PROXY on the admin workstation.
Upgrades and updates
1.28.0-gke.435
Upgrades to version 1.28.0-gke.435 might fail if audit.log has incorrect ownership
In some cases, the /var/log/apiserver/audit.log file on
control plane nodes has both group and user ownership set to root.
This file ownership setting causes upgrade failures for the control plane
nodes when upgrading a cluster from version 1.16.x to version 1.28.0-gke.435.
This issue only applies to clusters that were created prior to version
1.11 and that had Cloud Audit Logs disabled. Cloud Audit Logs is enabled
by default for clusters at version 1.9 and higher.
Workaround
If you're unable to upgrade your cluster to version 1.28.100-gke.146,
use the following steps as a workaround to complete your cluster upgrade
to version 1.28.0-gke.435:
If Cloud Audit Logs is enabled, remove the /var/log/apiserver/audit.log file.
If Cloud Audit Logs is disabled, change /var/log/apiserver/audit.log ownership to the same as the parent directory, /var/log/apiserver.
Networking, Upgrades and updates
1.28.0-gke.435
MetalLB doesn't assign IP addresses to VIP Services
Google Distributed Cloud uses MetalLB for
bundled load balancing. In Google Distributed Cloud
release 1.28.0-gke.435, the bundled MetalLB is upgraded to version 0.13,
which introduces CRD support for IPAddressPools. However,
because ConfigMaps allow any name for an IPAddressPool,
the pool names had to be converted to a Kubernetes-compliant name by
appending a hash to the end of the name of the IPAddressPool.
For example, an IPAddressPool with a name default
is converted to a name like default-qpvpd when you upgrade
your cluster to version 1.28.0-gke.435.
Since MetalLB requires a specific name of an IPPool for
selection, the name conversion prevents MetalLB from making a pool
selection and assigning IP addresses. Therefore, Services that use
metallb.universe.tf/address-pool as an annotation to select
the address pool for an IP address no longer receive an IP address from
the MetalLB controller.
This issue is fixed in Google Distributed Cloud version 1.28.100-gke.146.
Workaround
If you can't upgrade your cluster to version 1.28.100-gke.146, use the
following steps as a workaround:
Get the converted name of the IPAddressPool:
kubectlgetIPAddressPools-nkube-system
Update the affected Service to set the metallb.universe.tf/address-pool
annotation to the converted name with the hash.
For example, if the IPAddressPool name was converted from
default to a name like default-qpvpd, change
the annotation metallb.universe.tf/address-pool: default
in the Service to metallb.universe.tf/address-pool: default-qpvpd.
The hash used in the name conversion is deterministic, so the
workaround is persistent.
Upgrades and updates
1.14, 1.15, 1.16, 1.28, 1.29
Orphan pods after upgrading to version 1.14.x
When you upgrade clusters to version 1.14.x, some
resources from the previous version aren't deleted. Specifically, you
might see a set of orphaned pods like the following:
This issue is fixed in Google Distributed Cloud version 1.15.0 and higher.
Installation
1.14
Cluster creation stuck on the machine-init job
If you try to install Google Distributed Cloud version 1.14.x, you might
experience a failure due to the machine-init jobs, similar to
the following example output:
Cilium-operator missing Node list and
watch permissions
In Cilium 1.13, the cilium-operator ClusterRole
permissions are incorrect. The Node list and
watch permissions are missing. The
cilium-operator fails to start garbage collectors, which
results in the following issues:
Leakage of Cilium resources.
Stale identities aren't removed from BFP policy maps.
Policy maps might reach the 16K limit.
New entries can't be added.
Incorrect NetworkPolicy enforcement.
Identities might reach the 64K limit.
New Pods can't be created.
An operator that's missing the Node permissions reports the following
example log message:
2024-01-02T20:41:37.742276761Zlevel=errormsg=k8sErrorerror="github.com/cilium/cilium/operator/watchers/node.go:83: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User \"system:serviceaccount:kube-system:cilium-operator\" cannot list resource \"nodes\" in API group \"\" at the cluster scope"subsys=k8s
The Cilium agent reports an error message when it's unable to insert an
entry into a policy map, like the following example:
level=errormsg="Failed to add PolicyMap key"bpfMapKey="{6572100 0 0 0}"containerID=datapathPolicyRevision=0desiredPolicyRevision=7endpointID=1313error="Unable to update element for Cilium_policy_01313 map with file descriptor 190: the map is full, please consider resizing it. argument list too long"identity=128ipv4=ipv6=k8sPodName=/port=0subsys=endpoint
Workaround:
Remove the Cilium identities, then add the missing ClusterRole
permissions to the operator:
Remove the existing CiliumIdentity objects:
kubectldeleteciliumid–-all
Edit the cilium-operator ClusterRole object:
kubectleditclusterrolecilium-operator
Add a section for nodes that includes the missing
permissions, as shown in the following example:
This error can be safely ignored. If you encounter this error that
blocks the upgrade, re-run the upgrade command.
If you observe this error when you run the preflight using the
bmctl preflightcheck command, nothing is blocked by this
failure. You can run the preflight check again to get the accurate
preflight information.
Workaround:
Re-run the upgrade command, or if encountered during
bmctl preflightcheck, re-run preflightcheck
command.
Operation
1.14, 1.15.0-1.15.7, 1.16.0-1.16.3, 1.28.0
Periodic Network health check fails when a node is replaced or removed
This issue affects clusters that perform periodic network health checks
after a node has been replaced or removed. If your cluster undergoes periodic
health checks, the periodic network health check results in failure following
the replacement or removal of a node, because the network inventory ConfigMap
doesn't get updated once it's created.
Workaround:
The recommended workaround is to delete the inventory ConfigMap and the
periodic network health check. The cluster operator automatically
recreates them with the most up-to-date information.
Network Gateway for GDC can't apply your configuration when the device
name contains a period
If you have a network device that includes a period character
(.) in the name, such as bond0.2,
Network Gateway for GDC treats the period as a path in the
directory when it runs sysctl to make changes. When
Network Gateway for GDC checks if duplicate address detection
(DAD) is enabled, the check might fail and so won't reconcile.
The behavior is different between cluster versions:
1.14 and 1.15: This error only exists when you use IPv6
floating IP addresses. If you don't use IPv6 floating IP addresses, you
won't notice this issue when your device names contain a period.
1.16.0 - 1.16.2: This error always exists when your device
names contain a period.
Workaround:
Upgrade your cluster to version 1.16.3 or later.
As a workaround until you can upgrade your clusters, remove the period
(.) from the name of the device.
Upgrades and updates, Networking, Security
1.16.0
Upgrades to 1.16.0 fail when seccomp is disabled
If seccomp is disabled for your cluster
(spec.clusterSecurity.enableSeccomp set to false),
then upgrades to version 1.16.0 fail.
Google Distributed Cloud version 1.16 uses Kubernetes version 1.27.
In Kubernetes version 1.27.0 and higher, the feature for setting
seccomp profiles is GA and no longer uses a
feature gate.
This Kubernetes change causes upgrades to version 1.16.0 to fail when
seccomp is disabled in the cluster configuration. This issue
is fixed in version 1.16.1 and higher clusters. If you
have the cluster.spec.clusterSecurity.enableSeccomp field set
to false, you can upgrade to version 1.16.1 or higher.
Clusters with spec.clusterSecurity.enableSeccomp unset or
set to true are not affected.
containerd metadata might become corrupt after reboot when
/var/lib/containerd is mounted
If you have optionally mounted /var/lib/containerd, the
containerd metadata might become corrupt after a reboot. Corrupt metadata
might cause Pods to fail, including system-critical Pods.
To check if this issue affects you, see if an optional mount is defined
in /etc/fstab for /var/lib/containerd/ and has
nofail in the mount options.
Workaround:
Remove the nofail mount option in /etc/fstab,
or upgrade your cluster to version 1.15.6 or later.
Operation
1.13, 1.14, 1.15, 1.16, 1.28
Clean up stale Pods in the cluster
You might see Pods managed by a Deployment (ReplicaSet) in a
Failed state and with the status of
TaintToleration. These Pods don't use cluster resources, but
should be deleted.
You can use the following kubectl command to list the
Pods that you can clean up:
kubectlgetpods–A|grepTaintToleration
The following example output shows a Pod with the
TaintToleration status:
For each Pod with the described symptoms, check the ReplicaSet that the
Pod belongs to. If the ReplicaSet is satisfied, you can delete the Pods:
Get the ReplicaSet that manages the Pod and find the
ownerRef.Kind value:
kubectlgetpodPOD_NAME-nNAMESPACE-oyaml
Get the ReplicaSet and verify that the status.replicas
is the same as spec.replicas:
kubectlgetreplicasetREPLICA_NAME-nNAMESPACE-oyaml
If the names match, delete the Pod:
kubectldeletepodPOD_NAME-nNAMESPACE.
Upgrades
1.16.0
etcd-events can stall when upgrade to version 1.16.0
When you upgrade an existing cluster to version 1.16.0, Pod failures
related to etcd-events can stall the operation. Specifically,
the upgrade-node job fails for the
TASK [etcd_events_install : Run etcdevents] step.
If you're affected by this issue, you see Pod failures like the
following:
The kube-apiserver Pod fails to start with the
following error:
connectionerror:desc="transport: Error while dialing dial tcp 127.0.0.1:2382: connect: connection refused"
The etcd-events pod fails to start with the following error:
These errors indicate eviction events or an inability to schedule Pods
due to node resources. As Network Gateway for GDC Pods have no
PriorityClass, they have the same default priority as other workloads.
When nodes are resource-constrained, the network gateway Pods might be
evicted. This behavior is particularly bad for the ang-node
DaemonSet, as those Pods must be scheduled on a specific node and can't
migrate.
Workaround:
Upgrade to 1.15 or later.
As a short-term fix, you can manually assign a
PriorityClass
to the Network Gateway for GDC components. The Google Distributed Cloud controller
overwrites these manual changes during a reconciliation process, such as
during a cluster upgrade.
Assign the system-cluster-critical PriorityClass to the
ang-controller-manager and autoscaler cluster
controller Deployments.
Assign the system-node-critical PriorityClass to the
ang-daemon node DaemonSet.
Installation, Upgrades and updates
1.15.0, 1.15.1, 1.15.2
Cluster creation and upgrades fail due to cluster name length
Creating version 1.15.0, 1.15.1, or 1.15.2 clusters or upgrading
clusters to version 1.15.0, 1.15.1, or 1.15.2 fails when the cluster name
is longer than 48 characters (version 1.15.0) or 45 characters (version
1.15.1 or 1.15.2). During cluster creation and upgrade operations,
Google Distributed Cloud creates a health check resource with a name that
incorporates the cluster name and version:
For version 1.15.0 clusters, the health check resource name is
CLUSTER_NAME-add-ons-CLUSTER_VER.
For version 1.15.1 or 1.15.2 clusters, the health check resource name is
CLUSTER_NAME-kubernetes-CLUSTER_VER.
For long cluster names, the health check resource name exceeds the
Kubernetes 63 character length restriction for label
names, which prevents the creation of the health check resource.
Without a successful health check, the cluster operation fails.
To see if you are affected by this issue, use kubectl describe
to check the failing resource:
To unblock the cluster upgrade or creation, you can bypass the
healthcheck. Use the following command to patch the healthcheck custom
resource with passing status: (status: {pass: true})
Version 1.14.0 and 1.14.1 clusters with preview features can't upgrade to version 1.15.x
If version 1.14.0 and 1.14.1 clusters have a preview feature enabled,
they're blocked from successfully upgrading to version 1.15.x. This
applies to preview features like the ability to create a cluster without
kube-proxy, which is enabled with the following annotation in the cluster
configuration file:
This issue is fixed in version 1.14.2 and higher clusters.
Workaround:
If you're unable to upgrade your clusters to version 1.14.2 or higher
before upgrading to version 1.15.x, you can upgrade to version 1.15.x
directly by using a bootstrap cluster:
bmctlupgradecluster--use-bootstrap=true
Operation
1.15
Version 1.15 clusters don't accept duplicate floating IP addresses
Network Gateway for GDC doesn't let you create new NetworkGatewayGroup
custom resources that contain IP addresses in spec.floatingIPs
that are already used in existing NetworkGatewayGroup custom
resources. This rule is enforced by a webhook in bare metal clusters
version 1.15.0 and higher. Pre-existing duplicate floating IP addresses
don't cause errors. The webhook only prevents the creation of new
NetworkGatewayGroups custom resources that contain duplicate
IP addresses.
The webhook error message identifies the conflicting IP address and the
existing custom resource that is already using it:
IPaddressexistsinothergatewaywithnamedefault
The initial documentation for advanced networking features, such as the
Egress NAT gateway, doesn't caution against duplicate IP addresses.
Initially, only the NetworkGatewayGroup resource named
default was recognized by the reconciler. Network Gateway for GDC
now recognizes all NetworkGatewayGroup custom
resources in the system namespace. Existing NetworkGatewayGroup
custom resources are honored, as is.
Workaround:
Errors happen for the creation of a new NetworkGatewayGroup
custom resource only.
To address the error:
Use the following command to list NetworkGatewayGroups
custom resources:
To apply your changes, close and save edited custom resources.
VM Runtime on GDC
1.13.7
VMs might not start on 1.13.7 clusters that use a private registry
When you enable VM Runtime on GDC on a new or upgraded version
1.13.7 cluster that uses a private registry, VMs that connect to the node
network or use a GPU might not start properly. This issue is due to some
system Pods in the vm-system namespace getting image pull
errors. For example, if your VM uses the node network, some Pods might
report image pull errors like the following:
macvtap-4x9zp0/1Init:ImagePullBackOff070m
This issue is fixed in version 1.14.0 and higher clusters.
Workaround
If you're unable to upgrade your clusters immediately, you can pull
images manually. The following commands pull the macvtap CNI plugin image
for your VM and push it to your private registry:
Replace REG_HOST with the domain name of a host that you mirror locally.
Installation
1.11, 1.12
During cluster creation in the kind cluster, the gke-metric-agent pod fails to start
During cluster creation in the kind cluster, the gke-metrics-agent pod
fails to start because of an image pulling error as follows:
error="failed to pull and unpack image \"gcr.io/gke-on-prem-staging/gke-metrics-agent:1.8.3-anthos.2\": failed to resolve reference \"gcr.io/gke-on-prem-staging/gke-metrics-agent:1.8.3-anthos.2\": pulling from host gcr.io failed with status code [manifests 1.8.3-anthos.2]: 403 Forbidden"
Also, in the bootstrap cluster's containerd log, you will see the following entry:
Sep1323:54:20bmctl-control-planecontainerd[198]:time="2022-09-13T23:54:20.378172743Z"level=infomsg="PullImage \"gcr.io/gke-on-prem-staging/gke-metrics-agent:1.8.3-anthos.2\" "Sep1323:54:21bmctl-control-planecontainerd[198]:time="2022-09-13T23:54:21.057247258Z"level=errormsg="PullImage \"gcr.io/gke-on-prem-staging/gke-metrics-agent:1.8.3-anthos.2\" failed"error="failed to pull and unpack image \"gcr.io/gke-on-prem-staging/gke-metrics-agent:1.8.3-anthos.2\": failed to resolve reference \"gcr.io/gke-on-prem-staging/gke-metrics-agent:1.8.3-anthos.2\": pulling from host gcr.io failed with status code [manifests 1.8.3-anthos.2]: 403 Forbidden"
You will see the following "failing to pull" error in the pod:
gcr.io/gke-on-prem-staging/gke-metrics-agent
Workaround
Despite the errors, the cluster creation process isn't blocked as the
purpose of gke-metrics-agent pod in kind cluster is to facilitate the
cluster creation success rate and for internal tracking and monitoring.
Hence, you can ignore this error.
Workaround
Despite the errors, the cluster creation process isn't blocked as the
purpose of gke-metrics-agent pod in kind cluster is to facilitate the
cluster creation success rate and for internal tracking and monitoring.
Hence, you can ignore this error.
Operation, Networking
1.12, 1.13, 1.14, 1.15, 1.16, 1.28
Accessing an IPv6 Service endpoint crashes the LoadBalancer Node on
CentOS or RHEL
When you access a dual-stack Service (a Service that has both IPv4 and
IPv6 endpoints) and use the IPv6 endpoint, the LoadBalancer Node that
serves the Service might crash. This issue affects customers that use
dual-stack services with CentOS or RHEL and kernel version earlier than
kernel-4.18.0-372.46.1.el8_6.
If you believe that this issue affects you, check the kernel version on
the LoadBalancer Node using the uname -a command.
Workaround:
Update the LoadBalancer Node to kernel version
kernel-4.18.0-372.46.1.el8_6 or later. This kernel version is
available by default in CentOS and RHEL version 8.6 and later.
Networking
1.11, 1.12, 1.13, 1.14.0
Intermittent connectivity issues after Node reboot
After you restart a Node, you might see intermittent connectivity
issues for a NodePort or LoadBalancer Service. For example, you might have
intermittent TLS handshake or connection reset errors. This issue is
fixed for cluster versions 1.14.1 and higher.
To check if this issue affects you, look at the iptables
forward rules on Nodes where the backend Pod for the affected Service is
running:
sudoiptables-LFORWARD
If you see the KUBE-FORWARD rule before the
CILIUM_FORWARD rule in iptables, you might be
affected by this issue. The following example output shows a Node where
the problem exists:
Chain FORWARD (policy ACCEPT)
target prot opt source destination
KUBE-FORWARD all -- anywhere anywhere /* kubernetes forwarding rules */
KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */
KUBE-EXTERNAL-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes externally-visible service portals */
CILIUM_FORWARD all -- anywhere anywhere /* cilium-feeder: CILIUM_FORWARD */
Workaround:
Restart the anetd Pod on the Node that's misconfigured. After you
restart the anetd Pod, the forwarding rule in iptables should
be configured correctly.
The following example output shows that the CILIUM_FORWARD
rule is now correctly configured before the KUBE-FORWARD
rule:
Chain FORWARD (policy ACCEPT)
target prot opt source destination
CILIUM_FORWARD all -- anywhere anywhere /* cilium-feeder: CILIUM_FORWARD */
KUBE-FORWARD all -- anywhere anywhere /* kubernetes forwarding rules */
KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */
KUBE-EXTERNAL-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes externally-visible service portals */
Upgrades and updates
1.9, 1.10
The preview feature does not retain the original permission and owner
information
The preview feature of 1.9.x cluster using bmctl 1.9.x does not retain the
original permission and owner information. To verify if you are affected by this
feature, extract the backed-up file using the following command:
tar-xzvfBACKUP_FILE
Workaround
Verify if the metadata.json is present and if the bmctlVersion is 1.9.x. If the metadata.json isn't present, upgrade to 1.10.x cluster and use bmctl 1.10.x to backup/restore.
Upgrades and creates
1.14.2
clientconfig-operator stuck in pending state with CreateContainerConfigError
If you've upgraded to or created a version 1.14.2 cluster with an OIDC/LDAP
configuration, you may see the clientconfig-operator Pod stuck
in a pending state. With this issue, there are two
clientconfig-operator Pods, with one in a running state and the
other in a pending state.
This issue applies to version 1.14.2 clusters only. Earlier cluster
versions such as 1.14.0 and 1.14.1 aren't affected. This issue is fixed in
version 1.14.3 and all subsequent releases, including 1.15.0 and later.
Workaround:
As a workaround, you can patch the clientconfig-operator
deployment to add additional security context and ensure that the deployment
is ready.
Use the following command to patch clientconfig-operator in the
target cluster:
CLUSTER_KUBECONFIG: the path of the kubeconfig
file for the target cluster.
Operation
1.11, 1.12, 1.13, 1.14, 1.15
Certificate authority rotation fails for clusters without bundled load balancing
For clusters without bundled load balancing (spec.loadBalancer.mode set to manual), the bmctl update credentials certificate-authorities rotate
command can become unresponsive and fail with the following error: x509: certificate signed by unknown authority.
If you're affected by this issue, the bmctl command might
output the following message before becoming unresponsive:
SigningCAcompletedin3/0control-planenodes
In this case, the command eventually fails. The rotate certificate-authority
log for a cluster with three control planes may include entries like the
following:
If you need additional assistance, contact
Google Support.
Installation, Networking
1.11, 1.12, 1.13, 1.14.0-1.14.1
ipam-controller-manager crashloops in dual-stack
clusters
When you deploy a dual-stack cluster (a cluster with both IPv4 and IPv6
addresses), the ipam-controller-manager Pod(s) might
crashloop. This behavior causes the Nodes to cycle between
Ready and NotReady states, and might cause the
cluster installation to fail. This problem can occur when the API server
is under high load.
To see if this issue affects you, check if the
ipam-controller-manager Pod(s) are failing with
CrashLoopBackOff errors:
1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13, and 1.14
etcd watch starvation
Clusters running etcd version 3.4.13 or earlier may experience watch
starvation and non-operational resource watches, which can lead to the
following problems:
Pod scheduling is disrupted
Nodes are unable to register
kubelet doesn't observe pod changes
These problems can make the cluster non-functional.
This issue is fixed in Google Distributed Cloud version 1.12.9, 1.13.6,
1.14.3, and subsequent releases. These newer releases use etcd version
3.4.21. All prior versions of Google Distributed Cloud are affected by
this issue.
Workaround
If you can't upgrade immediately, you can mitigate the risk of
cluster failure by reducing the number of nodes in your cluster. Remove
nodes until the etcd_network_client_grpc_sent_bytes_total
metric is less than 300 MBps.
To view this metric in Metrics Explorer:
Go to the Metrics Explorer in the Google Cloud console:
Expand the Select a metric, enter Kubernetes Container
in the filter bar, and then use the submenus to select the metric:
In the Active resources menu, select Kubernetes Container.
In the Active metric categories menu, select Anthos.
In the Active metrics menu, select etcd_network_client_grpc_sent_bytes_total.
Click Apply.
Networking
1.11.6, 1.12.3
SR-IOV operator's vfio-pci mode "Failed" state
The SriovNetworkNodeState object's syncStatus
can report the "Failed" value for a configured node. To view the status of
a node and determine if the problem affects you, run the following
command:
Replace NODE_NAME with the name of the node to
check.
Workaround:
If the SriovNetworkNodeState object status is "Failed",
upgrade your cluster to version 1.11.7 or later or version 1.12.4 or
later.
Upgrades and updates
1.10, 1.11, 1.12, 1.13, 1.14.0, 1.14.1
Some worker nodes aren't in a Ready state after upgrade
Once upgrade is finished, some worker nodes may have their Ready condition set to false. On the Node resource, you will see an error next to the Ready condition similar to the following example:
container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
When you log into the stalled machine, the CNI configuration on the machine is empty:
sudols/etc/cni/net.d/
Workaround
Restart the node's anetd pod by deleting it.
Upgrades and updates, Security
1.10
Multiple certificate rotations from cert-manager result in inconsistency
After multiple manual or auto certificate rotations, the webhook pod,
such as anthos-cluster-operator isn't updated with the new
certificates issued by cert-manager. Any update to the cluster
custom resource fails and results in an error similar as follows:
Internal error occurred: failed calling
webhook "vcluster.kb.io": failed to call webhook: Post "https://webhook-service.kube-system.svc:443/validate-baremetal-cluster-gke-io-v1-cluster?timeout=10s": x509: certificate signed by unknown authority (possibly because of "x509:
invalid signature: parent certificate cannot sign this kind of certificate"
while trying to verify candidate authority certificate
"webhook-service.kube-system.svc")
This issue might occur in the following circumstances:
If you have done two manual cert-manager issued certificate rotations
on a cluster older than 180 days or more and never restarted the anthos-cluster-operator.
If you have done a manual cert-manager issued
certificate rotations on a cluster older than 90 days or more and never
restarted the anthos-cluster-operator.
Workaround
Restart the pod by terminating the anthos-cluster-operator.
Upgrades and updates
1.14.0
Outdated lifecycle controller deployer pods created during user cluster upgrade
In version 1.14.0 admin clusters, one or more outdated lifecycle
controller deployer pods might be created during user cluster upgrades.
This issue applies for user clusters that were initially created at
versions lower than 1.12. The unintentionally created pods don't impede
upgrade operations, but they might be found in an unexpected state. We
recommend that you remove the outdated pods.
This issue is fixed in release 1.14.1.
Workaround:
To remove the outdated lifecycle controller deployer pods:
BGPSession state constantly changing due to large number
of incoming routes
Google Distributed Cloud advanced networking fails to manage BGP sessions
correctly when external peers advertise a high number of routes (about 100
or more). With a large number of incoming routes, the node-local BGP
controller takes too long to reconcile BGP sessions and fails to update the
status. The lack of status updates, or a health check, causes the session
to be deleted for being stale.
Undesirable behavior on BGP sessions that you might notice and indicate
a problem include the following:
Continuous bgpsession deletion and recreation.
bgpsession.status.state never becomes
Established
Routes failing to advertise or being repeatedly advertised and
withdrawn.
BGP load balancing problems might be noticeable with connectivity
issues to LoadBalancer services.
BGP FlatIP issue might be noticeable with connectivity
issues to Pods.
To determine if your BGP issues are caused by the remote peers
advertising too many routes, use the following commands to review the
associated statuses and output:
Use kubectl get bgpsessions on the affected cluster.
The output shows bgpsessions with state "Not Established"
and the last report time continuously counts up to about 10-12 seconds
before it appears to reset to zero.
The output of kubectl get bgpsessions shows that the
affected sessions are being repeatedly recreated:
Log messages indicate that stale BGP sessions are being deleted:
kubectllogsang-controller-manager-POD_NUMBER
Replace POD_NUMBER with the leader
pod in your cluster.
Workaround:
Reduce or eliminate the number of routes advertised from the remote
peer to the cluster with an export policy.
In cluster versions 1.14.2 and later, you can also disable the
feature that processes received routes by using an
AddOnConfiguration. Add the
--disable-received-routes argument to the ang-daemon
daemonset's bgpd container.
Networking
1.14, 1.15, 1.16, 1.28
Application timeouts caused by conntrack table
insertion failures
Clusters running on an Ubuntu OS that uses kernel 5.15 or
higher are susceptible to netfilter connection tracking (conntrack) table
insertion failures. Insertion failures can occur even when the conntrack
table has room for new entries. The failures are caused by changes in
kernel 5.15 and higher that restrict table insertions based on chain
length.
To see if you are affected by this issue, you can check the in-kernel
connection tracking system statistics with the following command:
If a chaintoolong value in the response is a non-zero
number, you're affected by this issue.
Workaround
The short term mitigation is to increase the size of both the netfiler
hash table (nf_conntrack_buckets) and the netfilter
connection tracking table (nf_conntrack_max). Use the
following commands on each cluster node to increase the size of the
tables:
Replace TABLE_SIZE with new size in bytes. The
default table size value is 262144. We suggest that you set a
value equal to 65,536 times the number of cores on the node. For example,
if your node has eight cores, set the table size to 524288.
Can't restore cluster backups with bmctl for some versions
We recommend that you back up your clusters before you upgrade so that
you can restore the earlier version if the upgrade doesn't succeed.
A problem with the bmctl restore cluster command causes it to
fail to restore backups of clusters with the identified versions. This
issue is specific to upgrades, where you're restoring a backup of an earlier
version.
If your cluster is affected, the bmctl restore cluster
log contains the following error:
Error: failed to extract image paths from profile: anthos version VERSION not supported
Workaround:
Until this issue is fixed, we recommend that you use the instructions in
Back up and restore clusters
to back up your clusters manually and restore them manually, if necessary.
Networking
1.10, 1.11, 1.12, 1.13, 1.14.0-1.14.2
NetworkGatewayGroup crashes if there's no IP address on
the interface
NetworkGatewayGroup fails to create daemons for nodes that
don't have both IPv4 and IPv6 interfaces on them. This causes features like
BGP LB and EgressNAT to fail. If you check the logs of the failing
ang-node Pod in the kube-system namespace, errors
similar to the following example are displayed when an IPv6 address is
missing:
ANGd.Setup Failed to create ANG daemon {"nodeName": "bm-node-1", "error":
"creating NDP client failed: ndp: address \"linklocal\" not found on interface \"ens192\""}
In the previous example, there's no IPv6 address on the
ens192 interface. Similar ARP errors are displayed if the
node is missing an IPv4 address.
NetworkGatewayGroup tries to establish an ARP connection and
an NDP connection to the link local IP address. If the IP address doesn't
exist (IPv4 for ARP, IPv6 for NDP) then the connection fails and the daemon
doesn't continue.
This issue is fixed in release 1.14.3.
Workaround:
Connect to the node using SSH and add an IPv4 or IPv6 address to the
link that contains the node IP. In the previous example log entry, this
interface was ens192:
ipaddressadddevINTERFACEscopelinkADDRESS
Replace the following:
INTERFACE: The interface for your
node, such as ens192.
ADDRESS: The IP address and subnet
mask to apply to the interface.
Reset/Deletion
1.10, 1.11, 1.12, 1.13.0-1.13.2
anthos-cluster-operator crash loop when removing a
control plane node
When you try to remove a control plane node by removing the IP address
from the Cluster.Spec, the anthos-cluster-operator
enters into a crash loop state that blocks any other operations.
Workaround:
Issue is fixed in 1.13.3 and 1.14.0 and later. All other versions are
affected. Upgrade to one of the fixed versions
IP_ADDRESS: The IP address of
the node in a crash loop state.
CLUSTER_NAMESPACE: The cluster
namespace.
Installation
1.13.1, 1.13.2 and 1.13.3
kubeadm join fails in large clusters due to token
mismatch
When you install clusters with a large number of nodes, you
might see a kubeadmin join error message similar to the
following example:
TASK [kubeadm : kubeadm join --config /dev/stdin --ignore-preflight-errors=all] ***
fatal: [10.200.0.138]: FAILED! => {"changed": true, "cmd": "kubeadm join
--config /dev/stdin --ignore-preflight-errors=all", "delta": "0:05:00.140669", "end": "2022-11-01 21:53:15.195648", "msg": "non-zero return code", "rc": 1,
"start": "2022-11-01 21:48:15.054979", "stderr": "W1101 21:48:15.082440 99570 initconfiguration.go:119]
Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future.
Automatically prepending scheme \"unix\" to the \"criSocket\" with value \"/run/containerd/containerd.sock\". Please update your configuration!\nerror
execution phase preflight: couldn't validate the identity of the API Server: could not find a JWS signature in the cluster-info ConfigMap for token ID \"yjcik0\"\n
To see the stack trace of this error execute with --v=5 or higher", "stderr_lines":
["W1101 21:48:15.082440 99570 initconfiguration.go:119] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future.
Automatically prepending scheme \"unix\" to the \"criSocket\" with value \"/run/containerd/containerd.sock\".
Please update your configuration!", "error execution phase preflight: couldn't validate the identity of the API Server:
could not find a JWS signature in the cluster-info ConfigMap for token ID \"yjcik0\"",
"To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[preflight]
Running pre-flight checks", "stdout_lines": ["[preflight] Running pre-flight checks"]}
Workaround:
This issue is resolved in Google Distributed Cloud version 1.13.4 and later.
If you need to use an affected version, first create a cluster with
less than 20 nodes, and then resize the cluster to add additional nodes
after the install is complete.
Logging and monitoring
1.10, 1.11, 1.12, 1.13.0
Low CPU limit for metrics-server in Edge clusters
In Google Distributed Cloud Edge clusters, low CPU limits for
metrics-server can cause frequent restarts of
metrics-server. Horizontal Pod Autoscaling (HPA) doesn't work
due to metrics-server being unhealthy.
If metrics-server CPU limit is less than 40m,
your clusters can be affected. To check the metrics-server
CPU limits, review one of the following files:
Remove the --config-dir=/etc/config line and increase the
CPU limits, as shown in the following example:
[...]
- command:
- /pod_nanny
# - --config-dir=/etc/config # <--- Remove this line
- --container=metrics-server
- --cpu=50m # <--- Increase CPU, such as to 50m
- --extra-cpu=0.5m
- --memory=35Mi
- --extra-memory=4Mi
- --threshold=5
- --deployment=metrics-server
- --poll-period=30000
- --estimator=exponential
- --scale-down-delay=24h
- --minClusterSize=5
- --use-metrics=true
[...]
Save and close the metrics-server to apply the changes.
Networking
1.14, 1.15, 1.16
Direct NodePort connection to hostNetwork Pod doesn't work
Connection to a Pod enabled with hostNetwork using NodePort
Service fails when the backend Pod is on the same node as the targeted
NodePort. This issues affects LoadBalancer Services when used with
hostNetwork-ed Pods. With multiple backends, there can be a sporadic
connection failure.
This issue is caused by a bug in the eBPF program.
Workaround:
When using a Nodeport Service, don't target the node on which any of
the backend Pod runs. When using the LoadBalancer Service, make sure the
hostNetwork-ed Pods don't run on LoadBalancer nodes.
Upgrades and updates
1.12.3, 1.13.0
1.13.0 admin clusters can't manage 1.12.3 user clusters
Admin clusters that run version 1.13.0 can't manage user clusters that
run version 1.12.3. Operations against a version 1.12.3 user cluster fail.
Workaround:
Upgrade your admin cluster to version 1.13.1, or upgrade the user
cluster to the same version as the admin cluster.
Upgrades and updates
1.12
Upgrading to 1.13.x is blocked for admin clusters with worker node pools
Version 1.13.0 and higher admin clusters can't contain worker node pools.
Upgrades to version 1.13.0 or higher for admin clusters with worker node
pools is blocked. If your admin cluster upgrade is stalled, you can confirm
if worker node pools are the cause by checking following error in the
upgrade-cluster.log file inside the bmctl-workspace
folder:
Operation failed, retrying with backoff. Cause: error creating "baremetal.cluster.gke.io/v1, Kind=NodePool" cluster-test-cluster-2023-06-06-140654/np1: admission webhook "vnodepool.kb.io" denied the request: Adding worker nodepool to Admin cluster is disallowed.
Workaround:
Before upgrading, move all worker node pools to user clusters. For
instructions to add and remove node pools, see
Manage node pools in a cluster.
Upgrades and updates
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16, 1.28
Errors when updating resources using kubectl apply
If you update existing resources like the ClientConfig or
Stackdriver custom resources using kubectl apply,
the controller might return an error or revert your input and planned changes.
For example, you might try to edit the Stackdriver custom
resource as follows by first getting the resource, and then applying an updated version:
Corrupted backlog chunks cause stackdriver-log-forwarder
crashloop
The stackdriver-log-forwarder crashloops if it tries to
process a corrupted backlog chunk. The following example errors are shown in
the container logs:
[2022/09/16 02:05:01] [error] [storage] format check failed: tail.1/1-1659339894.252926599.flb
[2022/09/16 02:05:01] [error] [engine] could not segregate backlog chunks
When this crashloop occurs, you can't see logs in Cloud Logging.
Workaround:
To resolve these errors, complete the following steps:
Identify the corrupted backlog chunks. Review the following example
error messages:
[2022/09/16 02:05:01] [error] [storage] format check failed: tail.1/1-1659339894.252926599.flb
[2022/09/16 02:05:01] [error] [engine] could not segregate backlog chunks
In this example, the file tail.1/1-1659339894.252926599.flb that's stored in var/log/fluent-bit-buffers/tail.1/ is at
fault. Every *.flb file with a format check failed must be removed.
End the running pods for stackdriver-log-forwarder:
Make sure that the DaemonSet has cleaned up all the nodes. The
output of the following two commands should be equal to the number of
nodes in the cluster:
Restarting Dataplane V2 (anetd) on clusters can result in
existing VMs unable to attach to non-pod-network
On multi-nic clusters, restarting Dataplane V2 (anetd) can
result in virtual machines being unable to attach to networks. An error
similar to the following might be observed in the anetd
pod logs:
could not find an allocator to allocate the IP of the multi-nic endpoint
Workaround:
You can restart the VM as a quick fix. To avoid a recurrence of the
issue, upgrade your cluster to version 1.14.1 or a later.
Operation
1.13, 1.14.0, 1.14.1
gke-metrics-agent has no memory limit on Edge profile clusters
Depending on the cluster's workload, the gke-metrics-agent
might use greater than 4608 MiB of memory. This issue only affects
Google Distributed Cloud for bare metal Edge profile clusters. Default profile clusters aren't
impacted.
Workaround:
Upgrade your cluster to version 1.14.2 or later.
Installation
1.12, 1.13
Cluster creation might fail due to race conditions
When you create clusters using kubectl, due to race
conditions preflight check may never finish. As a result, cluster creation
may fail in certain cases.
The preflight check reconciler creates a SecretForwarder
to copy the default ssh-key secret to the target namespace.
Typically, preflight check leverages on the owner references and
reconciles once the SecretForwarder is complete. However, in
rare cases the owner references of the SecretForwarder can
lose the reference to the preflight check, causing the preflight check to
get stuck. As a result, cluster creation fails. In order to continue the
reconciliation for the controller-driven preflight check, delete the
cluster-operator pod or delete the preflight-check resource. When you
delete the preflight-check resource, it creates another one and continues
the reconciliation. Alternately, you can upgrade your existing clusters
(that were created with an earlier version) to a fixed version.
Networking
1.9, 1.10, 1.11, 1.12, 1.13, 1.14, 1.15
Reserved IP addresses aren't released when using whereabouts plugin with the multi-NIC feature
In the multi-Nic feature, if you're using the CNI whereabouts plugin
and you use the CNI DEL operation to delete a network interface for a Pod,
some reserved IP addresses might not be released properly. This happens
when the CNI DEL operation is interrupted.
You can verify the unused
IP address reservations of the Pods by running the following command:
kubectlgetippools-A--kubeconfigKUBECONFIG_PATH
Workaround:
Manually delete the IP addresses (ippools) that aren't used.
Installation
1.10, 1.11.0, 1.11.1, 1.11.2
Node Problem Detector fails in 1.10.4 user cluster
The Node Problem Detector might fail in version 1.10.x user clusters,
when version 1.11.0, 1.11.1, or 1.11.2 admin clusters
manage 1.10.x user clusters. When the Node Problem Detector fails, the log
gets updated with the following error message:
Upgrade the admin cluster to 1.11.3 to resolve the issue.
Operation
1.14
1.14 island mode IPv4 cluster nodes have a pod CIDR mask
size of 24
In release 1.14, the maxPodsPerNode setting isn't taken
into account for
island mode
clusters, so the nodes are assigned a pod CIDR mask size of 24
(256 IP addresses).nThis might cause the cluster to run out of pod IP
addresses earlier than expected. For example, if your cluster has a pod
CIDR mask size of 22; each node will be assigned a pod CIDR mask of 24 and
the cluster will only be able to support up to 4 nodes. Your cluster may
also experience network instability in a period of high pod churn when maxPodsPerNode is set to 129 or higher and there isn't enough
overhead in the pod CIDR for each node.
If your cluster is affected, the anetd pod reports the
following error when you add a new node to the cluster and there's no
podCIDR available:
error="required IPv4 PodCIDR not available"
Workaround
Use the following steps to resolve the issue:
Upgrade to 1.14.1 or a later version.
Remove the worker nodes and add them back.
Remove the control plane nodes and add them back, preferably one by
one to avoid cluster downtime.
Upgrades and updates
1.14.0, 1.14.1
Cluster upgrade rollback failure
An upgrade rollback might fail for version 1.14.0 or 1.14.1 clusters.
If you upgrade a cluster from 1.14.0 to 1.14.1 and then try to rollback to
1.14.0 by using bmctl restore cluster command, an error
like the following example might be returned:
Replace HEALTHCHECK_RESOURCE_NAME with
the name of the healthcheck resources.
Rerun the bmctl restore cluster command.
Networking
1.12.0
Service external IP address does
not work in flat mode
In a cluster that has flatIPv4 set to true,
Services of type LoadBalancer are not accessible by their
external IP addresses.
This issue is fixed in version 1.12.1.
Workaround:
In the cilium-config ConfigMap, set
enable-415 to "true", and then restart
the anetd Pods.
Upgrades and updates
1.13.0, 1.14
In-place upgrades from 1.13.0 to 1.14.x never finish
When you try to do an in-place upgrade from 1.13.0 to
1.14.x using bmctl 1.14.0 and the
--use-bootstrap=false flag, the upgrade never finishes.
An error with the preflight-check operator causes the
cluster to never schedule the required checks, which means the preflight
check never finishes.
Workaround:
Upgrade to 1.13.1 first before you upgrade to 1.14.x. An in-place
upgrade from 1.13.0 to 1.13.1 should work. Or, upgrade from 1.13.0 to
1.14.x without the --use-bootstrap=false flag.
Upgrades and updates, Security
1.13 and 1.14
Clusters upgraded to 1.14.0 lose master taints
The control plane nodes require one of two specific taints to prevent
workload pods from being scheduled on them. When you upgrade version 1.13
clusters to version 1.14.0, the control plane nodes lose the
following required taints:
node-role.kubernetes.io/master:NoSchedule
node-role.kubernetes.io/master:PreferNoSchedule
This problem doesn't cause upgrade failures, but pods that aren't
supposed to run on the control plane nodes may start doing so. These
workload pods can overwhelm control plane nodes and lead to cluster
instability.
Determine if you're affected
Find control plane nodes, use the following command:
If neither of the required taints is listed, then you're affected.
Workaround
Use the following steps for each control plane node of your affected
version 1.14.0 cluster to restore proper function. These steps are for the
node-role.kubernetes.io/master:NoSchedule taint and related
pods. If you intend for the control plane nodes to use the PreferNoSchedule
taint, then adjust the steps accordingly.
VM creation fails intermittently with upload errors
Creating a new Virtual Machine (VM) with the kubectl virt create vm
command fails infrequently during image upload. This issue applies for
both Linux and Windows VMs. The error looks something like the following
example:
Retry the kubectl virt create vm command to create your VM.
Upgrades and updates, Logging and monitoring
1.11
Managed collection components in 1.11 clusters aren't preserved in
upgrades to 1.12
Managed collection components are part of Managed Service for Prometheus.
If you manually deployed
managed collection
components in the gmp-system namespace of your
version 1.11 clusters, the associated resources aren't
preserved when you upgrade to version 1.12.
Starting with version 1.12.0 clusters, Managed Service
for Prometheus components in the gmp-system namespace and
related custom resource definitions are managed by
stackdriver-operator with the enableGMPForApplications
field. The enableGMPForApplications field defaults to
true, so if you manually deploy Managed Service for Prometheus
components in the namespace before upgrading to version 1.12, the
resources are deleted by stackdriver-operator.
Workaround
To preserve manually managed collection resources:
Backup all existing PodMonitoring custom resources.
This is most likely to occur with version 1.12 Docker clusters that
were upgraded from 1.11, as that upgrade doesn't require the annotation
to maintain the Docker container runtime. In this case, clusters don't have
the annotation when upgrading to 1.13. Note that starting with
version 1.13, containerd is the only permitted container runtime.
Workaround:
If you're affected by this problem, update the cluster resource with
the missing annotation. You can add the annotation either while the
upgrade is running or after canceling and before retrying the upgrade.
Installation
1.11
bmctl exits before cluster creation completes
Cluster creation may fail for Google Distributed Cloud version 1.11.0
(this issue is fixed in Google Distributed Cloud release 1.11.1). In some
cases, the bmctl create cluster command exits early and
writes errors like the following to the logs:
The failed operation produces artifacts, but the cluster isn't
operational. If this issue affects you, use the following steps to clean
up artifacts and create a cluster:
View workaround steps
To delete cluster artifacts and reset the node machine, run the
following command:
bmctlreset-cUSER_CLUSTER_NAME
To start the cluster creation operation, run the following command:
The --keep-bootstrap-cluster flag is important if this
command fails.
If the cluster creation command succeeds, you can skip the remaining
steps. Otherwise, continue.
Run the following command to get the version for the bootstrap cluster:
To delete the bootstrap cluster, run the following command:
bmctlresetbootstrap
Installation, VM Runtime on GDC
1.11, 1.12
Installation reports VM runtime reconciliation error
The cluster creation operation may report an error similar to the
following:
I042301:17:20.8956403935589logs.go:82]"msg"="Cluster reconciling:""message"="Internal error occurred: failed calling webhook \"vvmruntime.kb.io\": failed to call webhook: Post \"https://vmruntime-webhook-service.kube-system.svc:443/validate-vm-cluster-gke-io-v1vmruntime?timeout=10s\": dial tcp 10.95.5.151:443: connect: connection refused""name"="xxx""reason"="ReconciliationError"
Workaround
This error is benign and you can safely ignore it.
Installation
1.10, 1.11, 1.12
Cluster creation fails when using multi-NIC, containerd,
and HTTPS proxy
Cluster creation fails when you have the following combination of
conditions:
Cluster is configured to use containerd as the
container runtime (nodeConfig.containerRuntime set to
containerd in the cluster configuration file, the default
for Google Distributed Cloud version 1.13 and higher).
Cluster is configured to provide multiple network interfaces,
multi-NIC, for pods (clusterNetwork.multipleNetworkInterfaces
set to true in the cluster configuration file).
Cluster is configured to use a proxy (spec.proxy.url is
specified in the cluster configuration file). Even though cluster
creation fails, this setting is propagated when you attempt to create a
cluster. You may see this proxy setting as an HTTPS_PROXY environment variable or in your containerd configuration
(/etc/systemd/system/containerd.service.d/09-proxy.conf).
Workaround
Append service CIDRs (clusterNetwork.services.cidrBlocks)
to the NO_PROXY environment variable on all node machines.
Installation
1.10, 1.11, 1.12
Failure on systems with restrictive umask
setting
Google Distributed Cloud release 1.10.0 introduced a rootless control
plane feature that runs all the control plane components as a non-root
user. Running all components as a non-root user may cause installation
or upgrade failures on systems with a more restrictive
umask setting of 0077.
Workaround
Reset the control plane nodes and change the umask setting
to 0022 on all the control plane machines. After the machines
have been updated, retry the installation.
Alternatively, you can change the directory and file permissions of
/etc/kubernetes on the control-plane machines for the
installation or upgrade to proceed.
Make /etc/kubernetes and all its subdirectories world
readable: chmod o+rx.
Make all the files owned by root user under the
directory (recursively) /etc/kubernetes world readable (chmod o+r). Exclude private key files (.key) from these
changes as they are already created with correct ownership and
permissions.
Make /usr/local/etc/haproxy/haproxy.cfg world
readable.
Make /usr/local/etc/bgpadvertiser/bgpadvertiser-cfg.yaml
world readable.
Installation
1.10, 1.11, 1.12, 1.13
Control group v2 incompatibility
Control group v2 (cgroup v2) isn't supported in versions 1.13 and
earlier of Google Distributed Cloud. However, version 1.14 supports cgroup
v2 as a Preview
feature. The presence
of /sys/fs/cgroup/cgroup.controllers indicates that your
system uses cgroup v2.
Workaround
If your system uses cgroup v2, upgrade your cluster to version 1.14.
For installations triggered by admin or hybrid clusters (in other
words, clusters not created with bmctl, like user clusters),
the preflight check does not verify Google Cloud service account
credentials or their associated permissions.
When installing bare metal clusters on vSphere VMs, you must set the
tx-udp_tnl-segmentation and
tx-udp_tnl-csum-segmentation flags to off. These flags are
related to the hardware segmentation offload done by the vSphere driver
VMXNET3 and they don't work with the GENEVE tunnel of
bare metal clusters.
Workaround
Run the following command on each node to check the current values for
these flags:
ethtool-kNET_INTFC|grepsegm
Replace NET_INTFC with the network
interface associated with the IP address of the node.
The response should have entries like the following:
Sometimes in RHEL 8.4, ethtool shows these flags are off
while they aren't. To explicitly set these flags to off, toggle the flags
on and then off with the following commands:
This flag change does not persist across reboots. Configure the startup
scripts to explicitly set these flags when the system boots.
Upgrades and updates
1.10
bmctl can't create, update, or reset lower version user
clusters
The bmctl CLI can't create, update, or reset a user
cluster with a lower minor version, regardless of the admin cluster
version. For example, you can't use bmctl with a version of
1.N.X to reset a user cluster of version
1.N-1.Y, even if the admin cluster is also at version
1.N.X.
If you are affected by this issue, you should see the logs similar to
the following when you use bmctl:
Use kubectl to create, edit, or delete the user cluster
custom resource inside the admin cluster.
The ability to upgrade user clusters is unaffected.
Upgrades and updates
1.12
Cluster upgrades to version 1.12.1 may stall
Upgrading clusters to version 1.12.1 sometimes stalls due to the API
server becoming unavailable. This issue affects all cluster types and all
supported operating systems. When this issue occurs, the bmctl
upgrade clustercommand can fail at multiple points, including during
the second phase of preflight checks.
Workaround
You can check your upgrade logs to determine if you are affected by
this issue. Upgrade logs are located in
/baremetal/bmctl-workspace/CLUSTER_NAME/log/upgrade-cluster-TIMESTAMP by default.
The upgrade-cluster.log may contain errors like the following:
HAProxy and Keepalived must be running on each control plane node before you
reattempt to upgrade your cluster to version 1.12.1. Use the
crictl command-line interface
on each node to check to see if the haproxy and
keepalived containers are running:
If either HAProxy or Keepalived isn't running on a node, restart
kubelet on the node:
systemctlrestartkubelet
Upgrades and updates, VM Runtime on GDC
1.11, 1.12
Upgrading clusters to version 1.12.0 or higher fails when
VM Runtime on GDC is enabled
In version 1.12.0 clusters, all resources related to
VM Runtime on GDC are migrated to the vm-system
namespace to better support the VM Runtime on GDC GA release. If
you have VM Runtime on GDC enabled in a version 1.11.x or lower
cluster, upgrading to version 1.12.0 or higher fails unless you first
disable VM Runtime on GDC. When you're affected by this issue, the
upgrade operation reports the following error:
Failed to upgrade cluster: cluster isn't upgradable with vmruntime enabled from
version 1.11.x to version 1.12.0: please disable VMruntime before upgrade to
1.12.0 and higher version
Upgrade stuck at error during manifests operations
In some situations, cluster upgrades fail to complete and the
bmctl CLI becomes unresponsive. This problem can be caused by
an incorrectly updated resource. To determine if you're affected by this
issue and to correct it, check the anthos-cluster-operator
logs and look for errors similar to the following entries:
controllers/Cluster"msg"="error during manifests operations""error"="1 error occurred: ... {RESOURCE_NAME} is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
These entries are a symptom of an incorrectly updated resource, where
{RESOURCE_NAME} is the name of the problem resource.
Workaround
If you find these errors in your logs, complete the following steps:
Use kubectl edit to remove the
kubectl.kubernetes.io/last-applied-configuration annotation
from the resource contained in the log message.
Save and apply your changes to the resource.
Retry the cluster upgrade.
Upgrades and updates
1.10, 1.11, 1.12
Upgrades are blocked for clusters with features that use Network Gateway for GDC
Cluster upgrades from 1.10.x to 1.11.x fail for clusters that use
either egress NAT gateway or
bundled load-balancing with
BGP. These features both use Network Gateway for GDC. Cluster upgrades
get stuck at the Waiting for upgrade to complete...
command-line message and the anthos-cluster-operator logs errors
like the following:
apply run failed ... MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field
is immutable...
Workaround
To unblock the upgrade, run the following commands against the cluster
you are upgrading:
Nodes uncordoned if you don't use the maintenance mode procedure
If you runversion 1.12.0 clusters
(anthosBareMetalVersion: 1.12.0) or lower and manually use
kubectl cordon on a node, Google Distributed Cloud for bare metal might uncordon the
node before you're ready in an effort to reconcile the expected state.
Workaround
For version 1.12.0 and lower clusters, use
maintenance mode to
cordon and drain nodes safely.
In version 1.12.1 (anthosBareMetalVersion: 1.12.1) or
higher, Google Distributed Cloud for bare metal won't uncordon your nodes unexpectedly when
you use kubectl cordon.
Operation
1.11
Version 11 admin clusters using a registry mirror can't manage version
1.10 clusters
If your admin cluster is on version 1.11 and uses a registry mirror, it
can't manage user clusters that are on a lower minor version. This issue
affects reset, update, and upgrade operations on the user cluster.
To determine whether this issue affects you, check your logs for
cluster operations, such as create, upgrade, or reset. These logs are
located in the bmctl-workspace/CLUSTER_NAME/
folder by default. If you're affected by the issue, your logs contain the
following error message:
flag provided but not defined: -registry-mirror-host-to-endpoints
Operation
1.10, 1.11
kubeconfig Secret overwritten
The bmctl check cluster command, when run on user
clusters, overwrites the user cluster kubeconfig Secret with the admin
cluster kubeconfig. Overwriting the file causes standard cluster
operations, such as updating and upgrading, to fail for affected user
clusters. This problem applies to cluster versions 1.11.1
and earlier.
To determine if this issue affects a user cluster, run the following
command:
ADMIN_KUBECONFIG: the path to the
admin cluster kubeconfig file.
USER_CLUSTER_NAMESPACE: the
namespace for the cluster. By default, the cluster namespaces names
are the name of the cluster prefaced with
cluster-. For example, if you name your cluster
test, the default namespace is cluster-test.
USER_CLUSTER_NAME: the name of the
user cluster to check.
If the cluster name in the output (see
contexts.context.cluster in the following sample output) is
the admin cluster name, then the specified user cluster is affected.
The following steps restore function to an affected user cluster
(USER_CLUSTER_NAME):
Locate the user cluster kubeconfig file. Google Distributed Cloud for bare metal
generates the kubeconfig file on the admin workstation when you create a
cluster. By default, the file is in the
bmctl-workspace/USER_CLUSTER_NAME
directory.
Verify the kubeconfig is correct user cluster kubeconfig:
Replace PATH_TO_GENERATED_FILE with the
path to the user cluster kubeconfig file. The response returns details
about the nodes for the user cluster. Confirm the machine names are
correct for your cluster.
Run the following command to delete the corrupted kubeconfig file in
the admin cluster:
If you use containerd as the container runtime, running snapshot as
non-root user requires /usr/local/bin to be in the user's PATH.
Otherwise it will fail with a crictl: command not found
error.
When you aren't logged in as the root user, sudo is used
to run the snapshot commands. The sudo PATH can differ from the
root profile and may not contain /usr/local/bin.
Workaround
Update the secure_path in /etc/sudoers to
include /usr/local/bin. Alternatively, create a symbolic link
for crictl in another /bin directory.
Logging and monitoring
1.10
stackdriver-log-forwarder has [parser:cri] invalid
time format warning logs
If the
container runtime
interface (CRI) parser
uses an incorrect regular expression for parsing time, the logs for the
stackdriver-log-forwarder Pod contain errors and warnings
like the following:
Your edited resource should look similar to the following:
[PARSER]# https://rubular.com/r/Vn30bO78GlkvyBNamecri
Formatregex
# The timestamp is described inhttps://www.rfc-editor.org/rfc/rfc3339#section-5.6
Regex^(?<time>[0-9]{4}-[0-9]{2}-[0-9]{2}[Tt][0-9]{2}:[0-9]{2}:[0-9]{2}(?:\.[0-9]+)?(?:[Zz]|[+-][0-9]{2}:[0-9]{2}))(?<stream>stdout|stderr)(?<logtag>[^]*)(?<log>.*)$Time_KeytimeTime_Format%Y-%m-%dT%H:%M:%S.%L%z
Time_Keepoff
For cluster versions 1.10 to 1.15, some customers have
found unexpectedly high billing for Metrics volume on the
Billing page. This issue affects you only when all of the
following circumstances apply:
Application monitoring is enabled (enableStackdriverForApplications=true)
Application Pods have the prometheus.io/scrap=true
annotation
To confirm whether you are affected by this issue,
list your
user-defined metrics. If you see billing for unwanted metrics, then this
issue applies to you.
Workaround
If you are affected by this issue, we recommend that you upgrade your
clusters to version 1.12 and switch to new application monitoring solution
managed-service-for-prometheus
that address this issue:
Separate flags to control the collection of application logs versus
application metrics
Bundled Google Cloud Managed Service for Prometheus
If you can't upgrade to version 1.12, use the following steps:
Find the source Pods and Services that have the unwanted billing:
Remove the prometheus.io/scrap=true annotation from the
Pod or Service.
Logging and monitoring
1.11, 1.12, 1.13, 1.14, 1.15, 1.16, 1.28
Edits to metrics-server-config aren't persisted
High pod density can, in extreme cases, create excessive logging and
monitoring overhead, which can cause Metrics Server to stop and restart. You
can edit the metrics-server-config ConfigMap to allocate
more resources to keep Metrics Server running. However, due to reconciliation,
edits made to metrics-server-config can get
reverted to the default value during a cluster update or upgrade operation.
Metrics Server isn't affected immediately, but the next time
it restarts, it picks up the reverted ConfigMap and is vulnerable to excessive
overhead, again.
Workaround
For 1.11.x, you can script the ConfigMap edit and perform it along
with updates or upgrades to the cluster. For 1.12 and onward, contact support.
Several Google Distributed Cloud software-only metrics have been deprecated and, starting with
Google Distributed Cloud release 1.11, data is no longer collected for these
deprecated metrics. If you use these metrics in any of your alerting
policies, there won't be any data to trigger the alerting condition.
The following table lists the individual metrics that have been
deprecated and the metric that replaces them.
In cluster versions lower than 1.11, the policy definition
file for the recommended Anthos on baremetal node cpu usage exceeds
80 percent (critical) alert uses the deprecated metrics. The
node-cpu-usage-high.json JSON definition file is updated for
releases 1.11.0 and later.
Workaround
Use the following steps to migrate to the replacement metrics:
In the Google Cloud console, select Monitoring or click the
following button: Go to Monitoring
In the navigation pane, select
Dashboards, and delete the Anthos cluster node status
dashboard.
Click the Sample library tab and reinstall the Anthos
cluster node status dashboard.
stackdriver-log-forwarder has CrashloopBackOff
errors
In some situations, the fluent-bit logging agent can get
stuck processing corrupt chunks. When the logging agent is unable to bypass
corrupt chunks, you may observe that
stackdriver-log-forwarder keeps crashing with a
CrashloopBackOff error. If you are having this problem, your
logs have entries like the following
[2022/03/09 02:18:44] [engine] caught signal (SIGSEGV) #0 0x5590aa24bdd5
in validate_insert_id() at plugins/out_stackdriver/stackdriver.c:1232
#1 0x5590aa24c502 in stackdriver_format() at plugins/out_stackdriver/stackdriver.c:1523
#2 0x5590aa24e509 in cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:2105
#3 0x5590aa19c0de in output_pre_cb_flush() at include/fluent-bit/flb_output.h:490
#4 0x5590aa6889a6 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117 #5 0xffffffffffffffff in ???() at ???:0
Workaround:
Clean up the buffer chunks for the Stackdriver Log Forwarder.
Note: In the following commands, replace
KUBECONFIG with the path to the admin
cluster kubeconfig file.
gke-metrics-agent is a daemonset that is collecting metrics on each
node and forward them to Cloud Monitoring. It may produce logs such as
the following:
These error logs can be safely ignored as the metrics they refer to are
not supported and not critical for monitoring purposes.
Logging and monitoring
1.10, 1.11
Intermittent metrics export interruptions
Clusters might experience interruptions in normal,
continuous exporting of metrics, or missing metrics on some nodes. If this
issue affects your clusters, you may see gaps in data for the following
metrics (at a minimum):
The command finds cpu: 50m if your edits have taken
effect.
Networking
1.10
Multiple default gateways breaks connectivity to external endpoints
Having multiple default gateways in a node can lead to broken
connectivity from within a Pod to external endpoints, such as
google.com.
To determine if you're affected by this issue, run the following
command on the node:
iprouteshow
Multiple instances of default in the response indicate
that you're affected.
Networking
1.12
Networking custom resource edits on user clusters get overwritten
Version 1.12.x clusters don't prevent you from manually
editing networking
custom resources
in your user cluster. Google Distributed Cloud reconciles custom resources
in the user clusters with the custom resources in your admin cluster
during cluster upgrades. This reconciliation overwrites any edits made
directly to the networking custom resources in the user cluster. The
networking custom resources should be modified in the admin cluster only,
but version 1.12.x clusters don't enforce this requirement.
You edit these custom resources in your admin cluster and the
reconciliation step applies the changes to your user clusters.
Workaround
If you've modified any of the previously mentioned custom resources on
a user cluster, modify the corresponding custom resources on your admin
cluster to match before upgrading. This step ensures that your
configuration changes are preserved. Cluster versions
1.13.0 and higher prevent you from modifying the networking custom
resources on your user clusters directly.
Networking
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16, 1.28
Pod connectivity failures and reverse path filtering
Google Distributed Cloud configures reverse path filtering on nodes to
disable source validation (net.ipv4.conf.all.rp_filter=0).
If the rp_filter setting is changed to 1 or
2, pods will fail due to out-of-node communication
timeouts.
Reverse path filtering is set with rp_filter files in the
IPv4 configuration folder (net/ipv4/conf/all). This value may
also be overridden by sysctl, which stores reverse path
filtering settings in a network security configuration file, such as
/etc/sysctl.d/60-gce-network-security.conf.
Workaround
Pod connectivity can be restored by performing either of the following
workarounds:
Set the value for net.ipv4.conf.all.rp_filter back to
0 manually, and then run sudo sysctl -p to apply
the change.
Or
Restart the anetd Pod to set
net.ipv4.conf.all.rp_filter back to 0. To
restart the anetd Pod, use the following commands to locate
and delete the anetd Pod and a new anetd Pod
will start up in its place:
After performing either of the workarounds verify that the
net.ipv4.conf.all.rp_filter value is set to 0 by
running sysctl net.ipv4.conf.all.rp_filter on each node.
Bootstrap (kind) cluster IP addresses and cluster node IP addresses
overlapping
192.168.122.0/24 and 10.96.0.0/27 are the
default pod and service CIDRs used by the bootstrap (kind) cluster.
Preflight checks will fail if they overlap with cluster node machine IP
addresses.
Workaround
To avoid the conflict, you can pass the
--bootstrap-cluster-pod-cidr and
--bootstrap-cluster-service-cidr flags to bmctl
to specify different values.
Operating system
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16, 1.28
Cluster creation or upgrade fails on CentOS
In December 2020, the CentOS community and Red Hat announced the
sunset
of CentOS. On January 31, 2022, CentOS 8 reached its end of life
(EOL). As a result of the EOL, yum repositories stopped
working for CentOS, which causes cluster creation and cluster upgrade
operations to fail. This applies to all supported versions of CentOS and
affects all versions of clusters.
Workaround
View workaround steps
As a workaround, run the following commands to have your CentOS use an
archive feed:
Container can't write to VOLUME defined in Dockerfile
with containerd and SELinux
If you use containerd as the container runtime and your operating
system has SELinux enabled, the VOLUME defined in the
application Dockerfile might not be writable. For example, containers
built with the following Dockerfile aren't able to write to the
/tmp folder.
FROM ubuntu:20.04 RUN chmod -R 777 /tmp VOLUME /tmp
To verify if you're affected by this issue, run the following command
on the node that hosts the problematic container:
ausearch-mavc
If you're affected by this issue, you see a denied error
like the following:
To work around this issue, make either of the following changes:
Turn off SELinux.
Don't use the VOLUME feature inside Dockerfile.
Upgrades and updates
1.10, 1.11, 1.12
Node Problem Detector isn't enabled by default after cluster upgrades
When you upgrade clusters, Node Problem Detector isn't enabled
by default. This issue is applicable for upgrades in release 1.10 to
1.12.1 and has been fixed in release 1.12.2.
Workaround:
To enable the Node Problem Detector:
Verify if node-problem-detector systemd service is
running on the node.
Use the SSH command and connect to the node.
Check if node-problem-detector systemd service is
running on the node:
systemctlis-activenode-problem-detector
If the command result displays inactive, then the node-problem-detector isn't running on the node.
To enable the Node Problem Detector, use the
kubectl edit command and edit the
node-problem-detector-config ConfigMap. For more
information, see
Node Problem
Detector.
Operation
1.9, 1.10
Cluster backup fails when using non-root login
The bmctl backup cluster command fails if
nodeAccess.loginUser is set to a non-root username.]
Workaround:
This issue applies to versions 1.9.x, 1.10.0, and 1.10.1
and is fixed in version 1.10.2 and later.
Networking
1.10, 1.11, 1.12
Load Balancer Services don't work with containers on the control plane
host network
There is a bug in anetd where packets are dropped for
LoadBalancer Services if the backend pods are both running on the control
plane node and are using the hostNetwork: true field in the
container's spec.
The bug isn't present in version 1.13 or later.
Workaround:
The following workarounds can help if you use a LoadBalancer Service
that is backed by hostNetwork Pods:
Run them on worker nodes (not control plane nodes).
Orphaned anthos-version-$version$ pod failing to pull image
Cluster upgrading from 1.12.x to 1.13.x might observe a failing
anthos-version-$version$ pod with ImagePullBackOff error.
This happens due to the race condition of anthos-cluster-operator gets
upgraded and it shouldn't affect any regular cluster capabilities.
The bug isn't present after version 1.13 or later.
Workaround:
Delete the Job of dynamic-version-installer by
kubectl delete job anthos-version-$version$ -n kube-system
Upgrades and updates
1.13
1.12 clusters upgraded from 1.11 can't upgrade to 1.13.0
Version 1.12 clusters that were upgraded from version 1.11 can't be
upgraded to version 1.13.0. This upgrade issue doesn't apply to clusters
that were created at version 1.12.
To determine if you're affected, check the logs of the upgrade job that
contains the upgrade-first-no* string in the admin cluster.
If you see the following error message, you're affected.
There's an issue in stackdriver-operator that causes it to
consume higher CPU time than normal. Normal CPU usage is less than 50
milliCPU (50m) for stackdriver-operator in idle
state. The cause is a mismatch of Certificate resources that
stackdriver-operator applies with the expectations from
cert-manager. This mismatch causes a race condition between
cert-manager and stackdriver-operator in
updating those resources.
This issue may result in reduced performance on clusters with limited
CPU availability.
Workaround:
Until you can upgrade to a version that fixed this bug, use
the following workaround:
To temporarily scale down stackdriver-operator to 0
replicas, apply an AddonConfiguration custom resource:
In the Google Distributed Cloud 1.16 minor release, the
enableStackdriverForApplications field in the
stackdriver custom resource spec is deprecated. This field is
replaced by two fields, enableCloudLoggingForApplications and
enableGMPForApplications, in the stackdriver custom resource.
We recommend that you to use Google Cloud Managed Service for Prometheus for monitoring
your workloads. Use the enableGMPForApplications field to
enable this feature.
If you rely on metrics collection triggered by
prometheus.io/scrape annotations on your
workloads, you can use the annotationBasedApplicationMetrics
feature gate flag to keep the old behavior. However, there is an issue that
prevents the annotationBasedApplicationMetrics from working
properly, preventing metrics collection from your applications into
Cloud Monitoring.
Workaround:
To resolve this issue, upgrade your cluster to version 1.16.2 or higher.
The annotation-based workload metrics collection enabled by the
annotationBasedApplicationMetrics feature gate collects
metrics for objects that have the prometheus.io/scrape
annotation. Many software systems with open source origin may use this
annotation. If you continue using this method of metrics
collection, be aware of this dependency so that you aren't surprised by
metrics charges in Cloud Monitoring.
Cloud audit logging failure due to permission denied
Cloud Audit Logs needs a special permission setup that is
automatically performed by cluster-operator through GKE Hub.
However in cases where one admin cluster managed multiple clusters with
different project IDs, a bug in cluster-operator would cause the same service
account to be appended to the allowlist repeatedly and fail the allowlisting
request due to size limitation. This would result in audit logs from some
or all these clusters fail to be injected into Google Cloud.
The symptom is a series of Permission Denied errors in the
audit-proxy Pod in the affected cluster.
Another symptom is the error status and a long list of duplicated
service account when you check cloud audit logging allowlist through GKE
Hub:
To resolve the issue, you can upgrade your cluster to at least 1.28.1000,
1.29.500, or 1.30.200 where the issue is fixed; Or you can apply the
following Workaround:
View workaround steps
Delete and recreate the cloud audit logging GKE Hub feature to force
trigger the allowlisting automation again.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-12-19 UTC."],[],[]]