Version 1.11. This version is no longer supported. For more information, see the version support policy. For information about how to upgrade to version 1.12, see Upgrading Anthos on bare metal in the 1.12 documentation.
This page lists all known issues for Anthos clusters on bare metal. To filter the known
issues by a product version or category, select your desired filters from the
following drop-down menus.
Select your Anthos clusters on bare metal version:
Select your problem category:
Or, search for your issue:
Category
Identified version(s)
Issue and workaround
Configuration, Installation, Upgrades and updates, Operation
1.14, 1.15, 1.16
Configuration issues when interchangeably using bmctl
update and kubectl apply
If the bmctl update and kubeclt apply -f
commands are used interchangeably, contents populated by the header of the
cluster spec might be unintentionally removed.
For example, you might use bmctl update to update the
cluster configuration file to configure a registry mirror. The cluster
then has spec.nodeConfig.registryMirrors populated after
reconciliation.
If you then attempt to make changes to the cluster using the cluster
configuration with the kubectl apply -f command, the
spec.nodeConfig.registryMirrors populated by the reconciler
on the cluster is removed. This behavior happens because the configuration
doesn't exist in the cluster configuration file, but it does in the
cluster itself.
Workaround:
Don't use bmctl update and kubectl apply -f
interchangeably when editing the cluster configuration file.
In the previous registry mirror example, edit the cluster YAML
configuration directly instead of editing
spec.anthosBareMetalVersion in the cluster configuration
file.
Operation
1.13, 1.14, 1.15, 1.16
Clean up stale Pods in the cluster
You might see Pods managed by a Deployment (ReplicaSet) in a
Failed state and with the status of
TaintToleration. These Pods don't use cluster resources, but
should be deleted.
You can use the following kubectl command to list the
Pods that you can clean up:
kubectl get pods –A | grep TaintToleration
The following example output shows a Pod with the
TaintToleration status:
For each Pod with the described symptoms, check the ReplicaSet that the
Pod belongs to. If the ReplicaSet is satisfied, you can delete the Pods:
Get the ReplicaSet that manages the Pod and find the
ownerRef.Kind value:
kubectl get pod POD_NAME -n NAMESPACE -o yaml
Get the ReplicaSet and verify that the status.replicas
is the same as spec.replicas:
kubectl get replicaset REPLICA_NAME -n NAMESPACE -o yaml
If the names match, delete the Pod:
kubectl delete pod POD_NAME -n NAMESPACE.
Upgrades
1.16.0
etcd-events can stall when upgrade to version 1.16.0
When you upgrade an existing cluster to version 1.16.0, Pod failures
related to etcd-events can stall the operation. Specifically,
the upgrade-node job fails for the
TASK [etcd_events_install : Run etcdevents] step.
If you're affected by this issue, you see Pod failures like the
following:
The kube-apiserver Pod fails to start with the
following error:
These errors indicate eviction events or an inability to schedule Pods
due to node resources. As Anthos Network Gateway Pods have no
PriorityClass, they have the same default priority as other workloads.
When nodes are resource-constrained, the network gateway Pods might be
evicted. This behavior is particularly bad for the ang-node
DaemonSet, as those Pods must be scheduled on a specific node and can't
migrate.
Workaround:
Upgrade to 1.15 or later.
As a short-term fix, you can manually assign a
PriorityClass
to the Anthos Network Gateway components. The Anthos clusters on bare metal controller
overwrites these manual changes during a reconciliation process, such as
during a cluster upgrade.
Assign the system-cluster-critical PriorityClass to the
ang-controller-manager and autoscaler cluster
controller Deployments.
Assign the system-node-critical PriorityClass to the
ang-daemon node DaemonSet.
Installation, Upgrades and updates
1.15.0, 1.15.1, 1.15.2
Cluster creation and upgrades fail due to cluster name length
Creating version 1.15.0, 1.15.1, or 1.15.2 clusters or upgrading
clusters to version 1.15.0, 1.15.1, or 1.15.2 fails when the cluster name
is longer than 48 characters (version 1.15.0) or 45 characters (version
1.15.1 or 1.15.2). During cluster creation and upgrade operations,
Anthos clusters on bare metal creates a health check resource with a name that
incorporates the cluster name and version:
For version 1.15.0 clusters, the health check resource name is
CLUSTER_NAME-add-ons-CLUSTER_VER.
For version 1.15.1 or 1.15.2 clusters, the health check resource name is
CLUSTER_NAME-kubernetes-CLUSTER_VER.
For long cluster names, the health check resource name exceeds the
Kubernetes 63 character length restriction for label
names, which prevents the creation of the health check resource.
Without a successful health check, the cluster operation fails.
To see if you are affected by this issue, use kubectl describe
to check the failing resource:
If this issue is affecting you, the response contains a warning for a
ReconcileError like the following:
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ReconcileError 77s (x15 over 2m39s) healthcheck-controller Reconcile error, retrying: 1 error occurred:
* failed to create job for health check
db-uat-mfd7-fic-hybrid-cloud-uk-wdc-cluster-02-kubernetes-1.15.1: Job.batch
"bm-system-db-uat-mfd7-fic-hybrid-cloud-u24d5f180362cffa4a743" is invalid: [metadata.labels: Invalid
value: "db-uat-mfd7-fic-hybrid-cloud-uk-wdc-cluster-02-kubernetes-1.15.1": must be no more than 63
characters, spec.template.labels: Invalid value:
"db-uat-mfd7-fic-hybrid-cloud-uk-wdc-cluster-02-kubernetes-1.15.1": must be no more than 63 characters]
Workaround
To unblock the cluster upgrade or creation, you can bypass the
healthcheck. Use the following command to patch the healthcheck custom
resource with passing status: (status: {pass: true})
Version 1.14.0 and 1.14.1 clusters with preview features can't upgrade to version 1.15.x
If version 1.14.0 and 1.14.1 clusters have a preview feature enabled,
they're blocked from successfully upgrading to version 1.15.x. This
applies to preview features like the ability to create a cluster without
kube-proxy, which is enabled with the following annotation in the cluster
configuration file:
If you're
affected by this issue, you get an error like the following during the
cluster upgrade:
[2023-06-20 23:37:47+0000] error judging if the cluster is managing itself:
error to parse the target cluster: error parsing cluster config: 1 error
occurred:
Cluster.baremetal.cluster.gke.io "$cluster-name" is invalid:
Annotations[preview.baremetal.cluster.gke.io/$preview-feature-name]:
Forbidden: preview.baremetal.cluster.gke.io/$preview-feature-name feature
isn't supported in 1.15.1 Anthos Bare Metal version
This issue is fixed in Anthos clusters on bare metal version 1.14.2 and higher.
Workaround:
If you're unable to upgrade your clusters to version 1.14.2 or higher
before upgrading to version 1.15.x, you can upgrade to version 1.15.x
directly by using a bootstrap cluster:
bmctl upgrade cluster --use-bootstrap=true
Operation
1.15
Version 1.15 clusters don't accept duplicate floating IP addresses
Anthos Network Gateway doesn't let you create new NetworkGatewayGroup
custom resources that contain IP addresses in spec.floatingIPs
that are already used in existing NetworkGatewayGroup custom
resources. This rule is enforced by a webhook in Anthos clusters on bare metal
version 1.15.0 and higher. Pre-existing duplicate floating IP addresses
don't cause errors. The webhook only prevents the creation of new
NetworkGatewayGroups custom resources that contain duplicate
IP addresses.
The webhook error message identifies the conflicting IP address and the
existing custom resource that is already using it:
IP address exists in other gateway with name default
The initial documentation for advanced networking features, such as the
Egress NAT gateway, doesn't caution against duplicate IP addresses.
Initially, only the NetworkGatewayGroup resource named
default was recognized by the reconciler. Anthos Network
Gateway now recognizes all NetworkGatewayGroup custom
resources in the system namespace. Existing NetworkGatewayGroup
custom resources are honored, as is.
Workaround:
Errors happen for the creation of a new NetworkGatewayGroup
custom resource only.
To address the error:
Use the following command to list NetworkGatewayGroups
custom resources:
kubectl get NetworkGatewayGroups --kubeconfig ADMIN_KUBECONFIG \
-n kube-system -o yaml
Open existing NetworkGatewayGroup custom resources and
remove any conflicting floating IP addresses (spec.floatingIPs):
To apply your changes, close and save edited custom resources.
Anthos VM Runtime
1.13.7
VMs might not start on 1.13.7 clusters that use a private registry
When you enable Anthos VM Runtime on a new or upgraded version
1.13.7 cluster that uses a private registry, VMs that connect to the node
network or use a GPU might not start properly. This issue is due to some
system Pods in the vm-system namespace getting image pull
errors. For example, if your VM uses the node network, some Pods might
report image pull errors like the following:
macvtap-4x9zp 0/1 Init:ImagePullBackOff 0 70m
This issue is fixed in Anthos clusters on bare metal version 1.14.0 and higher.
Workaround
If you're unable to upgrade your clusters immediately, you can pull
images manually. The following commands pull the macvtap CNI plugin image
for your VM and push it to your private registry:
Replace REG_HOST with the domain name of a host that you mirror locally.
Installation
1.11, 1.12
During cluster creation in the kind cluster, the gke-metric-agent pod fails to start
During cluster creation in the kind cluster, the gke-metrics-agent pod
fails to start because of an image pulling error as follows:
error="failed to pull and unpack image \"gcr.io/gke-on-prem-staging/gke-metrics-agent:1.8.3-anthos.2\": failed to resolve reference \"gcr.io/gke-on-prem-staging/gke-metrics-agent:1.8.3-anthos.2\": pulling from host gcr.io failed with status code [manifests 1.8.3-anthos.2]: 403 Forbidden"
Also, in the bootstrap cluster's containerd log, you will see the following entry:
Sep 13 23:54:20 bmctl-control-plane containerd[198]: time="2022-09-13T23:54:20.378172743Z" level=info msg="PullImage \"gcr.io/gke-on-prem-staging/gke-metrics-agent:1.8.3-anthos.2\" " Sep 13 23:54:21 bmctl-control-plane containerd[198]: time="2022-09-13T23:54:21.057247258Z" level=error msg="PullImage \"gcr.io/gke-on-prem-staging/gke-metrics-agent:1.8.3-anthos.2\" failed" error="failed to pull and unpack image \"gcr.io/gke-on-prem-staging/gke-metrics-agent:1.8.3-anthos.2\": failed to resolve reference \"gcr.io/gke-on-prem-staging/gke-metrics-agent:1.8.3-anthos.2\": pulling from host gcr.io failed with status code [manifests 1.8.3-anthos.2]: 403 Forbidden"
You will see the following "failing to pull" error in the pod:
gcr.io/gke-on-prem-staging/gke-metrics-agent
Workaround
Despite the errors, the cluster creation process is not blocked as the
purpose of gke-metrics-agent pod in kind cluster is to facilitate the
cluster creation success rate and for internal tracking and monitoring.
Hence, you can ignore this error.
Workaround
Despite the errors, the cluster creation process is not blocked as the
purpose of gke-metrics-agent pod in kind cluster is to facilitate the
cluster creation success rate and for internal tracking and monitoring.
Hence, you can ignore this error.
Operation, Networking
1.12, 1.13, 1.14, 1.15, 1.16
Accessing an IPv6 Service endpoint crashes the LoadBalancer Node on
CentOS or RHEL
When you access a dual-stack Service (a Service that has both IPv4 and
IPv6 endpoints) and use the IPv6 endpoint, the LoadBalancer Node that
serves the Service might crash. This issue affects customers that use
dual-stack services with CentOS or RHEL and kernel version earlier than
kernel-4.18.0-372.46.1.el8_6.
If you believe that this issue affects you, check the kernel version on
the LoadBalancer Node using the uname -a command.
Workaround:
Update the LoadBalancer Node to kernel version
kernel-4.18.0-372.46.1.el8_6 or later. This kernel version is
available by default in CentOS and RHEL version 8.6 and later.
Networking
1.11, 1.12, 1.13, 1.14.0
Intermittent connectivity issues after Node reboot
After you restart a Node, you might see intermittent connectivity
issues for a NodePort or LoadBalancer Service. For example, you might have
intermittent TLS handshake or connection reset errors. This issue is
fixed for Anthos clusters on bare metal versions 1.14.1 and higher.
To check if this issue affects you, look at the iptables
forward rules on Nodes where the backend Pod for the affected Service is
running:
sudo iptables -L FORWARD
If you see the KUBE-FORWARD rule before the
CILIUM_FORWARD rule in iptables, you might be
affected by this issue. The following example output shows a Node where
the problem exists:
Chain FORWARD (policy ACCEPT)
target prot opt source destination
KUBE-FORWARD all -- anywhere anywhere /* kubernetes forwarding rules */
KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */
KUBE-EXTERNAL-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes externally-visible service portals */
CILIUM_FORWARD all -- anywhere anywhere /* cilium-feeder: CILIUM_FORWARD */
Workaround:
Restart the anetd Pod on the Node that's misconfigured. After you
restart the anetd Pod, the forwarding rule in iptables should
be configured correctly.
The following example output shows that the CILIUM_FORWARD
rule is now correctly configured before the KUBE-FORWARD
rule:
Chain FORWARD (policy ACCEPT)
target prot opt source destination
CILIUM_FORWARD all -- anywhere anywhere /* cilium-feeder: CILIUM_FORWARD */
KUBE-FORWARD all -- anywhere anywhere /* kubernetes forwarding rules */
KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */
KUBE-EXTERNAL-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes externally-visible service portals */
Upgrades and updates
1.9, 1.10
The preview feature does not retain the original permission and owner
information
The preview feature of 1.9.x cluster using bmctl 1.9.x does not retain the
original permission and owner information. To verify if you are affected by this
feature, extract the backed-up file using the following command:
tar -xzvf BACKUP_FILE
Workaround
Verify if the metadata.json is present and if the bmctlVersion is 1.9.x. If the metadata.json is not present, upgrade to 1.10.x cluster and use bmctl 1.10.x to backup/restore.
Upgrades and creates
1.14.2
clientconfig-operator stuck in pending state with CreateContainerConfigError
If you've upgraded to or created a version 1.14.2 cluster with an OIDC/LDAP
configuration, you may see the clientconfig-operator Pod stuck
in a pending state. With this issue, there are two
clientconfig-operator Pods, with one in a running state and the
other in a pending state.
This issue applies to Anthos clusters on bare metal version 1.14.2 only. Earlier
versions such as 1.14.0 and 1.14.1 aren't affected. This issue is fixed in
version 1.14.3 and all subsequent releases, including 1.15.0 and later.
Workaround:
As a workaround, you can patch the clientconfig-operator
deployment to add additional security context and ensure that the deployment
is ready.
Use the following command to patch clientconfig-operator in the
target cluster:
CLUSTER_KUBECONFIG: the path of the kubeconfig
file for the target cluster.
Operation
1.11, 1.12, 1.13, 1.14, 1.15
Certificate authority rotation fails for clusters without bundled load balancing
For clusters without bundled load balancing (spec.loadBalancer.mode set to manual), the bmctl update credentials certificate-authorities rotate
command can become unresponsive and fail with the following error: x509: certificate signed by unknown authority.
If you're affected by this issue, the bmctl command might
output the following message before becoming unresponsive:
Signing CA completed in 3/0 control-plane nodes
In this case, the command eventually fails. The rotate certificate-authority
log for a cluster with three control planes may include entries like the
following:
[2023-06-14 22:33:17+0000] waiting for all nodes to trust CA bundle OK
[2023-06-14 22:41:27+0000] waiting for first round of pod restart to complete OK
Signing CA completed in 0/0 control-plane nodes
Signing CA completed in 1/0 control-plane nodes
Signing CA completed in 2/0 control-plane nodes
Signing CA completed in 3/0 control-plane nodes
...
Unable to connect to the server: x509: certificate signed by unknown
authority (possibly because of "crypto/rsa: verification error" while
trying to verify candidate authority certificate "kubernetes")
Workaround
If you need additional assistance, contact
Google Support.
Installation, Networking
1.11, 1.12, 1.13, 1.14.0-1.14.1
ipam-controller-manager crashloops in dual-stack
clusters
When you deploy a dual-stack cluster (a cluster with both IPv4 and IPv6
addresses), the ipam-controller-manager Pod(s) might
crashloop. This behavior causes the Nodes to cycle between
Ready and NotReady states, and might cause the
cluster installation to fail. This problem can occur when the API server
is under high load.
To see if this issue affects you, check if the
ipam-controller-manager Pod(s) are failing with
CrashLoopBackOff errors:
kubectl -n kube-system get pods | grep ipam-controller-manager
The following example output shows Pods in a
CrashLoopBackOff state:
1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, 1.13, and 1.14
etcd watch starvation
Clusters running etcd version 3.4.13 or earlier may experience watch
starvation and non-operational resource watches, which can lead to the
following problems:
Pod scheduling is disrupted
Nodes are unable to register
kubelet doesn't observe pod changes
These problems can make the cluster non-functional.
This issue is fixed in Anthos clusters on bare metal releases 1.12.9, 1.13.6,
1.14.3, and subsequent releases. These newer releases use etcd version
3.4.21. All prior versions of Anthos clusters on bare metal are affected by
this issue.
Workaround
If you can't upgrade immediately, you can mitigate the risk of
cluster failure by reducing the number of nodes in your cluster. Remove
nodes until the etcd_network_client_grpc_sent_bytes_total
metric is less than 300 MBps.
To view this metric in Metrics Explorer:
Go to the Metrics Explorer in the Google Cloud console:
Expand the Select a metric, enter Kubernetes Container
in the filter bar, and then use the submenus to select the metric:
In the Active resources menu, select Kubernetes Container.
In the Active metric categories menu, select Anthos.
In the Active metrics menu, select etcd_network_client_grpc_sent_bytes_total.
Click Apply.
Networking
1.11.6, 1.12.3
SR-IOV operator's vfio-pci mode "Failed" state
The SriovNetworkNodeState object's syncStatus
can report the "Failed" value for a configured node. To view the status of
a node and determine if the problem affects you, run the following
command:
kubectl -n gke-operators get \
sriovnetworknodestates.sriovnetwork.k8s.cni.cncf.io NODE_NAME \
-o jsonpath='{.status.syncStatus}'
Replace NODE_NAME with the name of the node to
check.
Workaround:
If the SriovNetworkNodeState object status is "Failed",
update to Anthos clusters on bare metal version 1.11.7 or later or version 1.12.4 or
later.
Upgrades and updates
1.10, 1.11, 1.12, 1.13, 1.14.0, 1.14.1
Some worker nodes aren't in a Ready state after upgrade
Once upgrade is finished, some worker nodes may have their Ready condition set to false. On the Node resource, you will see an error next to the Ready condition similar to the following example:
container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
When you log into the stalled machine, the CNI configuration on the machine is empty:
sudo ls /etc/cni/net.d/
Workaround
Restart the node's anetd pod by deleting it.
Upgrades and updates, Security
1.10
Multiple certificate rotations from cert-manager result in inconsistency
After multiple manual or auto certificate rotations, the webhook pod,
such as anthos-cluster-operator isn't updated with the new
certificates issued by cert-manager. Any update to the cluster
custom resource fails and results in an error similar as follows:
Internal error occurred: failed calling
webhook "vcluster.kb.io": failed to call webhook: Post "https://webhook-service.kube-system.svc:443/validate-baremetal-cluster-gke-io-v1-cluster?timeout=10s": x509: certificate signed by unknown authority (possibly because of "x509:
invalid signature: parent certificate cannot sign this kind of certificate"
while trying to verify candidate authority certificate
"webhook-service.kube-system.svc")
This issue might occur in the following circumstances:
If you have done two manual cert-manager issued certificate rotations
on a cluster older than 180 days or more and never restarted the anthos-cluster-operator.
If you have done a manual cert-manager issued
certificate rotations on a cluster older than 90 days or more and never
restarted the anthos-cluster-operator.
Workaround
Restart the pod by terminating the anthos-cluster-operator.
Upgrades and updates
1.14.0
Outdated lifecycle controller deployer pods created during user cluster upgrade
In version 1.14.0 admin clusters, one or more outdated lifecycle
controller deployer pods might be created during user cluster upgrades.
This issue applies for user clusters that were initially created at
versions lower than 1.12. The unintentionally created pods don't impede
upgrade operations, but they might be found in abnormal state. We
recommend that you remove the outdated pods.
This issue is fixed in release 1.14.1.
Workaround:
To remove the outdated lifecycle controller deployer pods:
List preflight check resources:
kubectl get preflightchecks --kubeconfig ADMIN_KUBECONFIG -A
The output looks like this:
NAMESPACE NAME PASS AGE
cluster-ci-87a021b9dcbb31c ci-87a021b9dcbb31c true 20d
cluster-ci-87a021b9dcbb31c ci-87a021b9dcbb31cd6jv6 false 20d
where ci-87a021b9dcbb31c is the cluster name.
Delete resources whose value in the PASS column is either true or false.
For example, to delete the resources in the preceding sample output,
use the following commands:
BGPSession state constantly changing due to large number
of incoming routes
Anthos clusters on bare metal advanced networking fails to manage BGP sessions
correctly when external peers advertise a high number of routes (about 100
or more). With a large number of incoming routes, the node-local BGP
controller takes too long to reconcile BGP sessions and fails to update the
status. The lack of status updates, or a health check, causes the session
to be deleted for being stale.
Undesirable behavior on BGP sessions that you might notice and indicate
a problem include the following:
Continuous bgpsession deletion and recreation.
bgpsession.status.state never becomes
Established
Routes failing to advertise or being repeatedly advertised and
withdrawn.
BGP load balancing problems might be noticeable with connectivity
issues to LoadBalancer services.
BGP FlatIP issue might be noticeable with connectivity
issues to Pods.
To determine if your BGP issues are caused by the remote peers
advertising too many routes, use the following commands to review the
associated statuses and output:
Use kubectl get bgpsessions on the affected cluster.
The output shows bgpsessions with state "Not Established"
and the last report time continuously counts up to about 10-12 seconds
before it appears to reset to zero.
The output of kubectl get bgpsessions shows that the
affected sessions are being repeatedly recreated:
kubectl get bgpsessions \
-o jsonpath="{.items[*]['metadata.name', 'metadata.creationTimestamp']}"
Log messages indicate that stale BGP sessions are being deleted:
kubectl logs ang-controller-manager-POD_NUMBER
Replace POD_NUMBER with the leader
pod in your cluster.
Workaround:
Reduce or eliminate the number of routes advertised from the remote
peer to the cluster with an export policy.
In Anthos clusters on bare metal version 1.14.2 and later, you can also disable the
feature that processes received routes by using an
AddOnConfiguration. Add the
--disable-received-routes argument to the ang-daemon
daemonset's bgpd container.
Networking
1.14, 1.15, 1.16
Application timeouts caused by conntrack table
insertion failures
Clusters running on an Ubuntu OS that uses kernel 5.15 or
higher are susceptible to netfilter connection tracking (conntrack) table
insertion failures. Insertion failures can occur even when the conntrack
table has room for new entries. The failures are caused by changes in
kernel 5.15 and higher that restrict table insertions based on chain
length.
To see if you are affected by this issue, you can check the in-kernel
connection tracking system statistics with the following command:
If a chaintoolong value in the response is a non-zero
number, you're affected by this issue.
Workaround
The short term mitigation is to increase the size of both the netfiler
hash table (nf_conntrack_buckets) and the netfilter
connection tracking table (nf_conntrack_max). Use the
following commands on each cluster node to increase the size of the
tables:
Replace TABLE_SIZE with new size in bytes. The
default table size value is 262144. We suggest that you set a
value equal to 65,536 times the number of cores on the node. For example,
if your node has eight cores, set the table size to 524288.
Can't restore cluster backups with bmctl for some versions
We recommend that you back up your clusters before you upgrade so that
you can restore the earlier version if the upgrade doesn't succeed.
A problem with the bmctl restore cluster command causes it to
fail to restore backups of clusters with the identified versions. This
issue is specific to upgrades, where you're restoring a backup of an earlier
version.
If your cluster is affected, the bmctl restore cluster
log contains the following error:
Error: failed to extract image paths from profile: anthos version VERSION not supported
Workaround:
Until this issue is fixed, we recommend that you use the instructions in
Back up and restore clusters
to back up your clusters manually and restore them manually, if necessary.
Networking
1.10, 1.11, 1.12, 1.13, 1.14.0-1.14.2
NetworkGatewayGroup crashes if there's no IP address on
the interface
NetworkGatewayGroup fails to create daemons for nodes that
don't have both IPv4 and IPv6 interfaces on them. This causes features like
BGP LB and EgressNAT to fail. If you check the logs of the failing
ang-node Pod in the kube-system namespace, errors
similar to the following example are displayed when an IPv6 address is
missing:
ANGd.Setup Failed to create ANG daemon {"nodeName": "bm-node-1", "error":
"creating NDP client failed: ndp: address \"linklocal\" not found on interface \"ens192\""}
In the previous example, there's no IPv6 address on the
ens192 interface. Similar ARP errors are displayed if the
node is missing an IPv4 address.
NetworkGatewayGroup tries to establish an ARP connection and
an NDP connection to the link local IP address. If the IP address doesn't
exist (IPv4 for ARP, IPv6 for NDP) then the connection fails and the daemon
doesn't continue.
This issue is fixed in release 1.14.3.
Workaround:
Connect to the node using SSH and add an IPv4 or IPv6 address to the
link that contains the node IP. In the previous example log entry, this
interface was ens192:
ip address add dev INTERFACE scope link ADDRESS
Replace the following:
INTERFACE: The interface for your
node, such as ens192.
ADDRESS: The IP address and subnet
mask to apply to the interface.
Reset/Deletion
1.10, 1.11, 1.12, 1.13.0-1.13.2
anthos-cluster-operator crash loop when removing a
control plane node
When you try to remove a control plane node by removing the IP address
from the Cluster.Spec, the anthos-cluster-operator
enters into a crash loop state that blocks any other operations.
Workaround:
Issue is fixed in 1.13.3 and 1.14.0 and later. All other versions are
affected. Upgrade to one of the fixed versions
IP_ADDRESS: The IP address of
the node in a crash loop state.
CLUSTER_NAMESPACE: The cluster
namespace.
Installation
1.13.1, 1.13.2 and 1.13.3
kubeadm join fails in large clusters due to token
mismatch
When you install Anthos clusters on bare metal with a large number of nodes, you
might see a kubeadmin join error message similar to the
following example:
TASK [kubeadm : kubeadm join --config /dev/stdin --ignore-preflight-errors=all] ***
fatal: [10.200.0.138]: FAILED! => {"changed": true, "cmd": "kubeadm join
--config /dev/stdin --ignore-preflight-errors=all", "delta": "0:05:00.140669", "end": "2022-11-01 21:53:15.195648", "msg": "non-zero return code", "rc": 1,
"start": "2022-11-01 21:48:15.054979", "stderr": "W1101 21:48:15.082440 99570 initconfiguration.go:119]
Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future.
Automatically prepending scheme \"unix\" to the \"criSocket\" with value \"/run/containerd/containerd.sock\". Please update your configuration!\nerror
execution phase preflight: couldn't validate the identity of the API Server: could not find a JWS signature in the cluster-info ConfigMap for token ID \"yjcik0\"\n
To see the stack trace of this error execute with --v=5 or higher", "stderr_lines":
["W1101 21:48:15.082440 99570 initconfiguration.go:119] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future.
Automatically prepending scheme \"unix\" to the \"criSocket\" with value \"/run/containerd/containerd.sock\".
Please update your configuration!", "error execution phase preflight: couldn't validate the identity of the API Server:
could not find a JWS signature in the cluster-info ConfigMap for token ID \"yjcik0\"",
"To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[preflight]
Running pre-flight checks", "stdout_lines": ["[preflight] Running pre-flight checks"]}
Workaround:
This issue is resolved in Anthos clusters on bare metal version 1.13.4 and later.
If you need to use an affected version, first create a cluster with
less than 20 nodes, and then resize the cluster to add additional nodes
after the install is complete.
Logging and monitoring
1.10, 1.11, 1.12, 1.13.0
Low CPU limit for metrics-server in Edge clusters
In Anthos clusters on bare metal Edge clusters, low CPU limits for
metrics-server can cause frequent restarts of
metrics-server. Horizontal Pod Autoscaling (HPA) doesn't work
due to metrics-server being unhealthy.
If metrics-server CPU limit is less than 40m,
your clusters can be affected. To check the metrics-server
CPU limits, review one of the following files:
Remove the --config-dir=/etc/config line and increase the
CPU limits, as shown in the following example:
[...]
- command:
- /pod_nanny
# - --config-dir=/etc/config # <--- Remove this line
- --container=metrics-server
- --cpu=50m # <--- Increase CPU, such as to 50m
- --extra-cpu=0.5m
- --memory=35Mi
- --extra-memory=4Mi
- --threshold=5
- --deployment=metrics-server
- --poll-period=30000
- --estimator=exponential
- --scale-down-delay=24h
- --minClusterSize=5
- --use-metrics=true
[...]
Save and close the metrics-server to apply the changes.
Networking
1.14, 1.15, 1.16
Direct NodePort connection to hostNetwork Pod doesn't work
Connection to a Pod enabled with hostNetwork via NodePort
Service fails when the backend Pod is on the same node as the targeted
NodePort. This issues affects LoadBalancer Services when used with
hostNetwork-ed Pods. With multiple backends, there can be a sporadic
connection failure.
This issue is caused by a bug in the eBPF program.
Workaround:
When using a Nodeport Service, don't target the node on which any of
the backend Pod runs. When using the LoadBalancer Service, make sure the
hostNetwork-ed Pods don't run on LoadBalancer nodes.
Upgrades and updates
1.12.3, 1.13.0
1.13.0 admin clusters can't manage 1.12.3 user cluster(s)
Admin clusters that run version 1.13.0 can't manage user clusters that
run version 1.12.3. Operations against a version 1.12.3 user cluster fail.
Workaround:
Upgrade your admin cluster to version 1.13.1, or upgrade the user
cluster to the same version as the admin cluster.
Upgrades and updates
1.12
Upgrading to 1.13.x is blocked for admin clusters with worker node pools
Version 1.13.0 and higher admin clusters can't contain worker node pools.
Upgrades to version 1.13.0 or higher for admin clusters with worker node
pools is blocked. If your admin cluster upgrade is stalled, you can confirm
if worker node pools are the cause by checking following error in the
upgrade-cluster.log file inside the bmctl-workspace folder:
Operation failed, retrying with backoff. Cause: error creating "baremetal.cluster.gke.io/v1, Kind=NodePool" cluster-test-cluster-2023-06-06-140654/np1: admission webhook "vnodepool.kb.io" denied the request: Adding worker nodepool to Admin cluster is disallowed.
Workaround:
Before upgrading, move all worker node pools to user clusters. For
instructions to add and remove node pools, see
Manage node pools in a cluster.
Upgrades and updates
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16
Errors when updating resources using kubectl apply
If you update existing resources like the ClientConfig or
Stackdriver custom resources using kubectl apply,
the controller might return an error or revert your input and desired changes.
For example, you might try to edit the Stackdriver custom
resource as follows by first getting the resource, and then applying an updated version:
Corrupted backlog chunks cause stackdriver-log-forwarder
crashloop
The stackdriver-log-forwarder crashloops if it tries to
process a corrupted backlog chunk. The following example errors are shown in
the container logs:
[2022/09/16 02:05:01] [error] [storage] format check failed: tail.1/1-1659339894.252926599.flb
[2022/09/16 02:05:01] [error] [engine] could not segregate backlog chunks
When this crashloop occurs, you can't see logs in Cloud Logging.
Workaround:
To resolve these errors, complete the following steps:
Identify the corrupted backlog chunks. Review the following example
error messages:
[2022/09/16 02:05:01] [error] [storage] format check failed: tail.1/1-1659339894.252926599.flb
[2022/09/16 02:05:01] [error] [engine] could not segregate backlog chunks
In this example, the file tail.1/1-1659339894.252926599.flb that's stored in var/log/fluent-bit-buffers/tail.1/ is at
fault. Every *.flb file with a format check failed must be removed.
End the running pods for stackdriver-log-forwarder:
Make sure that the DaemonSet has cleaned up all the nodes. The
output of the following two commands should be equal to the number of
nodes in the cluster:
Restarting Dataplane V2 (anetd) on clusters can result in
existing VMs unable to attach to non-pod-network
On multi-nic clusters, restarting Dataplane V2 (anetd) can
result in virtual machines being unable to attach to networks. An error
similar to the following might be observed in the anetd
pod logs:
could not find an allocator to allocate the IP of the multi-nic endpoint
Workaround:
You can restart the VM as a quick fix. To avoid a recurrence of the
issue, upgrade to Anthos clusters on bare metal 1.14.1 or a later.
Operation
1.13, 1.14.0, 1.14.1
gke-metrics-agent has no memory limit on Edge profile clusters
Depending on the cluster's workload, the gke-metrics-agent
might use greater than 4608 MiB of memory. This issue only affects Anthos clusters on bare metal Edge profile clusters. Default profile clusters aren't
impacted.
Workaround:
Upgrade your cluster to version 1.14.2 or later.
Installation
1.12, 1.13
Cluster creation might fail due to race conditions
When you create clusters using kubectl, due to race
conditions preflight check may never finish. As a result, cluster creation
may fail in certain cases.
The preflight check reconciler creates a SecretForwarder
to copy the default ssh-key secret to the target namespace.
Typically, preflight check leverages on the owner references and
reconciles once the SecretForwarder is complete. However, in
rare cases the owner references of the SecretForwarder can
lose the reference to the preflight check, causing the preflight check to
get stuck. As a result, cluster creation fails. In order to continue the
reconciliation for the controller-driven preflight check, delete the
cluster-operator pod or delete the preflight-check resource. When you
delete the preflight-check resource, it creates another one and continues
the reconciliation. Alternately, you can upgrade your existing clusters
(that were created with an earlier version) to a fixed version.
Networking
1.9, 1.10, 1.11, 1.12, 1.13, 1.14, 1.15
Reserved IP addresses aren't released when using whereabouts plugin with the multi-NIC feature
In the multi-Nic feature, if you're using the CNI whereabouts plugin
and you use the CNI DEL operation to delete a network interface for a Pod,
some reserved IP addresses might not be released properly. This happens
when the CNI DEL operation is interrupted.
You can verify the unused
IP address reservations of the Pods by running the following command:
kubectl get ippools -A --kubeconfig KUBECONFIG_PATH
Workaround:
Manually delete the IP addresses (ippools) that aren't used.
Installation
1.10, 1.11.0, 1.11.1, 1.11.2
Node Problem Detector fails in 1.10.4 user cluster
The Node Problem Detector may fail in the Anthos clusters on bare metal user clusters
1.10.x, when Anthos clusters on bare metal 1.11.0, 1.11.1, or 1.11.2 admin clusters
manage 1.10.x user clusters. When the Node Problem Detector fails, the log
gets updated with the following error message:
Error - NPD not supported for anthos baremetal version 1.10.4:
anthos version 1.10.4 not supported.
Workaround
Upgrade the admin cluster to 1.11.3 to resolve the issue.
Operation
1.14
1.14 island mode IPv4 cluster nodes have a pod CIDR mask
size of 24
In release 1.14, the maxPodsPerNode setting isn't taken
into account for
island mode
clusters, so the nodes are assigned a pod CIDR mask size of 24
(256 IP addresses).nThis might cause the cluster to run out of pod IP
addresses earlier than expected. For instance, if your cluster has a pod
CIDR mask size of 22; each node will be assigned a pod CIDR mask of 24 and
the cluster will only be able to support up to 4 nodes. Your cluster may
also experience network instability in a period of high pod churn when maxPodsPerNode is set to 129 or higher and there isn't enough
overhead in the pod CIDR for each node.
If your cluster is affected, the anetd pod reports the
following error when you add a new node to the cluster and there's no
podCIDR available:
error="required IPv4 PodCIDR not available"
Workaround
Use the following steps to resolve the issue:
Upgrade to 1.14.1 or a later version.
Remove the worker nodes and add them back.
Remove the control plane nodes and add them back, preferably one by
one to avoid cluster downtime.
Upgrades and updates
1.14.0, 1.14.1
Cluster upgrade rollback failure
An upgrade rollback might fail for Anthos clusters on bare metal 1.14.0 to 1.14.1.
If you upgrade a cluster from 1.14.0 to 1.14.1 and then try to rollback to
1.14.0 by using bmctl restore cluster command, an error
like the following example might be returned:
I0119 22:11:49.705596 107905 client.go:48] Operation failed, retrying with backoff.
Cause: error updating "baremetal.cluster.gke.io/v1, Kind=HealthCheck" cluster-user-ci-f3a04dc1b0d2ac8/user-ci-f3a04dc1b0d2ac8-network: admission webhook "vhealthcheck.kb.io"
denied the request: HealthCheck.baremetal.cluster.gke.io "user-ci-f3a04dc1b0d2ac8-network" is invalid:
Spec: Invalid value: v1.HealthCheckSpec{ClusterName:(*string)(0xc0003096e0), AnthosBareMetalVersion:(*string)(0xc000309690),
Type:(*v1.CheckType)(0xc000309710), NodePoolNames:[]string(nil), NodeAddresses:[]string(nil), ConfigYAML:(*string)(nil),
CheckImageVersion:(*string)(nil), IntervalInSeconds:(*int64)(0xc0015c29f8)}: Field is immutable
Workaround:
Delete all healthchecks.baremetal.cluster.gke.io resources
under the cluster namespace and then rerun the bmctl restore
cluster command:
List all healthchecks.baremetal.cluster.gke.io
resources:
kubectl get healthchecks.baremetal.cluster.gke.io \
--namespace=CLUSTER_NAMESPACE \
--kubeconfig=ADMIN_KUBECONFIG
Replace the following:
CLUSTER_NAMESPACE: the
namespace for the cluster.
ADMIN_KUBECONFIG: the path to the
admin cluster kubeconfig file.
Delete all healthchecks.baremetal.cluster.gke.io
resources listed in the previous step:
Replace HEALTHCHECK_RESOURCE_NAME with
the name of the healthcheck resources.
Rerun the bmctl restore cluster command.
Networking
1.12.0
Service external IP address does
not work in flat mode
In a cluster that has flatIPv4 set to true,
Services of type LoadBalancer are not accessible by their
external IP addresses.
This issue is fixed in version 1.12.1.
Workaround:
In the cilium-config ConfigMap, set
enable-415 to "true", and then restart
the anetd Pods.
Upgrades and updates
1.13.0, 1.14
In-place upgrades from 1.13.0 to 1.14.x never finish
When you try to do an in-place upgrade from 1.13.0 to
1.14.x using bmctl 1.14.0 and the
--use-bootstrap=false flag, the upgrade never finishes.
An error with the preflight-check operator causes the
cluster to never schedule the required checks, which means the preflight
check never finishes.
Workaround:
Upgrade to 1.13.1 first before you upgrade to 1.14.x. An in-place
upgrade from 1.13.0 to 1.13.1 should work. Or, upgrade from 1.13.0 to
1.14.x without the --use-bootstrap=false flag.
Upgrades and updates, Security
1.13 and 1.14
Clusters upgraded to 1.14.0 lose master taints
The control plane nodes require one of two specific taints to prevent
workload pods from being scheduled on them. When you upgrade version 1.13
Anthos clusters to version 1.14.0, the control plane nodes lose the following required taints:
node-role.kubernetes.io/master:NoSchedule
node-role.kubernetes.io/master:PreferNoSchedule
This problem doesn't cause upgrade failures, but pods that aren't
supposed to run on the control plane nodes may start doing so. These
workload pods can overwhelm control plane nodes and lead to cluster
instability.
Determine if you're affected
Find control plane nodes, use the following command:
kubectl get node -l 'node-role.kubernetes.io/control-plane' \
-o name --kubeconfig KUBECONFIG_PATH
To check the list of taints on a node, use the following command:
If neither of the required taints is listed, then you're affected.
Workaround
Use the following steps for each control plane node of your affected
version 1.14.0 cluster to restore proper function. These steps are for the
node-role.kubernetes.io/master:NoSchedule taint and related
pods. If you intend for the control plane nodes to use the PreferNoSchedule
taint, then adjust the steps accordingly.
Find pods without the node-role.kubernetes.io/master:NoSchedule toleration:
kubectl get pods -A --field-selector spec.nodeName="NODE_NAME" \
-o=custom-columns='Name:metadata.name,Toleration:spec.tolerations[*].key' \
--kubeconfig KUBECONFIG_PATH | \
grep -v "node-role.kubernetes.io/master" | uniq
Delete the pods that don't have the node-role.kubernetes.io/master:NoSchedule
toleration:
kubectl delete pod POD_NAME –-kubeconfig KUBECONFIG_PATH
Operation, Anthos VM Runtime
1.11, 1.12, 1.13, 1.14, 1.15, 1.16
VM creation fails intermittently with upload errors
Creating a new Virtual Machine (VM) with the kubectl virt create vm
command fails infrequently during image upload. This issue applies for
both Linux and Windows VMs. The error looks something like the following
example:
PVC default/heritage-linux-vm-boot-dv not found DataVolume default/heritage-linux-vm-boot-dv created
Waiting for PVC heritage-linux-vm-boot-dv upload pod to be ready... Pod now ready
Uploading data to https://10.200.0.51
2.38 MiB / 570.75 MiB [>----------------------------------------------------------------------------------] 0.42% 0s
fail to upload image: unexpected return value 500, ...
Workaround
Retry the kubectl virt create vm command to create your VM.
Upgrades and updates, Logging and monitoring
1.11
Managed collection components in 1.11 clusters aren't preserved in
upgrades to 1.12
Managed collection components are part of Managed Service for Prometheus.
If you manually deployed managed collection components in the gmp-system namespace of your
version 1.11 Anthos clusters, the associated resources aren't
preserved when you upgrade to version 1.12.
Starting with Anthos clusters on bare metal version 1.12.0, Managed Service
for Prometheus components in the gmp-system namespace and
related custom resource definitions are managed by
stackdriver-operator with the enableGMPForApplications
field. The enableGMPForApplications field defaults to
true, so if you manually deploy Managed Service for Prometheus
components in the namespace before upgrading to version 1.12, the
resources are deleted by stackdriver-operator.
Workaround
To preserve manually managed collection resources:
Backup all existing PodMonitoring custom resources.
If you're affected by this issue, bmctl writes the
following error in the upgrade-cluster.log file inside the
bmctl-workspace folder:
Operation failed, retrying with backoff. Cause: error creating "baremetal.cluster.gke.io/v1, Kind=Cluster": admission webhook
"vcluster.kb.io" denied the request: Spec.NodeConfig.ContainerRuntime: Forbidden: Starting with Anthos Bare Metal version 1.13 Docker container
runtime will not be supported. Before 1.13 please set the containerRuntime to containerd in your cluster resources.
Although highly discouraged, you can create a cluster with Docker node pools until 1.13 by passing the flag "--allow-docker-container-runtime" to bmctl
create cluster or add the annotation "baremetal.cluster.gke.io/allow-docker- container-runtime: true" to the cluster configuration file.
This is most likely to occur with version 1.12 Docker clusters that
were upgraded from 1.11, as that upgrade doesn't require the annotation
to maintain the Docker container runtime. In this case, clusters don't have
the annotation when upgrading to 1.13. Note that starting with
version 1.13, containerd is the only permitted container runtime.
Workaround:
If you're affected by this problem, update the cluster resource with
the missing annotation. You can add the annotation either while the
upgrade is running or after canceling and before retrying the upgrade.
Installation
1.11
bmctl exits before cluster creation completes
Cluster creation may fail for Anthos clusters on bare metal version 1.11.0
(this issue is fixed in Anthos clusters on bare metal release 1.11.1). In some
cases, the bmctl create cluster command exits early and
writes errors like the following to the logs:
Error creating cluster: error waiting for applied resources: provider cluster-api watching namespace USER_CLUSTER_NAME not found in the target cluster
Workaround
The failed operation produces artifacts, but the cluster isn't
operational. If this issue affects you, use the following steps to clean
up artifacts and create a cluster:
View workaround steps
To delete cluster artifacts and reset the node machine, run the
following command:
bmctl reset -c USER_CLUSTER_NAME
To start the cluster creation operation, run the following command:
The --keep-bootstrap-cluster flag is important if this
command fails.
If the cluster creation command succeeds, you can skip the remaining
steps. Otherwise, continue.
Run the following command to get the version for the bootstrap cluster:
This error is benign and you can safely ignore it.
Installation
1.10, 1.11, 1.12
Cluster creation fails when using multi-NIC, containerd,
and HTTPS proxy
Cluster creation fails when you have the following combination of
conditions:
Cluster is configured to use containerd as the
container runtime (nodeConfig.containerRuntime set to
containerd in the cluster configuration file, the default
for Anthos clusters on bare metal version 1.11).
Cluster is configured to provide multiple network interfaces,
multi-NIC, for pods (clusterNetwork.multipleNetworkInterfaces
set to true in the cluster configuration file).
Cluster is configured to use a proxy (spec.proxy.url is
specified in the cluster configuration file). Even though cluster
creation fails, this setting is propagated when you attempt to create a
cluster. You may see this proxy setting as an HTTPS_PROXY environment variable or in your containerd configuration
(/etc/systemd/system/containerd.service.d/09-proxy.conf).
Workaround
Append service CIDRs (clusterNetwork.services.cidrBlocks)
to the NO_PROXY environment variable on all node machines.
Installation
1.10, 1.11, 1.12
Failure on systems with restrictive umask
setting
Anthos clusters on bare metal release 1.10.0 introduced a rootless control
plane feature that runs all the control plane components as a non-root
user. Running all components as a non-root user may cause installation
or upgrade failures on systems with a more restrictive
umask setting of 0077.
Workaround
Reset the control plane nodes and change the umask setting
to 0022 on all the control plane machines. After the machines
have been updated, retry the installation.
Alternatively, you can change the directory and file permissions of
/etc/kubernetes on the control-plane machines for the
installation or upgrade to proceed.
Make /etc/kubernetes and all its subdirectories world
readable: chmod o+rx.
Make all the files owned by root user under the
directory (recursively) /etc/kubernetes world readable (chmod o+r). Exclude private key files (.key) from these
changes as they are already created with correct ownership and
permissions.
Make /usr/local/etc/haproxy/haproxy.cfg world
readable.
Make /usr/local/etc/bgpadvertiser/bgpadvertiser-cfg.yaml
world readable.
Installation
1.10, 1.11, 1.12, 1.13
Control group v2 incompatibility
Control group v2 (cgroup v2) is not supported in versions 1.13 and
earlier of Anthos clusters on bare metal. However, version 1.14 supports cgroup
v2 as a Preview
feature. The presence
of /sys/fs/cgroup/cgroup.controllers indicates that your
system uses cgroup v2.
Workaround
If your system uses cgroup v2, upgrade to version 1.14 of
Anthos clusters on bare metal.
Installation
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16
Preflight checks and service account credentials
For installations triggered by admin or hybrid clusters (in other
words, clusters not created with bmctl, like user clusters),
the preflight check does not verify Google Cloud service account
credentials or their associated permissions.
Installation
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16
Application default credentials and bmctl
bmctl uses
Application Default Credentials
(ADC) to validate the cluster operation's location value in the
cluster spec when it is not set to global.
Workaround
For ADC to work, you need to either point the
GOOGLE_APPLICATION_CREDENTIALS environment variable to a
service account credential file, or run
gcloud auth application-default login.
Installation
1.10, 1.11, 1.12, 1.13, 1.14, 1.15
Docker service
On cluster node machines, if the Docker executable is present in the
PATH environment variable, but the Docker service is not
active, preflight check will fail and report that the Docker service
is not active.
Workaround
Remove Docker, or enable the Docker service.
Installation
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16
Installing on vSphere
When installing Anthos clusters on bare metal on vSphere VMs, you must set the
tx-udp_tnl-segmentation and
tx-udp_tnl-csum-segmentation flags to off. These flags are
related to the hardware segmentation offload done by the vSphere driver
VMXNET3 and they don't work with the GENEVE tunnel of
Anthos clusters on bare metal.
Workaround
Run the following command on each node to check the current values for
these flags:
ethtool -k NET_INTFC | grep segm
Replace NET_INTFC with the network
interface associated with the IP address of the node.
The response should have entries like the following:
...
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
...
Sometimes in RHEL 8.4, ethtool shows these flags are off
while they aren't. To explicitly set these flags to off, toggle the flags
on and then off with the following commands:
ethtool -K ens192 tx-udp_tnl-segmentation on ethtool -K ens192 \
tx-udp_tnl-csum-segmentation on
ethtool -K ens192 tx-udp_tnl-segmentation off ethtool -K ens192 \
tx-udp_tnl-csum-segmentation off
This flag change does not persist across reboots. Configure the startup
scripts to explicitly set these flags when the system boots.
Upgrades and updates
1.10
bmctl can't create, update, or reset lower version user
clusters
The bmctl CLI can't create, update, or reset a user
cluster with a lower minor version, regardless of the admin cluster
version. For example, you can't use bmctl with a version of
1.N.X to reset a user cluster of version
1.N-1.Y, even if the admin cluster is also at version
1.N.X.
If you are affected by this issue, you should see the logs similar to
the following when you use bmctl:
[2022-06-02 05:36:03-0500] error judging if the cluster is managing itself: error to parse the target cluster: error parsing cluster config: 1 error occurred:
* cluster version 1.8.1 is not supported in bmctl version 1.9.5, only cluster version 1.9.5 is supported
Workaround:
Use kubectl to create, edit, or delete the user cluster
custom resource inside the admin cluster.
The ability to upgrade user clusters is unaffected.
Upgrades and updates
1.12
Cluster upgrades to version 1.12.1 may stall
Upgrading clusters to version 1.12.1 sometimes stalls due to the API
server becoming unavailable. This issue affects all cluster types and all
supported operating systems. When this issue occurs, the bmctl
upgrade clustercommand can fail at multiple points, including during
the second phase of preflight checks.
Workaround
You can check your upgrade logs to determine if you are affected by
this issue. Upgrade logs are located in
/baremetal/bmctl-workspace/CLUSTER_NAME/log/upgrade-cluster-TIMESTAMP by default.
The upgrade-cluster.log may contain errors like the following:
Failed to upgrade cluster: preflight checks failed: preflight check failed
The machine log may contain errors like the following (repeated failures
indicate that you are affected by this issue):
HAProxy and Keepalived must be running on each control plane node before you
reattempt to upgrade your cluster to version 1.12.1. Use the
crictl command-line interface
on each node to check to see if the haproxy and
keepalived containers are running:
If either HAProxy or Keepalived isn't running on a node, restart
kubelet on the node:
systemctl restart kubelet
Upgrades and updates, Anthos VM Runtime
1.11, 1.12
Upgrading clusters to version 1.12.0 or higher fails when
Anthos VM Runtime is enabled
In Anthos clusters on bare metal release 1.12.0, all resources related to
Anthos VM Runtime are migrated to the vm-system
namespace to better support the Anthos VM Runtime GA release. If
you have Anthos VM Runtime enabled in a version 1.11.x or lower
cluster, upgrading to version 1.12.0 or higher fails unless you first
disable Anthos VM Runtime. When you're affected by this issue, the
upgrade operation reports the following error:
Failed to upgrade cluster: cluster is not upgradable with vmruntime enabled from
version 1.11.x to version 1.12.0: please disable VMruntime before upgrade to
1.12.0 and higher version
Upgrade stuck at error during manifests operations
In some situations, cluster upgrades fail to complete and the
bmctl CLI becomes unresponsive. This problem can be caused by
an incorrectly updated resource. To determine if you're affected by this
issue and to correct it, check the anthos-cluster-operator
logs and look for errors similar to the following entries:
controllers/Cluster "msg"="error during manifests operations" "error"="1 error occurred: ... {RESOURCE_NAME} is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
These entries are a symptom of an incorrectly updated resource, where
{RESOURCE_NAME} is the name of the problem resource.
Workaround
If you find these errors in your logs, complete the following steps:
Use kubectl edit to remove the
kubectl.kubernetes.io/last-applied-configuration annotation
from the resource contained in the log message.
Save and apply your changes to the resource.
Retry the cluster upgrade.
Upgrades and updates
1.10, 1.11, 1.12
Upgrades are blocked for clusters with features that use Anthos
Network Gateway
Cluster upgrades from 1.10.x to 1.11.x fail for clusters that use
either egress NAT gateway or
bundled load-balancing with
BGP. These features both use Anthos Network Gateway. Cluster upgrades
get stuck at the Waiting for upgrade to complete...
command-line message and the anthos-cluster-operator logs errors
like the following:
apply run failed ... MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field
is immutable...
Workaround
To unblock the upgrade, run the following commands against the cluster
you are upgrading:
containerd 1.5.13 requires libseccomp 2.5 or
higher
Anthos clusters on bare metal release 1.12.1 ships with
containerd version 1.5.13, and this version of
containerd requires libseccomp version 2.5 or
higher.
If your system doesn't have libseccomp version 2.5 or
higher installed, update it in advance of upgrading existing clusters to
version 1.12.1. Otherwise, you may see errors in cplb-update
Pods for load balancer nodes such as the following:
runc did not terminate successfully: runc: symbol lookup error: runc: undefined
symbol: seccomp_notify_respond
Workaround
To install the latest version of libseccomp in Ubuntu, run
the following command:
sudo apt-get install libseccomp-dev
To install the latest version of libseccomp in CentOS or
RHEL, run the following command:
sudo dnf -y install libseccomp-devel
Operation
1.10, 1.11, 1.12
Nodes uncordoned if you don't use the maintenance mode procedure
If you run Anthos clusters on bare metal version 1.12.0
(anthosBareMetalVersion: 1.12.0) or lower and manually use
kubectl cordon on a node, Anthos clusters on bare metal might uncordon the
node before you're ready in an effort to reconcile the expected state.
Workaround
For Anthos clusters on bare metal version 1.12.0 and lower, use
maintenance mode to
cordon and drain nodes safely.
In version 1.12.1 (anthosBareMetalVersion: 1.12.1) or
higher, Anthos clusters on bare metal won't uncordon your nodes unexpectedly when
you use kubectl cordon.
Operation
1.11
Version 11 admin clusters using a registry mirror can't manage version
1.10 clusters
If your admin cluster is on version 1.11 and uses a registry mirror, it
can't manage user clusters that are on a lower minor version. This issue
affects reset, update, and upgrade operations on the user cluster.
To determine whether this issue affects you, check your logs for
cluster operations, such as create, upgrade, or reset. These logs are
located in the bmctl-workspace/CLUSTER_NAME/
folder by default. If you're affected by the issue, your logs contain the
following error message:
flag provided but not defined: -registry-mirror-host-to-endpoints
Operation
1.10, 1.11
kubeconfig Secret overwritten
The bmctl check cluster command, when run on user
clusters, overwrites the user cluster kubeconfig Secret with the admin
cluster kubeconfig. Overwriting the file causes standard cluster
operations, such as updating and upgrading, to fail for affected user
clusters. This problem applies to Anthos clusters on bare metal versions 1.11.1
and earlier.
To determine if this issue affects a user cluster, run the following
command:
ADMIN_KUBECONFIG: the path to the
admin cluster kubeconfig file.
USER_CLUSTER_NAMESPACE: the
namespace for the cluster. By default, the cluster namespaces for Anthos clusters on bare metal are the name of the cluster prefaced with
cluster-. For example, if you name your cluster
test, the default namespace is cluster-test.
USER_CLUSTER_NAME: the name of the
user cluster to check.
If the cluster name in the output (see
contexts.context.cluster in the following sample output) is
the admin cluster name, then the specified user cluster is affected.
The following steps restore function to an affected user cluster
(USER_CLUSTER_NAME):
Locate the user cluster kubeconfig file. Anthos clusters on bare metal
generates the kubeconfig file on the admin workstation when you create a
cluster. By default, the file is in the
bmctl-workspace/USER_CLUSTER_NAME
directory.
Verify the kubeconfig is correct user cluster kubeconfig:
kubectl get nodes \
--kubeconfig PATH_TO_GENERATED_FILE
Replace PATH_TO_GENERATED_FILE with the
path to the user cluster kubeconfig file. The response returns details
about the nodes for the user cluster. Confirm the machine names are
correct for your cluster.
Run the following command to delete the corrupted kubeconfig file in
the admin cluster:
If you use containerd as the container runtime, running snapshot as
non-root user requires /usr/local/bin to be in the user's PATH.
Otherwise it will fail with a crictl: command not found
error.
When you aren't logged in as the root user, sudo is used
to run the snapshot commands. The sudo PATH can differ from the
root profile and may not contain /usr/local/bin.
Workaround
Update the secure_path in /etc/sudoers to
include /usr/local/bin. Alternatively, create a symbolic link
for crictl in another /bin directory.
Logging and monitoring
1.10
stackdriver-log-forwarder has [parser:cri] invalid
time format warning logs
If the
container runtime
interface (CRI) parser
uses an incorrect regular expression for parsing time, the logs for the
stackdriver-log-forwarder Pod contain errors and warnings
like the following:
[2022/03/04 17:47:54] [error] [parser] time string length is too long [2022/03/04 20:16:43] [ warn] [parser:cri] invalid time format %Y-%m-%dT%H:%M:%S.%L%z for '2022-03-04T20:16:43.680484387Z'
Workaround:
View workaround steps
Upgrade Anthos clusters on bare metal to version 1.11.0 or later.
If you're unable to upgrade your clusters immediately, use the
following steps to update CRI parsing regex:
To prevent your following changes from reverting, scale down the
stackdriver-operator:
Your edited resource should look similar to the following:
[PARSER]
# https://rubular.com/r/Vn30bO78GlkvyB
Name cri
Format regex
# The timestamp is described in
https://www.rfc-editor.org/rfc/rfc3339#section-5.6
Regex ^(?<time>[0-9]{4}-[0-9]{2}-[0-9]{2}[Tt ][0-9]
{2}:[0-9]{2}:[0-9]{2}(?:\.[0-9]+)?(?:[Zz]|[+-][0-9]
{2}:[0-9]{2})) (?<stream>stdout|stderr)
(?<logtag>[^ ]*) (?<log>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
Time_Keep off
For Anthos clusters on bare metal versions 1.10 to latest, some customers have
found unexpectedly high billing for Metrics volume on the
Billing page. This issue affects you only when all of the
following circumstances apply:
Application monitoring is enabled (enableStackdriverForApplications=true)
Application Pods have the prometheus.io/scrap=true
annotation
To confirm whether you are affected by this issue,
list your
user-defined metrics. If you see billing for unwanted metrics, then this
issue applies to you.
Workaround
If you are affected by this issue, we recommend that you upgrade your
clusters to version 1.12 and switch to new application monitoring solution managed-service-for-prometheus that address this issue:
Separate flags to control the collection of application logs versus
application metrics
Bundled Google Cloud Managed Service for Prometheus
If you can't upgrade to version 1.12, use the following steps:
Find the source Pods and Services that have the unwanted billing:
kubectl --kubeconfig KUBECONFIG \
get pods -A -o yaml | grep 'prometheus.io/scrape: "true"'
kubectl --kubeconfig KUBECONFIG get \
services -A -o yaml | grep 'prometheus.io/scrape: "true"'
Remove the prometheus.io/scrap=true annotation from the
Pod or Service.
Logging and monitoring
1.11, 1.12, 1.13, 1.14, 1.15, 1.16
Edits to metrics-server-config aren't persisted
High pod density can, in extreme cases, create excessive logging and
monitoring overhead, which can cause Metrics Server to stop and restart. You
can edit the metrics-server-config ConfigMap to allocate
more resources to keep Metrics Server running. However, due to reconciliation,
edits made to metrics-server-config can get
reverted to the default value during a cluster update or upgrade operation.
Metrics Server isn't affected immediately, but the next time
it restarts, it picks up the reverted ConfigMap and is vulnerable to excessive
overhead, again.
Workaround
For 1.11.x, you can script the ConfigMap edit and perform it along
with updates or upgrades to the cluster. For 1.12 and onward, please reach
out to support.
Several Anthos metrics have been deprecated and, starting with
Anthos clusters on bare metal release 1.11, data is no longer collected for these
deprecated metrics. If you use these metrics in any of your alerting
policies, there won't be any data to trigger the alerting condition.
The following table lists the individual metrics that have been
deprecated and the metric that replaces them.
In Anthos clusters on bare metal releases before 1.11, the policy definition
file for the recommended Anthos on baremetal node cpu usage exceeds
80 percent (critical) alert uses the deprecated metrics. The
node-cpu-usage-high.json JSON definition file is updated for
releases 1.11.0 and later.
Workaround
Use the following steps to migrate to the replacement metrics:
In the Google Cloud console, select Monitoring or click the
following button: Go to Monitoring
In the navigation pane, select
Dashboards, and delete the Anthos cluster node status
dashboard.
Click the Sample library tab and reinstall the Anthos
cluster node status dashboard.
stackdriver-log-forwarder has CrashloopBackOff
errors
In some situations, the fluent-bit logging agent can get
stuck processing corrupt chunks. When the logging agent is unable to bypass
corrupt chunks, you may observe that
stackdriver-log-forwarder keeps crashing with a
CrashloopBackOff error. If you are having this problem, your
logs have entries like the following
[2022/03/09 02:18:44] [engine] caught signal (SIGSEGV) #0 0x5590aa24bdd5
in validate_insert_id() at plugins/out_stackdriver/stackdriver.c:1232
#1 0x5590aa24c502 in stackdriver_format() at plugins/out_stackdriver/stackdriver.c:1523
#2 0x5590aa24e509 in cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:2105
#3 0x5590aa19c0de in output_pre_cb_flush() at include/fluent-bit/flb_output.h:490
#4 0x5590aa6889a6 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117 #5 0xffffffffffffffff in ???() at ???:0
Workaround:
Clean up the buffer chunks for the Stackdriver Log Forwarder.
Note: In the following commands, replace
KUBECONFIG with the path to the admin
cluster kubeconfig file.
While these summary type metrics are in the metrics list, they are not
supported by gke-metrics-agent at this time.
Logging and monitoring
1.10, 1.11
Intermittent metrics export interruptions
Anthos clusters on bare metal may experience interruptions in normal,
continuous exporting of metrics, or missing metrics on some nodes. If this
issue affects your clusters, you may see gaps in data for the following
metrics (at a minimum):
The command finds cpu: 50m if your edits have taken
effect.
Networking
1.10
Multiple default gateways breaks connectivity to external endpoints
Having multiple default gateways in a node can lead to broken
connectivity from within a Pod to external endpoints, such as
google.com.
To determine if you're affected by this issue, run the following
command on the node:
ip route show
Multiple instances of default in the response indicate
that you're affected.
Networking
1.12
Networking custom resource edits on user clusters get overwritten
Anthos clusters on bare metal version 1.12.x doesn't prevent you from manually
editing networking
custom resources
in your user cluster. Anthos clusters on bare metal reconciles custom resources
in the user clusters with the custom resources in your admin cluster
during cluster upgrades. This reconciliation overwrites any edits made
directly to the networking custom resources in the user cluster. The
networking custom resources should be modified in the admin cluster only,
but Anthos clusters on bare metal version 1.12.x doesn't enforce this
requirement.
You edit these custom resources in your admin cluster and the
reconciliation step applies the changes to your user clusters.
Workaround
If you've modified any of the previously mentioned custom resources on
a user cluster, modify the corresponding custom resources on your admin
cluster to match before upgrading. This step ensures that your
configuration changes are preserved. Anthos clusters on bare metal versions
1.13.0 and higher prevent you from modifying the networking custom
resources on your user clusters directly.
Networking
1.11, 1.12, 1.13, 1.14, 1.15, 1.16
NAT failure with too many parallel connections
For a given node in your cluster, the node IP address provides network
address translation (NAT) for packets routed to an address outside of the
cluster. Similarly, when inbound packets enter a load-balancing node
configured to use bundled load balancing (spec.loadBalancer.mode:
bundled), source network address translation (SNAT) routes the
packets to the node IP address before they are forwarded on to a backend
Pod.
The port range for NAT used by Anthos clusters on bare metal is
32768–65535. This range limits the number
of parallel connections to 32,767 per protocol on that node. Each
connection needs an entry in the conntrack table. If you have too many
short-lived connections, the conntrack table runs out of ports for NAT. A
garbage collector cleans up the stale entries, but the cleanup isn't
immediate.
When the number of connections on your node approaches 32,767, you will
start seeing packet drops for connections that need NAT.
You can identify this problem by running the following command on the
anetd Pod on the problematic node:
Client source IP with bundled Layer 2 load balancing
Setting the
external traffic policy
to Local can cause routing errors, such as No route to
host for bundled Layer 2 load balancing. The external traffic
policy is set to Cluster (externalTrafficPolicy:
Cluster), by default. With this setting, Kubernetes handles
cluster-wide traffic. Services of type LoadBalancer or
NodePort can use externalTrafficPolicy: Local to
preserve the client source IP address. With this setting, however,
Kubernetes only handles node-local traffic.
Workaround
If you want to preserve the client source IP address, additional
configuration may be required to ensure service IPs are reachable. For configuration details, see
Preserving
client source IP address in Configure bundled load balancing.
Networking
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16
Modifying firewalld will erase Cilium iptable policy
chains
When running Anthos clusters on bare metal with firewalld enabled
on either CentOS or Red Had Enterprise Linux (RHEL), changes to
firewalld can remove the Cilium iptables chains
on the host network. The iptables chains are added by the
anetd Pod when it is started. The loss of the Cilium
iptables chains causes the Pod on the Node to lose network
connectivity outside of the Node.
Changes to firewalld that will remove the
iptables chains include, but aren't limited to:
Restarting firewalld, using systemctl
Reloading the firewalld with the command line client
(firewall-cmd --reload)
Workaround
Restart anetd on the Node. Locate and delete the
anetd Pod with the following commands to restart
anetd:
When using the
egress NAT gateway feature
preview, it is possible to set traffic selection rules that specify an
egressSourceIP address that is already in use for another
EgressNATPolicy object. This may cause egress traffic routing
conflicts.
Workaround
Coordinate with your development team to determine which floating IP
addresses are available for use before specifying the
egressSourceIP address in your EgressNATPolicy
custom resource.
Networking
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16
Pod connectivity failures and reverse path filtering
Anthos clusters on bare metal configures reverse path filtering on nodes to
disable source validation (net.ipv4.conf.all.rp_filter=0).
If the rp_filter setting is changed to 1 or
2, pods will fail due to out-of-node communication
timeouts.
Reverse path filtering is set with rp_filter files in the
IPv4 configuration folder (net/ipv4/conf/all). This value may
also be overridden by sysctl, which stores reverse path
filtering settings in a network security configuration file, such as
/etc/sysctl.d/60-gce-network-security.conf.
Workaround
To restore Pod connectivity, either set
net.ipv4.conf.all.rp_filter back to 0 manually,
or restart the anetd Pod to set
net.ipv4.conf.all.rp_filter back to 0. To
restart the anetd Pod, use the following commands to locate
and delete the anetd Pod and a new anetd Pod
will start up in its place:
Bootstrap (kind) cluster IP addresses and cluster node IP addresses
overlapping
192.168.122.0/24 and 10.96.0.0/27 are the
default pod and service CIDRs used by the bootstrap (kind) cluster.
Preflight checks will fail if they overlap with cluster node machine IP
addresses.
Workaround
To avoid the conflict, you can pass the
--bootstrap-cluster-pod-cidr and
--bootstrap-cluster-service-cidr flags to bmctl
to specify different values.
Operating system
1.11
Incompatibility with Ubuntu 18.04.6 on GA kernel
Anthos clusters on bare metal versions 1.11.0 and 1.11.1 aren't compatible
with Ubuntu 18.04.6 on the GA kernel (from 4.15.0-144-generic
to 4.15.0-176-generic. The incompatibility causes the
networking agent to fail to configure the cluster network with a
"BPF program is too large" error in the anetd logs. You may
see pods stuck in ContainerCreating status with a
networkPlugin cni failed to set up pod error in the pods
event log. This issue doesn't apply to the Ubuntu Hardware Enablement
(HWE) kernels.
Workaround
We recommend that you
get the
HWE kernel and upgrade it to the latest supported HWE version for
Ubuntu 18.04.
Operating system
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16
Cluster creation or upgrade fails on CentOS
In December 2020, the CentOS community and Red Hat announced the
sunset
of CentOS. On January 31, 2022, CentOS 8 reached its end of life
(EOL). As a result of the EOL, yum repositories stopped
working for CentOS, which causes cluster creation and cluster upgrade
operations to fail. This applies to all supported versions of CentOS and
affects all versions of Anthos clusters on bare metal.
Workaround
View workaround steps
As a workaround, run the following commands to have your CentOS use an
archive feed:
sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-Linux-*
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' \
/etc/yum.repos.d/CentOS-Linux-*
On RHEL and CentOS, there is a cluster level limitation of 100,000
endpoints. Kubernetes service. If 2 services reference the same set of
pods, this counts as 2 separate sets of endpoints. The underlying
nftable implementation on RHEL and CentOS causes this
limitation; it is not an intrinsic limitation of Anthos clusters on bare metal.
Security
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16
Container can't write to VOLUME defined in Dockerfile
with containerd and SELinux
If you use containerd as the container runtime and your operating
system has SELinux enabled, the VOLUME defined in the
application Dockerfile might not be writable. For example, containers
built with the following Dockerfile aren't able to write to the
/tmp folder.
FROM ubuntu:20.04 RUN chmod -R 777 /tmp VOLUME /tmp
To verify if you're affected by this issue, run the following command
on the node that hosts the problematic container:
ausearch -m avc
If you're affected by this issue, you see a denied error
like the following:
To work around this issue, make either of the following changes:
Turn off SELinux.
Don't use the VOLUME feature inside Dockerfile.
Security
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16
SELinux errors during pod creation
Pod creation sometimes fails when SELinux prevents the container
runtime from setting labels on tmpfs mounts. This failure is
rare, but can happen when SELinux is in Enforcing mode and in
some kernels.
To verify that SELinux is the cause of pod creation failures, use the
following command to check for errors in the kubelet logs:
journalctl -u kubelet
If SELinux is causing pod creation to fail, the command response
contains an error similar to the following:
error setting label on mount source '/var/lib/kubelet/pods/6d9466f7-d818-4658-b27c-3474bfd48c79/volumes/kubernetes.io~secret/localpv-token-bpw5x': failed to set file label on /var/lib/kubelet/pods/6d9466f7-d818-4658-b27c-3474bfd48c79/volumes/kubernetes.io~secret/localpv-token-bpw5x: permission denied
To verify that this issue is related to SELinux enforcement, run the
following command
ausearch -m avc
This command searches the audit logs for access vector cache (AVC)
permission errors. The avc: denied in the following sample
response confirms that the pod creation failures are related to SELinux
enforcement.
The root cause of this pod creation problem with SELinux is a kernel
bug found in the following Linux images:
Red Hat Enterprise Linux (RHEL) releases prior to 8.3
CentOS releases prior to 8.3
Workaround
Rebooting the machine helps recover from the issue.
To prevent pod creation errors from occurring, use RHEL 8.3 or later or
CentOS 8.3 or later, because those versions have fixed the kernel bug.
Reset/Deletion
1.10, 1.11, 1.12
Namespace deletion
Deleting a namespace will prevent new resources from being created in
that namespace, including jobs to reset machines.
Workaround
When deleting a user cluster, you must delete the cluster object first
before deleting its namespace. Otherwise, the jobs to reset machines
can't get created, and the deletion process will skip the machine
clean-up step.
Reset/Deletion
1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16
containerd service
The bmctl reset command doesn't delete any
containerd configuration files or binaries. The
containerd systemd service is left up and running. The
command deletes the containers running pods scheduled to the node.
Upgrades and updates
1.10, 1.11, 1.12
Node Problem Detector is not enabled by default after cluster upgrades
When you upgrade Anthos clusters on bare metal, Node Problem Detector is not enabled
by default. This issue is applicable for upgrades in release 1.10 to
1.12.1 and has been fixed in release 1.12.2.
Workaround:
To enable the Node Problem Detector:
Verify if node-problem-detector systemd service is
running on the node.
Use the SSH command and connect to the node.
Check if node-problem-detector systemd service is
running on the node:
systemctl is-active node-problem-detector
If the command result displays inactive, then the node-problem-detector is not running on the node.
To enable the Node Problem Detector, use the
kubectl edit command and edit the
node-problem-detector-config ConfigMap. For more
information, see
Node Problem
Detector.
Operation
1.9, 1.10
Cluster backup fails when using non-root login
The bmctl backup cluster command fails if
nodeAccess.loginUser is set to a non-root username.]
Workaround:
This issue applies to Anthos clusters on bare metal 1.9.x, 1.10.0, and 1.10.1
and is fixed in version 1.10.2 and later.
Networking
1.10, 1.11, 1.12
Load Balancer Services don't work with containers on the control plane
host network
There is a bug in anetd where packets are dropped for
LoadBalancer Services if the backend pods are both running on the control
plane node and are using the hostNetwork: true field in the
container's spec.
The bug is not present in version 1.13 or later.
Workaround:
The following workarounds can help if you use a LoadBalancer Service
that is backed by hostNetwork Pods:
Run them on worker nodes (not control plane nodes).
Orphaned anthos-version-$version$ pod failing to pull image
Cluster upgrading from 1.12.x to 1.13.x might observe a failing
anthos-version-$version$ pod with ImagePullBackOff error.
This happens due to the race condition of anthos-cluster-operator gets
upgraded and it should not affect any regular cluster functionality.
The bug is not present after version 1.13 or later.
Workaround:
Delete the Job of dynamic-version-installer by
kubectl delete job anthos-version-$version$ -n kube-system
Upgrades and updates
1.13
1.12 clusters upgraded from 1.11 can't upgrade to 1.13.0
Version 1.12 clusters that were upgraded from version 1.11 can't be
upgraded to version 1.13.0. This upgrade issue doesn't apply to clusters
that were created at version 1.12.
To determine if you're affected, check the logs of the upgrade job that
contains the upgrade-first-no* string in the admin cluster.
If you see the following error message, you're affected.
TASK [kubeadm_upgrade_apply : Run kubeadm upgrade apply] *******
...
[upgrade/config] FATAL: featureGates: Invalid value: map[string]bool{\"IPv6DualStack\":false}: IPv6DualStack is not a valid feature name.
...
Workaround:
To work around this issue:
Run the following commands on your admin workstation: