GKE on Bare Metal known issues

Configuration

Control plane and load balancer specifications

The control plane and load balancer node pool specifications are special. These specifications declare and control critical cluster resources. The canonical source for these resources is their respective sections in the cluster config file:

  • spec.controlPlane.nodePoolSpec
  • spec.loadBalancer.nodePoolSpec

Consequently, do not modify the top-level control plane and load balancer node pool resources directly. Modify the associated sections in the cluster config file instead.

Installation

Cluster creation fails for HA cluster configurations with global proxy

Cluster creation fails when you have the following combination of conditions:

  • Cluster version specified in the cluster configuration file (spec.anthosBareMetalVersion) has one of the following values: 1.9.0, 1.9.1, or 1.9.2.

  • Cluster is configured as a high-availability (HA) cluster, meaning there are three or more control plane nodes specified in the cluster configuration file (spec.controlPlane.nodePoolSpec.nodes).

  • One or more control plane nodes is configured with a global proxy. You might see this proxy setting as an HTTPS_PROXY environment variable, or you might find https_proxy or HTTPS_PROXY entries in the /etc/environment file.

To check if you have a global proxy set on the control plane machine:

  1. Use SSH to connect to the control plane machine.

  2. Run the following command to see if you get a proxy server address:

    echo $HTTPS_PROXY $https_proxy
    

Workaround

As a workaround, you can either remove the global proxy from the control plane node or you can add the control plane VIP (spec.loadBalancer.vips.controlPlaneVIP) to the NO_PROXY list on the control plane node.

To add the control plane VIP to the NO_PROXY list:

  1. On the control plane node, open the /etc/environment file for editing.

  2. If the file doesn't have a NO_PROXY entry, add one like the following:

    NO_PROXY=CONTROL_PLANE_VIP
    

    Replace CONTROL_PLANE_VIP with the value of spec.loadBalancer.vips.controlPlaneVIP from the cluster configuration file.

  3. If there's already a NO_PROXY entry, add the control plane VIP address to the comma-separated list.

Cluster creation fails when using multi-NIC, containerd, and HTTPS proxy

Cluster creation fails when you have the following combination of conditions:

  • Cluster is configured to use containerd as the container runtime (nodeConfig.containerRuntime set to containerd in the cluster configuration file, the default for GKE on Bare Metal version 1.9).

  • Cluster is configured to provide multiple network interfaces, multi-NIC, for pods (clusterNetwork.multipleNetworkInterfaces set to true in the cluster configuration file).

  • Cluster is configured to use a proxy (spec.proxy.url is specified in the cluster configuration file). Even though cluster creation fails, this setting is propagated when you attempt to create a cluster. You may see this proxy setting as an HTTPS_PROXY environment variable or in your containerd configuration (/etc/systemd/system/containerd.service.d/09-proxy.conf).

As a workaround for this issue, append service CIDRs (clusterNetwork.services.cidrBlocks) to the NO_PROXY environment variable on all node machines.

Unspecified containerRuntime doesn't default to containerd

In GKE on Bare Metal release 1.9.0, the containerRuntime default was updated from docker to containerd in the generated cluster configuration file. If the containerRuntime field isn't set or is removed from the cluster configuration file, containerRuntime is set to docker when you create clusters. The containerRuntime value should default to containerd, unless it is explicitly set to docker. This issue applies to releases 1.9.0 and 1.9.1 only.

To determine which container runtime your cluster is using, follow the steps in Retrieve cluster information. Check the value of containerRuntime in the cluster.spec.nodeConfig section.

The only way to change the container runtime is by upgrading your clusters. For more information, see Change your container runtime.

This issue is fixed in GKE on Bare Metal release 1.9.2.

Control group v2 incompatibility

Control group v2 (cgroup v2) is not officially supported in GKE on Bare Metal 1.9. The presence of /sys/fs/cgroup/cgroup.controllers indicates that your system uses cgroup v2.

The preflight checks verify that cgroup v2 is not in use on the cluster machine.

Benign error messages during installation

When examining cluster creation logs, you may notice transient failures about registering clusters or calling webhooks. These errors can be safely ignored, because the installation will retry these operations until they succeed.

Preflight checks and service account credentials

For installations triggered by admin or hybrid clusters (in other words, clusters not created with bmctl, like user clusters), the preflight check does not verify Google Cloud Platform service account credentials or their associated permissions.

Preflight checks and permission denied

During installation you may see errors about /bin/sh: /tmp/disks_check.sh: Permission denied. These error messages are caused because /tmp is mounted with noexec option. For bmctl to work you need to remove noexec option from /tmp mount point.

Application default credentials and bmctl

bmctl uses Application Default Credentials (ADC) to validate the cluster operation's location value in the cluster spec when it is not set to global.

For ADC to work, you need to either point the GOOGLE_APPLICATION_CREDENTIALS environment variable to a service account credential file, or run gcloud auth application-default login.

Docker service

On cluster node machines, if the Docker executable is present in the PATH environment variable, but the Docker service is not active, preflight check will fail and report that the Docker service is not active. To fix this error, either remove Docker, or enable the Docker service.

Installing on vSphere

When installing GKE on Bare Metal on vSphere VMs, you must set the tx-udp_tnl-segmentation and tx-udp_tnl-csum-segmentation flags to off. These flags are related to the hardware segmentation offload done by the vSphere driver VMXNET3 and they don't work with the GENEVE tunnel of GKE on Bare Metal.

Run the following command on each node to check the current values for these flags. ethtool -k NET_INTFC |grep segm ... tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on ... Replace NET_INTFC with the network interface associated with the IP address of the node.

Sometimes in RHEL 8.4, ethtool shows these flags are off while they aren't. To explicitly set these flags to off, toggle the flags on and then off with the following commands.

ethtool -K ens192 tx-udp_tnl-segmentation on
ethtool -K ens192 tx-udp_tnl-csum-segmentation on

ethtool -K ens192 tx-udp_tnl-segmentation off
ethtool -K ens192 tx-udp_tnl-csum-segmentation off

This flag change does not persist across reboots. Configure the startup scripts to explicitly set these flags when the system boots.

Upgrades and updates

bmctl update cluster fails if the .manifests directory is missing

If the .manifests directory is removed prior to running bmctl update cluster, the command fails with an error similar to the following:

Error updating cluster resources.: failed to get CRD file .manifests/1.9.0/cluster-operator/base/crd/bases/baremetal.cluster.gke.io_clusters.yaml: open .manifests/1.9.0/cluster-operator/base/crd/bases/baremetal.cluster.gke.io_clusters.yaml: no such file or directory

You can fix this issue by running bmctl check cluster first, which will recreate the .manifests directory.

This issue applies to GKE on Bare Metal 1.10 and earlier and is fixed in version 1.11 and later.

bmctl can't create, update, or reset lower version user clusters

The bmctl CLI can't create, update, or reset a user cluster with a lower minor version, regardless of the admin cluster version. For example, you can't use bmctl with a version of 1.N.X to reset a user cluster of version 1.N-1.Y, even if the admin cluster is also at version 1.N.X.

If you are affected by this issue, you should see the logs similar to the following when you use bmctl:

[2022-06-02 05:36:03-0500] error judging if the cluster is managing itself:
error to parse the target cluster: error parsing cluster config: 1 error
occurred:
    * cluster version 1.8.1 is not supported in bmctl version 1.9.5, only
cluster version 1.9.5 is supported

To work around the issue, use kubectl to create, edit, or delete the user cluster custom resource inside the admin cluster.

The ability to upgrade user clusters is unaffected.

Upgrade stuck at error during manifests operations

In some situations, cluster upgrades fail to complete and the bmctl CLI becomes unresponsive. This problem can be caused by an incorrectly updated resource. To determine if you're affected by this issue and to correct it, use the following steps:

  1. Check the anthos-cluster-operator logs and look for errors similar to the following entries:

    controllers/Cluster "msg"="error during manifests operations" "error"="1 error occurred:
    ...
    {RESOURCE_NAME} is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
    

    These entries are a symptom of an incorrectly updated resource, where {RESOURCE_NAME} is the name of the problem resource.

  2. If you find these errors in your logs, use kubectl edit to remove the kubectl.kubernetes.io/last-applied-configuration annotation from the resource contained in the log message.

  3. Save and apply your changes to the resource.

  4. Retry the cluster upgrade.

Upgrades fail for version 1.8 clusters in maintenance mode

Attempting to upgrade a version 1.8.x cluster to version 1.9.x fails if any node machines have previously been put into maintenance mode. This is due to an annotation that remains on these nodes.

To determine if you are affected by this issue, use the following steps:

  1. Get the version of the cluster you want to upgrade by running the following command:

    kubectl --kubeconfig ADMIN_KUBECONFIG get cluster CLUSTER_NAME  \
        -n CLUSTER_NAMESPACE --output=jsonpath="{.spec.anthosBareMetalVersion}"
    

    If the returned version value is for the 1.8 minor release, such as 1.8.3, then continue. Otherwise, this issue does not apply to you.

  2. Check whether the cluster has any nodes that have previously been put into maintenance mode by running the following command:

    kubectl --kubeconfig ADMIN_KUBECONFIG get BareMetalMachines -n CLUSTER_NAMESPACE  \
        --output=jsonpath="{.items[*].metadata.annotations}"
    

    If the returned annotations contain baremetal.cluster.gke.io/maintenance-mode-duration, then you are affected by this known issue.

To unblock the cluster upgrade, run the following command for each affected node machine to remove the baremetal.cluster.gke.io/maintenance-mode-duration annotation:

kubectl --kubeconfig ADMIN_KUBECONFIG  annotate BareMetalMachine -n CLUSTER_NAMESPACE \
    NODE_MACHINE_NAME baremetal.cluster.gke.io/maintenance-mode-duration-

bmctl update doesn't remove maintenance blocks

The bmctl update command can't remove or modify the maintenanceBlocks section from the cluster resource configuration. For more information, including instructions for removing nodes from maintenance mode, see Put nodes into maintenance mode.

User cluster patch upgrade limitation

User clusters that are managed by an admin cluster must be at the same GKE on Bare Metal version or lower and within one minor release. For example, a version 1.9.0 (anthosBareMetalVersion: 1.9.0) admin cluster managing version 1.8.4 user clusters is acceptable.

An upgrade limitation prevents you from upgrading your user clusters to a new security patch when the patch is released after the release version the admin cluster is using. For example, if your admin cluster is at version 1.7.2, which was released on June 2, 2021, you can't upgrade your user clusters to version 1.6.4, because it was released on August 13, 2021.

Node draining can't start when Node is out of reach

The draining process for Nodes won't start if the Node is out of reach from GKE on Bare Metal. For example, if a Node goes offline during a cluster upgrade process, it may cause the upgrade to stop responding. This is a rare occurrence. To minimize the likelyhood of encountering this problem, ensure your Nodes are operating properly before initiating an upgrade.

Operation

Cluster backup fails when using non-root login

The bmctl backup cluster command fails if nodeAccess.loginUser is set to a non-root username.

This issue applies to GKE on Bare Metal 1.9.x, 1.10.0, and 1.10.1 and is fixed in version 1.10.2 and later.

Nodes uncordoned if you don't use the maintenance mode procedure

If you manually use kubectl cordon on a node, GKE on Bare Metal might uncordon the node before you're ready in an effort to reconcile the expected state. For GKE on Bare Metal version 1.12.0 and lower, use maintenance mode to cordon and drain nodes safely. In version 1.12.1 (anthosBareMetalVersion: 1.12.1) or higher, GKE on Bare Metal won't uncordon your nodes unexpectedly when you use kubectl cordon.

kubeconfig secret overwritten

The bmctl check cluster command, when run on user clusters, overwrites the user cluster kubeconfig secret with the admin cluster kubeconfig. Overwriting the file causes standard cluster operations, such as updating and upgrading, to fail for affected user clusters. This problem applies to GKE on Bare Metal versions 1.11.1 and earlier.

To determine if a user cluster is affected by this issue, run the following command:

kubectl --kubeconfig ADMIN_KUBECONFIG get secret -n cluster-USER_CLUSTER_NAME \
    USER_CLUSTER_NAME -kubeconfig  -o json  | jq -r '.data.value'  | base64 -d

Replace the following:

  • ADMIN_KUBECONFIG: the path to the admin cluster kubeconfig file.
  • USER_CLUSTER_NAME: the name of the user cluster to check.

If the cluster name in the output (see contexts.context.cluster in the following sample output) is the admin cluster name, then the specified user cluster is affected.

user-cluster-kubeconfig  -o json  | jq -r '.data.value'  | base64 -d
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data:LS0tLS1CRU...UtLS0tLQo=
    server: https://10.200.0.6:443
  name: ci-aed78cdeca81874
contexts:
- context:
    cluster: ci-aed78cdeca81874
    user: ci-aed78cdeca81874-admin
  name: ci-aed78cdeca81874-admin@ci-aed78cdeca81874
current-context: ci-aed78cdeca81874-admin@ci-aed78cdeca81874
kind: Config
preferences: {}
users:
- name: ci-aed78cdeca81874-admin
  user:
    client-certificate-data: LS0tLS1CRU...UtLS0tLQo=
    client-key-data: LS0tLS1CRU...0tLS0tCg==

The following steps restore function to an affected user cluster (USER_CLUSTER_NAME):

  1. Locate the user cluster kubeconfig file.

    GKE on Bare Metal generates the kubeconfig file on the admin workstation when you create a cluster. By default, the file is in the bmctl-workspace/USER_CLUSTER_NAME directory.

  2. Verify the kubeconfig is correct user cluster kubeconfig:

    kubectl get nodes --kubeconfig PATH_TO_GENERATED_FILE
    

    Replace PATH_TO_GENERATED_FILE with the path to the user cluster kubeconfig file. The response returns details about the nodes for the user cluster. Confirm the machine names are correct for your cluster.

  3. Run the following command to delete the corrupted kubeconfig file in the admin cluster:

    kubectl delete secret -n USER_CLUSTER_NAMESPACE USER_CLUSTER_NAME-kubeconfig
    
  4. Run the following command to save the correct kubeconfig secret back to the admin cluster:

    kubectl create secret generic -n USER_CLUSTER_NAMESPACE USER_CLUSTER_NAME-kubeconfig \
        --from-file=value=PATH_TO_GENERATED_FILE
    

Taking a snapshot as a non-root login user

If you use containerd as the container runtime, running snapshot as non-root user requires /usr/local/bin to be in the user's PATH. Otherwise it will fail with a crictl: command not found error.

When you aren't logged in as the root user, sudo is used to run the snapshot commands. The sudo PATH can differ from the root profile and may not contain /usr/local/bin.

You can fix this error by updating the secure_path in /etc/sudoers to include /usr/local/bin. Alternatively, create a symbolic link for crictl in another /bin directory.

Anthos VM Runtime

Restarting a pod causes the VMs on the pod to change IP addresses or lose their IP address altogether. If the IP address of a VM changes, this does not affect the reachability of VM applications exposed as a Kubernetes service. If the IP address is lost, you must run dhclient from the VM to acquire an IP address for the VM.

Logging and monitoring

Missing logs after network outage

In some cases, when your cluster recovers from a network outage, you may see that new logs aren't appearing in Cloud Logging. You may also see multiple messages like the following in your logs for stackdriver-log-forwarder:

re-schedule retry=0x7fef2acbd8d0 239 in the next 51 seconds

To reactivate log forwarding, restart the stackdriver-log-forwarder Pod. If the log forwarder is restarted within 4.5 hours of the outage, the buffered logs are forwarded to Cloud Logging. Logs older than 4.5 hours are dropped.

Intermittent metrics export interruptions

GKE on Bare Metal release 1.9.x and 1.10.x may experience interruptions in normal, continuous exporting of metrics, or missing metrics on some nodes. If this issue affects your clusters, you may see gaps in data for the following metrics (at a minimum):

  • kubernetes.io/anthos/container_memory_working_set_bytes
  • kubernetes.io/anthos/container_cpu_usage_seconds_total
  • kubernetes.io/anthos/container_network_receive_bytes_total

To fix this issue, upgrade your clusters to version 1.10.3 or later.

If you can't upgrade, perform the following steps as a workaround:

  1. Open your stackdriver resource for editing:

    kubectl -n kube-system edit stackdriver stackdriver
    
  2. To increase the CPU request for gke-metrics-agent from 10m to 50m, add the following resourceAttrOverride section to the stackdriver manifest :

    spec:
      resourceAttrOverride:
        gke-metrics-agent/gke-metrics-agent:
          limits:
            cpu: 100m
            memory: 4608Mi
          requests:
            cpu: 50m
            memory: 200Mi
    

    Your edited resource should look similar to the following:

    spec:
      anthosDistribution: baremetal
      clusterLocation: us-west1-a
      clusterName: my-cluster
      enableStackdriverForApplications: true
      gcpServiceAccountSecretName: ...
      optimizedMetrics: true
      portable: true
      projectID: my-project-191923
      proxyConfigSecretName: ...
      resourceAttrOverride:
        gke-metrics-agent/gke-metrics-agent:
          limits:
            cpu: 100m
            memory: 4608Mi
          requests:
            cpu: 50m
            memory: 200Mi
    
  3. Save your changes and close the text editor.

  4. To verify your changes have taken effect, run the following command:

    kubectl -n kube-system get daemonset gke-metrics-agent -o yaml | grep "cpu: 50m"
    

    The command finds cpu: 50m if your edits have taken effect.

  5. To prevent your following changes from reverting, scale down stackdriver-operator:

    kubectl -n kube-system scale deploy stackdriver-operator --replicas=0
    
  6. Open gke-metrics-agent-conf for editing:

    kubectl -n kube-system edit configmap gke-metrics-agent-conf
    
  7. Edit the configuration to change all instances of probe_interval: 0.1s to probe_interval: 13s:

     183     processors:
     184       disk_buffer/metrics:
     185         backend_endpoint: https://monitoring.googleapis.com:443
     186         buffer_dir: /metrics-data/nsq-metrics-metrics
     187         probe_interval: 13s
     188         retention_size_mib: 6144
     189       disk_buffer/self:
     190         backend_endpoint: https://monitoring.googleapis.com:443
     191         buffer_dir: /metrics-data/nsq-metrics-self
     192         probe_interval: 13s
     193         retention_size_mib: 200
     194       disk_buffer/uptime:
     195         backend_endpoint: https://monitoring.googleapis.com:443
     196         buffer_dir: /metrics-data/nsq-metrics-uptime
     197         probe_interval: 13s
     198         retention_size_mib: 200
    
  8. Save your changes and close the text editor.

  9. Restart the gke-metrics-agent daemon set:

    kubectl -n kube-system rollout restart daemonset gke-metrics-agent
    

Security

Container can't write to VOLUME defined in Dockerfile with containerd and SELinux

If you use containerd as the container runtime and your operating system has SELinux enabled, the VOLUME defined in the application Dockerfile might not be writable. For example, containers built with the following Dockerfile aren't able to write to the /tmp folder.

FROM ubuntu:20.04
RUN chmod -R 777 /tmp
VOLUME /tmp

To verify if you're affected by this issue, run the following command on the node that hosts the problematic container:

ausearch -m avc

If you're affected by this issue, you see a denied error like the following:

time->Mon Apr  4 21:01:32 2022 type=PROCTITLE msg=audit(1649106092.768:10979):
proctitle="bash" type=SYSCALL msg=audit(1649106092.768:10979): arch=c000003e
syscall=257 success=no exit=-13 a0=ffffff9c a1=55eeba72b320 a2=241 a3=1b6
items=0 ppid=75712 pid=76042 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0
egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="bash" exe="/usr/bin/bash"
subj=system_u:system_r:container_t:s0:c701,c935 key=(null) type=AVC
msg=audit(1649106092.768:10979): avc:  denied {
write } for  pid=76042 comm="bash"
name="ad9bc6cf14bfca03d7bb8de23c725a86cb9f50945664cb338dfe6ac19ed0036c"
dev="sda2" ino=369501097 scontext=system_u:system_r:container_t:s0:c701,c935
tcontext=system_u:object_r:container_ro_file_t:s0 tclass=dir permissive=0 

To work around this issue, make either of the following changes:

  • Turn off SELinux.
  • Don't use the VOLUME feature inside Dockerfile.

Cluster CA Rotation (Preview Feature)

The cluster CA/certificate will be rotated during upgrade. On-demand rotation support is a Preview feature.

GKE on Bare Metal rotates kubelet serving certificates automatically. Each kubelet node agent can send out a Certificate Signing Request (CSR) when a certificate nears expiration. A controller in your admin clusters validates and approves the CSR.

After you perform a user cluster certificate authority (CA) rotation on a cluster, all user authentication flows fail. These failures occur because the ClientConfig custom resource used in authentication flows isn't being updated with the new CA data during CA rotation. If you have performed a cluster CA rotation on your cluster, check to see if the certificateAuthorityData field in default ClientConfig of the kube-public namespace contains the older cluster CA.

To resolve the issue manually, update the certificateAuthorityData field with the current cluster CA.

SELinux errors during pod creation

Pod creation sometimes fails when SELinux prevents the container runtime from setting labels on tmpfs mounts. This failure is rare, but can happen when SELinux is in Enforcing mode and in some kernels.

To verify that SELinux is the cause of pod creation failures, use the following command to check for errors in the kubelet logs:

journalctl -u kubelet

If SELinux is causing pod creation to fail, the command response contains an error similar to the following:

error setting label on mount source '/var/lib/kubelet/pods/
6d9466f7-d818-4658-b27c-3474bfd48c79/volumes/kubernetes.io~secret/localpv-token-bpw5x':
failed to set file label on /var/lib/kubelet/pods/
6d9466f7-d818-4658-b27c-3474bfd48c79/volumes/kubernetes.io~secret/localpv-token-bpw5x:
permission denied

To verify that this issue is related to SELinux enforcement, run the following command:

ausearch -m avc

This command searches the audit logs for access vector cache (AVC) permission errors. The avc: denied in the following sample response confirms that the pod creation failures are related to SELinux enforcement.

type=AVC msg=audit(1627410995.808:9534): avc:  denied  { associate } for
pid=20660 comm="dockerd" name="/" dev="tmpfs" ino=186492
scontext=system_u:object_r:container_file_t:s0:c61,c201
tcontext=system_u:object_r:locale_t:s0 tclass=filesystem permissive=0

The root cause of this pod creation problem with SELinux is a kernel bug found in the following Linux images:

  • Red Hat Enterprise Linux (RHEL) releases prior to 8.3
  • CentOS releases prior to 8.3

Rebooting the machine helps recover from the issue.

To prevent pod creation errors from occurring, use RHEL 8.3 or later or CentOS 8.3 or later, because those versions have fixed the kernel bug.

Networking

Multiple default gateways breaks connectivity to external endpoints

Having multiple default gateways in a node can lead to broken connectivity from within a Pod to external endpoints, such as google.com.

To determine if you're affected by this issue, run the following command on the node:

ip route show

Multiple instances of default in the response indicate that you're affected.

To work around this issue, ensure the default gateway interface that is used for your Kubernetes Node IP is the first on the list.

Client source IP with bundled Layer 2 load balancing

Setting the external traffic policy to Local can cause routing errors, such as No route to host, for bundled Layer 2 load balancing. The external traffic policy is set to Cluster (externalTrafficPolicy: Cluster), by default. With this setting, Kubernetes handles cluster-wide traffic. Services of type LoadBalancer or NodePort can use externalTrafficPolicy: Local to preserve the client source IP address. With this setting, however, Kubernetes only handles node-local traffic.

If you want to preserve the client source IP address, additional configuration may be required to ensure service IPs are reachable. For configuration details, see Preserving client source IP address in Configure bundled load balancing.

Modifying firewalld will erase Cilium iptable policy chains

When running GKE on Bare Metal with firewalld enabled on either CentOS or Red Had Enterprise Linux (RHEL), changes to firewalld can remove the Cilium iptables chains on the host network. The iptables chains are added by the anetd Pod when it is started. The loss of the Cilium iptables chains causes the Pod on the Node to lose network connectivity outside of the Node.

Changes to firewalld that will remove the iptables chains include, but aren't limited to:

  • Restarting firewalld, using systemctl
  • Reloading the firewalld with the command line client (firewall-cmd --reload)

You can fix this connectivity issue by restarting anetd on the Node. Locate and delete the anetd Pod with the following commands to restart anetd:

kubectl get pods -n kube-system
kubectl delete pods -n kube-system ANETD_XYZ

Replace ANETD_XYZ with the name of the anetd Pod.

Duplicate egressSourceIP addresses

When using the egress NAT gateway feature preview, it is possible to set traffic selection rules that specify an egressSourceIP address that is already in use for another EgressNATPolicy object. This may cause egress traffic routing conflicts. Coordinate with your development team to determine which floating IP addresses are available for use before specifying the egressSourceIP address in your EgressNATPolicy custom resource.

Pod connectivity failures due to I/O timeout and reverse path filtering

GKE on Bare Metal configures reverse path filtering on nodes to disable source validation (net.ipv4.conf.all.rp_filter=0). If therp_filter setting is changed to 1 or 2, pods fail due to out-of-node communication timeouts.

Observed connectivity failures communicating to Kubernetes Service IP addresses are a symptom this problem. Here are a couple of examples of the types of errors you might see:

  • If all pods for a given node fail to communicate to the Service IP addresses, the istiod Pod might report an error like the following:

     {"severity":"Error","timestamp":"2021-11-12T17:19:28.907001378Z",
        "message":"watch error in cluster Kubernetes: failed to list *v1.Node:
        Get \"https://172.26.0.1:443/api/v1/nodes?resourceVersion=5  34239\":
        dial tcp 172.26.0.1:443: i/o timeout"}
    
  • For the localpv daemon set that runs on every node, the log might show a timeout like the following:

     I1112 17:24:33.191654       1 main.go:128] Could not get node information
    (remaining retries: 2): Get
    https://172.26.0.1:443/api/v1/nodes/NODE_NAME:
    dial tcp 172.26.0.1:443: i/o timeout
    

Reverse path filtering is set with rp_filter files in the IPv4 configuration folder (net/ipv4/conf/all). The sysctl command stores reverse path filtering settings in a network security configuration file, such as /etc/sysctl.d/60-gce-network-security.conf.. The sysctl command can override the reverse path filtering setting.

To restore Pod connectivity, either set net.ipv4.conf.all.rp_filter back to 0 manually, or restart the anetd Pod to set net.ipv4.conf.all.rp_filter back to 0. To restart the anetd Pod, use the following commands to locate and delete the anetd Pod. A new anetd Pod start up in its place:

kubectl get pods -n kube-system
kubectl delete pods -n kube-system ANETD_XYZ

Replace ANETD_XYZ with the name of the anetd Pod.

To set net.ipv4.conf.all.rp_filter manually, run the following command:

sysctl -w net.ipv4.conf.all.rp_filter = 0

Bootstrap (kind) cluster IP addresses and cluster node IP addresses overlapping

192.168.122.0/24 and 10.96.0.0/27 are the default pod and service CIDRs used by the bootstrap (kind) cluster. Preflight checks will fail if they overlap with cluster node machine IP addresses. To avoid the conflict, you can pass the --bootstrap-cluster-pod-cidr and --bootstrap-cluster-service-cidr flags to bmctl to specify different values.

Overlapping IP addresses across different clusters

There is no validation for overlapping IP addresses across different clusters during update. The validation only applies at cluster/node pool creation time.

Operating system

Cluster creation or upgrade fails on CentOS

In December 2020, the CentOS community and Red Hat announced the sunset of CentOS. On January 31, 2022, CentOS 8 reached its end of life (EOL). As a result of the EOL, yum repositories stopped working for CentOS, which causes cluster creation and cluster upgrade operations to fail. This applies to all supported versions of CentOS and affects all versions of GKE on Bare Metal.

As a workaround, run the following commands to have your CentOS use an archive feed:

sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-Linux-*
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' \
    /etc/yum.repos.d/CentOS-Linux-*

As a long-term solution, consider migrating to another supported operating system.

Operating system endpoint limitations

On RHEL and CentOS, there is a cluster level limitation of 100,000 endpoints. This number is the sum of all pods that are referenced by a Kubernetes service. If 2 services reference the same set of pods, this counts as 2 separate sets of endpoints. The underlying nftable implementation on RHEL and CentOS causes this limitation; it is not an intrinsic limitation of GKE on Bare Metal.

Reset/Deletion

Namespace deletion

Deleting a namespace will prevent new resources from being created in that namespace, including jobs to reset machines. When deleting a user cluster, you must delete the cluster object first before deleting its namespace. Otherwise, the jobs to reset machines cannot get created, and the deletion process will skip the machine clean-up step.

containerd service

The bmctl reset command doesn't delete any containerd configuration files or binaries. The containerd systemd service is left up and running. The command deletes the containers running pods scheduled to the node.