GKE on Bare Metal known issues

Installation

Control group v2 incompatibility

Control group v2 (cgroup v2) is incompatible with GKE on Bare Metal 1.6. Kubernetes 1.18 does not support cgroup v2. Also Docker only offers experimental support as of 20.10. systemd switched to cgroup v2 by default in version 247.2-2. The presence of /sys/fs/cgroup/cgroup.controllers indicates that your system uses cgroup v2.

Starting with GKE on Bare Metal 1.6.2, the preflight checks verify that cgroup v2 is not in use on the cluster machine.

Benign error messages during installation

During highly available (HA) cluster installation, you may see errors about etcdserver leader change. These error messages are benign and can be ignored.

When you use bmctl for cluster installation, you may see a Log streamer failed to get BareMetalMachine log message at the very end of the create-cluster.log. This error message is benign and can be ignored.

When examining cluster creation logs, you may notice transient failures about registering clusters or calling webhooks. These errors can be safely ignored, because the installation will retry these operations until they succeed.

Preflight checks and service account credentials

For installations triggered by admin or hybrid clusters (in other words, clusters not created with bmctl, like user clusters), the preflight check does not verify Google Cloud Platform service account credentials or their associated permissions.

Preflight checks and permission denied

During installation you may see errors about /bin/sh: /tmp/disks_check.sh: Permission denied. These error messages are caused because /tmp is mounted with noexec option. For bmctl to work you need to remove noexec option from /tmp mount point.

Creating cloud monitoring workspace before viewing dashboards

You need to create a cloud monitoring workspace through the Google Cloud console before you can view any GKE on Bare Metal monitoring dashboards,

Application default credentials and bmctl

bmctl uses Application Default Credentials (ADC) to validate the cluster operation's location value in the cluster spec when it is not set to global.

For ADC to work, you need to either point the GOOGLE_APPLICATION_CREDENTIALS environment variable to a service account credential file, or run gcloud auth application-default login.

Ubuntu 20.04 LTS and bmctl

On GKE on Bare Metal versions prior to 1.8.2, some Ubuntu 20.04 LTS distributions with a more recent Linux kernel (including GCP Ubuntu 20.04 LTS images on the 5.8 kernel) have made /proc/sys/net/netfilter/nf_conntrack_max read-only in non-init network namespaces. This prevents bmctl from setting the max connection tracking table size, which prevents the bootstrap cluster from starting. A symptom of the incorrect table size is that the kube-proxy Pod in the bootstrap cluster will crashloop as shown in the following sample error log:

kubectl logs -l k8s-app=kube-proxy -n kube-system --kubeconfig ./bmctl-workspace/.kindkubeconfig
I0624 19:05:08.009565       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 393216
F0624 19:05:08.009646       1 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

The workaround is to manually set net/netfilter/nf_conntrack_max to the needed value on the host: sudo sysctl net.netfilter.nf_conntrack_max=393216. Note that the needed value depends on the number of cores for the node. Use the above kubectl logs command shown above to confirm the desired value from kube-proxy logs.

This issue is fixed in GKE on Bare Metal release 1.8.2 and later.

Ubuntu 20.04.3+ LTS and HWE

Ubuntu 20.04.3 enabled kernel 5.11 in its Hardware Enablement (HWE) package. GKE on Bare Metal release 1.7.x doesn't support this kernel. If you want to use kernel 5.11, download and upgrade to GKE on Bare Metal release 1.8.0 or later.

Docker service

On cluster node machines, if the Docker executable is present in the PATH environment variable, but the Docker service is not active, preflight check will fail and report that the Docker service is not active. To fix this error, either remove Docker, or enable the Docker service.

Containerd requires /usr/local/bin in PATH

Clusters with the containerd runtime require /usr/local/bin to be in the SSH user's PATH for the kubeadm init command to find the crictl binary. If crictl can't be found, cluster creation fails.

When you aren't logged in as the root user, sudo is used to run the kubeadm init command. The sudo PATH can differ from the root profile and may not contain /usr/local/bin.

Fix this error by updating the secure_path in /etc/sudoers to include /usr/local/bin. Alternatively, create a symbolic link for crictl in another /bin directory.

Starting with 1.8.2, GKE on Bare Metal adds /usr/local/bin to the PATH when running commands. However, running snapshot as nonroot user will still contain crictl: command not found (which can be fixed by workaround above).

Flapping node readiness

Clusters may occasionally exhibit flapping node readiness (node status changing rapidly between Ready and NotReady) behavior. An unhealthy Pod Lifecycle Event Generator (PLEG) causes this behavior. The PLEG is a module in kubelet.

To confirm an unhealthy PLEG is causing this behavior, use the following journalctl command to check for PLEG log entries:

journalctl -f | grep -i pleg

Log entries like the following indicate the PLEG is unhealthy:

...
skipping pod synchronization - PLEG is not healthy: pleg was last seen active
3m0.793469
...

A known runc race condition. is the probable cause of the unhealthy PLEG. Stuck runc processes are a symptom of the race condition. Use the following command to check the runc init process status:

ps aux | grep 'runc init'

To fix this issue:

  1. Run the following commands on each node to install the latest containerd.io and extract the latest runc command-line tool:

    Ubuntu

    sudo apt update
    sudo apt install containerd.io
    # Back up current runc
    cp /usr/local/sbin/runc ~/
    sudo cp /usr/bin/runc /usr/local/sbin/runc
    
    # runc version should be > 1.0.0-rc93
    /usr/local/sbin/runc --version
    

    CentOS/RHEL

    sudo dnf install containerd.io
    # Back up current runc
    cp /usr/local/sbin/runc ~/
    sudo cp /usr/bin/runc /usr/local/sbin/runc
    
    # runc version should be > 1.0.0-rc93
    /usr/local/sbin/runc --version
    
  2. Reboot the node if there are stuck runc init processes.

    Alternatively, you can clean up any stuck processes manually.

Upgrades and updates

bmctl update cluster fails if the .manifests directory is missing

If the .manifests directory is removed prior to running bmctl update cluster, the command fails with an error similar to the following:

Error updating cluster resources.: failed to get CRD file .manifests/1.9.0/cluster-operator/base/crd/bases/baremetal.cluster.gke.io_clusters.yaml: open .manifests/1.9.0/cluster-operator/base/crd/bases/baremetal.cluster.gke.io_clusters.yaml: no such file or directory

You can fix this issue by running bmctl check cluster first, which will recreate the .manifests directory.

This issue applies to GKE on Bare Metal 1.10 and earlier and is fixed in version 1.11 and later.

Upgrade stuck at error during manifests operations

In some situations, cluster upgrades fail to complete and the bmctl CLI becomes unresponsive. This problem can be caused by an incorrectly updated resource. To determine if you're affected by this issue and to correct it, use the following steps:

  1. Check the anthos-cluster-operator logs and look for errors similar to the following entries:

    controllers/Cluster "msg"="error during manifests operations" "error"="1 error occurred:
    ...
    {RESOURCE_NAME} is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
    

    These entries are a symptom of an incorrectly updated resource, where {RESOURCE_NAME} is the name of the problem resource.

  2. If you find these errors in your logs, use kubectl edit to remove the kubectl.kubernetes.io/last-applied-configuration annotation from the resource contained in the log message.

  3. Save and apply your changes to the resource.

  4. Retry the cluster upgrade.

bmctl update doesn't remove maintenance blocks

The bmctl update command can't remove or modify the maintenanceBlocks section from the cluster resource configuration. For more information, including instructions for removing nodes from maintenance mode, see Put nodes into maintenance mode.

Upgrading clusters to 1.7.6 from 1.7.5 is blocked

You can't upgrade Google Distributed Cloud Virtual for Bare Metal version 1.7.5 clusters to version 1.7.6. This blockage doesn't affect any other versions of GKE on Bare Metal. For example, you can upgrade your clusters from version 1.7.4 to version 1.7.6. If you have version 1.7.5 clusters, to get the security vulnerability fixes addressed in release 1.7.6, you must upgrade to a later release when it becomes available.

Upgrading from 1.6.0

Upgrading is not available in the 1.6.0 release.

Upgrading to from 1.7.0 to 1.7.x

When upgrading from 1.7.0 to 1.7.x, your cluster may get stuck on the control plane Node upgrade. You may see MACHINE-IP-machine-upgrade jobs run and fail periodically. This issue affects 1.7.0 clusters that have:

  • Docker pre-installed on control plane Nodes.
  • containerd selected as the runtime.

This issue is caused by GKE on Bare Metal misconfiguring the cri-socket to Docker instead of containerd. To resolve this issue, you must set the image pull credentials for Docker:

  1. Log in to Docker:

    docker login gcr.io
    

    This creates a $HOME/.docker/config.json file.

  2. List the IP addresses of all control plane Nodes, separated by a space:

    IPs=(NODE_IP1 NODE_IP2 ...)
    
  3. Copy the Docker configuration to the Nodes:

    for ip in "${IPs[@]}"; do
      scp $HOME/.docker/config.json USER_NAME@{ip}:docker-config.json
    

    Replace USER_NAME with the user name configured in the admin cluster configuration file.

  4. Set the image pull credentials for Docker:

    ssh USER_NAME@${ip} "sudo mkdir -p /root/.docker && sudo cp docker-config.json /root/.docker/config.json"
    

User cluster patch upgrade limitation

User clusters that are managed by an admin cluster must be at the same GKE on Bare Metal version or lower and within one minor release. For example, a version 1.8.1 (anthosBareMetalVersion: 1.8.1) admin cluster managing version 1.7.2 user clusters is acceptable, but version 1.6.3 user clusters aren't within one minor release.

An upgrade limitation prevents you from upgrading your user clusters to a new security patch when the patch is released after the release version the admin cluster is using. For example, if your admin cluster is at version 1.8.2, which was released on July 29, 2021, you can't upgrade your user clusters to version 1.7.3, because it was released on August 16, 2021.

Control group driver is misconfigured to cgroupfs

If you run into issues concerning the control group (cgroup) driver, this may be caused by GKE on Bare Metal wrongly configuring it to cgroupfs instead of systemd.

To fix this issue:

  1. Log in to your machines open /etc/containerd/config.toml.

  2. Under [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options], add SystemdCgroup = true:

    ...
       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
           runtime_type = "io.containerd.runc.v2"
           runtime_engine = ""
           runtime_root = ""
           privileged_without_host_devices = false
           base_runtime_spec = ""
           [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
             SystemdCgroup = true
     [plugins."io.containerd.grpc.v1.cri".cni]
       bin_dir = "/opt/cni/bin"
       conf_dir = "/etc/cni/net.d"
       max_conf_num = 1
       conf_template = ""
    ...
    
  3. Save your changes and close the file.

  4. Open /etc/systemd/system/kubelet.service.d/10-kubeadm.conf.

  5. At the end of the file, add --cgroup-driver=systemd --runtime-cgroups=/system.slice/containerd.service:

    [Service]
    Environment="HOME=/root"
    Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
    Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
    # This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
    EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
    # This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
    # the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
    EnvironmentFile=-/etc/default/kubelet
    ExecStart=
    ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS --cgroup-driver=systemd --runtime-cgroups=/system.slice/containerd.service
    
  6. Save your changes and reboot the server.

  7. Verify that systemd is the control group driver by running:

    systemd-cgls
    

    Verify that there is a kubepods.slice section and that all pods are under this section.

Operation

kubeconfig secret overwritten

The bmctl check cluster command, when run on user clusters, overwrites the user cluster kubeconfig secret with the admin cluster kubeconfig. Overwriting the file causes standard cluster operations, such as updating and upgrading, to fail for affected user clusters. This problem applies to GKE on Bare Metal versions 1.11.1 and earlier.

To determine if a user cluster is affected by this issue, run the following command:

kubectl --kubeconfig ADMIN_KUBECONFIG get secret -n cluster-USER_CLUSTER_NAME \
    USER_CLUSTER_NAME -kubeconfig  -o json  | jq -r '.data.value'  | base64 -d

Replace the following:

  • ADMIN_KUBECONFIG: the path to the admin cluster kubeconfig file.
  • USER_CLUSTER_NAME: the name of the user cluster to check.

If the cluster name in the output (see contexts.context.cluster in the following sample output) is the admin cluster name, then the specified user cluster is affected.

user-cluster-kubeconfig  -o json  | jq -r '.data.value'  | base64 -d
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data:LS0tLS1CRU...UtLS0tLQo=
    server: https://10.200.0.6:443
  name: ci-aed78cdeca81874
contexts:
- context:
    cluster: ci-aed78cdeca81874
    user: ci-aed78cdeca81874-admin
  name: ci-aed78cdeca81874-admin@ci-aed78cdeca81874
current-context: ci-aed78cdeca81874-admin@ci-aed78cdeca81874
kind: Config
preferences: {}
users:
- name: ci-aed78cdeca81874-admin
  user:
    client-certificate-data: LS0tLS1CRU...UtLS0tLQo=
    client-key-data: LS0tLS1CRU...0tLS0tCg==

The following steps restore function to an affected user cluster (USER_CLUSTER_NAME):

  1. Locate the user cluster kubeconfig file.

    GKE on Bare Metal generates the kubeconfig file on the admin workstation when you create a cluster. By default, the file is in the bmctl-workspace/USER_CLUSTER_NAME directory.

  2. Verify the kubeconfig is correct user cluster kubeconfig:

    kubectl get nodes --kubeconfig PATH_TO_GENERATED_FILE
    

    Replace PATH_TO_GENERATED_FILE with the path to the user cluster kubeconfig file. The response returns details about the nodes for the user cluster. Confirm the machine names are correct for your cluster.

  3. Run the following command to delete the corrupted kubeconfig file in the admin cluster:

    kubectl delete secret -n USER_CLUSTER_NAMESPACE USER_CLUSTER_NAME-kubeconfig
    
  4. Run the following command to save the correct kubeconfig secret back to the admin cluster:

    kubectl create secret generic -n USER_CLUSTER_NAMESPACE USER_CLUSTER_NAME-kubeconfig \
        --from-file=value=PATH_TO_GENERATED_FILE
    

Reset/Deletion

User cluster support

You can't reset user clusters with the bmctl reset command.

Mount points and fstab

Reset does not unmount the mount points under /mnt/localpv-share/ and it does not clean up the corresponding entries in /etc/fstab.

Namespace deletion

Deleting a namespace will prevent new resources from being created in that namespace, including jobs to reset machines. When deleting a user cluster, you must delete the cluster object first before deleting its namespace. Otherwise, the jobs to reset machines cannot get created, and the deletion process will skip the machine clean-up step.

containerd service

The bmctl reset command doesn't delete any containerd configuration files or binaries. The containerd systemd service is left up and running. The command deletes the containers running pods scheduled to the node.

Security

The cluster CA/certificate will be rotated during upgrade. On-demand rotation support is not currently available.

GKE on Bare Metal rotates kubelet serving certificates automatically. Each kubelet node agent can send out a Certificate Signing Request (CSR) when a certificate nears expiration. A controller in your admin clusters validates and approves the CSR.

Logging and Monitoring

Node logs aren't exported to Cloud Logging

Node logs from nodes with a dot (".") in their name are not exported to Cloud Logging. As a workaround, use the following instructions to add a filter to the stackdriver-log-forwarder-config resource to enable the Stackdriver Operator to recognize and export these logs.

  1. Scale down the size of the Stackdriver Operator, stackdriver-operator:

    kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system scale \
        deploy stackdriver-operator --replicas=0
    
  2. Edit the Log Forwarder configmap, stackdriver-log-forwarder-config:

    kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system edit configmap \
        stackdriver-log-forwarder-config
    
  3. Add the following filter to the end of the input-systemd.conf section of the configmap:

       [FILTER]
           Name    lua
           Match_Regex   container-runtime|kubelet|node-problem-detector|node-journal
           script  replace_dot.lua
           call    replace
    
     replace_dot.lua: |
       function replace(tag, timestamp, record)
           new_record = record
    
           local local_resource_id_key = "logging.googleapis.com/local_resource_id"
    
           -- Locate the local_resource_id
           local local_resource_id = record[local_resource_id_key]
    
           local first = 1
           local new_local_resource_id = ""
           for s in string.gmatch(local_resource_id, "[^.]+") do
               new_local_resource_id = new_local_resource_id .. s
               if first == 1 then
                   new_local_resource_id = new_local_resource_id .. "."
                   first = 0
               else
                   new_local_resource_id = new_local_resource_id .. "_"
               end
           end
    
           -- Remove the trailing underscore
           new_local_resource_id = new_local_resource_id:sub(1, -2)
           new_record[local_resource_id_key] = new_local_resource_id
           return 1, timestamp, new_record
       end
    
  4. Delete all Log Forwarder pods:

    kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system patch daemonset \
        stackdriver-log-forwarder -p \
        '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
    

    Verify that the stackdriver-log-forwarder pods are deleted before going to the next step.

  5. Deploy a daemonset to clean up all currupted, unprocessed data in buffers in fluent-bit:

    kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system apply -f - << EOF
        apiVersion: apps/v1
        kind: DaemonSet
        metadata:
          name: fluent-bit-cleanup
          namespace: kube-system
        spec:
          selector:
            matchLabels:
              app: fluent-bit-cleanup
          template:
            metadata:
              labels:
                app: fluent-bit-cleanup
            spec:
              containers:
              - name: fluent-bit-cleanup
                image: debian:10-slim
                command: ["bash", "-c"]
                args:
                - |
                  rm -rf /var/log/fluent-bit-buffers/
                  echo "Fluent Bit local buffer is cleaned up."
                  sleep 3600
                volumeMounts:
                - name: varlog
                  mountPath: /var/log
                securityContext:
                  privileged: true
              tolerations:
              - key: "CriticalAddonsOnly"
                operator: "Exists"
              - key: node-role.kubernetes.io/master
                effect: NoSchedule
              - key: node-role.gke.io/observability
                effect: NoSchedule
              volumes:
              - name: varlog
                hostPath:
                  path: /var/log
    EOF
    
  6. Verify that the daemonset has cleaned up all the nodes with the following commands.

    kubectl --kubeconfig ADMIN_KUBECONFIG logs -n kube-system \
        -l app=fluent-bit-cleanup | grep "cleaned up" | wc -l
    
    kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system get pods \
        -l app=fluent-bit-cleanup --no-headers | wc -l
    

    The output of the two commands should be equal to your node number in the cluster

  7. Delete the cleanup daemonset:

    kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system delete ds fluent-bit-cleanup
    
  8. Restart pods:

    kubectl --kubeconfig ADMIN_KUBECONFIG -n kube-system patch \
        daemonset stackdriver-log-forwarder --type json \
        -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing"}]'
    

Networking

Client source IP with bundled Layer 2 load balancing

Setting the external traffic policy to Local can cause routing errors, such as No route to host, for bundled Layer 2 load balancing. The external traffic policy is set to Cluster (externalTrafficPolicy: Cluster), by default. With this setting, Kubernetes handles cluster-wide traffic. Services of type LoadBalancer or NodePort can use externalTrafficPolicy: Local to preserve the client source IP address. With this setting, however, Kubernetes only handles node-local traffic.

If you want to preserve the client source IP address, additional configuration may be required to ensure service IPs are reachable. For configuration details, see Preserving client source IP address in Configure bundled load balancing.

Modifying firewalld will erase Cilium iptable policy chains

When running GKE on Bare Metal with firewalld enabled on either CentOS or Red Had Enterprise Linux (RHEL), changes to firewalld can remove the Cilium iptables chains on the host network. The iptables chains are added by the anetd Pod when it is started. The loss of the Cilium iptables chains causes the Pod on the Node to lose network connectivity outside of the Node.

Changes to firewalld that will remove the iptables chains include, but aren't limited to:

  • Restarting firewalld, using systemctl
  • Reloading the firewalld with the command line client (firewall-cmd --reload)

You can fix this connectivity issue by restarting anetd on the Node. Locate and delete the anetd Pod with the following commands to restart anetd:

kubectl get pods -n kube-system
kubectl delete pods -n kube-system ANETD_XYZ

Replace ANETD_XYZ with the name of the anetd Pod.

Pod connectivity failures due to I/O timeout and reverse path filtering

GKE on Bare Metal configures reverse path filtering on nodes to disable source validation (net.ipv4.conf.all.rp_filter=0). If therp_filter setting is changed to 1 or 2, pods fail due to out-of-node communication timeouts.

Observed connectivity failures communicating to Kubernetes Service IP addresses are a symptom this problem. Here are a couple of examples of the types of errors you might see:

  • If all pods for a given node fail to communicate to the Service IP addresses, the istiod Pod might report an error like the following:

     {"severity":"Error","timestamp":"2021-11-12T17:19:28.907001378Z",
        "message":"watch error in cluster Kubernetes: failed to list *v1.Node:
        Get \"https://172.26.0.1:443/api/v1/nodes?resourceVersion=5  34239\":
        dial tcp 172.26.0.1:443: i/o timeout"}
    
  • For the localpv daemon set that runs on every node, the log might show a timeout like the following:

     I1112 17:24:33.191654       1 main.go:128] Could not get node information
    (remaining retries: 2): Get
    https://172.26.0.1:443/api/v1/nodes/NODE_NAME:
    dial tcp 172.26.0.1:443: i/o timeout
    

Reverse path filtering is set with rp_filter files in the IPv4 configuration folder (net/ipv4/conf/all). The sysctl command stores reverse path filtering settings in a network security configuration file, such as /etc/sysctl.d/60-gce-network-security.conf.. The sysctl command can override the reverse path filtering setting.

To restore Pod connectivity, either set net.ipv4.conf.all.rp_filter back to 0 manually, or restart the anetd Pod to set net.ipv4.conf.all.rp_filter back to 0. To restart the anetd Pod, use the following commands to locate and delete the anetd Pod. A new anetd Pod start up in its place:

kubectl get pods -n kube-system
kubectl delete pods -n kube-system ANETD_XYZ

Replace ANETD_XYZ with the name of the anetd Pod.

To set net.ipv4.conf.all.rp_filter manually, run the following command:

sysctl -w net.ipv4.conf.all.rp_filter = 0

Bootstrap (kind) cluster IP addresses and cluster node IP addresses overlapping

192.168.122.0/24 and 10.96.0.0/27 are the default pod and service CIDRs used by the bootstrap (kind) cluster. Preflight checks will fail if they overlap with cluster node machine IP addresses. To avoid the conflict, you can pass the --bootstrap-cluster-pod-cidr and --bootstrap-cluster-service-cidr flags to bmctl to specify different values.

Overlapping IP addresses across different clusters

There is no preflight check to validate overlapping IP addresses across different clusters.

hostport feature in GKE on Bare Metal

The hostport feature in ContainerPort is not currently supported.

Operating system

Cluster creation or upgrade fails on CentOS

In December 2020, the CentOS community and Red Hat announced the sunset of CentOS. On January 31, 2022, CentOS 8 reached its end of life (EOL). As a result of the EOL, yum repositories stopped working for CentOS, which causes cluster creation and cluster upgrade operations to fail. This applies to all supported versions of CentOS and affects all versions of GKE on Bare Metal.

As a workaround, run the following commands to have your CentOS use an archive feed:

sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-Linux-*
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' \
    /etc/yum.repos.d/CentOS-Linux-*

As a long-term solution, consider migrating to another supported operating system, such as Ubuntu or RHEL.

Operating system endpoint limitations

On RHEL and CentOS, there is a cluster level limitation of 100,000 endpoints. This number is the sum of all pods that are referenced by a Kubernetes service. If 2 services reference the same set of pods, this counts as 2 separate sets of endpoints. The underlying nftable implementation on RHEL and CentOS causes this limitation; it is not an intrinsic limitation of GKE on Bare Metal.

Configuration

Control plane and load balancer specifications

The control plane and load balancer node pool specifications are special. These specifications declare and control critical cluster resources. The canonical source for these resources is their respective sections in the cluster config file:

  • spec.controlPlane.nodePoolSpec
  • spec.LoadBalancer.nodePoolSpec

Consequently, do not modify the top-level control plane and load balancer node pool resources directly. Modify the associated sections in the cluster config file instead.

Mutable fields in the cluster and node pool specification

Currently, only the following cluster and node pool specification fields in the cluster config file can be updated after the cluster is created (they are mutable fields):

  • For the Cluster object (kind: Cluster), the following fields are mutable:

    • spec.anthosBareMetalVersion
    • spec.bypassPreflightCheck
    • spec.controlPlane.nodePoolSpec.nodes
    • spec.loadBalancer.nodePoolSpec.nodes
    • spec.maintenanceBlocks
    • spec.nodeAccess.loginUser
  • For the NodePool object (kind: NodePool), the following fields are mutable:

    • spec.nodes

Snapshots

Taking a snapshot as a non-root login user

For GKE on Bare Metal versions 1.8.1 and earlier, if you aren't logged in as root, you can't take a cluster snapshot with the bmctl command. Starting with release 1.8.2, GKE on Bare Metal will respect nodeAccess.loginUser in the cluster spec. If the admin cluster is unreachable, you can specify the login user with the --login-user flag.

Note that if you use containerd as the container runtime, snapshot still fails to run crictl commands. See Containerd requires /usr/local/bin in PATH for a workaround. The PATH settings used for SUDO cause this problem.

GKE Connect

Crash looping gke-connect-agent Pod

Heavy usage of GKE Connect gateway can sometimes result in gke-connect-agent Pod out-of-memory problems. Symptoms of these out-of-memory issues include:

  • The gke-connect-agent Pod shows a high number of restarts or ends up in crash looping state.
  • The connect gateway stops functioning.

To address this out-of-memory problem, edit the deployment with prefix gke-connect-agent under the gke-connect namespace and raise the memory limit to 256 MiB or higher.

kubectl patch deploy $(kubectl get deploy -l app=gke-connect-agent -n gke-connect -o jsonpath='{.items[0].metadata.name}') -n gke-connect --patch '{"spec":{"containers":[{"resources":{"limits":{"memory":"256Mi"}}}]}}'

This problem is fixed in GKE on Bare Metal release 1.8.2 and later.