Installation
Cluster creation fails when using multi-NIC, containerd
, and proxy
Cluster creation fails when you have the following combination of conditions:
Cluster is configured to use
containerd
as the container runtime (nodeConfig.containerRuntime
set tocontainerd
in the cluster configuration file, the default for Google Distributed Cloud version 1.8).Cluster is configured to provide multiple network interfaces, multi-NIC, for pods (
spec.clusterNetwork.multipleNetworkInterfaces
set totrue
in the cluster configuration file).Cluster is configured to use a proxy (
spec.proxy.url
is specified in the cluster configuration file). Even though cluster creation fails, this setting is propagated when you attempt to create a cluster. You may see this proxy setting as anHTTPS_PROXY
environment variable or in yourcontainerd
configuration (/etc/systemd/system/containerd.service.d/09-proxy.conf
).
As a workaround for this issue, append service CIDRs
(clusterNetwork.services.cidrBlocks
) to the NO_PROXY
environment variable on
all node machines.
Control group v2 incompatibility
Control group v2
(cgroup v2) is incompatible with Google Distributed Cloud 1.6.
Kubernetes 1.18 does not support cgroup v2. Also Docker
only offers experimental support as of 20.10. systemd
switched to cgroup v2 by default in version 247.2-2.
The presence of /sys/fs/cgroup/cgroup.controllers
indicates that your
system uses cgroup v2.
The preflight checks verify that cgroup v2 is not in use on the cluster machine.
Benign error messages during installation
When examining cluster creation logs, you may notice transient failures about registering clusters or calling webhooks. These errors can be safely ignored, because the installation will retry these operations until they succeed.
Preflight checks and service account credentials
For installations triggered by admin or hybrid clusters (in other words,
clusters not created with bmctl
, like user clusters), the preflight check does
not verify Google Cloud Platform service account credentials or their
associated permissions.
Preflight checks and permission denied
During installation you may see errors about /bin/sh: /tmp/disks_check.sh: Permission denied
.
These error messages result from /tmp
having been mounted with the noexec
option.
For bmctl
to work you need to remove the noexec
option from the /tmp
mount point.
Application default credentials and bmctl
bmctl
uses Application Default Credentials (ADC)
to validate the cluster operation's
location value in the cluster spec
when it's not set to global
.
For ADC to work, you need to either point the GOOGLE_APPLICATION_CREDENTIALS
environment variable to a service account credential file, or run
gcloud auth application-default login
.
Ubuntu 20.04 LTS and bmctl
On Google Distributed Cloud versions prior to 1.8.2, some Ubuntu 20.04 LTS distributions with a more recent Linux kernel (including
GCP Ubuntu 20.04 LTS images on the 5.8 kernel) have made
/proc/sys/net/netfilter/nf_conntrack_max
read-only in non-init network
namespaces. This prevents bmctl
from setting the max connection tracking
table size, which prevents the bootstrap cluster from starting. A symptom of
the incorrect table size is that the kube-proxy
Pod in the bootstrap cluster
will crashloop as shown in the following sample error log:
kubectl logs -l k8s-app=kube-proxy -n kube-system --kubeconfig ./bmctl-workspace/.kindkubeconfig
I0624 19:05:08.009565 1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 393216
F0624 19:05:08.009646 1 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied
The workaround is to manually set net/netfilter/nf_conntrack_max
to the needed
value on the host: sudo sysctl net.netfilter.nf_conntrack_max=393216
. Note
that the needed value depends on the number of cores for the node. Use the above
kubectl logs
command shown above to confirm the desired value from kube-proxy
logs.
This issue is fixed in Google Distributed Cloud release 1.8.2 and later.
Docker service
On cluster node machines, if the Docker executable is present in the PATH
environment variable, but the Docker service is not active, preflight check
will fail and report that the Docker service is not active
. To fix this error,
either remove Docker, or enable the Docker service.
Registry Mirror and Cloud Audit Logging
On Google Distributed Cloud versions prior to 1.8.2, the bmctl
registry mirror
package is missing the gcr.io/anthos-baremetal-release/auditproxy:gke_master_auditproxy_20201115_RC00
image. To enable the Cloud Audit Logging feature when using a registry mirror,
you will need to manually download the missing image and push it to your
registry server with the following commands:
docker pull gcr.io/anthos-baremetal-release/auditproxy:gke_master_auditproxy_20201115_RC00
docker tag gcr.io/anthos-baremetal-release/auditproxy:gke_master_auditproxy_20201115_RC00 REGISTRY_SERVER/anthos-baremetal-release/auditproxy:gke_master_auditproxy_20201115_RC00
docker push REGISTRY_SERVER/anthos-baremetal-release/auditproxy:gke_master_auditproxy_20201115_RC00
Containerd requires /usr/local/bin
in PATH
Clusters with the containerd runtime require /usr/local/bin
to be in the SSH
user's PATH for the kubeadm init
command to find the crictl
binary.
If crictl
can't be found, cluster creation fails.
When you aren't logged in as the root user, sudo
is used to run the
kubeadm init
command. The sudo
PATH can differ from the root profile and may
not contain /usr/local/bin
.
Fix this error by updating the secure_path
in /etc/sudoers
to
include /usr/local/bin
. Alternatively, create a symbolic link for crictl
in
another /bin
directory.
Starting with 1.8.2, Google Distributed Cloud adds /usr/local/bin
to the PATH
when running commands. However, running snapshot as nonroot user will still
contain crictl: command not found
(which can be fixed by workaround above).
Installing on vSphere
When installing Google Distributed Cloud on vSphere VMs, you must set the
tx-udp_tnl-segmentation
and tx-udp_tnl-csum-segmentation
flags to off.
These flags are related to the hardware segmentation offload done by the vSphere
driver VMXNET3 and they don't work with the GENEVE tunnel of
Google Distributed Cloud.
Run the following command on each node to check the current values for these
flags.
ethtool -k NET_INTFC |grep segm
...
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
...
Replace NET_INTFC
with the network interface associated
with the IP address of the node.
Sometimes in RHEL 8.4, ethtool
shows these flags are off while they aren't. To
explicitly set these flags to off, toggle the flags on and then off with the
following commands.
ethtool -K ens192 tx-udp_tnl-segmentation on
ethtool -K ens192 tx-udp_tnl-csum-segmentation on
ethtool -K ens192 tx-udp_tnl-segmentation off
ethtool -K ens192 tx-udp_tnl-csum-segmentation off
This flag change does not persist across reboots. Configure the startup scripts to explicitly set these flags when the system boots.
Flapping node readiness
Clusters may occasionally exhibit flapping node readiness (node status changing
rapidly between Ready
and NotReady
) behavior. An unhealthy
Pod Lifecycle Event Generator (PLEG) causes this behavior. The PLEG is a module
in kubelet.
To confirm an unhealthy PLEG is causing this behavior, use the following
journalctl
command to check for PLEG log entries:
journalctl -f | grep -i pleg
Log entries like the following indicate the PLEG is unhealthy:
...
skipping pod synchronization - PLEG is not healthy: pleg was last seen active
3m0.793469
...
A known
runc
race condition.
is the probable cause of the unhealthy PLEG. Stuck runc
processes are a
symptom of the race condition. Use the following command to check the
runc init
process status:
ps aux | grep 'runc init'
To fix this issue:
Determine the cluster runtime.
Before you can update the
runc
version, you must determine which container runtime your cluster uses. ThecontainerRuntime
field in the cluster configuration file identifies which container runtime your cluster uses. WhencontainerRuntime
is set tocontainerd
, your cluster uses the containerd runtime. If the field is set todocker
or isn't set, your cluster uses the Docker runtime.To get the value, either open the cluster configuration file with your favorite editor or, if you have access to the admin cluster kubeconfig, run the following command:
kubctl describe cluster CLUSTER_NAME -n CLUSTER_NAMESPACE | grep -i runtime
Replace the following:
CLUSTER_NAME
: the name of the cluster to back up.CLUSTER_NAMESPACE
: the namespace for the cluster. By default, the cluster namespaces for Google Distributed Cloud are the name of the cluster prefaced withcluster-
.
To install either containerd.io or docker-ce and extract the latest
runc
command-line tool, run the commands that correspond to your operating system and container runtime on each node.Ubuntu containerd
sudo apt update sudo apt install containerd.io # Back up current runc cp /usr/local/sbin/runc ~/ sudo cp /usr/bin/runc /usr/local/sbin/runc # runc version should be > 1.0.0-rc93 /usr/local/sbin/runc --version
Ubuntu docker
sudo apt update sudo apt install docker-ce # Back up current runc cp /usr/local/sbin/runc ~/ sudo cp /usr/bin/runc /usr/local/sbin/runc # runc version should be > 1.0.0-rc93 /usr/local/sbin/runc --version
CentOS/RHEL containerd
sudo dnf install containerd.io # Back up current runc cp /usr/local/sbin/runc ~/ sudo cp /usr/bin/runc /usr/local/sbin/runc # runc version should be > 1.0.0-rc93 /usr/local/sbin/runc --version
CentOS/RHEL docker
sudo dnf install docker-ce # Back up current runc cp /usr/local/sbin/runc ~/ sudo cp /usr/bin/runc /usr/local/sbin/runc # runc version should be > 1.0.0-rc93 /usr/local/sbin/runc --version
Reboot the node if there are stuck
runc init
processes.Alternatively, you can clean up any stuck processes manually.
Upgrades and updates
bmctl update cluster
fails if the .manifests
directory is missing
If the .manifests
directory is removed prior to running
bmctl update cluster
, the command fails with an error similar to the
following:
Error updating cluster resources.: failed to get CRD file .manifests/1.9.0/cluster-operator/base/crd/bases/baremetal.cluster.gke.io_clusters.yaml: open .manifests/1.9.0/cluster-operator/base/crd/bases/baremetal.cluster.gke.io_clusters.yaml: no such file or directory
You can fix this issue by running bmctl check cluster
first, which will
recreate the .manifests
directory.
This issue applies to Google Distributed Cloud 1.10 and earlier and is fixed in version 1.11 and later.
Upgrade stuck at error during manifests operations
In some situations, cluster upgrades fail to complete and the bmctl
CLI
becomes unresponsive. This problem can be caused by an incorrectly updated
resource. To determine if you're affected by this issue and to correct it, use
the following steps:
Check the
anthos-cluster-operator
logs and look for errors similar to the following entries:controllers/Cluster "msg"="error during manifests operations" "error"="1 error occurred: ... {RESOURCE_NAME} is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
These entries are a symptom of an incorrectly updated resource, where
{RESOURCE_NAME}
is the name of the problem resource.If you find these errors in your logs, use
kubectl edit
to remove thekubectl.kubernetes.io/last-applied-configuration
annotation from the resource contained in the log message.Save and apply your changes to the resource.
Retry the cluster upgrade.
Upgrades fail for version 1.8 clusters in maintenance mode
Attempting to upgrade a version 1.8.x cluster to version 1.9.x fails if any node machines have previously been put into maintenance mode. This is due to an annotation that remains on these nodes.
To determine if you are affected by this issue, use the following steps:
Get the version of the cluster you want to upgrade by running the following command:
kubectl --kubeconfig ADMIN_KUBECONFIG get cluster CLUSTER_NAME \ -n CLUSTER_NAMESPACE --output=jsonpath="{.spec.anthosBareMetalVersion}"
If the returned version value is for the 1.8 minor release, such as
1.8.3
, then continue. Otherwise, this issue does not apply to you.Check whether the cluster has any nodes that have previously been put into maintenance mode by running the following command:
kubectl --kubeconfig ADMIN_KUBECONFIG get BareMetalMachines -n CLUSTER_NAMESPACE \ --output=jsonpath="{.items[*].metadata.annotations}"
If the returned annotations contain
baremetal.cluster.gke.io/maintenance-mode-duration
, then you are affected by this known issue.
To unblock the cluster upgrade, run the following command for each affected node
machine to remove the baremetal.cluster.gke.io/maintenance-mode-duration
annotation:
kubectl --kubeconfig ADMIN_KUBECONFIG annotate BareMetalMachine -n CLUSTER_NAMESPACE \
NODE_MACHINE_NAME baremetal.cluster.gke.io/maintenance-mode-duration-
bmctl update
doesn't remove maintenance blocks
The bmctl update
command can't remove or modify the maintenanceBlocks
section from the cluster resource configuration. For more information, including
instructions for removing nodes from maintenance mode, see
Put nodes into maintenance mode.
Cluster upgrades from 1.7.x fail for certain load balancer configurations
Upgrading clusters from version 1.7.x to 1.8.y may fail for the following load balancer configurations:
Manually-configured external load balancer (
loadBalancer.mode
set tomanual
)Bundled load balancing (
loadBalancer.mode
set tobundled
), using a separate node pool (loadBalancer.nodePoolSpec.nodes
are specified)
A symptom of this failure is that the cal-update
job on a control plane node
fails at task Check apiserver health
with refused connection error messages.
Here's a sample log for the cal-update
job:
TASK [cal-update : Check apiserver health] *************************************
Thursday 20 January 2022 13:50:46 +0000 (0:00:00.033) 0:00:09.316 ******
FAILED - RETRYING: Check apiserver health (30 retries left).
FAILED - RETRYING: Check apiserver health (29 retries left).
FAILED - RETRYING: Check apiserver health (28 retries left).
FAILED - RETRYING: Check apiserver health (27 retries left).
FAILED - RETRYING: Check apiserver health (26 retries left).
...
FAILED - RETRYING: Check apiserver health (3 retries left).
FAILED - RETRYING: Check apiserver health (2 retries left).
FAILED - RETRYING: Check apiserver health (1 retries left).
[WARNING]: Consider using the get_url or uri module rather than running 'curl'.
If you need to use command because get_url or uri is insufficient you can add
'warn: false' to this command task or set 'command_warnings=False' in
ansible.cfg to get rid of this message.
fatal: [10.50.116.79]: FAILED! => {"attempts": 30, "changed": true, "cmd": "curl
-k https://127.0.0.1/healthz", "delta": "0:00:00.008949", "end": "2022-01-20
19:22:30.381875", "msg": "non-zero return code", "rc": 7, "start": "2022-01-20
19:22:30.372926", "stderr": " % Total % Received % Xferd Average Speed
Time Time Time Current\n Dload Upload
Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --
:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to 127.0.0.1 port 443:
Connection refused", "stderr_lines": [" % Total % Received % Xferd Average
Speed Time Time Time Current", " Dload
Upload Total Spent Left Speed", "", " 0 0 0 0 0 0
0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to
127.0.0.1 port 443: Connection refused"], "stdout": "", "stdout_lines": []}
As a manual workaround, create a port forward from port 443
(where the check
happens) to port 6444
(where apiserver
listens) before you upgrade. To
create the port forward, run the following command on each control plane node:
sudo iptables -t nat -I OUTPUT -p tcp -o lo --dport 443 -j REDIRECT --to-ports 6444
The cal-update
job runs when there's a reconciliation and checks on port
443
, so you should keep this port forward for your 1.8.x upgraded clusters.
After you upgrade to version 1.9.0 or later, remove the port forward by running the following command on each control plane node:
sudo iptables -t nat -D OUTPUT -p tcp -o lo --dport 443 -j REDIRECT --to-ports 6444
Upgrades to 1.8.0 and 1.8.1 admin, hybrid, and standalone clusters don't complete
Upgrading admin, hybrid, or standalone clusters from version 1.7.x to version 1.8.0 or 1.8.1 fails to complete sometimes. This upgrade failure applies to clusters that you have updated after cluster creation.
An indication of this upgrade problem is the console output Waiting for upgrade
to complete ...
with no mention of which node is being upgraded. This symptom
also indicates that your admin cluster has been successfully upgraded to
Kubernetes version v1.20.8-gke.1500, the Kubernetes version for
Google Distributed Cloud releases 1.8.0 and 1.8.1.
This upgrade issue is fixed for Google Distributed Cloud release 1.8.2.
To confirm whether this issue impacts your cluster upgrade to 1.8.0 or 1.8.1:
Create the following shell script:
if [ $(kubectl get cluster <var>CLUSTER\_NAME -n <var>CLUSTER\_NAMESPACE --kubeconfig bmctl-workspace/.kindkubeconfig -o=jsonpath='{.metadata.generation}') -le $(kubectl get cluster CLUSTER_NAME -n CLUSTER_NAMESPACE --kubeconfig bmctl-workspace/.kindkubeconfig -o=jsonpath='{{.status.systemServiceConditions[?(@.type=="Reconciling")].observedGeneration}}') ]; then echo "Bug Detected"; else echo "OK"; fi
Replace the following:
CLUSTER_NAME
: the name of the cluster being checked.CLUSTER_NAMESPACE
: the namespace for the cluster.
Run the script while the upgrade is in process, but after the preflight checks have completed.
When the
observedGeneration
value is not less than thegeneration
value,Bug Detected
is written to the console output. This output indicates that your cluster upgrade is affected.To unblock the upgrade, run the following command:
kubectl get --raw=/apis/baremetal.cluster.gke.io/v1/namespaces/CLUSTER_NAMESPACE/clusters/CLUSTER_NAME/status \ --kubeconfig bmctl-workspace/.kindkubeconfig | \ sed -e 's/\("systemServiceConditions":\[{[^{]*"type":"DashboardReady"}\),{[^{}]*}/\1/g' | \ kubectl replace --raw=/apis/baremetal.cluster.gke.io/v1/namespaces/CLUSTER_NAMESPACE/clusters/CLUSTER_NAME/status \ --kubeconfig bmctl-workspace/.kindkubeconfig -f-
Replace the following:
CLUSTER_NAME
: the name of the cluster being checked.CLUSTER_NAMESPACE
: the namespace for the cluster.
Upgrades to 1.8.3 or 1.8.4
Upgrading Google Distributed Cloud to version 1.8.3 or 1.8.4 sometimes fails with a nil Context error. If your cluster upgrade fails with a nil Context error, perform the following steps to complete the upgrade:
Set the
GOOGLE_APPLICATION_CREDENTIALS
environment variable to point to your service account key file.export GOOGLE_APPLICATION_CREDENTIALS=KEY_PATH
Replace
KEY_PATH
with the path of the JSON file that contains your service account key.Run the
bmctl upgrade cluster
command, again.
User cluster patch upgrade limitation
User clusters that are managed by an admin cluster must be at the same
Google Distributed Cloud version or lower and within one minor release. For
example, a version 1.7.1 (anthosBareMetalVersion: 1.7.1
) admin cluster
managing version 1.6.2 user clusters is acceptable.
An upgrade limitation prevents you from upgrading your user clusters to a new security patch when the patch is released after the release version the admin cluster is using. For example, if your admin cluster is at version 1.7.2, which was released on June 2, 2021, you can't upgrade your user clusters to version 1.6.4, because it was released on August 13, 2021.
Ubuntu 18.04 and 18.04.1 incompatibility
To upgrade to 1.8.1 or 1.8.2, cluster node machines and the workstation that runs bmctl
need to have Linux kernel version 4.17.0 or newer. Otherwise the anetd
networking controller will not work. The symptom is that pods with anet
prefix
in kube-system
namespace will continue to crash with the following error
message:
BPF NodePort services needs kernel 4.17.0 or newer.
This issue affects Ubuntu 18.04 and 18.04.1, since they are on kernel version 4.15.
This issue has been fixed in Google Distributed Cloud 1.8.3.
Upgrading 1.7.x clusters that use containerd
Cluster upgrades to 1.8.x are blocked for 1.7.x clusters that are configured to
use the preview containerd capability. The containerd preview uses the incorrect
control group (cgroup) driver cgroupfs
, instead of the recommended systemd
driver. There are reported cases of cluster instability when clusters that use
the cgroupfs
driver are put under resource pressure. The GA containerd
capability in release 1.8.0 uses the correct systemd
driver.
If you have existing 1.7.x clusters that use the preview containerd container runtime feature, we recommend that you create new 1.8.0 clusters configured for containerd and migrate any existing apps and workloads. This ensures the highest cluster stability when using the containerd container runtime.
SELinux upgrade failures
Upgrading 1.7.1 clusters configured with the containerd container runtime and running SELinux on RHEL or CentOS will fail. We recommend that you create new 1.8.0 clusters configured to use containerd and migrate your workloads.
Node draining can't start when Node is out of reach
The draining process for Nodes won't start if the Node is out of reach from Google Distributed Cloud. For example, if a Node goes offline during a cluster upgrade process, it may cause the upgrade to stop responding. This is a rare occurrence. To minimize the likelyhood of encountering this problem, ensure your Nodes are operating properly before initiating an upgrade.
Operation
kubeconfig secret overwritten
The bmctl check cluster
command, when run on user clusters, overwrites the
user cluster kubeconfig secret with the admin cluster kubeconfig. Overwriting
the file causes standard cluster operations, such as updating and upgrading, to
fail for affected user clusters. This problem applies to Google Distributed Cloud
versions 1.11.1 and earlier.
To determine if a user cluster is affected by this issue, run the following command:
kubectl --kubeconfig ADMIN_KUBECONFIG get secret -n cluster-USER_CLUSTER_NAME \
USER_CLUSTER_NAME -kubeconfig -o json | jq -r '.data.value' | base64 -d
Replace the following:
ADMIN_KUBECONFIG
: the path to the admin cluster kubeconfig file.USER_CLUSTER_NAME
: the name of the user cluster to check.
If the cluster name in the output (see contexts.context.cluster
in the
following sample output) is the admin cluster name, then the specified user
cluster is affected.
user-cluster-kubeconfig -o json | jq -r '.data.value' | base64 -d
apiVersion: v1
clusters:
- cluster:
certificate-authority-data:LS0tLS1CRU...UtLS0tLQo=
server: https://10.200.0.6:443
name: ci-aed78cdeca81874
contexts:
- context:
cluster: ci-aed78cdeca81874
user: ci-aed78cdeca81874-admin
name: ci-aed78cdeca81874-admin@ci-aed78cdeca81874
current-context: ci-aed78cdeca81874-admin@ci-aed78cdeca81874
kind: Config
preferences: {}
users:
- name: ci-aed78cdeca81874-admin
user:
client-certificate-data: LS0tLS1CRU...UtLS0tLQo=
client-key-data: LS0tLS1CRU...0tLS0tCg==
The following steps restore function to an affected user cluster
(USER_CLUSTER_NAME
):
Locate the user cluster kubeconfig file.
Google Distributed Cloud generates the kubeconfig file on the admin workstation when you create a cluster. By default, the file is in the
bmctl-workspace/USER_CLUSTER_NAME
directory.Verify the kubeconfig is correct user cluster kubeconfig:
kubectl get nodes --kubeconfig PATH_TO_GENERATED_FILE
Replace
PATH_TO_GENERATED_FILE
with the path to the user cluster kubeconfig file. The response returns details about the nodes for the user cluster. Confirm the machine names are correct for your cluster.Run the following command to delete the corrupted kubeconfig file in the admin cluster:
kubectl delete secret -n USER_CLUSTER_NAMESPACE USER_CLUSTER_NAME-kubeconfig
Run the following command to save the correct kubeconfig secret back to the admin cluster:
kubectl create secret generic -n USER_CLUSTER_NAMESPACE USER_CLUSTER_NAME-kubeconfig \ --from-file=value=PATH_TO_GENERATED_FILE
Reset/Deletion
Namespace deletion
Deleting a namespace will prevent new resources from being created in that namespace, including jobs to reset machines. When deleting a user cluster, you must delete the cluster object first before deleting its namespace. Otherwise, the jobs to reset machines cannot get created, and the deletion process will skip the machine clean-up step.
containerd service
The bmctl reset
command doesn't delete any containerd
configuration files or
binaries. The containerd systemd
service is left up and running.
The command deletes the containers running pods scheduled to the node.
Security
The cluster CA/certificate will be rotated during upgrade. On-demand rotation support is a Preview feature.
Google Distributed Cloud rotates kubelet
serving certificates automatically.
Each kubelet
node agent can send out a Certificate Signing Request (CSR) when
a certificate nears expiration. A controller in your admin clusters validates
and approves the CSR.
Cluster CA Rotation (Preview Feature)
After you perform a
user cluster certificate authority (CA) rotation
on a cluster, all user authentication flows fail. These failures occur because
the ClientConfig custom resource used in authentication flows isn't being
updated with the new CA data during CA rotation. If you have performed a cluster
CA rotation on your cluster, check to see if the certificateAuthorityData
field in default
ClientConfig of the kube-public
namespace contains the
older cluster CA.
To resolve the issue manually, update the certificateAuthorityData
field with
the current cluster CA.
Networking
Client source IP with bundled Layer 2 load balancing
Setting the
external traffic policy
to Local
can cause routing errors, such as No route to host
, for bundled
Layer 2 load balancing. The external traffic policy is set to Cluster
(externalTrafficPolicy: Cluster
), by default. With this setting, Kubernetes
handles cluster-wide traffic. Services of type LoadBalancer
or NodePort
can
use externalTrafficPolicy: Local
to preserve the client source IP address.
With this setting, however, Kubernetes only handles node-local traffic.
If you want to preserve the client source IP address, additional configuration may be required to ensure service IPs are reachable. For configuration details, see Preserving client source IP address in Configure bundled load balancing.
Modifying firewalld will erase Cilium iptable policy chains
When running Google Distributed Cloud with firewalld
enabled on either CentOS or
Red Had Enterprise Linux (RHEL), changes to firewalld
can remove the Cilium
iptables
chains on the host network. The iptables
chains are added by the
anetd
Pod when it is started. The loss of the Cilium iptables
chains causes
the Pod on the Node to lose network connectivity outside of the Node.
Changes to firewalld
that will remove the iptables
chains include, but
aren't limited to:
- Restarting
firewalld
, usingsystemctl
- Reloading the
firewalld
with the command line client (firewall-cmd --reload
)
You can fix this connectivity issue by restarting anetd
on the Node. Locate
and delete the anetd
Pod with the following commands to restart anetd
:
kubectl get pods -n kube-system
kubectl delete pods -n kube-system ANETD_XYZ
Replace ANETD_XYZ with the name of the anetd
Pod.
Duplicate egressSourceIP
addresses
When using the egress NAT gateway feature
preview, it is possible to set traffic selection rules that specify an
egressSourceIP
address that is already in use for another EgressNATPolicy
object. This may cause egress traffic routing conflicts. Coordinate with your
development team to determine which floating IP addresses are available for use
before specifying the egressSourceIP
address in your EgressNATPolicy
custom
resource.
Pod connectivity failures due to I/O timeout and reverse path filtering
Google Distributed Cloud configures reverse path filtering on nodes to disable
source validation (net.ipv4.conf.all.rp_filter=0
). If therp_filter
setting
is changed to 1
or 2
, pods fail due to out-of-node communication
timeouts.
Observed connectivity failures communicating to Kubernetes Service IP addresses are a symptom this problem. Here are a couple of examples of the types of errors you might see:
If all pods for a given node fail to communicate to the Service IP addresses, the
istiod
Pod might report an error like the following:{"severity":"Error","timestamp":"2021-11-12T17:19:28.907001378Z", "message":"watch error in cluster Kubernetes: failed to list *v1.Node: Get \"https://172.26.0.1:443/api/v1/nodes?resourceVersion=5 34239\": dial tcp 172.26.0.1:443: i/o timeout"}
For the
localpv
daemon set that runs on every node, the log might show a timeout like the following:I1112 17:24:33.191654 1 main.go:128] Could not get node information (remaining retries: 2): Get https://172.26.0.1:443/api/v1/nodes/NODE_NAME: dial tcp 172.26.0.1:443: i/o timeout
Reverse path filtering is set with rp_filter
files in the IPv4 configuration
folder (net/ipv4/conf/all
). The sysctl
command stores reverse path filtering
settings in a network security configuration file, such as
/etc/sysctl.d/60-gce-network-security.conf
.. The sysctl
command can override
the reverse path filtering setting.
To restore Pod connectivity, either set net.ipv4.conf.all.rp_filter
back to
0
manually, or restart the anetd
Pod to set net.ipv4.conf.all.rp_filter
back to 0
. To restart the anetd
Pod, use the following commands to locate
and delete the anetd
Pod. A new anetd
Pod start up in its place:
kubectl get pods -n kube-system
kubectl delete pods -n kube-system ANETD_XYZ
Replace ANETD_XYZ
with the name of the anetd
Pod.
To set net.ipv4.conf.all.rp_filter
manually, run the following command:
sysctl -w net.ipv4.conf.all.rp_filter = 0
Bootstrap (kind) cluster IP addresses and cluster node IP addresses overlapping
192.168.122.0/24
and 10.96.0.0/27
are the default pod and service CIDRs used by
the bootstrap (kind) cluster. Preflight checks will fail if they overlap with
cluster node machine IP addresses. To avoid the conflict, you can pass
the --bootstrap-cluster-pod-cidr
and --bootstrap-cluster-service-cidr
flags
to bmctl
to specify different values.
Overlapping IP addresses across different clusters
There is no validation for overlapping IP addresses across different clusters during update. The validation only applies at cluster/node pool creation time.
Operating system
Cluster creation or upgrade fails on CentOS
In December 2020, the CentOS community and Red Hat announced the sunset of
CentOS.
On January 31, 2022, CentOS 8 reached its end of life (EOL). As a result of the
EOL, yum
repositories stopped working for CentOS, which causes cluster
creation and cluster upgrade operations to fail. This applies to all supported
versions of CentOS and affects all versions of Google Distributed Cloud.
As a workaround, run the following commands to have your CentOS use an archive feed:
sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-Linux-*
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' \ /etc/yum.repos.d/CentOS-Linux-*
As a long-term solution, consider migrating to another supported operating system.
Operating system endpoint limitations
On RHEL and CentOS, there is a cluster level limitation of 100,000 endpoints.
This number is the sum of all pods that are referenced by a
Kubernetes service. If 2 services reference the same set of pods, this counts
as 2 separate sets of endpoints. The underlying nftable
implementation on
RHEL and CentOS causes this limitation; it is not an intrinsic limitation of
Google Distributed Cloud.
Configuration
Control plane and load balancer specifications
The control plane and load balancer node pool specifications are special. These specifications declare and control critical cluster resources. The canonical source for these resources is their respective sections in the cluster config file:
spec.controlPlane.nodePoolSpec
spec.LoadBalancer.nodePoolSpec
Consequently, do not modify the top-level control plane and load balancer node pool resources directly. Modify the associated sections in the cluster config file instead.
Anthos VM Runtime
- Restarting a pod causes the VMs on the pod to change IP addresses or lose
their IP address altogether. If the IP address of a VM changes, this does not
affect the reachability of VM applications exposed as a Kubernetes service. If
the IP address is lost, you must run
dhclient
from the VM to acquire an IP address for the VM.
SELinux
SELinux errors during pod creation
Pod creation sometimes fails when SELinux prevents the container runtime
from setting labels on tmpfs
mounts. This failure is rare, but can happen when
SELinux is in Enforcing
mode and in some kernels.
To verify that SELinux is the cause of pod creation failures, use the following
command to check for errors in the kubelet
logs:
journalctl -u kubelet
If SELinux is causing pod creation to fail, the command response contains an error similar to the following:
error setting label on mount source '/var/lib/kubelet/pods/
6d9466f7-d818-4658-b27c-3474bfd48c79/volumes/kubernetes.io~secret/localpv-token-bpw5x':
failed to set file label on /var/lib/kubelet/pods/
6d9466f7-d818-4658-b27c-3474bfd48c79/volumes/kubernetes.io~secret/localpv-token-bpw5x:
permission denied
To verify that this issue is related to SELinux enforcement, run the following command:
ausearch -m avc
This command searches the audit logs for access vector cache (AVC) permission
errors. The avc: denied
in the following sample response confirms that the pod
creation failures are related to SELinux enforcement.
type=AVC msg=audit(1627410995.808:9534): avc: denied { associate } for
pid=20660 comm="dockerd" name="/" dev="tmpfs" ino=186492
scontext=system_u:object_r:container_file_t:s0:c61,c201
tcontext=system_u:object_r:locale_t:s0 tclass=filesystem permissive=0
The root cause of this pod creation problem with SELinux is a kernel bug found in the following Linux images:
- Red Hat Enterprise Linux (RHEL) releases prior to 8.3
- CentOS releases prior to 8.3
Rebooting the machine helps recover from the issue.
To prevent pod creation errors from occurring, use RHEL 8.3 or later or CentOS 8.3 or later, because those versions have fixed the kernel bug.
Snapshots
Taking a snapshot as a non-root login user
For Google Distributed Cloud versions 1.8.1 and earlier, if you aren't logged in
as root, you can't take a cluster snapshot with the bmctl
command.
Starting with release 1.8.2, Google Distributed Cloud will respect nodeAccess.loginUser
in the cluster spec. If the admin cluster is unreachable, you can specify the login user with the --login-user
flag.
Note that if you use containerd as the container runtime, snapshot still fails to run crictl
commands. See
Containerd requires /usr/local/bin in PATH
for a workaround. The PATH settings used for SUDO
cause this problem.
GKE Connect
Crash looping gke-connect-agent
Pod
Heavy usage of GKE Connect gateway can sometimes result in gke-connect-agent
Pod out-of-memory problems. Symptoms of these out-of-memory issues include:
- The
gke-connect-agent
Pod shows a high number of restarts or ends up in crash looping state. - The connect gateway stops functioning.
To address this out-of-memory problem, edit the deployment with prefix
gke-connect-agent
under the gke-connect
namespace and raise the memory limit
to 256 MiB or higher.
kubectl patch deploy $(kubectl get deploy -l app=gke-connect-agent -n gke-connect -o jsonpath='{.items[0].metadata.name}') -n gke-connect --patch '{"spec":{"containers":[{"resources":{"limits":{"memory":"256Mi"}}}]}}'
This problem is fixed in Google Distributed Cloud release 1.8.2 and later.
Logging and monitoring
stackdriver-log-forwarder
Pod stuck restarting
For Google Distributed Cloud versions before 1.9, if a node is forcibly shut down,
the stackdriver-log-forwarder
Pod may get stuck in a restarting state. In this
case, you may see a log entry like following:
[error] [input:storage_backlog:storage_backlog.7] chunk validation failed, data might
be corrupted. Found 0 valid records, failed content starts right after byte 0.
When the stackdriver-log-forwarder
Pod is stuck, most logs are blocked from
going to Cloud Logging and any unflushed data is lost. To resolve this issue,
reset the logging pipeline.
To reset the logging pipeline:
Scale down the
stackdriver-operator
.kubectl --kubeconfig=KUBECONFIG -n kube-system scale deploy \ stackdriver-operator --replicas=0
Delete the
stackdriver-log-forwarder
DaemonSet:kubectl --kubeconfig KUBECONFIG -n kube-system delete daemonset \ stackdriver-log-forwarder
Verify the
stackdriver-log-forwarder
Pods are deleted before going to the next step.Deploy the following DaemonSet to clean up any corrupted data in
fluent-bit
buffers:kubectl --kubeconfig KUBECONFIG -n kube-system apply -f - << EOF apiVersion: apps/v1 kind: DaemonSet metadata: name: fluent-bit-cleanup namespace: kube-system spec: selector: matchLabels: app: fluent-bit-cleanup template: metadata: labels: app: fluent-bit-cleanup spec: containers: - name: fluent-bit-cleanup image: debian:10-slim command: ["bash", "-c"] args: - | rm -rf /var/log/fluent-bit-buffers/ echo "Fluent Bit local buffer is cleaned up." sleep 3600 volumeMounts: - name: varlog mountPath: /var/log securityContext: privileged: true tolerations: - key: "CriticalAddonsOnly" operator: "Exists" - key: node-role.kubernetes.io/master effect: NoSchedule - key: node-role.gke.io/observability effect: NoSchedule volumes: - name: varlog hostPath: path: /var/log EOF
Make sure the DaemonSet has cleaned up all the nodes.
The output of the following commands should equal the number of nodes in your cluster.
kubectl --kubeconfig KUBECONFIG logs -n kube-system -l app=fluent-bit-cleanup | \ grep "cleaned up" | wc -l kubectl --kubeconfig KUBECONFIG -n kube-system get pods -l app=fluent-bit-cleanup \ --no-headers | wc -l
Delete the cleanup DaemonSet:
kubectl --kubeconfig KUBECONFIG -n kube-system delete ds fluent-bit-cleanup
Scale up the operator and wait for it to redeploy the logging pipeline.
kubectl --kubeconfig=KUBECONFIG -n kube-system scale deploy \ stackdriver-operator --replicas=1
This problem is fixed in Google Distributed Cloud release 1.9.0 and later.