Installation
Control group v2 incompatibility
Control group v2
(cgroup v2) is incompatible with Google Distributed Cloud 1.6.
Kubernetes 1.18 does not support cgroup v2. Also Docker
only offers experimental support as of 20.10. systemd
switched to cgroup v2 by default in version 247.2-2.
The presence of /sys/fs/cgroup/cgroup.controllers
indicates that your
system uses cgroup v2.
Starting with Google Distributed Cloud 1.6.2, the preflight checks verify that cgroup v2 is not in use on the cluster machine.
Benign error messages during installation
During highly available (HA) cluster installation, you may see errors about
etcdserver leader change
. These error messages are benign and can be ignored.
When you use bmctl
for cluster installation, you may see a
Log streamer failed to get BareMetalMachine
log message at the very end
of the create-cluster.log
. This error message is benign and can be ignored.
When examining cluster creation logs, you may notice transient failures about registering clusters or calling webhooks. These errors can be safely ignored, because the installation will retry these operations until they succeed.
Preflight checks and service account credentials
For installations triggered by admin or hybrid clusters (in other words,
clusters not created with bmctl
, like user clusters), the preflight check does
not verify Google Cloud Platform service account credentials or their
associated permissions.
Application default credentials and bmctl
bmctl
uses Application Default Credentials (ADC)
to validate the cluster operation's
location value in the cluster spec
when it is not set to global
.
For ADC to work, you need to either point the GOOGLE_APPLICATION_CREDENTIALS
environment variable to a service account credential file, or run
gcloud auth application-default login
.
Docker service
On cluster node machines, if the Docker executable is present in the PATH
environment variable, but the Docker service is not active, preflight check
will fail and report that the Docker service is not active
. To fix this error,
either remove Docker, or enable the Docker service.
Upgrades and updates
Upgrading is not available in the Google Distributed Cloud 1.6.x releases.
bmctl update cluster
fails if the .manifests
directory is missing
If the .manifests
directory is removed prior to running
bmctl update cluster
, the command fails with an error similar to the
following:
Error updating cluster resources.: failed to get CRD file .manifests/1.9.0/cluster-operator/base/crd/bases/baremetal.cluster.gke.io_clusters.yaml: open .manifests/1.9.0/cluster-operator/base/crd/bases/baremetal.cluster.gke.io_clusters.yaml: no such file or directory
You can fix this issue by running bmctl check cluster
first, which will
recreate the .manifests
directory.
This issue applies to Google Distributed Cloud 1.10 and earlier and is fixed in version 1.11 and later.
bmctl update
doesn't remove maintenance blocks
The bmctl update
command can't remove or modify the maintenanceBlocks
section from the cluster resource configuration. For more information, including
instructions for removing nodes from maintenance mode, see
Put nodes into maintenance mode.
Operation
kubeconfig secret overwritten
The bmctl check cluster
command, when run on user clusters, overwrites the
user cluster kubeconfig secret with the admin cluster kubeconfig. Overwriting
the file causes standard cluster operations, such as updating and upgrading, to
fail for affected user clusters. This problem applies to Google Distributed Cloud
versions 1.11.1 and earlier.
To determine if a user cluster is affected by this issue, run the following command:
kubectl --kubeconfig ADMIN_KUBECONFIG get secret -n cluster-USER_CLUSTER_NAME \
USER_CLUSTER_NAME -kubeconfig -o json | jq -r '.data.value' | base64 -d
Replace the following:
ADMIN_KUBECONFIG
: the path to the admin cluster kubeconfig file.USER_CLUSTER_NAME
: the name of the user cluster to check.
If the cluster name in the output (see contexts.context.cluster
in the
following sample output) is the admin cluster name, then the specified user
cluster is affected.
user-cluster-kubeconfig -o json | jq -r '.data.value' | base64 -d
apiVersion: v1
clusters:
- cluster:
certificate-authority-data:LS0tLS1CRU...UtLS0tLQo=
server: https://10.200.0.6:443
name: ci-aed78cdeca81874
contexts:
- context:
cluster: ci-aed78cdeca81874
user: ci-aed78cdeca81874-admin
name: ci-aed78cdeca81874-admin@ci-aed78cdeca81874
current-context: ci-aed78cdeca81874-admin@ci-aed78cdeca81874
kind: Config
preferences: {}
users:
- name: ci-aed78cdeca81874-admin
user:
client-certificate-data: LS0tLS1CRU...UtLS0tLQo=
client-key-data: LS0tLS1CRU...0tLS0tCg==
The following steps restore function to an affected user cluster
(USER_CLUSTER_NAME
):
Locate the user cluster kubeconfig file.
Google Distributed Cloud generates the kubeconfig file on the admin workstation when you create a cluster. By default, the file is in the
bmctl-workspace/USER_CLUSTER_NAME
directory.Verify the kubeconfig is correct user cluster kubeconfig:
kubectl get nodes --kubeconfig PATH_TO_GENERATED_FILE
Replace
PATH_TO_GENERATED_FILE
with the path to the user cluster kubeconfig file. The response returns details about the nodes for the user cluster. Confirm the machine names are correct for your cluster.Run the following command to delete the corrupted kubeconfig file in the admin cluster:
kubectl delete secret -n USER_CLUSTER_NAMESPACE USER_CLUSTER_NAME-kubeconfig
Run the following command to save the correct kubeconfig secret back to the admin cluster:
kubectl create secret generic -n USER_CLUSTER_NAMESPACE USER_CLUSTER_NAME-kubeconfig \ --from-file=value=PATH_TO_GENERATED_FILE
Reset/Deletion
User cluster credentials
The bmctl reset
command relies on the top-level credentials section in the
cluster configuration file. For user clusters, you will need to manually
update the file to add the credentials section.
Mount points and fstab
Reset does not unmount the mount points under /mnt/anthos-system
and
/mnt/localpv-share/
. It also does not clean up the corresponding entries in
/etc/fstab
.
Namespace deletion
Deleting a namespace will prevent new resources from being created in that namespace, including jobs to reset machines. When deleting a user cluster, you must delete the cluster object first before deleting its namespace. Otherwise, the jobs to reset machines cannot get created, and the deletion process will skip the machine clean-up step.
Security
The cluster CA/certificate will be rotated during upgrade. On-demand rotation support is not currently available.
Google Distributed Cloud rotates kubelet
serving certificates automatically.
Each kubelet
node agent can send out a Certificate Signing Request (CSR) when
a certificate nears expiration. A controller in your admin clusters validates
and approves the CSR.
Networking
Client source IP with bundled Layer 2 load balancing
Setting the
external traffic policy
to Local
can cause routing errors, such as No route to host
, for bundled
Layer 2 load balancing. The external traffic policy is set to Cluster
(externalTrafficPolicy: Cluster
), by default. With this setting, Kubernetes
handles cluster-wide traffic. Services of type LoadBalancer
or NodePort
can
use externalTrafficPolicy: Local
to preserve the client source IP address.
With this setting, however, Kubernetes only handles node-local traffic.
If you want to preserve the client source IP address, additional configuration may be required to ensure service IPs are reachable. For configuration details, see Preserving client source IP address in Configure bundled load balancing.
Pod connectivity failures and reverse path filtering
Google Distributed Cloud configures reverse path filtering on nodes to disable
source validation (net.ipv4.conf.all.rp_filter=0
). If therp_filter
setting
is changed to 1
or 2
, pods will fail due to out-of-node communication
timeouts.
Reverse path filtering is set with rp_filter
files in the IPv4 configuration
folder (net/ipv4/conf/all
). This value may also be overridden by sysctl
,
which stores reverse path filtering settings in a network security configuration
file, such as /etc/sysctl.d/60-gce-network-security.conf
.
To restore Pod connectivity, either set net.ipv4.conf.all.rp_filter
back to
0
manually, or restart the anetd
Pod to set net.ipv4.conf.all.rp_filter
back to 0
. To restart the anetd
Pod, use the following commands to locate
and delete the anetd
Pod and a new anetd
Pod will start up in its place:
kubectl get pods -n kube-system
kubectl delete pods -n kube-system ANETD_XYZ
Replace ANETD_XYZ with the name of the anetd
Pod.
Bootstrap (kind) cluster IP addresses and cluster node IP addresses overlapping
192.168.122.0/24
and 10.96.0.0/27
are the default pod and service CIDRs used by
the bootstrap (kind) cluster. Preflight checks will fail if they overlap with
cluster node machine IP addresses. To avoid the conflict, you can pass
the --bootstrap-cluster-pod-cidr
and --bootstrap-cluster-service-cidr
flags
to bmctl
to specify different values.
Overlapping IP addresses across different clusters
There is no preflight check to validate overlapping IP addresses across different clusters.
hostport
feature in Google Distributed Cloud
The hostport
feature in ContainerPort
is not currently supported.
Operating system
Cluster creation or upgrade fails on CentOS
In December 2020, the CentOS community and Red Hat announced the sunset of
CentOS.
On January 31, 2022, CentOS 8 reached its end of life (EOL). As a result of the
EOL, yum
repositories stopped working for CentOS, which causes cluster
creation and cluster upgrade operations to fail. This applies to all supported
versions of CentOS and affects all versions of Google Distributed Cloud.
As a workaround, run the following commands to have your CentOS use an archive feed:
sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-Linux-*
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' \ /etc/yum.repos.d/CentOS-Linux-*
As a long-term solution, consider migrating to another supported operating system, such as Ubuntu or RHEL.
Operating system endpoint limitations
On RHEL and CentOS, there is a cluster level limitation of 100,000 endpoints.
Kubernetes service. If 2 services reference the same set of pods, this counts
as 2 separate sets of endpoints. The underlying nftable
implementation on
RHEL and CentOS causes this limitation; it is not an intrinsic limitation of
Google Distributed Cloud.
Configuration
Control plane and load balancer specifications
The control plane and load balancer node pool specifications are special. These specifications declare and control critical cluster resources. The canonical source for these resources is their respective sections in the cluster config file:
spec.controlPlane.nodePoolSpec
spec.LoadBalancer.nodePoolSpec
Consequently, do not modify the top-level control plane and load balancer node pool resources directly. Modify the associated sections in the cluster config file instead.
Mutable fields in the cluster and node pool specification
Currently, only the following cluster and node pool specification fields in the cluster config file can be updated after the cluster is created (they are mutable fields):
For the
Cluster
object (kind: Cluster
), the following fields are mutable:spec.anthosBareMetalVersion
spec.bypassPreflightCheck
spec.controlPlane.nodePoolSpec.nodes
spec.loadBalancer.nodePoolSpec.nodes
spec.maintenanceBlocks
spec.nodeAccess.loginUser
For the
NodePool
object (kind: NodePool
), the following fields are mutable:spec.nodes
Node shows NotReady
status
Under certain load conditions, Google Distributed Cloud 1.6.x nodes may display a
NotReady
status due to Pod Lifecycle Event Generator (PLEG) being unhealthy.
The node status will contain the following entry:
PLEG is not healthy: pleg was last seen active XXXmXXXs ago; threshold is 3m0s
How do I know if I'm affected?
A likely cause of this issue is the runc binary version. To confirm if you have the problematic version installed, connect to one of the cluster machines using SSH and run:
/usr/bin/runc -v
If the output is 1.0.0-rc93
, then you have the problematic version installed.
Possible workarounds
To resolve this issue, we recommend upgrading to Google Distributed Cloud 1.7.0 or a later version.
If upgrading is not an option, you can revert the containerd.io
package to an
earlier version on the problematic node machines. To do this, connect to the
node machine using SSH and run:
Ubuntu
apt install containerd.io=1.4.3-1
CentOS/RHEL
dnf install containerd.io-1.3.9-3.1.el8.