Configuration
Control plane and load balancer specifications
The control plane and load balancer node pool specifications are special. These specifications declare and control critical cluster resources. The canonical source for these resources is their respective sections in the cluster config file:
spec.controlPlane.nodePoolSpec
spec.loadBalancer.nodePoolSpec
Consequently, do not modify the top-level control plane and load balancer node pool resources directly. Modify the associated sections in the cluster config file instead.
Installation
bmctl
exits before cluster creation completes
Cluster creation may fail for Anthos clusters on bare metal version 1.11.0 (this issue
is fixed in Anthos clusters on bare metal release 1.11.1). In some cases, the
bmctl create cluster
command exits early and writes errors like the
following to the logs:
Error creating cluster: error waiting for applied resources: provider cluster-api
watching namespace USER_CLUSTER_NAME not found in the target cluster
The failed operation produces artifacts, but the cluster isn't operational. If this issue affects you, use the following steps to clean up artifacts and create a cluster:
To delete cluster artifacts and reset the node machine, run the following command :
bmctl reset -c USER_CLUSTER_NAME
To start the cluster creation operation, run the following command:
bmctl create cluster -c USER_CLUSTER_NAME --keep-bootstrap-cluster
The
--keep-bootstrap-cluster
flag is important if this command fails.If the cluster creation command succeeds, you can skip the remaining steps. Otherwise, continue.
Run the following command to get the version for the bootstrap cluster:
kubectl get cluster USER_CLUSTER_NAME -n USER_CLUSTER_NAMESPACE \ --kubeconfig bmctl-workspace/.kindkubeconfig \ -o=jsonpath='{.status.anthosBareMetalVersion}
The output should be
1.11.0
. If the output isn't1.11.0
, wait a minute or two and retry.To manually move resources from the bootstrap cluster to the target cluster, run the following command:
bmctl move --from-kubeconfig bmctl-workspace/.kindkubeconfig --to-kubeconfig \ bmctl-workspace/USER_CLUSTER_NAME/USER_CLUSTER_NAME-kubeconfig \ -n USER_CLUSTER_NAMESPACE
To delete the bootstrap cluster, run the following command:
bmctl reset bootstrap
Installation reports VM runtime reconciliation error
The cluster creation operation may report an error similar to the following:
I0423 01:17:20.895640 3935589 logs.go:82] "msg"="Cluster reconciling:"
"message"="Internal error occurred: failed calling webhook \"vvmruntime.kb.io\":
failed to call webhook: Post \"https://vmruntime-webhook-service.kube-
system.svc:443/validate-vm-cluster-gke-io-v1-vmruntime?timeout=10s\": dial tcp
10.95.5.151:443: connect: connection refused" "name"="xxx"
"reason"="ReconciliationError"
This error is benign and you can safely ignore it.
Cluster creation fails when using multi-NIC, containerd
, and HTTPS proxy
Cluster creation fails when you have the following combination of conditions:
Cluster is configured to use
containerd
as the container runtime (nodeConfig.containerRuntime
set tocontainerd
in the cluster configuration file, the default for Anthos clusters on bare metal version 1.12).Cluster is configured to provide multiple network interfaces, multi-NIC, for pods (
clusterNetwork.multipleNetworkInterfaces
set totrue
in the cluster configuration file).Cluster is configured to use a proxy (
spec.proxy.url
is specified in the cluster configuration file). Even though cluster creation fails, this setting is propagated when you attempt to create a cluster. You may see this proxy setting as anHTTPS_PROXY
environment variable or in yourcontainerd
configuration (/etc/systemd/system/containerd.service.d/09-proxy.conf
).
As a workaround for this issue, append service CIDRs
(clusterNetwork.services.cidrBlocks
) to the NO_PROXY
environment variable on
all node machines.
Failure on systems with restrictive umask
setting
Anthos clusters on bare metal release 1.10.0 introduced a rootless control plane
feature that runs all the control plane components as a non-root user. Running
all components as a non-root user may cause installation or upgrade failures on
systems with a more restrictive umask
setting of 0077
.
The workaround for cluster creation failures is to reset the control plane nodes
and change the umask
setting to 0022
on
all the control plane machines. After the machines have been updated, retry the
installation.
Alternatively, you can change the directory and file permissions of
/etc/kubernetes
on the control-plane machines for the installation or
upgrade to proceed.
- Make
/etc/kubernetes
and all its subdirectories world readable:chmod o+rx
. - Make all the files owned by
root
user under the directory (recursively)/etc/kubernetes
world readable (chmod o+r
). Exclude private key files (.key) from these changes as they are already created with correct ownership and permissions. - Make
/usr/local/etc/haproxy/haproxy.cfg
world readable. - Make
/usr/local/etc/bgpadvertiser/bgpadvertiser-cfg.yaml
world readable.
Control group v2 incompatibility
Control group v2
(cgroup v2) is not supported in Anthos clusters on bare metal. The presence of
/sys/fs/cgroup/cgroup.controllers
indicates that your system uses cgroup v2.
The preflight checks verify that cgroup v2 is not in use on the cluster machine.
Benign error messages during installation
When examining cluster creation logs, you may notice transient failures about registering clusters or calling webhooks. These errors can be safely ignored, because the installation will retry these operations until they succeed.
Preflight checks and service account credentials
For installations triggered by admin or hybrid clusters (in other words,
clusters not created with bmctl
, like user clusters), the preflight check does
not verify Google Cloud Platform service account credentials or their
associated permissions.
Application default credentials and bmctl
bmctl
uses Application Default Credentials (ADC)
to validate the cluster operation's
location value in the cluster spec
when it is not set to global
.
For ADC to work, you need to either point the GOOGLE_APPLICATION_CREDENTIALS
environment variable to a service account credential file, or run
gcloud auth application-default login
.
Docker service
On cluster node machines, if the Docker executable is present in the PATH
environment variable, but the Docker service is not active, preflight check
will fail and report that the Docker service is not active
. To fix this error,
either remove Docker, or enable the Docker service.
Installing on vSphere
When installing Anthos clusters on bare metal on vSphere VMs, you must set the
tx-udp_tnl-segmentation
and tx-udp_tnl-csum-segmentation
flags to off.
These flags are related to the hardware segmentation offload done by the vSphere
driver VMXNET3 and they don't work with the GENEVE tunnel of
Anthos clusters on bare metal.
Run the following command on each node to check the current values for these
flags.
ethtool -k NET_INTFC |grep segm
...
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
...
Replace NET_INTFC
with the network interface associated
with the IP address of the node.
Sometimes in RHEL 8.4, ethtool
shows these flags are off while they aren't. To
explicitly set these flags to off, toggle the flags on and then off with the
following commands.
ethtool -K ens192 tx-udp_tnl-segmentation on
ethtool -K ens192 tx-udp_tnl-csum-segmentation on
ethtool -K ens192 tx-udp_tnl-segmentation off
ethtool -K ens192 tx-udp_tnl-csum-segmentation off
This flag change does not persist across reboots. Configure the startup scripts to explicitly set these flags when the system boots.
Upgrades and updates
Upgrading clusters to version 1.11.2 or higher fails when Anthos VM Runtime is enabled
In Anthos clusters on bare metal release 1.11.2, all resources related to
Anthos VM Runtime are migrated to the vm-system
namespace. If you
have Anthos VM Runtime enabled in a version 1.11.1 or lower cluster,
upgrading to version 1.11.2 or higher fails unless you first disable
Anthos VM Runtime. When you're affected by this issue, the upgrade
operation reports the following error:
Failed to upgrade cluster: cluster is not upgradable with vmruntime enabled from
version 1.11.0 to version 1.11.2: please disable VMruntime before upgrade to
1.11.2 and higher version
To disable Anthos VM Runtime:
Edit the
VMRuntime
custom resource:kubectl edit vmruntime
Set
enabled
tofalse
in the spec:apiVersion: vm.cluster.gke.io/v1 kind: VMRuntime metadata: name: vmruntime spec: enabled: false ...
Save the custom resource in your editor.
Once the cluster upgrade is complete, re-enable Anthos VM Runtime.
For more information, see Working with VM-based workloads.
Upgrade stuck at error during manifests operations
In some situations, cluster upgrades fail to complete and the bmctl
CLI
becomes unresponsive. This problem can be caused by an incorrectly updated
resource. To determine if you're affected by this issue and to correct it, use
the following steps:
Check the
anthos-cluster-operator
logs and look for errors similar to the following entries:controllers/Cluster "msg"="error during manifests operations" "error"="1 error occurred: ... {RESOURCE_NAME} is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
These entries are a symptom of an incorrectly updated resource, where
{RESOURCE_NAME}
is the name of the problem resource.If you find these errors in your logs, use
kubectl edit
to remove thekubectl.kubernetes.io/last-applied-configuration
annotation from the resource contained in the log message.Save and apply your changes to the resource.
Retry the cluster upgrade.
Upgrades are blocked for clusters with features that use Anthos Network Gateway
Cluster upgrades from 1.10.x to 1.11.x fail for clusters that use either
egress NAT gateway or
bundled load-balancing with BGP.
These features both use Anthos Network Gateway. Cluster upgrades get stuck at
the Waiting for upgrade to complete...
command-line message and the
anthos-cluster-operator
logs errors like the following:
apply run failed ...
MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable...
To unblock the upgrade, run the following commands against the cluster you are upgrading:
kubectl -n kube-system delete deployment ang-controller-manager-autoscaler
kubectl -n kube-system delete deployment ang-controller-manager
kubectl -n kube-system delete ds ang-node
bmctl update
doesn't remove maintenance blocks
The bmctl update
command can't remove or modify the maintenanceBlocks
section from the cluster resource configuration. For more information, including
instructions for removing nodes from maintenance mode, see
Put nodes into maintenance mode.
Node draining can't start when Node is out of reach
The draining process for Nodes won't start if the Node is out of reach from Anthos clusters on bare metal. For example, if a Node goes offline during a cluster upgrade process, it may cause the upgrade to stop responding. This is a rare occurrence. To minimize the likelyhood of encountering this problem, ensure your Nodes are operating properly before initiating an upgrade.
Operation
Version 11 admin clusters using a registry mirror can't manage version 1.10 clusters
If your admin cluster is on version 1.11 and uses a registry mirror, it can't manage user clusters that are on a lower minor version. This issue affects reset, update, and upgrade operations on the user cluster.
To determine whether this issue affects you, check your logs for cluster
operations, such as create, upgrade, or reset. These logs are located in the
bmctl-workspace/CLUSTER_NAME/
folder by default. If you're
affected by the issue, your logs contain the following error message:
flag provided but not defined: -registry-mirror-host-to-endpoints
kubeconfig Secret overwritten
The bmctl check cluster
command, when run on user clusters, overwrites the
user cluster kubeconfig Secret with the admin cluster kubeconfig. Overwriting
the file causes standard cluster operations, such as updating and upgrading, to
fail for affected user clusters. This problem applies to Anthos clusters on bare metal
versions 1.11.1 and earlier.
To determine if this issue affects a user cluster, run the following command:
kubectl --kubeconfig ADMIN_KUBECONFIG get secret -n USER_CLUSTER_NAMESPACE \
USER_CLUSTER_NAME -kubeconfig -o json | jq -r '.data.value' | base64 -d
Replace the following:
ADMIN_KUBECONFIG
: the path to the admin cluster kubeconfig file.USER_CLUSTER_NAMESPACE
: the namespace for the cluster. By default, the cluster namespaces for Anthos clusters on bare metal are the name of the cluster prefaced withcluster-
. For example, if you name your clustertest
, the default namespace iscluster-test
.USER_CLUSTER_NAME
: the name of the user cluster to check.
If the cluster name in the output (see contexts.context.cluster
in the
following sample output) is the admin cluster name, then the specified user
cluster is affected.
user-cluster-kubeconfig -o json | jq -r '.data.value' | base64 -d
apiVersion: v1
clusters:
- cluster:
certificate-authority-data:LS0tLS1CRU...UtLS0tLQo=
server: https://10.200.0.6:443
name: ci-aed78cdeca81874
contexts:
- context:
cluster: ci-aed78cdeca81874
user: ci-aed78cdeca81874-admin
name: ci-aed78cdeca81874-admin@ci-aed78cdeca81874
current-context: ci-aed78cdeca81874-admin@ci-aed78cdeca81874
kind: Config
preferences: {}
users:
- name: ci-aed78cdeca81874-admin
user:
client-certificate-data: LS0tLS1CRU...UtLS0tLQo=
client-key-data: LS0tLS1CRU...0tLS0tCg==
The following steps restore function to an affected user cluster
(USER_CLUSTER_NAME
):
Locate the user cluster kubeconfig file.
Anthos clusters on bare metal generates the kubeconfig file on the admin workstation when you create a cluster. By default, the file is in the
bmctl-workspace/USER_CLUSTER_NAME
directory.Verify the kubeconfig is correct user cluster kubeconfig:
kubectl get nodes --kubeconfig PATH_TO_GENERATED_FILE
Replace
PATH_TO_GENERATED_FILE
with the path to the user cluster kubeconfig file. The response returns details about the nodes for the user cluster. Confirm the machine names are correct for your cluster.Run the following command to delete the corrupted kubeconfig file in the admin cluster:
kubectl delete secret -n USER_CLUSTER_NAMESPACE USER_CLUSTER_NAME-kubeconfig
Run the following command to save the correct kubeconfig secret back to the admin cluster:
kubectl create secret generic -n USER_CLUSTER_NAMESPACE USER_CLUSTER_NAME-kubeconfig \ --from-file=value=PATH_TO_GENERATED_FILE
Taking a snapshot as a non-root login user
If you use containerd as the container runtime, running snapshot as non-root
user requires /usr/local/bin
to be in the user's PATH. Otherwise it will fail
with a crictl: command not found
error.
When you aren't logged in as the root user, sudo
is used to run the snapshot
commands. The sudo
PATH can differ from the root profile and may not contain
/usr/local/bin
.
You can fix this error by updating the secure_path
in /etc/sudoers
to
include /usr/local/bin
. Alternatively, create a symbolic link for crictl
in
another /bin
directory.
Anthos VM Runtime
- Restarting a pod causes the VMs on the pod to change IP addresses or lose
their IP address altogether. If the IP address of a VM changes, this does not
affect the reachability of VM applications exposed as a Kubernetes service. If
the IP address is lost, you must run
dhclient
from the VM to acquire an IP address for the VM.
Logging and monitoring
Unexpected monitoring billing
For Anthos clusters on bare metal versions 1.10 and 1.11, some customers have found
unexpectedly high billing for Metrics volume
on the Billing page. This
issue affects you only when both of the following circumstances apply:
Application logging and monitoring is enabled (
enableStackdriverForApplications=true
)Application Pods have the
prometheus.io/scrap=true
annotation
To confirm whether you are affected by this issue, list your user-defined metrics. If you see billing for unwanted metrics, then this issue applies to you.
To ensure you don't get billed extra for Metrics volume
when you use
application logging and monitoring, use the following steps:
Find the source Pods and Services that have the unwanted billed metrics.
kubectl --kubeconfig KUBECONFIG get pods -A -o yaml | grep 'prometheus.io/scrape: "true"' kubectl --kubeconfig KUBECONFIG get services -A -o yaml | grep 'prometheus.io/scrape: "true"'
Remove the
prometheus.io/scrap=true
annotation from the Pod or Service.
Edits to metrics-server-config
aren't persisted
High pod density can, in extreme cases, create excessive logging and monitoring
overhead, which can cause Metrics Server to stop and restart. You can edit the
metrics-server-config
ConfigMap to allocate more resources to keep Metrics
Server running. However, due to reconciliation, edits made to metrics-server-
config
can get reverted to the default value during a cluster update or upgrade
operation. Metrics Server isn't affected immediately, but the next time it
restarts, it picks up the reverted ConfigMap and is vulnerable to excessive
overhead, again.
As a workaround, you can script the ConfigMap edit and perform it along with updates or upgrades to the cluster.
Deprecated metrics affects Cloud Monitoring dashboard
Several Anthos metrics have been deprecated and, starting with Anthos clusters on bare metal release 1.11, data is no longer collected for these deprecated metrics. If you use these metrics in any of your alerting policies, there won't be any data to trigger the alerting condition.
The following table lists the individual metrics that have been deprecated and the metric that replaces them.
Deprecated metrics | Replacement metric |
---|---|
kube_daemonset_updated_number_scheduled |
kube_daemonset_status_updated_number_scheduled |
kube_node_status_allocatable_cpu_cores kube_node_status_allocatable_memory_bytes kube_node_status_allocatable_pods |
kube_node_status_allocatable |
kube_node_status_capacity_cpu_cores kube_node_status_capacity_memory_bytes kube_node_status_capacity_pods |
kube_node_status_capacity |
In Anthos clusters on bare metal releases before 1.11, the policy definition file for
the recommended Anthos on baremetal node cpu usage exceeds 80 percent
(critical)
alert uses the deprecated metrics. The node-cpu-usage-high.json
JSON definition file is updated for releases 1.11.0 and later.
Use the following steps to migrate to the replacement metrics:
In the Google Cloud console, select Monitoring or click the following button:
In the navigation pane, select
Dashboards, and delete the Anthos cluster node status dashboard.
Click the Sample library tab and reinstall the Anthos cluster node status dashboard.
Follow the instructions in Creating alerting policies to create a policy using the updated
node-cpu-usage-high.json
policy definition file.
Unknown metric data in Cloud Monitoring
The data in Cloud Monitoring for version 1.10.x clusters may contain irrelevant summary metrics entries such as the following:
Unknown metric: kubernetes.io/anthos/go_gc_duration_seconds_summary_percentile
Other metrics types that may have irrelevant summary metrics include:
apiserver_admission_step_admission_duration_seconds_summary
go_gc_duration_seconds
scheduler_scheduling_duration_seconds
Ignore these invalid summary metrics.
For more information about supported metrics for Anthos clusters on bare metal, see View Anthos clusters on bare metal metrics.
Intermittent metrics export interruptions
Anthos clusters on bare metal release 1.11.0 may experience interruptions in normal, continuous exporting of metrics, or missing metrics on some nodes. If this issue affects your clusters, you may see gaps in data for the following metrics (at a minimum):
kubernetes.io/anthos/container_memory_working_set_bytes
kubernetes.io/anthos/container_cpu_usage_seconds_total
kubernetes.io/anthos/container_network_receive_bytes_total
To fix this issue, upgrade your clusters to version 1.11.1 or later.
If you can't upgrade, perform the following steps as a workaround:
Open your
stackdriver
resource for editing:kubectl -n kube-system edit stackdriver stackdriver
To increase the CPU request for
gke-metrics-agent
from10m
to50m
, add the followingresourceAttrOverride
section to thestackdriver
manifest :spec: resourceAttrOverride: gke-metrics-agent/gke-metrics-agent: limits: cpu: 100m memory: 4608Mi requests: cpu: 50m memory: 200Mi
Your edited resource should look similar to the following:
spec: anthosDistribution: baremetal clusterLocation: us-west1-a clusterName: my-cluster enableStackdriverForApplications: true gcpServiceAccountSecretName: ... optimizedMetrics: true portable: true projectID: my-project-191923 proxyConfigSecretName: ... resourceAttrOverride: gke-metrics-agent/gke-metrics-agent: limits: cpu: 100m memory: 4608Mi requests: cpu: 50m memory: 200Mi
Save your changes and close the text editor.
To verify your changes have taken effect, run the following command:
kubectl -n kube-system get daemonset gke-metrics-agent -o yaml | grep "cpu: 50m"
The command finds
cpu: 50m
if your edits have taken effect.
Networking
NAT failure with too many parallel connections
For a given node in your cluster, the node IP address provides network address
translation (NAT) for packets routed to an address outside of the cluster.
Similarly, when inbound packets enter a load-balancing node configured to use
bundled load balancing (spec.loadBalancer.mode: bundled
), source network
address translation (SNAT) routes the packets to the node IP address before they
are forwarded on to a backend Pod.
The port range for NAT used by Anthos clusters on bare metal is 32768
–65535
.
This range limits the number of parallel connections to 32,767 per protocol on
that node. Each connection needs an entry in the conntrack table. If you have
too many short-lived connections, the conntrack table runs out of ports for NAT.
A garbage collector cleans up the stale entries, but the cleanup isn't
immediate.
When the number of connections on your node approaches 32,767, you will start seeing packet drops for connections that need NAT. The workaround for this problem is to redistribute your traffic to other nodes.
You can identify this problem by running the following command on the anetd
Pod on the problematic node:
kubectl -n kube-system anetd-XXX -- hubble observe --from-ip $IP --to-ip $IP -f
You should see errors of the following form:
No mapping for NAT masquerade DROPPED
Client source IP with bundled Layer 2 load balancing
Setting the
external traffic policy
to Local
can cause routing errors, such as No route to host
, for bundled
Layer 2 load balancing. The external traffic policy is set to Cluster
(externalTrafficPolicy: Cluster
), by default. With this setting, Kubernetes
handles cluster-wide traffic. Services of type LoadBalancer
or NodePort
can
use externalTrafficPolicy: Local
to preserve the client source IP address.
With this setting, however, Kubernetes only handles node-local traffic.
If you want to preserve the client source IP address, additional configuration may be required to ensure service IPs are reachable. For configuration details, see Preserving client source IP address in Configure bundled load balancing.
Modifying firewalld will erase Cilium iptable policy chains
When running Anthos clusters on bare metal with firewalld
enabled on either CentOS or
Red Had Enterprise Linux (RHEL), changes to firewalld
can remove the Cilium
iptables
chains on the host network. The iptables
chains are added by the
anetd
Pod when it is started. The loss of the Cilium iptables
chains causes
the Pod on the Node to lose network connectivity outside of the Node.
Changes to firewalld
that will remove the iptables
chains include, but
aren't limited to:
- Restarting
firewalld
, usingsystemctl
- Reloading the
firewalld
with the command line client (firewall-cmd --reload
)
You can fix this connectivity issue by restarting anetd
on the Node. Locate
and delete the anetd
Pod with the following commands to restart anetd
:
kubectl get pods -n kube-system
kubectl delete pods -n kube-system ANETD_XYZ
Replace ANETD_XYZ with the name of the anetd
Pod.
Duplicate egressSourceIP
addresses
When using the egress NAT gateway feature
preview, it is possible to set traffic selection rules that specify an
egressSourceIP
address that is already in use for another EgressNATPolicy
object. This may cause egress traffic routing conflicts. Coordinate with your
development team to determine which floating IP addresses are available for use
before specifying the egressSourceIP
address in your EgressNATPolicy
custom
resource.
Pod connectivity failures and reverse path filtering
Anthos clusters on bare metal configures reverse path filtering on nodes to disable
source validation (net.ipv4.conf.all.rp_filter=0
). If therp_filter
setting
is changed to 1
or 2
, pods will fail due to out-of-node communication
timeouts.
Reverse path filtering is set with rp_filter
files in the IPv4 configuration
folder (net/ipv4/conf/all
). This value may also be overridden by sysctl
,
which stores reverse path filtering settings in a network security configuration
file, such as /etc/sysctl.d/60-gce-network-security.conf
.
To restore Pod connectivity, either set net.ipv4.conf.all.rp_filter
back to
0
manually, or restart the anetd
Pod to set net.ipv4.conf.all.rp_filter
back to 0
. To restart the anetd
Pod, use the following commands to locate
and delete the anetd
Pod and a new anetd
Pod will start up in its place:
kubectl get pods -n kube-system
kubectl delete pods -n kube-system ANETD_XYZ
Replace ANETD_XYZ with the name of the anetd
Pod.
Bootstrap (kind) cluster IP addresses and cluster node IP addresses overlapping
192.168.122.0/24
and 10.96.0.0/27
are the default pod and service CIDRs used by
the bootstrap (kind) cluster. Preflight checks will fail if they overlap with
cluster node machine IP addresses. To avoid the conflict, you can pass
the --bootstrap-cluster-pod-cidr
and --bootstrap-cluster-service-cidr
flags
to bmctl
to specify different values.
Operating system
Incompatibility with Ubuntu 18.04.6 on GA kernel
Anthos clusters on bare metal versions 1.11.0 and 1.11.1 aren't compatible with Ubuntu
18.04.6 on the GA kernel (from 4.15.0-144-generic
to 4.15.0-176-generic
).
The incompatibility causes the networking agent to fail to configure the cluster
network with a "BPF program is too large" error in the anetd
logs. You may see
pods stuck in ContainerCreating
status with a networkPlugin cni failed to set
up pod
error in the pods event log. This issue doesn't apply to the Ubuntu
Hardware Enablement (HWE) kernels.
We recommend that you get the HWE kernel and upgrade it to the latest supported HWE version for Ubuntu 18.04.
Cluster creation or upgrade fails on CentOS
In December 2020, the CentOS community and Red Hat announced the sunset of
CentOS.
On January 31, 2022, CentOS 8 reached its end of life (EOL). As a result of the
EOL, yum
repositories stopped working for CentOS, which causes cluster
creation and cluster upgrade operations to fail. This applies to all supported
versions of CentOS and affects all versions of Anthos clusters on bare metal.
As a workaround, run the following commands to have your CentOS use an archive feed:
sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-Linux-*
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' \
/etc/yum.repos.d/CentOS-Linux-*
As a long-term solution, consider migrating to another supported operating system.
Operating system endpoint limitations
On RHEL and CentOS, there is a cluster level limitation of 100,000 endpoints.
Kubernetes service. If 2 services reference the same set of pods, this counts
as 2 separate sets of endpoints. The underlying nftable
implementation on
RHEL and CentOS causes this limitation; it is not an intrinsic limitation of
Anthos clusters on bare metal.
Security
Container can't write to VOLUME
defined in Dockerfile with containerd and SELinux
If you use containerd as the container runtime and your operating system has
SELinux enabled, the VOLUME
defined in the application Dockerfile might not be
writable. For example, containers built with the following Dockerfile aren't
able to write to the /tmp
folder.
FROM ubuntu:20.04
RUN chmod -R 777 /tmp
VOLUME /tmp
To verify if you're affected by this issue, run the following command on the node that hosts the problematic container:
ausearch -m avc
If you're affected by this issue, you see a denied
error like the
following:
time->Mon Apr 4 21:01:32 2022 type=PROCTITLE msg=audit(1649106092.768:10979):
proctitle="bash" type=SYSCALL msg=audit(1649106092.768:10979): arch=c000003e
syscall=257 success=no exit=-13 a0=ffffff9c a1=55eeba72b320 a2=241 a3=1b6
items=0 ppid=75712 pid=76042 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0
egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="bash" exe="/usr/bin/bash"
subj=system_u:system_r:container_t:s0:c701,c935 key=(null) type=AVC
msg=audit(1649106092.768:10979): avc: denied {
write } for pid=76042 comm="bash"
name="ad9bc6cf14bfca03d7bb8de23c725a86cb9f50945664cb338dfe6ac19ed0036c"
dev="sda2" ino=369501097 scontext=system_u:system_r:container_t:s0:c701,c935
tcontext=system_u:object_r:container_ro_file_t:s0 tclass=dir permissive=0
To work around this issue, make either of the following changes:
- Turn off SELinux.
- Don't use the VOLUME feature inside Dockerfile.
SELinux errors during pod creation
Pod creation sometimes fails when SELinux prevents the container runtime
from setting labels on tmpfs
mounts. This failure is rare, but can happen when
SELinux is in Enforcing
mode and in some kernels.
To verify that SELinux is the cause of pod creation failures, use the following
command to check for errors in the kubelet
logs:
journalctl -u kubelet
If SELinux is causing pod creation to fail, the command response contains an error similar to the following:
error setting label on mount source '/var/lib/kubelet/pods/
6d9466f7-d818-4658-b27c-3474bfd48c79/volumes/kubernetes.io~secret/localpv-token-bpw5x':
failed to set file label on /var/lib/kubelet/pods/
6d9466f7-d818-4658-b27c-3474bfd48c79/volumes/kubernetes.io~secret/localpv-token-bpw5x:
permission denied
To verify that this issue is related to SELinux enforcement, run the following command:
ausearch -m avc
This command searches the audit logs for access vector cache (AVC) permission
errors. The avc: denied
in the following sample response confirms that the pod
creation failures are related to SELinux enforcement.
type=AVC msg=audit(1627410995.808:9534): avc: denied { associate } for
pid=20660 comm="dockerd" name="/" dev="tmpfs" ino=186492
scontext=system_u:object_r:container_file_t:s0:c61,c201
tcontext=system_u:object_r:locale_t:s0 tclass=filesystem permissive=0
The root cause of this pod creation problem with SELinux is a kernel bug found in the following Linux images:
- Red Hat Enterprise Linux (RHEL) releases prior to 8.3
- CentOS releases prior to 8.3
Rebooting the machine helps recover from the issue.
To prevent pod creation errors from occurring, use RHEL 8.3 or later or CentOS 8.3 or later, because those versions have fixed the kernel bug.
Reset/Deletion
Namespace deletion
Deleting a namespace will prevent new resources from being created in that namespace, including jobs to reset machines. When deleting a user cluster, you must delete the cluster object first before deleting its namespace. Otherwise, the jobs to reset machines cannot get created, and the deletion process will skip the machine clean-up step.
containerd service
The bmctl reset
command doesn't delete any containerd
configuration files or
binaries. The containerd systemd
service is left up and running.
The command deletes the containers running pods scheduled to the node.