This page provides troubleshooting information if you have problems when you install or upgrade Google Distributed Cloud clusters.
Bootstrap cluster issues
When Google Distributed Cloud creates self-managed (admin, hybrid, or standalone) clusters, it deploys a Kubernetes in Docker (kind) cluster to temporarily host the Kubernetes controllers needed to create clusters. This transient cluster is called a bootstrap cluster. User clusters are created and upgraded by their managing admin or hybrid cluster without the use of a bootstrap cluster.
If a kind cluster already exists in your deployment when you attempt to install,
Distributed Cloud deletes the existing kind cluster. Deletion only
happens after the installation or upgrade is successful. To preserve the
existing kind cluster even after success, use the --keep-bootstrap-cluster
flag of bmctl
.
Distributed Cloud creates a configuration file for the bootstrap cluster
under WORKSPACE_DIR/.kindkubeconfig
. You can connect to the
bootstrap cluster only during cluster creation and upgrade.
The bootstrap cluster needs to access a Docker repository to pull images. The
registry defaults to Container Registry unless you are using a private registry.
During cluster creation,bmctl
creates the following files:
bmctl-workspace/config.json
: Contains Google Cloud service account credentials for the registry access. The credentials are obtained from thegcrKeyPath
field in the cluster configuration file.bmctl-workspace/config.toml
: Contains the containerd configuration in the kind cluster.
Debug the bootstrap cluster
To debug the bootstrap cluster you can take the following steps:
- Connect to the bootstrap cluster during cluster creation and upgrade.
- Get the logs of the bootstrap cluster.
You can find the logs in the machine you use to run bmctl
in the following
folders:
bmctl-workspace/CLUSTER_NAME/log/create-cluster-TIMESTAMP/bootstrap-cluster/
bmctl-workspace/CLUSTER_NAME/log/upgrade-cluster-TIMESTAMP/bootstrap-cluster/
Replace CLUSTER_NAME
and TIMESTAMP
with
the name of your cluster and the corresponding system's time.
To get the logs from the bootstrap cluster directly, you can run the following command during cluster creation and upgrade:
docker exec -it bmctl-control-plane bash
The command opens a terminal inside the bmctl control plane container that runs in the bootstrap cluster.
To inspect the kubelet
and containerd
logs, use the following commands and
look for errors or warnings in the output:
journalctl -u kubelet
journalctl -u containerd
Cluster upgrade issues
When you upgrade Google Distributed Cloud clusters, you can monitor the progress and check the status of your clusters and nodes.
- If you have problems during an upgrade, try to determine at which stage the failure occurs. To learn more about what happens to a cluster during the upgrade process, see Lifecycle and stages of cluster upgrades.
- To learn more about the impact of a problem during cluster upgrades, see Understand the impact of failures in Distributed Cloud.
The following guidance can help determine if the upgrade is continuing as normal or there's a problem.
Monitor the upgrade progress
Use the kubectl describe cluster
command to view the status of a cluster
during the upgrade process:
kubectl describe cluster CLUSTER_NAME \
--namespace CLUSTER_NAMESPACE \
--kubeconfig ADMIN_KUBECONFIG
Replace the following values:
CLUSTER_NAME
: name of your cluster.CLUSTER_NAMESPACE
: the namespace of your cluster.ADMIN_KUBECONFIG
: the admin kubeconfig file.- By default, admin, hybrid, and standalone clusters use an in-place upgrade.
If you use the
--use-bootstrap=true
flag with thebmctl upgrade
command, the upgrade operation uses a bootstrap cluster. To monitor upgrade progress when a bootstrap cluster is used, specify the path to the bootstrap cluster kubeconfig file,.kindkubeconfig
. This file is located in the workspace directory.
- By default, admin, hybrid, and standalone clusters use an in-place upgrade.
If you use the
Look at the Status
section of the output, which shows an aggregation of the
cluster upgrade status. If the cluster reports an error, use the following
sections to troubleshoot where the issue is.
Check if the nodes are ready
Use the kubectl get nodes
command to view the status of nodes in a cluster
during the upgrade process:
kubectl get nodes --kubeconfig KUBECONFIG
To check if a node has successfully completed the upgrade process, look at the
VERSION
and AGE
columns in the command response. The VERSION
is the
Kubernetes version for the cluster. To see the Kubernetes version for a given
Distributed Cloud version, see the table in
Version Support Policy.
If the node shows NOT READY
, try to connect to the node and check the kubelet
status:
systemctl status kubelet
You can also review the kubelet logs:
journalctl -u kubelet
Review the output of the kubelet status and logs to messages that indicate why the node has a problem.
Check which node is upgrading
To check which node in the cluster is in the process of being upgraded, use the
kubectl get baremetalmachines
command:
kubectl get baremetalmachines --namespace CLUSTER_NAMESPACE \
--kubeconfig ADMIN_KUBECONFIG
Replace the following values:
CLUSTER_NAMESPACE
: the namespace of your cluster.ADMIN_KUBECONFIG
: the admin kubeconfig file.- If a bootstrap cluster is used for an admin, hybrid, or standalone upgrade,
specify the bootstrap cluster kubeconfig file
(
bmctl-workspace/.kindkubeconfig
).
- If a bootstrap cluster is used for an admin, hybrid, or standalone upgrade,
specify the bootstrap cluster kubeconfig file
(
The following example output shows that the node being upgraded has an ABM
VERSION
different from the DESIRED ABM VERSION
:
NAME CLUSTER READY INSTANCEID MACHINE ABM VERSION DESIRED ABM VERSION
10.200.0.2 cluster1 true baremetal://10.200.0.2 10.200.0.2 1.13.0 1.14.0
10.200.0.3 cluster1 true baremetal://10.200.0.3 10.200.0.3 1.13.0 1.13.0
Check what nodes are being drained
During the upgrade process, nodes are drained of Pods, and scheduling is
disabled until the node is successfully upgraded. To see which nodes are being
drained of Pods, use the kubectl get nodes
command:
kubectl get nodes --kubeconfig USER_CLUSTER_KUBECONFIG | grep "SchedulingDisabled"
Replace USER_CLUSTER_KUBECONFIG
with the path to the user cluster kubeconfig file.
The STATUS
column is filtered using grep
to only show nodes that report
SchedulingDisabled
. This status indicates that the nodes are being drained.
You can also check the node status from the admin cluster:
kubectl get baremetalmachines -n CLUSTER_NAMESPACE \
--kubeconfig ADMIN_KUBECONFIG
Replace the following values:
CLUSTER_NAMESPACE
: the namespace of your cluster.ADMIN_KUBECONFIG
: the admin kubeconfig file.- If a bootstrap cluster is used for an admin, hybrid, or standalone upgrade,
specify the bootstrap cluster kubeconfig file
(
bmctl-workspace/.kindkubeconfig
).
- If a bootstrap cluster is used for an admin, hybrid, or standalone upgrade,
specify the bootstrap cluster kubeconfig file
(
The node getting drained shows the status under the MAINTENANCE
column.
Check why a node has been in the status of draining for a long time
Use one of the methods in the previous section to identify the node getting
drained by using the using kubectl get nodes
command. Use the kubectl get
pods
command and filter on this node name to view additional details:
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=NODE_NAME
Replace NODE_NAME
with the name of the
node being drained. The output returns a list of Pods that are stuck or slow to
drain. The upgrade proceeds, even with stuck Pods, when the draining process on
a node takes more than 20 minutes.
Check why Pods are unhealthy
Upgrades can fail if a Pod contains upgrade-first-node
or upgrade-node
control plane IP addresses. This behavior is usually because the static Pods
aren't healthy.
Check the static Pods with the
crictl ps -a
command and look for any crashing Kubernetes oretcd
Pods. If you have any failed Pods, review the logs for the Pods to see why they're crashing.Some possibilities for crashloop behavior include the following:
- Permissions or owner of files mounted to static Pods aren't correct.
- Connectivity to the virtual IP address doesn't work.
- Issues with
etcd
.
If the
crictl ps
command doesn't work or return nothing, check thekubelet
andcontainerd
status. Use thesystemctl status SERVICE
andjournalctl -u SERVICE
commands to look at the logs.