Troubleshoot cluster install or upgrade issues

This page provides troubleshooting information if you have problems when you install or upgrade GKE on Bare Metal.

Bootstrap cluster issues

When GKE on Bare Metal creates or upgrades clusters, it deploys a Kubernetes in Docker (kind) cluster to temporarily host the Kubernetes controllers needed to create or upgrade clusters. This transient cluster is called a bootstrap cluster.

If a kind cluster already exists in your deployment when you attempt to install, GKE on Bare Metal deletes the existing kind cluster. Deletion only happens after the installation or upgrade is successful. To preserve the existing kind cluster even after success, use the --keep-bootstrap-cluster flag of bmctl.

GKE on Bare Metal creates a configuration file for the bootstrap cluster under WORKSPACE_DIR/.kindkubeconfig. You can connect to the bootstrap cluster only during cluster creation and upgrade.

The bootstrap cluster needs to access a Docker repository to pull images. The registry defaults to Container Registry unless you are using a private registry. During cluster creation,bmctl creates the following files:

  • bmctl-workspace/config.json: Contains Google Cloud service account credentials for the registry access. The credentials are obtained from the gcrKeyPath field in the cluster configuration file.

  • bmctl-workspace/config.toml: Contains the containerd configuration in the kind cluster.

Debug the bootstrap cluster

To debug the bootstrap cluster you can take the following steps:

  • Connect to the bootstrap cluster during cluster creation and upgrade.
  • Get the logs of the bootstrap cluster.

You can find the logs in the machine you use to run bmctl in the following folders:

  • bmctl-workspace/CLUSTER_NAME/log/create-cluster-TIMESTAMP/bootstrap-cluster/
  • bmctl-workspace/CLUSTER_NAME/log/upgrade-cluster-TIMESTAMP/bootstrap-cluster/

Replace CLUSTER_NAME and TIMESTAMP with the name of your cluster and the corresponding system's time.

To get the logs from the bootstrap cluster directly, you can run the following command during cluster creation and upgrade:

docker exec -it bmctl-control-plane bash

The command opens a terminal inside the bmctl control plane container that runs in the bootstrap cluster.

To inspect the kubelet and containerd logs, use the following commands and look for errors or warnings in the output:

journalctl -u kubelet
journalctl -u containerd

Cluster upgrade issues

When you upgrade GKE on Bare Metal, you can monitor the progress and check the status of your clusters and nodes. The following guidance can help determine if the upgrade is continuing as normal or there's a problem.

Monitor the upgrade progress

Use the kubectl describe cluster command to view the status of a cluster during the upgrade process:

kubectl describe cluster CLUSTER_NAME \
    --namespace CLUSTER_NAMESPACE \
    --kubeconfig ADMIN_KUBECONFIG

Replace the following values:

  • CLUSTER_NAME: name of your cluster.
  • CLUSTER_NAMESPACE: the namespace of your cluster.
  • ADMIN_KUBECONFIG: the admin kubeconfig file.
    • By default, a bootstrap cluster is used for admin, hybrid, and standalone cluster upgrades. To monitor upgrade progress when a bootstrap cluster is used, specify the path to the bootstrap cluster kubeconfig file, .kindkubeconfig. This file is located in the workspace directory.

Look at the Status section of the output, which shows an aggregation of the cluster upgrade status. If the cluster reports an error, use the following sections to troubleshoot where the issue is.

Check if the nodes are ready

Use the kubectl get nodes command to view the status of nodes in a cluster during the upgrade process:

kubectl get nodes --kubeconfig KUBECONFIG

To check if a node has successfully completed the upgrade process, look at the VERSION and AGE columns in the command response. The VERSION is the Kubernetes version for the cluster. To see the Kubernetes version for a given GKE on Bare Metal version, see the table in Version Support Policy.

If the node shows NOT READY, try to connect the node and check the kubelet status:

systemctl status kubelet

You can also review the kubelet logs:

journalctl -u kubelet

Review the output of the kubelet status and logs to messages that indicate why the node has a problem.

Check which node is currently upgrading

To check which node in the cluster is currently in the process of being upgraded, use the kubectl get baremetalmachines command:

kubectl get baremetalmachines --namespace CLUSTER_NAMESPACE \
  --kubeconfig ADMIN_KUBECONFIG

Replace the following values:

  • CLUSTER_NAMESPACE: the namespace of your cluster.
  • ADMIN_KUBECONFIG: the admin kubeconfig file.
    • If a bootstrap cluster is used for an admin, hybrid, or standalone upgrade, specify the bootstrap cluster kubeconfig file (bmctl-workspace/.kindkubeconfig).

The following example output shows that the node being currently being upgraded has an ABM VERSION different from the DESIRED ABM VERSION:

NAME         CLUSTER    READY   INSTANCEID               MACHINE      ABM VERSION   DESIRED ABM VERSION
10.200.0.2   cluster1   true    baremetal://10.200.0.2   10.200.0.2   1.13.0        1.14.0
10.200.0.3   cluster1   true    baremetal://10.200.0.3   10.200.0.3   1.13.0        1.13.0

Check what nodes are currently being drained

During the upgrade process, nodes are drained of Pods, and scheduling is disabled until the node is successfully upgraded. To see which nodes are currently being drained of Pods, use the kubectl get nodes command:

kubectl get nodes --kubeconfig USER_CLUSTER_KUBECONFIG | grep "SchedulingDisabled"

Replace USER_CLUSTER_KUBECONFIG with the path to the user cluster kubeconfig file.

The STATUS column is filtered using grep to only show nodes that report SchedulingDisabled. This status indicates that the nodes are being drained.

You can also check the node status from the admin cluster:

kubectl get baremetalmachines -n CLUSTER_NAMESPACE \
  --kubeconfig ADMIN_KUBECONFIG

Replace the following values:

  • CLUSTER_NAMESPACE: the namespace of your cluster.
  • ADMIN_KUBECONFIG: the admin kubeconfig file.
    • If a bootstrap cluster is used for an admin, hybrid, or standalone upgrade, specify the bootstrap cluster kubeconfig file (bmctl-workspace/.kindkubeconfig).

The node getting drained shows the status under the MAINTENANCE column.

Check why a node has been in the status of draining for a long time

Use one of the methods in the previous section to identify the node getting drained by using the using kubectl get nodes command. Use the kubectl get pods command and filter on this node name to view additional details:

kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=NODE_NAME

Replace NODE_NAME with the name of the node being drained. The output returns a list of Pods that are currently stuck or slow to drain. The upgrade proceeds, even with stuck Pods, when the draining process on a node takes more than 20 minutes.