You can diagnose or check clusters to debug issues and capture a snapshot of the cluster state. Additionally, if you have partially succeeded with an installation, but the cluster returns errors, or is not performing properly, you can try to reset the cluster.
Diagnosing clusters with
bmctl check cluster
You can capture the state of your created clusters with the
bmctl check cluster command. The flags for the command let you choose the
diagnostic scope of the command so you can get focused information.
The diagnostic information can help you discover issues and debug your deployments more effectively. The command captures all relevant cluster and node configuration files for your defined scope, and then packages the information into a single tar archive.
bmctl check cluster --snapshot --cluster CLUSTER_NAME --admin-kubeconfig ADMIN_KUBECONFIG_PATH
Replace the following:
CLUSTER_NAME: the name of the target cluster.
ADMIN_KUBECONFIG_PATH: the path to the admin cluster
This command outputs a tar archive that includes relevant debug information from all system components and machines in the cluster you specified.
You can change the scope of the diagnostic information collected with the following command flags:
--snapshot-scenario allflag increases the scope of the diagnostic snapshot to include all the Pods in the specified cluster:
bmctl check cluster --snapshot --snapshot-scenario all --cluster CLUSTER_NAME --admin-kubeconfig ADMIN_KUBECONFIG_PATH
--snapshot-dry-runflag works in conjunction with the
--snapshot-config stringflag. Use the
--snapshot-dry-runflag to output a configuration file that you can modify to define a custom diagnostic scope. Your scope can include specific pods, namespaces, or node commands.
After you modify the output file created with the
you can use it as input to diagnose your specific scope with the
--snapshot-config string flag, described below. If you omit this flag, a
default configuration is applied.
bmctl check cluster --snapshot --snapshot-dry-run --cluster CLUSTER_NAME --admin-kubeconfig ADMIN_KUBECONFIG_PATH
--snapshot-configflag tells the
bmctlcommand to use the scope options specified in a snapshot configuration file. Generally, you create the snapshot configuration file with the
bmctl check cluster --snapshot --snapshot-config SNAPSHOT_CONFIG_FILE --cluster CLUSTER_NAME --admin-kubeconfig ADMIN_KUBECONFIG_PATH
Diagnosing clusters when the admin cluster is unreachable
When the admin cluster is down or unreachable, use a snapshot configuration file to take a cluster snapshot. A snapshot configuration file is in YAML format. The configuration file includes following fields for specifying how information is captured for your cluster:
numOfParallelThreads: the snapshot routine typically runs numerous commands. Multiple parallel threads help the routine execute faster. We recommend that you set
10as shown in the following example. If your snapshots take too long, increase this value.
excludeWords: the snapshot contains a large quantity of data for your cluster nodes. Use
excludeWordsto reduce security risks when you share your snapshot. For example, exclude
passwordso that corresponding password strings can't be identified.
nodeCommands: this section specifies the following information:
nodes: a list of IP addresses for the cluster nodes from which you want to collect information. To create a snapshot when the admin cluster is not reachable, specify at least one node IP address.
commands: a list of commands (and arguments) to run on each node. The output of each command is included in the snapshot.
nodeFiles: this section specifies the following information:
nodes: a list of IP addresses for the cluster nodes from which you want to collect files. To create a snapshot when the admin cluster is not reachable, specify at least one node IP address .
files: a list of files to retrieve from each node. When the specified files are found on a node, they are included in the snapshot.
nodeSSHKey: path to your SSH key file for your nodes. This field is required for creating a snapshot when the admin cluster is not reachable.
Use the following command to create a snapshot using a snapshot configuration file:
bmctl check cluster --snapshot --snapshot-config SNAPSHOT_CONFIG
SNAPSHOT_CONFIG with the path to your snapshot
The following sample snapshot configuration file shows the standard commands and files used for creating a snapshot. You can add more commands and files when additional diagnostic information is needed.
numOfParallelThreads: 10 excludeWords: - password nodeCommands: - nodes: - 10.200.0.3 - 10.200.0.4 commands: - uptime - df --all --inodes - ip addr - ip neigh - iptables-save --counters - mount - ip route list table all - top -bn1 || true - docker info || true - docker ps -a || true - crictl ps -a || true - docker ps -a | grep anthos-baremetal-haproxy | cut -d ' ' -f1 | head -n 1 | xargs sudo docker logs || true - docker ps -a | grep anthos-baremetal-keepalived | cut -d ' ' -f1 | head -n 1 | xargs sudo docker logs || true - crictl ps -a | grep anthos-baremetal-haproxy | cut -d ' ' -f1 | head -n 1 | xargs sudo crictl logs || true - crictl ps -a | grep anthos-baremetal-keepalived | cut -d ' ' -f1 | head -n 1 | xargs sudo crictl logs || true - ps -edF - ps -eo pid,tid,ppid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm,args,cgroup - conntrack --count - dmesg - systemctl status -l docker || true - journalctl --utc -u docker - journalctl --utc -u docker-monitor.service - systemctl status -l kubelet - journalctl --utc -u kubelet - journalctl --utc -u kubelet-monitor.service - journalctl --utc --boot --dmesg - journalctl --utc -u node-problem-detector - systemctl status -l containerd || true - journalctl --utc -u containerd - systemctl status -l docker.haproxy || true - journalctl --utc -u docker.haproxy - systemctl status -l docker.keepalived || true - journalctl --utc -u docker.keepalived - systemctl status -l container.haproxy || true - journalctl --utc -u container.haproxy - systemctl status -l container.keepalived || true - journalctl --utc -u container.keepalived nodeFiles: - nodes: - 10.200.0.3 - 10.200.0.4 files: - /proc/sys/fs/file-nr - /proc/sys/net/netfilter/nf_conntrack_max - /proc/sys/net/ipv4/conf/all/rp_filter - /lib/systemd/system/kubelet.service - /etc/systemd/system/kubelet.service.d/10-kubeadm.conf - /lib/systemd/system/docker.service || true - /etc/systemd/system/containerd.service || true - /etc/docker/daemon.json || true - /etc/containerd/config.toml || true - /etc/systemd/system/container.keepalived.service || true - /etc/systemd/system/container.haproxy.service || true - /etc/systemd/system/docker.keepalived.service || true - /etc/systemd/system/docker.haproxy.service || true nodeSSHKey: ~/.ssh/id_rsa # path to your ssh key file to each nodes
Create a snapshot for stuck install/upgrade of admin cluster
When installing admin/hybrid/standalone clusters, if bmctl is stuck at the following output
- Waiting for cluster kubeconfig to become ready
- Waiting for cluster to become ready
- Waiting for node pools to become ready
or when upgrading admin/hybrid/standalone clusters,
- Waiting for upgrade to complete
you can run the following command to take a snapshot using the bootstrap cluster.
bmctl check cluster --snapshot --kubeconfig <var>WORKSPACE_DIR</var>/.kindkubeconfig --cluster <var>CLUSTER_NAME</var>
Resetting clusters with
bmctl reset cluster
When a cluster fails to install correctly, you can try to return the nodes to a clean state by resetting it. Then you can re-install the cluster after making configuration changes.
Resetting self-managed clusters
To reset a cluster that manages itself, such as an admin cluster, issue the following command:
bmctl reset --cluster CLUSTER_NAME
Replace CLUSTER_NAME with the name of the cluster you're resetting.
Resetting user clusters
To reset a user cluster, issue the following command:
bmctl reset --cluster CLUSTER_NAME --admin-kubeconfig ADMIN_KUBECONFIG_PATH
Replace CLUSTER_NAME with the name of the user cluster you're
resetting and replace ADMIN_KUBECONFIG_PATH with the path to the
associated admin cluster's
bmctl supports the use of
--kubeconfig as an alias for the
Reset cluster details
Regardless of cluster type, the reset command applies to the entire cluster. There is no option to target a subset of nodes in a cluster.
Output from the
bmctl cluster reset command looks similar to this sample:
bmctl reset --cluster cluster1 Creating bootstrap cluster... OK Deleting GKE Hub member admin in project my-gcp-project... Successfully deleted GKE Hub member admin in project my-gcp-project Loading images... OK Starting reset jobs... Resetting: 1 Completed: 0 Failed: 0 ... Resetting: 0 Completed: 1 Failed: 0 Flushing logs... OK
During the reset operation,
bmctl first attempts to delete the GKE hub
membership registration, and then cleans up the affected nodes. During the
reset, storage mounts and data from the
are also deleted.
For all nodes, bmctl runs
kubeadm reset, removes the tunnel interfaces
used for cluster networking, and deletes the following directories:
For load balancer Nodes,
bmctl also performs the following actions:
- Deletes the configuration files for
The reset tool expects the cluster configuration file to be at the following location under the current working directory:
bmctl-workspace/cluster name/cluster name.yaml