This page explains how to debug node issues on Anthos clusters on VMware (GKE on-prem) using a suite of preinstalled debugging tools.
Each Anthos clusters on VMware (GKE on-prem) cluster you create is composed of several
nodes. Each node includes a distribution of
toolbox, a shell
script that unpacks and runs a debugging container,
debug-toolbox is a container image that includes several useful debugging
If you encounter issues with a specific node, you can attempt debugging by
connecting to the affected node, run the
toolbox script to unpack and run the
debug-toolbox container, and run the tools included in the container.
Tools included in
debug-toolbox container runs a Debian base image that includes the
Since these tools are included in the container, they don't require an internet
connection. If you want to install additional debugging tools, you use
apt-get, which does require an internet connection.
This command starts a
While inside the container, run one of the tools. For example,
When you're finished, exit the container and close the SSH connection to the node.
Node Problem Detector
Beginning with Anthos clusters on VMware version 1.4, Node Problem
which is enabled for all the nodes in a cluster, helps in quick detection of
some common node problems. Node Problem Detector keeps checking for possible
problems and reports the same as events and conditions on the node. If a node
misbehaves, you can check whether Node Problem Detector detected the problem by
kubectl describe on the node and looking for the corresponding events
Node Problem Detector monitors generate several conditions on the node. If the
reported condition is
restart of the corresponding
systemd service (kubelet or Docker) might help in
making the node healthy again.
Beginning with Anthos clusters on VMware version 1.5, kubelet and docker
systemd service auto repair is enabled in Node Problem Detector. If
Node Problem Detector detects a
ContainerRuntimeUnhealthy condition on the node, it tries to restart the
kubelet or docker service automatically if the duration since last restart is
above a certain threshold.