Version 1.6. This version is no longer supported. For more information see the version support policy.

Debugging node issues

This page explains how to debug node issues on Google Distributed Cloud using a suite of preinstalled debugging tools.

Overview

Each Google Distributed Cloud cluster you create is composed of several nodes. Each Google Distributed Cloud node includes a distribution of CoreOS' toolbox, a shell script that unpacks and runs a debugging container, debug-toolbox. debug-toolbox is a container image that includes several useful debugging tools.

If you encounter issues with a specific node, you can attempt debugging by connecting to the affected node, run the toolbox script to unpack and run the debug-toolbox container, and run the tools included in the container.

Tools included in `debug-toolbox` container

The debug-toolbox container runs a Debian base image that includes the following packages:

bash
curl
dnsutils
hping3
iperf3
lsof
netcat
mtr
procps
strace
tcpdump
traceroute
util-linux

Since these tools are included in the container, they don't require an internet connection. If you want to install additional debugging tools, you use apt-get, which does require an internet connection.

Using `toolbox`

SSH into the cluster node.
Run the toolbox command:
```
sudo toolbox
```
This command starts a debug-toolbox container.
While inside the container, run one of the tools. For example, tcpdump.
When you're finished, exit the container and close the SSH connection to the node.

Node Problem Detector

Beginning with Google Distributed Cloud version 1.4, Node Problem Detector, which is enabled for all the nodes in a cluster, helps in quick detection of some common node problems. Node Problem Detector keeps checking for possible problems and reports the same as events and conditions on the node. If a node misbehaves, you can check whether Node Problem Detector detected the problem by running kubectl describe on the node and looking for the corresponding events and conditions.

Node Problem Detector monitors generate several conditions on the node. If the reported condition is KubeletUnhealthy or ContainerRuntimeUnhealthy, a restart of the corresponding systemd service (kubelet or Docker) might help in making the node healthy again.

Beginning with Google Distributed Cloud version 1.5, kubelet and docker systemd service auto repair is enabled in Node Problem Detector. If Node Problem Detector detects a KubeletUnhealthy or ContainerRuntimeUnhealthy condition on the node, it tries to restart the kubelet or docker service automatically if the duration since last restart is above a certain threshold.

Debugging node issues

Overview

Tools included in debug-toolbox container

Using toolbox

Node Problem Detector

Tools included in `debug-toolbox` container

Using `toolbox`