This page explains how to debug node issues on Google Distributed Cloud using a suite of preinstalled debugging tools.
Overview
Each Google Distributed Cloud cluster you create is composed of several
nodes. Each Google Distributed Cloud
node includes a distribution of
CoreOS' toolbox
, a shell
script that unpacks and runs a debugging container, debug-toolbox
.
debug-toolbox
is a container image that includes several useful debugging
tools.
If you encounter issues with a specific node, you can attempt debugging by
connecting to the affected node, run the toolbox
script to unpack and run the
debug-toolbox
container, and run the tools included in the container.
Tools included in debug-toolbox
container
The debug-toolbox
container runs a Debian base image that includes the
following packages:
- bash
- curl
- dnsutils
- hping3
- iperf3
- lsof
- netcat
- mtr
- procps
- strace
- tcpdump
- traceroute
- util-linux
Since these tools are included in the container, they don't require an internet
connection. If you want to install additional debugging tools, you use
apt-get
, which does require an internet connection.
Using toolbox
Run the
toolbox
command:sudo toolbox
This command starts a
debug-toolbox
container.While inside the container, run one of the tools. For example,
tcpdump
.When you're finished, exit the container and close the SSH connection to the node.
Node Problem Detector
Beginning with Google Distributed Cloud version 1.4, Node Problem
Detector,
which is enabled for all the nodes in a cluster, helps in quick detection of
some common node problems. Node Problem Detector keeps checking for possible
problems and reports the same as events and conditions on the node. If a node
misbehaves, you can check whether Node Problem Detector detected the problem by
running kubectl describe
on the node and looking for the corresponding events
and conditions.
Node Problem Detector monitors generate several conditions on the node. If the
reported condition is KubeletUnhealthy
or ContainerRuntimeUnhealthy
, a
restart of the corresponding systemd
service (kubelet or Docker) might help in
making the node healthy again.
Beginning with Google Distributed Cloud version 1.5, kubelet and docker
systemd service auto repair is enabled in Node Problem Detector. If
Node Problem Detector detects a KubeletUnhealthy
or
ContainerRuntimeUnhealthy
condition on the node, it tries to restart the
kubelet or docker service automatically if the duration since last restart is
above a certain threshold.