调试节点问题

本页面介绍如何使用一套预安装的调试工具调试 VMware 上的 Anthos 集群 (GKE On-Prem) 的节点问题。

概览

您创建的每个 Anthos Clusters on VMware (GKE On-Prem) 集群都由多个节点组成。每个节点均包含 CoreOS 的 toolbox 的发行版，这是一个解压缩并运行调试容器 debug-toolbox 的 shell 脚本。debug-toolbox 是一个容器映像，包含多个实用的调试工具。

如果您遇到特定节点问题，可以尝试通过连接到受影响的节点进行调试，运行 toolbox 脚本以解压并运行 debug-toolbox 容器，然后运行该容器中包含的工具。

`debug-toolbox` 容器中包含的工具

debug-toolbox 容器运行包含以下软件包的 Debian 基础映像：

bash
curl
dnsutils
hping3
iperf3
lsof
netcat
mtr
procps
strace
tcpdump
traceroute
util-linux

由于这些工具包含在容器中，因此无需互联网连接。如果要安装其他调试工具，请使用 apt-get（需要连接到互联网）。

使用 `toolbox`

通过 SSH 连接到集群节点。
运行 toolbox 命令：
```
sudo toolbox
```
此命令会启动 debug-toolbox 容器。
在容器内，运行其中一个工具。例如 tcpdump。
完成后，退出容器并关闭与节点的 SSH 连接。

Node Problem Detector

从 Anthos clusters on VMware 1.4 版开始，为集群中的所有节点启用的 Node Problem Detector 有助于快速检测一些常见节点问题。Node Problem Detector 会持续检查可能的问题，并报告与节点上的事件和条件相同的问题。如果节点出现异常，您可以通过在节点上运行 kubectl describe 并查找相应的事件和条件来检查 Node Problem Detector 是否检测到了问题。

Node Problem Detector 监控器会在节点上生成多个条件。如果报告的条件为 KubeletUnhealthy 或 ContainerRuntimeUnhealthy，重启相应的 systemd 服务（kubelet 或 Docker）可能有助于使节点恢复正常。

从 Anthos clusters on VMware (GKE on-prem) 1.5 版开始，Node Issue Detector 会启用 kubelet 和 Docker systemd 服务自动修复功能。如果 Node Problem Detector 在节点上检测到 KubeletUnhealthy 或 ContainerRuntimeUnhealthy 条件，则自上次重启后的时长超过特定阈值时，Node Problem Detector 会尝试自动重启 kubelet 或 Docker 服务。

调试节点问题

概览

debug-toolbox 容器中包含的工具

使用 toolbox

Node Problem Detector

`debug-toolbox` 容器中包含的工具

使用 `toolbox`