Node Problem Detector

Node Problem Detector 是一个开源库，可监控节点的健康状况并检测常见节点问题，例如硬件、内核或容器运行时问题。在 Anthos clusters on Bare Metal 中，它在每个节点上作为 systemd 服务运行。

从 Anthos clusters on Bare Metal 1.10.0 版开始，Node Problem Detector 默认处于启用状态。

它可以检测哪些问题？

Node Problem Detector 可以检测以下类型的问题：

它在节点上运行，并以 NodeCondition 或 Event 的形式向 Kubernetes API 服务器报告问题。（NodeCondition 是导致节点无法运行 pod 的问题，而 Event 是暂时性问题，对 pod 的影响有限，但其严重性仍被视为需要报告。）

Node Problem Detector 发现的某些 NodeConditions 包括：

Node Problem Detector 报告的 Events 种类的一些示例包括：

Warning TaskHung node/vm-worker-1-user-a12fabb4a99cb92-ddfce8832fd90f6f.lab.anthos kernel: task docker:7 blocked for more than 300 seconds.
Warning KernelOops node/vm-worker-1-user-a12fabb4a99cb92-ddfce8832fd90f6f.lab.anthos kernel: BUG: unable to handle kernel NULL pointer dereference at 00x0.

在节点上运行以下 kubectl describe 命令以查找 NodeConditions 和 Events：

kubectl --kubeconfig=KUBECONFIG_PATH describe node NODE_NAME

在该命令中，将以下条目替换为特定于您的环境的信息：

KUBECONFIG_PATH：目标集群 kubeconfig 文件的路径。（kubeconfig 文件的路径通常为 bmctl-workspace/CLUSTER_NAME/CLUSTER_NAME-kubeconfig。但是，如果您使用 WORKSPACE_DIR 标志指定工作区，则路径为 WORKSPACE_DIR/CLUSTER_NAME/CLUSTER_NAME-kubeconfig）。
NODE_NAME：您要获取其健康状况信息的节点的名称。

以下是在给定集群上启用 Node Problem Detector 的步骤：

修改集群的 ConfigMap 文件，该文件名为 node-problem-detector-config
```
   kubectl --kubeconfig=KUBECONFIG_PATH edit configmap \
       node-problem-detector-config --namespace=CLUSTER_NAMESPACE
```
此命令会自动启动一个文本编辑器（例如 vim 或 nano），您可以在其中修改 node-problem-detector-config 文件。在该命令中，将以下条目替换为特定于您的集群环境的信息：
- KUBECONFIG_PATH：管理员集群 kubeconfig 文件的路径。（kubeconfig 文件的路径通常为 bmctl-workspace/CLUSTER_NAME/CLUSTER_NAME-kubeconfig。但是，如果您使用 WORKSPACE_DIR 标志指定工作区，则路径为 WORKSPACE_DIR/CLUSTER_NAME/CLUSTER_NAME-kubeconfig）。
- CLUSTER_NAMESPACE：要在其中启用 Node Problem Detector 的集群命名空间。
node-problem-detector-config ConfigMap 最初没有 data 字段。使用以下键值对将 data 字段添加到配置映射中：
```
data:
  enabled: "true"
```

要在集群命名空间中停用 Node Problem Detector，请执行上述第 1 步和第 2 步，但在第 2 步中，将 enabled 键的值更改为“false”。

Node Problem Detector 在每个节点上作为 systemd 服务运行。要管理给定节点的 Node Problem Detector，请使用 SSH 访问节点，并运行以下 systemctl 命令。

要停用 Node Problem Detector，请运行以下命令：

systemctl stop node-problem-detector

要重启 Node Problem Detector，请运行以下命令：

systemctl restart node-problem-detector

要检查 Node Problem Detector 是否正在特定节点上运行，请运行以下命令：

systemctl is-active node-problem-detector

Anthos clusters on Bare Metal 不支持 Node Problem Detector 的以下自定义作业：