Version 1.14. This version is no longer supported. For information about how to upgrade to version 1.15, see Upgrading Anthos on bare metal in the 1.15 documentation. For more information about supported and unsupported versions, see the Version history page in the latest documentation.

Node Problem Detector

Node Problem Detector is an open source library that monitors the health of nodes and detects common node problems, such as hardware, kernel or container runtime issues. In Google Distributed Cloud, it runs as a systemd service on each node.

Starting with Google Distributed Cloud release 1.10.0, Node Problem Detector is enabled by default.

What problems does it detect?

Node problem detector can detect the following kinds of issues:

Container runtime problems, such as unresponsive runtime daemons
Hardware problems, such as CPU, memory, or disk failures
Kernel problems, such as kernel deadlock conditions or corrupted file systems

It runs on a node and reports problems to the Kubernetes API server as either a NodeCondition or as an Event. (A NodeCondition is a problem that makes a node unable to run pods whereas an Event is a temporary problem that has a limited effect on pods, but is nonetheless considered important enough to report).

Some of the NodeConditions discovered by Node Problem Detector are:

KernelDeadlock
ReadonlyFilesystem
FrequentKubeletRestart
FrequentDockerRestart
FrequentContainerdRestart
FrequentUnregisterNetDevice
KubeletUnhealthy
ContainerRuntimeUnhealthy
CorruptDockerOverlay2

Some examples of the kinds of Events reported by Node Problem Detector are:

Warning TaskHung node/vm-worker-1-user-a12fabb4a99cb92-ddfce8832fd90f6f.lab.anthos kernel: task docker:7 blocked for more than 300 seconds.
Warning KernelOops node/vm-worker-1-user-a12fabb4a99cb92-ddfce8832fd90f6f.lab.anthos kernel: BUG: unable to handle kernel NULL pointer dereference at 00x0.

How to view detected problems

Run the following kubectl describe command on a node to look for NodeConditions and Events:

kubectl --kubeconfig=KUBECONFIG_PATH describe node NODE_NAME

In the command, replace the following entries with information specific to your environment:

KUBECONFIG_PATH: the path to the target cluster kubeconfig file. (The path to the kubeconfig file is usually bmctl-workspace/CLUSTER_NAME/CLUSTER_NAME-kubeconfig. However, if you specified your workspace with the WORKSPACE_DIR flag, the path is WORKSPACE_DIR/CLUSTER_NAME/CLUSTER_NAME-kubeconfig).
NODE_NAME: the name of the node about which you want health information.

How to enable/disable Node Problem Detector

Here are the steps to take to enable Node Problem Detector on a given cluster:

Edit the cluster's ConfigMap file which is called node-problem-detector-config
```
   kubectl --kubeconfig=KUBECONFIG_PATH edit configmap \
       node-problem-detector-config --namespace=CLUSTER_NAMESPACE
```
This command automatically starts up a text editor (such as vim or nano) in which you can edit the node-problem-detector-config file. In the command, replace the following entries with information specific to your cluster environment:
- KUBECONFIG_PATH: the path to the admin cluster kubeconfig file. (The path to the kubeconfig file is usually bmctl-workspace/CLUSTER_NAME/CLUSTER_NAME-kubeconfig. However, if you specified your workspace with the WORKSPACE_DIR flag, the path is WORKSPACE_DIR/CLUSTER_NAME/CLUSTER_NAME-kubeconfig).
- CLUSTER_NAMESPACE: the namespace of the cluster in which you want to enable Node Problem Detector.
Initially, the node-problem-detector-config ConfigMap doesn't have a data field. Add the data field to the configuration map with the following key-value pair:
```
data:
  enabled: "true"
```

To disable Node Problem Detector in a cluster namespace, perform the preceding steps 1 and 2, but in step 2, change the value of the enabled key to 'false'.

How to stop/start Node Problem Detector

Node Problem Detector runs as a systemd service on each node. To manage Node Problem Detector for a given node, use SSH to access the node, and run the following systemctl commands.

To disable Node Problem Detector, run the following command:

systemctl stop node-problem-detector

To restart Node Problem Detector, run the following command:

systemctl restart node-problem-detector

To check if Node Problem Detector is running on a particular node, run the following command:

systemctl is-active node-problem-detector

Unsupported features

Google Distributed Cloud doesn't support the following customizations of Node Problem Detector:

Exporting Node Problem Detector reports to other monitoring systems, such as Stackdriver or Prometheus.
Customizing which NodeConditions or Events to look for.
Running user-defined monitoring scripts.