Node Problem Detector is an open source library that monitors the health of nodes and detects common node problems, such as hardware, kernel or container runtime issues. In Google Distributed Cloud, it runs as a systemd service on each node.
Starting with Google Distributed Cloud release 1.10.0, Node Problem Detector is enabled by default.
What problems does it detect?
Node problem detector can detect the following kinds of issues:
- Container runtime problems, such as unresponsive runtime daemons
- Hardware problems, such as CPU, memory, or disk failures
- Kernel problems, such as kernel deadlock conditions or corrupted file systems
It runs on a node and reports problems to the Kubernetes API
server as either a NodeCondition
or as an Event
.
(A NodeCondition
is a problem that makes a node unable to run pods whereas an
Event
is a temporary problem that has a limited effect on pods, but is
nonetheless considered important enough to report).
Some of the NodeConditions
discovered by Node Problem Detector are:
KernelDeadlock
ReadonlyFilesystem
FrequentKubeletRestart
FrequentDockerRestart
FrequentContainerdRestart
FrequentUnregisterNetDevice
KubeletUnhealthy
ContainerRuntimeUnhealthy
CorruptDockerOverlay2
Some examples of the kinds of Events
reported by Node Problem Detector are:
Warning TaskHung node/vm-worker-1-user-a12fabb4a99cb92-ddfce8832fd90f6f.lab.anthos kernel: task docker:7 blocked for more than 300 seconds.
Warning KernelOops node/vm-worker-1-user-a12fabb4a99cb92-ddfce8832fd90f6f.lab.anthos kernel: BUG: unable to handle kernel NULL pointer dereference at 00x0.
How to view detected problems
Run the following kubectl describe
command on a node to look for
NodeConditions
and Events
:
kubectl --kubeconfig=KUBECONFIG_PATH describe node NODE_NAME
In the command, replace the following entries with information specific to your environment:
KUBECONFIG_PATH
: the path to the target cluster kubeconfig file. (The path to the kubeconfig file is usuallybmctl-workspace/CLUSTER_NAME/CLUSTER_NAME-kubeconfig
. However, if you specified your workspace with the WORKSPACE_DIR flag, the path isWORKSPACE_DIR/CLUSTER_NAME/CLUSTER_NAME-kubeconfig
).NODE_NAME
: the name of the node about which you want health information.
How to enable/disable Node Problem Detector
Here are the steps to take to enable Node Problem Detector on a given cluster:
Use the
kubectl
command to edit thenode-problem-detector-config
configmap.kubectl --kubeconfig=KUBECONFIG_PATH edit configmap \ node-problem-detector-config --namespace=CLUSTER_NAMESPACE
This command automatically starts up a text editor (such as vim or nano) in which you can edit the
node-problem-detector-config
file. In the command, replace the following entries with information specific to your cluster environment:- KUBECONFIG_PATH: the path to the admin cluster
kubeconfig file. (The path to the kubeconfig file is usually
bmctl-workspace/CLUSTER_NAME/CLUSTER_NAME-kubeconfig
. However, if you specified your workspace with the WORKSPACE_DIR flag, the path isWORKSPACE_DIR/CLUSTER_NAME/CLUSTER_NAME-kubeconfig
). - CLUSTER_NAMESPACE: the namespace of the cluster in which you want to enable Node Problem Detector.
- KUBECONFIG_PATH: the path to the admin cluster
kubeconfig file. (The path to the kubeconfig file is usually
Initially, the
node-problem-detector-config
ConfigMap
doesn't have adata
field. Add thedata
field to the configuration map with the following key-value pair:data: enabled: "true"
To disable Node Problem Detector in a cluster namespace, perform the preceding
steps 1 and 2, but in step 2, change the value of the enabled
key to
'false'.
How to stop/start Node Problem Detector
Node Problem Detector runs as a systemd
service on each node. To manage Node Problem Detector for a given node, use SSH to access the node, and run the following systemctl
commands.
To disable Node Problem Detector, run the following command:
systemctl stop node-problem-detector
To restart Node Problem Detector, run the following command:
systemctl restart node-problem-detector
To check if Node Problem Detector is running on a particular node, run the following command:
systemctl is-active node-problem-detector