Monitoring system health with Node Problem Detector

Starting with Milestone 77, Container-Optimized OS includes the Node Problem Detector agent. You can use this feature to monitor the system health of COS instances. Node Problem Detector monitors the instance health and reports health-related metrics to Cloud Monitoring, including capacity and error metrics that you can then visualize with Google Cloud's operations suite dashboards. Collected metrics from the default configuration are free. Google will use aggregated metrics to understand node problems and improve the reliability of Container-Optimized OS.

The agent is pre-configured with the set of metrics to export. Customizing reported metrics for the built-in agent is not supported at this time. Node Problem Detector is open-source software. You can review its source code and configurations in their respective source repositories.

Enabling health monitoring

The feature is disabled by default at boot time. You can enable this feature using cloud-init or a startup script.

The cloud-init example explains the basics of configuring a Container-Optimized OS instance. You can use cloud-init to enable health monitoring with below cloud-config example:

#cloud-config

runcmd:
- systemctl start node-problem-detector

Viewing the collected metrics

Node Problem Detector reports a list of metrics against a Compute Engine instance monitored resource. The metrics are documented on Monitoring metrics list, prefixed with compute.googleapis.com/guest/. You can view the collected metrics using Monitoring Metrics Explorer:

  1. In the Google Cloud Console, go to Monitoring or use the following button:

    Go to Monitoring

  2. In the Monitoring navigation pane, click Metrics explorer.

  3. For the resource type, select Compute Engine VM instance.

  4. Select a metric, for example "Problem Count".

  5. You should see charts and statistics on the right side. To view the result for a specific Container-Optimized OS instance, set the filter to "instance_id=[INSTANCE_ID]", replacing [INSTANCE_ID] with the ID for the desired instance.

Disabling health monitoring

The feature is disabled by default at boot time. If you have already enabled the feature but want to disable it now, remove the systemctl start node-problem-detector step in your startup script and cloud-config, and then reboot the Container-Optimized OS instance.