Starting with Milestone 77, Container-Optimized OS includes the Node Problem Detector agent. You can use this feature to monitor the system health of COS instances. Node Problem Detector monitors the instance health and reports health-related metrics to Cloud Monitoring, including capacity and error metrics that you can then visualize with Google Cloud's operations suite dashboards. Collected metrics from the default configuration are free. Google will use aggregated metrics to understand node problems and improve the reliability of Container-Optimized OS.
The agent is pre-configured with the set of metrics to export. Customizing reported metrics for the built-in agent is not supported at this time. Node Problem Detector is open-source software. You can review its source code and configurations in their respective source repositories.
Enabling health monitoring
The Node Problem Detector agent is disabled by default at boot time. You can enable this feature by using:
Using a startup script
You can enable Node Problem Detector by using a startup script.
explains the basics of configuring a Container-Optimized OS instance. You can
cloud-init to enable health monitoring with the following
#cloud-config runcmd: - systemctl start node-problem-detector
In Container-Optimized OS Milestone 88 and later, the Node Problem Detector can
also be enabled by setting the value of
the custom metadata section.
To enable monitoring when creating an instance:
gcloud compute instances create instance-name \ --image-family cos-stable \ --image-project cos-cloud \ --metadata google-monitoring-enabled=true
To enable monitoring in an existing instance:
gcloud compute instances add-metadata instance-name \ --metadata google-monitoring-enabled=true
Starting in milestone 97, monitoring can be enabled in project metadata:
gcloud compute project-info add-metadata \ --metadata google-monitoring-enabled=true
After execution, the node-problem-detector service will be enabled.
Using user-defined guest policies
Container-Optimized OS includes OS Config agent, that uses OS system utilities to maintain the state for the VM that is specified in the guest policy. For details about guest policies, see Enable OS Config agent and Create a guest policy. The following guest policy enables the Node problem detector agent on all the instances.
recipes: - name: recipe-enable-npd desiredState: INSTALLED installSteps: - scriptRun: interpreter: SHELL script: |- #!/bin/bash systemctl start node-problem-detector
Viewing the collected metrics
Node Problem Detector reports a list of metrics against a
Compute Engine instance monitored resource.
The metrics are documented on
Monitoring metrics list, prefixed
compute.googleapis.com/guest/. You can view the collected metrics
Monitoring Metrics Explorer:
In the Google Cloud console, go to Monitoring or use the following button:
In the Monitoring navigation pane, click Metrics explorer.
For the resource type, select Compute Engine VM instance.
Select a metric, for example "Problem Count".
You should see charts and statistics on the right side. To view the result for a specific Container-Optimized OS instance, set the filter to
"instance_id=[INSTANCE_ID]", replacing [INSTANCE_ID] with the ID for the desired instance.
Disabling health monitoring
To disable the service that has already been enabled through your
or through your startup script,
systemctl start node-problem-detector step, and then reboot the
Container-Optimized OS instance. If enabled by metadata, make sure the
google-monitoring-enabled key is set to