从 Milestone 77 开始,Container-Optimized OS 包含 Node Problem Detector 代理。您可以使用此功能来监控 COS 实例的系统运行状况。Node Problem Detector 负责监控实例运行状况,并将与运行状况相关的指标报告给 Cloud Monitoring,其中包括容量和错误指标,然后您可以使用 Google Cloud Observability 信息中心直观呈现这些指标。从默认配置收集的指标是免费的。Google 将使用汇总指标来了解节点问题并提高 Container-Optimized OS 的可靠性。
代理已预先配置了一组要导出的指标。目前不支持为内置代理自定义报告指标。Node Problem Detector 是一种开源软件。您可以在各自的源代码库中查看其源代码和配置。
启用运行状况监控
默认情况下,在启动时 Node Problem Detector 代理处于停用状态。您可以通过以下方式启用此功能:
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-04。"],[[["\u003cp\u003eContainer-Optimized OS (COS) now includes the Node Problem Detector agent, starting with Milestone 77, to monitor system health.\u003c/p\u003e\n"],["\u003cp\u003eThe Node Problem Detector agent reports instance health metrics to Cloud Monitoring, providing data on capacity and errors that can be visualized via Google Cloud Observability dashboards.\u003c/p\u003e\n"],["\u003cp\u003eThe Node Problem Detector can be enabled via \u003ccode\u003ecloud-init\u003c/code\u003e, startup scripts, metadata (including project-level settings), or user-defined guest policies, and it is enabled by default on Google Kubernetes Engine and Google Distributed Cloud clusters from version 1.4.0+.\u003c/p\u003e\n"],["\u003cp\u003eCollected metrics can be viewed in Monitoring Metrics Explorer by selecting "Compute Engine VM instance" as the resource type and setting filters such as \u003ccode\u003e"instance_id=[INSTANCE_ID]"\u003c/code\u003e.\u003c/p\u003e\n"],["\u003cp\u003eDisabling health monitoring involves either removing the \u003ccode\u003esystemctl start node-problem-detector\u003c/code\u003e step from your \u003ccode\u003ecloud-config\u003c/code\u003e or startup script, then rebooting the instance, or setting the \u003ccode\u003egoogle-monitoring-enabled\u003c/code\u003e key to \u003ccode\u003efalse\u003c/code\u003e in the metadata.\u003c/p\u003e\n"]]],[],null,["# Monitoring system health with Node Problem Detector\n\nStarting with Milestone 77, Container-Optimized OS includes the\n[Node Problem Detector](https://github.com/kubernetes/node-problem-detector)\nagent. You can use this feature to monitor the system health of COS instances.\nNode Problem Detector monitors the instance health and reports health-related\nmetrics to Cloud Monitoring, including capacity and error metrics that\nyou can then visualize with [Google Cloud Observability dashboards](/monitoring/charts).\nCollected metrics from the default configuration are free. Google will use aggregated metrics to\nunderstand node problems and improve the reliability of Container-Optimized OS.\n\nThe agent is pre-configured with the set of metrics to export. Customizing reported metrics for the\nbuilt-in agent is not supported at this time. Node Problem Detector is\nopen-source software. You can review its\n[source code](https://github.com/kubernetes/node-problem-detector)\nand [configurations](https://cos.googlesource.com/cos/overlays/board-overlays/+/refs/heads/master/project-lakitu/app-admin/node-problem-detector/)\nin their respective source repositories.\n| **Note:** The Node Problem Detector monitors the VM and *not* the docker containers hosted on the COS VM. You will see metrics for the entire VM and nothing specific to the containers.\n\nEnabling health monitoring\n--------------------------\n\n| **Note:** Google Kubernetes Engine Container-Optimized OS nodes and Google Distributed Cloud cluster nodes from version 1.4.0+ have Node Problem Detector enabled by default.\n\nThe Node Problem Detector agent is disabled by default at boot time. You can enable\nthis feature by using:\n\n- [`cloud-init`](#using_cloud-init)\n- [startup script](#script)\n- [metadata](#metadata)\n- [user-defined guest policies](#using_user-defined_guest_policies)\n\n### Using a startup script\n\nYou can enable Node Problem Detector by using a\n[startup script](/container-optimized-os/docs/how-to/create-configure-instance#running_startup_scripts).\n\n### Using cloud-init\n\nThe [`cloud-init` example](/container-optimized-os/docs/how-to/create-configure-instance#using_cloud-init)\nexplains the basics of configuring a Container-Optimized OS instance. You can\nuse `cloud-init` to enable health monitoring with the following `cloud-config`\nexample: \n\n #cloud-config\n\n runcmd:\n - systemctl start node-problem-detector\n\n### Using metadata\n\nIn Container-Optimized OS Milestone 88 and later, the Node Problem Detector can\nalso be enabled by setting the value of `google-monitoring-enabled` to `true` in\nthe custom metadata section.\n\nTo enable monitoring when creating an instance: \n\n```\ngcloud compute instances create VM_NAME \\\n --image=IMAGE \\\n --image-project=cos-cloud \\\n --metadata=google-monitoring-enabled=true\n```\n\nReplace the following:\n\n- \u003cvar translate=\"no\"\u003eVM_NAME\u003c/var\u003e: name of the new VM\n- \u003cvar translate=\"no\"\u003eIMAGE\u003c/var\u003e: a specific version of a public Container-Optimized OS image. For example, `--image=cos-113-18244-85-29`.\n\nTo enable monitoring in an existing instance: \n\n```\ngcloud compute instances add-metadata VM_NAME \\\n --metadata=google-monitoring-enabled=true\n```\n\nReplace \u003cvar translate=\"no\"\u003eVM_NAME\u003c/var\u003e with the name of the VM.\n\nStarting in [milestone 97](/container-optimized-os/docs/release-notes/m97),\nmonitoring can be enabled in project metadata: \n\n gcloud compute project-info add-metadata \\\n --metadata google-monitoring-enabled=true\n\n| **Note:** Metadata flags defined at the instance level take precedence over metadata flags defined at the project level.\n\nAfter execution, the node-problem-detector service will be enabled.\n\n### Using user-defined guest policies\n\nContainer-Optimized OS includes [OS Config agent](/container-optimized-os/docs/how-to/osconfig), that uses OS system utilities to maintain the state for the VM\nthat is specified in the guest policy. For details about guest policies,\nsee [Enable OS Config agent](/compute/docs/manage-os#overview) and [Create a guest\npolicy](/compute/docs/os-config-management/create-guest-policy). The following guest policy enables the Node problem detector agent on all the instances. \n\n recipes:\n - name: recipe-enable-npd\n desiredState: INSTALLED\n installSteps:\n - scriptRun:\n interpreter: SHELL\n script: |-\n #!/bin/bash\n systemctl start node-problem-detector\n\nViewing the collected metrics\n-----------------------------\n\nNode Problem Detector reports a list of metrics against a\n[Compute Engine instance](/monitoring/api/resources#tag_gce_instance) monitored resource.\nThe metrics are documented on\n[Monitoring metrics list](/monitoring/api/metrics_gcp_c#gcp-compute), prefixed\nwith `compute.googleapis.com/guest/`. You can view the collected metrics\nusing\n[Monitoring Metrics Explorer](/monitoring/charts/metrics-explorer):\n| **Note:** To view the metrics, the minium version of the node-problem-detector is [v0.7.0](https://github.com/kubernetes/node-problem-detector/releases).\n\n1. In the Google Cloud console, go to **Monitoring** or use the\n following button:\n\n [Go to Monitoring](https://console.cloud.google.com/monitoring)\n2. In the Monitoring navigation pane, click **Metrics explorer**.\n\n3. For the resource type, select **Compute Engine VM instance**.\n\n4. Select a metric, for example \"Problem Count\".\n\n5. You should see charts and statistics on the right side. To view the result\n for a specific Container-Optimized OS instance, set the filter to\n `\"instance_id=`\u003cvar translate=\"no\"\u003e[INSTANCE_ID]\u003c/var\u003e`\"`, replacing \u003cvar translate=\"no\"\u003e[INSTANCE_ID]\u003c/var\u003e\n with the ID for the desired instance.\n\nDisabling health monitoring\n---------------------------\n\nTo disable the service that has already been enabled through your [`cloud-config`](/container-optimized-os/docs/how-to/create-configure-instance#using_cloud-init)\nor through your [startup script](/container-optimized-os/docs/how-to/create-configure-instance#running_startup_scripts),\nremove the `systemctl start node-problem-detector` step, and then reboot the\nContainer-Optimized OS instance. If enabled by [metadata](#metadata), make sure the\n`google-monitoring-enabled` key is set to `false`."]]