NVIDIA Data Center GPU Manager (DCGM)

The NVIDIA Data Center GPU Manager integration collects key advanced GPU metrics from DCGM. The Ops Agent can be configured to collect one of two different sets of metrics by selecting the version of the dcgm receiver:

Version 2 of the dcgm receiver provides a curated set of metrics for monitoring the performance and state of the GPUs attached to a given VM instance.
Version 1 of the dcgm receiver provides a set of profiling metrics meant to be used in combination with the default GPU metrics. For information about the purpose and interpretation of these metrics, see Profiling Metrics in the DCGM feature overview.

For more information about the NVIDIA Data Center GPU Manager, see the DCGM documentation. This integration is compatible with DCGM version 3.1 through 3.3.9.

These metrics are available for Linux systems only. Profiling metrics are not collected from NVIDIA GPU models P100 and P4.

Prerequisites

To collect NVIDIA DCGM metrics, you must do the following:

Install the NVIDIA Datacenter driver.
Install DCGM.
Install the Ops Agent.
- Version 1 metrics: Ops Agent version 2.38.0 or higher. Only Ops Agent version 2.38.0 or versions 2.41.0 or higher are compatible with GPU monitoring. Do not install Ops Agent versions 2.39.0 and 2.40.0 on VMs with attached GPUs. For more information, see Agent crashes and report mentions NVIDIA.
- Version 2 metrics: Ops Agent version 2.51.0 or higher.

Install DCGM and verify installation

You must install DCGM version 3.1 through 3.3.9 and ensure that it runs as a privileged service. To install DCGM, see Installation in the DCGM documentation.

To verify that DCGM is running correctly, do the following:

Check the status of the DCGM service by running the following command:

sudo service nvidia-dcgm status

If the service is running, the nvidia-dcgm service is listed as active (running). The output resembles the following:

● nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; disabled; vendor preset: enabled)
Active: active (running) since Sat 2023-01-07 15:24:29 UTC; 3s ago
Main PID: 24388 (nv-hostengine)
Tasks: 7 (limit: 14745)
CGroup: /system.slice/nvidia-dcgm.service
       └─24388 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Verify that the GPU devices are found by running the following command:

dcgmi discovery --list

If devices are found, the output resembles the following:

1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA A100-SXM4-40GB                                          |
|        | PCI Bus ID: 00000000:00:04.0                                         |
|        | Device UUID: GPU-a2d9f5c7-87d3-7d57-3277-e091ad1ba957                |
+--------+----------------------------------------------------------------------+

Configure the Ops Agent for DCGM

Following the guide for Configuring the Ops Agent, add the required elements to collect telemetry from your DCGM service, and restart the agent.

Example configuration

The following commands create the configuration to collect and ingest the receiver version 2 metrics for NVIDIA DCGM and restart the Ops Agent:

# Configures Ops Agent to collect telemetry from the app and restart Ops Agent.
set -e

# Create a back up of the existing file so existing configurations are not lost.
sudo cp /etc/google-cloud-ops-agent/config.yaml /etc/google-cloud-ops-agent/config.yaml.bak

# Configure the Ops Agent.
sudo tee /etc/google-cloud-ops-agent/config.yaml > /dev/null << EOF
metrics:
  receivers:
    dcgm:
      type: dcgm
      receiver_version: 2
  service:
    pipelines:
      dcgm:
        receivers:
          - dcgm
EOF

sudo service google-cloud-ops-agent restart
sleep 20

If you want to collect only DCGM profiling metrics, then replace the value of the receiver_version field with 1. You can also remove the receiver_version entry entirely; the default version is 1. You can't use both versions at the same time.

After running these commands, you can check that the agent restarted. Run the following command and verify that the sub-agent components "Metrics Agent" and "Logging Agent" are listed as "active (running)":

sudo systemctl status google-cloud-ops-agent"*"

If you get an error message like "Unable to connect to DCGM daemon at localhost:5555 on libdcgm.so not Found; Is the DCGM daemon running?", then you have probably installed version 4.0 of the DGCM service. The DCGM shared library was renamed to libdgcdm.so.4, which the Ops Agent DCGM receiver doesn't recognize. You must use DCGM version 3.1 through 3.3.9.

If you are using custom service account instead of the default Compute Engine service account, or if you have a very old Compute Engine VM, then you might need to authorize the Ops Agent.

Configure metrics collection

To ingest metrics from NVIDIA DCGM, you must create a receiver for the metrics that NVIDIA DCGM produces and then create a pipeline for the new receiver.

This receiver does not support the use of multiple instances in the configuration, for example, to monitor multiple endpoints. All such instances write to the same time series, and Cloud Monitoring has no way to distinguish among them.

To configure a receiver for your dcgm metrics, specify the following fields:

Field	Default	Description
`collection_interval`	`60s`	A time duration, such as `30s` or `5m`.
`endpoint`	`localhost:5555`	Address of the DCGM service, formatted as `host:port`.
`receiver_version`	`1`	Either 1 or 2. Version 2 has many more metrics available.
`type`		This value must be `dcgm`.

What is monitored

The following tables provides lists of metrics that the Ops Agent collects from the NVIDIA DGCM instance. Not all metrics are available for all GPU models. Profiling metrics are not collected from NVIDIA GPU models P100 and P4.

Version 1 metrics

The following metrics are collected by using version 1 of the dcgm receiver.

Metric type
Kind, Type Monitored resources	Labels
`workload.googleapis.com/dcgm.gpu.profiling.dram_utilization` ^†
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.nvlink_traffic_rate` ^†
`GAUGE`, `INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.pcie_traffic_rate` ^†
`GAUGE`, `INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.pipe_utilization` ^†
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `pipe` ^‡ `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.sm_occupancy` ^†
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.sm_utilization` ^†
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`

^† Not available on GPU models P100 and P4.

^‡ For L4, the pipe value fp64 is not supported.

Version 2 metrics

The following metrics are collected by using version 2 of the dcgm receiver.

Metric type
Kind, Type Monitored resources	Labels
`workload.googleapis.com/gpu.dcgm.clock.frequency`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.clock.throttle_duration.time`
`CUMULATIVE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid` `violation` ^†
`workload.googleapis.com/gpu.dcgm.codec.decoder.utilization`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.codec.encoder.utilization`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.ecc_errors`
`CUMULATIVE`, `INT64` gce_instance	`error_type` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.energy_consumption`
`CUMULATIVE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.memory.bandwidth_utilization`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.memory.bytes_used`
`GAUGE`, `INT64` gce_instance	`gpu_number` `model` `state` `uuid`
`workload.googleapis.com/gpu.dcgm.nvlink.io` ^‡
`CUMULATIVE`, `INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.pcie.io` ^‡
`CUMULATIVE`, `INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.pipe.utilization` ^‡
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `pipe` ^§ `uuid`
`workload.googleapis.com/gpu.dcgm.sm.utilization` ^‡
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.temperature`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.utilization`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`

^† For P100 and P4, only violation values power, thermal, and sync_boost are supported.

^‡ Not available on GPU models P100 and P4.

^§ For L4, the pipe value fp64 is not supported.

GPU metrics

In addition, the built-in configuration for the Ops Agent also collects agent.googleapis.com/gpu metrics, which are reported by the NVIDIA Management Library (NVML). You do not need any additional configuration in the Ops Agent to collect these metrics, but you must create your VM with attached GPUs and install the GPU driver. For more information, see About the gpu metrics. The dcgm receiver version 1 metrics are designed to complement these default metrics, while dcgm receiver version 2 metrics are intended to be standalone.

Verify the configuration

This section describes how to verify that you correctly configured the NVIDIA DCGM receiver. It might take one or two minutes for the Ops Agent to begin collecting telemetry.

To verify that NVIDIA DCGM metrics are being sent to Cloud Monitoring, do the following:

In the Google Cloud console, go to the Metrics explorer page:
Go to Metrics explorer

If you use the search bar to find this page, then select the result whose subheading is Monitoring.
In the toolbar of the query-builder pane, select the button whose name is either MQL or PromQL.
Verify that MQL is selected in the Language toggle. The language toggle is in the same toolbar that lets you format your query.

For v1 metrics, enter the following query in the editor, and then click Run query:

fetch gce_instance
| metric 'workload.googleapis.com/dcgm.gpu.profiling.sm_utilization'
| every 1m

For v2 metrics, enter the following query in the editor, and then click Run:

fetch gce_instance
| metric 'workload.googleapis.com/gpu.dcgm.sm.utilization'
| every 1m

View dashboard

To view your NVIDIA DCGM metrics, you must have a chart or dashboard configured. The NVIDIA DCGM integration includes one or more dashboards for you. Any dashboards are automatically installed after you configure the integration and the Ops Agent has begun collecting metric data.

You can also view static previews of dashboards without installing the integration.

To view an installed dashboard, do the following:

In the Google Cloud console, go to the Dashboards page:
Go to Dashboards

If you use the search bar to find this page, then select the result whose subheading is Monitoring.
Select the Dashboard List tab, and then choose the Integrations category.
Click the name of the dashboard you want to view.

If you have configured an integration but the dashboard has not been installed, then check that the Ops Agent is running. When there is no metric data for a chart in the dashboard, installation of the dashboard fails. After the Ops Agent begins collecting metrics, the dashboard is installed for you.

To view a static preview of the dashboard, do the following:

In the Google Cloud console, go to the Integrations page:
Go to Integrations

If you use the search bar to find this page, then select the result whose subheading is Monitoring.
Click the Compute Engine deployment-platform filter.
Locate the entry for NVIDIA DCGM and click View Details.
Select the Dashboards tab to see a static preview. If the dashboard is installed, then you can navigate to it by clicking View dashboard.

For more information about dashboards in Cloud Monitoring, see Dashboards and charts.

For more information about using the Integrations page, see Manage integrations.

DCGM limitations, and pausing profiling

Concurrent usage of DCGM can conflict with usage of some other NVIDIA developer tools, such as Nsight Systems or Nsight Compute. This limitation applies to NVIDIA A100 and earlier GPUs. For more information, see Profiling Sampling Rate in the DCGM feature overiew.

When you need to use tools like Nsight Systems without significant disruption, you can temporarily pause or resume the metrics collection by using the following commands:

dcgmi profile --pause
dcgmi profile --resume

When profiling is paused, none of the DCGM metrics that the Ops Agent collects are emitted from the VM.

What's next

For a walkthrough on how to use Ansible to install the Ops Agent, configure a third-party application, and install a sample dashboard, see the Install the Ops Agent to troubleshoot third-party applications video.