NVIDIA Data Center GPU Manager (DCGM)

Stay organized with collections Save and categorize content based on your preferences.

The NVIDIA Data Center GPU Manager integration collects key advanced GPU metrics from DCGM, including Streaming Multiprocessor (SM) block utilization, SM occupancy, SM pipe utilization, PCIe traffic rate, and NVLink traffic rate. For information about the purpose and interpretation of these metrics, see Profiling Metrics in the DCGM feature overiew.

For more information about the NVIDIA Data Center GPU Manager, see the DCGM documentation.

The Ops Agent collects DCGM metrics by using NVIDIA's client library go-dcgm. To collect these metrics, you must install the GPU-enabled preview version of the Ops Agent. These metrics are available for Linux systems only.

This integration is compatible with DCGM version 2.4.6 to 3.1.3.

Prerequisites

To collect DCGM metrics, you must do the following:

Install DCGM and verify installation

You must install a DCGM version 2.4.6 to 3.1.3 and ensure that it runs as a privileged service. To install DCGM, see Installation in the DCGM documentation.

To verify that DCGM is running correctly, do the following:

  1. Check the status of the DCGM service by running the following command:

    sudo service nvidia-dcgm status
    

    If the service is running, the nvidia-dcgm service is listed as active (running). The output resembles the following:

    ● nvidia-dcgm.service - NVIDIA DCGM service
    Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; disabled; vendor preset: enabled)
    Active: active (running) since Sat 2023-01-07 15:24:29 UTC; 3s ago
    Main PID: 24388 (nv-hostengine)
    Tasks: 7 (limit: 14745)
    CGroup: /system.slice/nvidia-dcgm.service
           └─24388 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
    
  2. Verify that the GPU devices are found by running the following command:

    dcgmi discovery --list
    

    If devices are found, the output resembles the following:

    1 GPU found.
    +--------+----------------------------------------------------------------------+
    | GPU ID | Device Information                                                   |
    +--------+----------------------------------------------------------------------+
    | 0      | Name: NVIDIA A100-SXM4-40GB                                          |
    |        | PCI Bus ID: 00000000:00:04.0                                         |
    |        | Device UUID: GPU-a2d9f5c7-87d3-7d57-3277-e091ad1ba957                |
    +--------+----------------------------------------------------------------------+
    

Install the GPU-enabled Ops Agent

To collect these metrics, you must install the GPU-enabled version of the Ops Agent, version 2.25.1+pre.gpu.1. To install the agent and verify that it is running, use the following procedure:

  1. Open a terminal connection to your VM instance using SSH or a similar tool and ensure you have sudo access.

  2. Change to a directory you have write access to, for example your home directory.

  3. Install the Ops Agent by running the following commands:

    curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
    sudo REPO_SUFFIX=20230214-1.1.1 bash add-google-cloud-ops-agent-repo.sh --also-install
    

Configure the Ops Agent for DCGM

Following the guide for Configuring the Ops Agent, add the required elements to collect telemetry from your DCGM service, and restart the agent.

Example configuration

The following commands create the configuration to collect and ingest telemetry for DCGM and restart the Ops Agent:

# Configures Ops Agent to collect telemetry from the app and restart Ops Agent.
set -e

# Create a back up of the existing file so existing configurations are not lost.
sudo cp /etc/google-cloud-ops-agent/config.yaml /etc/google-cloud-ops-agent/config.yaml.bak

# Configure the Ops Agent.
sudo tee /etc/google-cloud-ops-agent/config.yaml > /dev/null << EOF
metrics:
  receivers:
    dcgm:
      type: dcgm
  service:
    pipelines:
      dcgm:
        receivers:
          - dcgm
EOF

sudo systemctl restart google-cloud-ops-agent

After running these commands, you can check that the agent restarted. Run the following command and verify that the sub-agent components "Metrics Agent" and "Logging Agent" are listed as "active (running)":

sudo systemctl status google-cloud-ops-agent"*"

If you are using a very old Compute Engine VM or a custom service account instead of the default Compute Engine service account, you might need to authorize the Ops Agent.

Configure metrics collection

To ingest metrics from DCGM, you must create receivers for the metrics that DCGM produces and then create a pipeline for the new receivers.

To configure a receiver for your dcgm metrics, specify the following fields:

Field Default Description
collection_interval 60s A time duration, such as 30s or 5m.
endpoint localhost:5555 Address of the DCGM service, formatted as host:port.
type This value must be dcgm.

What is monitored

The following table provides the list of metrics that the Ops Agent collects from the DCGM service.

Metric type 
Kind, Type
Monitored resources
Labels
workload.googleapis.com/dcgm.gpu.dram_utilization
GAUGEDOUBLE
gce_instance
gpu_number
model
uuid
GAUGEINT64
gce_instance
direction
gpu_number
model
uuid
workload.googleapis.com/dcgm.gpu.pcie_traffic_rate
GAUGEINT64
gce_instance
direction
gpu_number
model
uuid
workload.googleapis.com/dcgm.gpu.pipe_utilization
GAUGEDOUBLE
gce_instance
gpu_number
model
pipe
uuid
workload.googleapis.com/dcgm.gpu.sm_occupancy
GAUGEDOUBLE
gce_instance
gpu_number
model
uuid
workload.googleapis.com/dcgm.gpu.sm_utilization
GAUGEDOUBLE
gce_instance
gpu_number
model
uuid

In addition, the build-in configuration for the GPU-enabled Ops Agent also collects agent.googleapis.com/gpu metrics, which are reported by the NVIDIA Management Library (NVML). You do not need any additional configuration in the Ops Agent to collect these metrics, but you must create your VM with attached GPUs and install the GPU driver. For more information, see The nvml receiver.

Verify the configuration

This section describes how to verify that you correctly configured the DCGM receiver. It might take one or two minutes for the Ops Agent to begin collecting telemetry.

To verify that the metrics are ingested, go to Metrics Explorer and run the following query in the MQL tab:

fetch gce_instance
| metric 'workload.googleapis.com/dcgm.gpu.sm_utilization'
| every 1m

View dashboard

To view your NVIDIA GPU metrics, you must have a chart or dashboard configured. Cloud Monitoring includes the Sample Library, which provides a set of sample dashboards for integrations, including NVIDIA GPU metrics. For information, see Review metrics in Cloud Monitoring.

You can install the dashboards only after the Ops Agent has begun collecting the metrics. If there is no metric data for a chart in the dashboard, installation of the dashboard fails.

DCGM limitations, and pausing profiling

Concurrent usage of DCGM can conflict with usage of some other NVIDIA developer tools, such as Nsight Systems or Nsight Compute. This limitation applies to NVIDIA A100 and earlier GPUs. For more information, see Profiling Sampling Rate in the DCGM feature overiew.

When you need to use tools like Nsight Systems without significant disruption, you can temporarily pause or resume the metrics collection by using the following commands:

dcgmi profile --pause
dcgmi profile --resume

When profiling is paused, none of the DCGM metrics that the Ops Agent collects are emitted from the VM.

What's next

For a walkthrough on how to use Ansible to install the Ops Agent, configure a third-party application, and install a sample dashboard, see the Install the Ops Agent to troubleshoot third-party applications video.