Collect and view DCGM metrics

Autopilot Standard

You can monitor GPU utilization, performance, and health by configuring GKE to send NVIDIA Data Center GPU Manager (DCGM) metrics to Cloud Monitoring.

When you enable DCGM metrics, GKE installs the DCGM-Exporter tool, installs Google-managed GPU drivers, and deploys a ClusterPodMonitoring resource to send metrics to Google Cloud Managed Service for Prometheus.

You can also configure self-managed DCGM if you want to customize the set of DCGM metrics or if you have a cluster that does not meet the requirements for managed DCGM metrics.

What is DCGM

NVIDIA Data Center GPU Manager (DCGM) is a set of tools from NVIDIA that let you manage and monitor NVIDIA GPUs. DCGM provides a comprehensive view of GPU utilization, performance, and health.

GPU utilization metrics are an indication of how busy the monitored GPU is and if it is effectively utilized for processing tasks. This includes metrics for core processing, memory, I/O, and power utilization.
GPU performance metrics refer to how effectively and efficiently a GPU can perform a computational task. This includes metrics for clock speed and temperature.
GPU I/0 metrics like NVlink and PCIe measure data transfer bandwidth.

Before you begin

Before you start, make sure you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
Note: For existing gcloud CLI installations, make sure to set the compute/region and compute/zone properties. By setting default locations, you can avoid errors in gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location.

Requirements for NVIDIA Data Center GPU Manager (DCGM) metrics

To collect NVIDIA Data Center GPU Manager (DCGM) metrics, your GKE cluster must meet the following requirements:

GKE version 1.30.1-gke.1204000 or later
System metrics collection must be enabled
Google Cloud Managed Service for Prometheus managed collection must be enabled
The node pools must be running GKE managed GPU drivers. This means that you must create your node pools using default or latest for --gpu-driver-version.
Profiling metrics are only collected for NVIDIA H100 80GB GPUs.

Configure collection of DCGM metrics

You can enable GKE to collect DCGM metrics for an existing cluster using the Google Cloud console, the gcloud CLI, or Terraform.

Console

Create a GPU node pool.

You must use either Default or Latest for GPU Driver Installation.
Go to the Google Kubernetes Engine page in the Google Cloud console.

Go to Google Kubernetes Engine
Click the name of your cluster.
Next to Cloud Monitoring, click .
Select SYSTEM and DCGM.
Click Save.

gcloud

Create a GPU node pool.

You must use either default or latest for --gpu-driver-version.

Update your cluster:

gcloud container clusters update CLUSTER_NAME \
    --location=COMPUTE_LOCATION \
    --enable-managed-prometheus \
    --monitoring=SYSTEM,DCGM

Replace the following:

CLUSTER_NAME: the name of the existing cluster.
COMPUTE_LOCATION: the Compute Engine location of the cluster.

Terraform

To configure the collection of DCGM metrics by using Terraform, see the monitoring_config block in the Terraform registry for google_container_cluster. For general information about using Google Cloud with Terraform, see Terraform with Google Cloud.

Use DCGM metrics

You can view DCGM metrics by using the dashboards in the Google Cloud console or directly in the cluster overview and cluster details pages. For information, see View observability metrics.

You can view metrics using the Grafana DCGM metrics dashboard. For more information, see Query using Grafana. If you encounter any errors, see API compatibility.

Pricing

DCGM metrics use Google Cloud Managed Service for Prometheus to load metrics into Cloud Monitoring. Cloud Monitoring charges for the ingestion of these metrics are based on the number of samples ingested. However, these metrics are free-of-charge for the registered clusters that belong to a project that has GKE Enterprise edition enabled.

For more information, see Cloud Monitoring pricing.

Quota

DCGM metrics consume the Time series ingestion requests per minute quota of the Cloud Monitoring API. Before enabling the metrics packages, check your recent peak usage of that quota. If you have many clusters in the same project or are already approaching that quota limit, you can request a quota-limit increase before enabling either observability package.

DCGM metrics

The Cloud Monitoring metric names in this table must be prefixed with prometheus.googleapis.com/. That prefix has been omitted from the entries in the table.

Along with labels on the prometheus_target monitored resource, all collected DCGM metrics on GKE have the following labels attached to them:

GPU labels:

UUID: the GPU device UUID
device: the GPU device name.
gpu: the index number as an integer of the GPU device on the node. For example, if there are 8 GPUs attached, this value could range from 0 to 7.
modelName: the name of the GPU device model, such as NVIDIA L4.

Kubernetes labels:

container: the name of the Kubernetes container using the GPU device.
namespace: the Kubernetes namespace of the Pod and container using the GPU device.
pod: the Kubernetes Pod using the GPU device.

PromQL metric name Cloud Monitoring metric name
Kind, Type, Unit Monitored resources Required GKE version	Description
`DCGM_FI_DEV_FB_FREE` `DCGM_FI_DEV_FB_FREE/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	Free Frame Buffer in MB.
`DCGM_FI_DEV_FB_TOTAL` `DCGM_FI_DEV_FB_TOTAL/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	Total Frame Buffer of the GPU in MB.
`DCGM_FI_DEV_FB_USED` `DCGM_FI_DEV_FB_USED/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	Used Frame Buffer in MB.
`DCGM_FI_DEV_GPU_TEMP` `DCGM_FI_DEV_GPU_TEMP/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	Current temperature readings for the device (in °C).
`DCGM_FI_DEV_GPU_UTIL` `DCGM_FI_DEV_GPU_UTIL/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	GPU utilization (in %).
`DCGM_FI_DEV_MEM_COPY_UTIL` `DCGM_FI_DEV_MEM_COPY_UTIL/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	Memory utilization (in %).
`DCGM_FI_DEV_MEMORY_TEMP` `DCGM_FI_DEV_MEMORY_TEMP/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	Memory temperature for the device (in °C).
`DCGM_FI_DEV_POWER_USAGE` `DCGM_FI_DEV_POWER_USAGE/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	Power usage for the device (in Watts).
`DCGM_FI_DEV_SM_CLOCK` `DCGM_FI_DEV_SM_CLOCK/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	SM clock frequency (in MHz).
`DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION` `DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION/counter`
`CUMULATIVE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	Total energy consumption for the GPU in mJ since the driver was last reloaded.
`DCGM_FI_PROF_DRAM_ACTIVE` `DCGM_FI_PROF_DRAM_ACTIVE/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	The ratio of cycles the device memory interface is active sending or receiving data.
`DCGM_FI_PROF_GR_ENGINE_ACTIVE` `DCGM_FI_PROF_GR_ENGINE_ACTIVE/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	The ratio of time the graphics engine is active.
`DCGM_FI_PROF_NVLINK_RX_BYTES` `DCGM_FI_PROF_NVLINK_RX_BYTES/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	The rate of active NvLink rx (read) data in bytes including both header and payload.
`DCGM_FI_PROF_NVLINK_TX_BYTES` `DCGM_FI_PROF_NVLINK_TX_BYTES/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	The rate of active NvLink tx (transmit) data in bytes including both header and payload.
`DCGM_FI_PROF_PCIE_RX_BYTES` `DCGM_FI_PROF_PCIE_RX_BYTES/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	The rate of active PCIe rx (read) data in bytes including both header and payload.
`DCGM_FI_PROF_PCIE_TX_BYTES` `DCGM_FI_PROF_PCIE_TX_BYTES/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	The rate of active PCIe tx (transmit) data in bytes including both header and payload.
`DCGM_FI_PROF_PIPE_FP16_ACTIVE` `DCGM_FI_PROF_PIPE_FP16_ACTIVE/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	The ratio of cycles that the fp16 pipe is active.
`DCGM_FI_PROF_PIPE_FP32_ACTIVE` `DCGM_FI_PROF_PIPE_FP32_ACTIVE/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	The ratio of cycles that the fp32 pipe is active.
`DCGM_FI_PROF_PIPE_FP64_ACTIVE` `DCGM_FI_PROF_PIPE_FP64_ACTIVE/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	The ratio of cycles that the fp64 pipe is active.
`DCGM_FI_PROF_PIPE_TENSOR_ACTIVE` `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	The ratio of cycles that any tensor pipe is active.
`DCGM_FI_PROF_SM_ACTIVE` `DCGM_FI_PROF_SM_ACTIVE/gauge`
`GAUGE`, `DOUBLE`, `1` prometheus_target 1.30.1-gke.1204000	The ratio of cycles an SM has at least 1 warp assigned.

What's next

Learn how to View observability metrics.