Send GPU metrics to Cloud Monitoring

If your cluster has nodes that use NVIDIA GPUs, you can monitor the GPU utilization, performance, and health by configuring the cluster to send NVIDIA Data Center GPU Manager (DCGM) metrics to Cloud Monitoring. This solution uses Google Cloud Managed Service for Prometheus to collect metrics from NVIDIA DCGM.

This page is for IT administrators and Operators who manage the lifecycle of the underlying tech infrastructure. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.

Before you begin

To use Google Cloud Managed Service for Prometheus to collect metrics from DCGM, your Google Distributed Cloud deployment must meet the following requirements:

NVIDIA DCGM-Exporter tool must be already installed on your cluster. DCGM-Exporter is installed when you install NVIDIA GPU Operator. For NVIDIA GPU Operator installation instructions, see Install and verify the NVIDIA GPU Operator.
Google Cloud Managed Service for Prometheus must be enabled. For instructions, see Enable Google Cloud Managed Service for Prometheus.

Configure a PodMonitoring resource

Configure a PodMonitoring resource for Google Cloud Managed Service for Prometheus to collect the exported metrics. If you are having trouble installing an application or exporter due to restrictive security or organizational policies, then we recommend you consult open-source documentation for support.

To ingest the metric data emitted by the DCGM Exporter Pod (nvidia-dcgm-exporter), Google Cloud Managed Service for Prometheus uses target scraping. Target scraping and metrics ingestion are configured using Kubernetes custom resources. The managed service uses PodMonitoring custom resources.

A PodMonitoring custom resource scrapes targets in the namespace in which it's deployed only. To scrape targets in multiple namespaces, deploy the same PodMonitoring custom resource in each namespace.

Create a manifest file with the following configuration:

The selector section in the manifest specifies that the DCGM Exporter Pod, nvidia-dcgm-exporter, is selected for monitoring. This Pod is deployed when you install the NVIDIA GPU Operator.

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: dcgm-gmp
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
  - port: metrics
    interval: 30s

Deploy the PodMonitoring custom resource:
```
kubectl apply -n NAMESPACE -f FILENAME --kubeconfig KUBECONFIG
```
Replace the following:
- NAMESPACE: the namespace into which you're deploying the PodMonitoring custom resource.
- FILENAME: the path of the manifest file for the PodMonitoring custom resource.
- KUBECONFIG: the path of the kubeconfig file for the cluster.
Verify that the PodMonitoring custom resource is installed in the intended namespace, run the following command:
```
kubectl get podmonitoring -n NAMESPACE --kubeconfig KUBECONFIG
```
The output should look similar to the following:
```
NAME       AGE
dcgm-gmp   3m37s
```

Verify the configuration

You can use Metrics Explorer to verify that you correctly configured the DCGM exporter. It might take one or two minutes for Cloud Monitoring to ingest your metrics.

To verify the metrics are ingested, do the following:

In the Google Cloud console, go to the Metrics explorer page:
Go to Metrics explorer

If you use the search bar to find this page, then select the result whose subheading is Monitoring.
Use Prometheus Query Language (PromQL) to specify the data to display on the chart:
1. In the toolbar of the query-builder pane, click < > PromQL.
2. Enter your query into the query editor. For example, to chart the average number of seconds CPUs spent in each mode over the past hour, use the following query:
```
DCGM_FI_DEV_GPU_UTIL{cluster="CLUSTER_NAME", namespace="NAMESPACE"}
```
Replace the following:
- CLUSTER_NAME: the name of the cluster with nodes that are using GPUs.
- NAMESPACE: the namespace into which you deployed the PodMonitoring custom resource.
For more information about using PromQL, see PromQL in Cloud Monitoring.

Send GPU metrics to Cloud Monitoring Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Configure a PodMonitoring resource

Verify the configuration

Send GPU metrics to Cloud Monitoring