Management Tools

Monitor your NVIDIA GPUs on Compute Engine with Ops Agent

September 22, 2023

Lujie Duan

Software Engineer

Suffian Khan

Software Engineer, AI+Accelerators, Google Cloud

Organizations using AI and ML for applications such as product recommendations, scientific computing, and gaming often turn to NVIDIA GPUs on Google Cloud for the necessary compute performance. To understand their workload’s behavior and optimize the ML development process, they need to monitor the GPU performance metrics. To help, we’re excited to announce that Ops Agent now collects metrics from NVIDIA GPUs on Compute Engine VMs.

Cloud Ops Agent is the Google-recommended telemetry solution for Compute Engine that offers a curated experience for monitoring VM instances. With essential metrics from the NVIDIA Management Library (NVML) and advanced profiling metrics from the NVIDIA Data Center GPU Manager (DCGM), you can now get improved visibility into your NVIDIA GPUs and accelerated workloads.

With Ops Agent, you can:

Visualize the health of your GPU fleet with GPU metrics and out-of-the-box dashboards
Optimize costs by identifying underutilized GPUs and consolidating workloads
Plan scaling by looking at trends to decide when to expand GPU capacity or upgrade existing GPUs
Identify which GPU processes (the ML models) are consuming utilization and memory
Use DCGM profiling metrics to identify bottlenecks and performance issues within the GPU
Alert on metrics from your GPUs

Get essential GPU metrics right out of the box

If you use NVIDIA GPUs, you’re probably familiar with the nvidia-smi command, which provides an overview of all GPU devices and the processes running on them. Leveraging the same underlying API in NVML, Ops Agent can collect those essential metrics without extra configuration. This includes metrics for:

GPU utilization
GPU memory usage
Process maximum GPU memory usage
Process lifetime GPU utilization

The process metrics track what workloads are running on the GPUs.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image5_ljKSfoG.max-2000x2000.png

Viewing the GPU memory usage metric in Cloud Console’s Metrics Explorer

Collect advanced GPU metrics with DCGM

NVIDIA’s DCGM is a suite of tools to manage and monitor NVIDIA GPUs at scale. It offers an API for advanced profiling-level metrics of different hardware components, including streaming processors and interconnections such as NVLink and more. We have curated a list of these advanced metrics with the Ops Agent DCGM integration.

See the documentation for instructions on how to configure Ops Agent to use the DCGM integration.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image2_w9tsyDB.max-2000x2000.png

Viewing the DCGM PCIe traffic rate in Cloud console Metrics Explorer

Visualize the health of your GPUs

Together with the other offerings in Google Cloud's operations suite, you can easily query and visualize the collected GPU metrics from Ops Agent. Use our Metrics Explorer query builder or PromQL to construct queries, create custom charts, and add them to dashboards. Our NVIDIA GPU Monitoring dashboard provides a single pane of glass across your GPU fleet using GPU metrics collected from both GKE GPU nodes and Compute Engine GPU VMs. See the documentation on how to import this dashboard to your project. For the Cloud Monitoring DCGM integration, the DCGM dashboard is automatically installed to your project once DCGM metrics collection begins, delivering a focused view of the GPU profiling metrics.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image6_L0d242N.max-1300x1300.png

Using the NVIDIA GPU Monitoring Overview dashboard to monitor your GPU fleet

https://storage.googleapis.com/gweb-cloudblog-publish/images/image4_MoG52ZB.max-1600x1600.png

Viewing advanced GPU metrics with the DCGM integration dashboard

One unified agent for VM monitoring, logging, and trace

Ops Agent is a feature-rich, unified telemetry agent with an intuitive configuration interface that lets you do more than just gain visibility into your GPUs:

Automatically collect host metrics such as CPU, memory, and process metrics
Automatically collect system logs such as syslog from Linux VMs and Windows Event Log from Windows VMs
Collect Prometheus metrics and OpenTelemetry Protocol (OTLP) metrics and traces from your workloads
Use the logging files receiver to ingest log files from your machine learning workloads to Cloud Logging
Use metrics processors to change the collection interval of your NVML and DCGM metrics or filter out any unneeded metrics. You can use metrics processors with NVML and DCGM metrics to filter and only keep the metrics you need, and easily change the collection interval of those metrics via the configuration file.

And with just one agent to manage, you can focus more on taking advantage of your GPU VMs.

Get started today

Interested in trying out Ops Agent when you create a VM through the Google Cloud console? We recently added a one-click option to add an Ops Agent when creating a new VM. This lets you try out Ops Agent with its default configuration before deciding how to manage your VMs and Ops Agents at scale.