NVIDIA Data Center GPU Manager (DCGM)

NVIDIA Data Center GPU Manager 集成从 DCGM 收集关键的高级 GPU 指标。可以通过选择 dcgm 接收器的版本，将 Ops Agent 配置为收集两组不同的指标之一：

dcgm 接收器版本 2 提供了一组精选的指标，用于监控挂接到给定虚拟机实例的 GPU 的性能和状态。
dcgm 接收器版本 1 提供了一组性能分析指标，旨在与默认 GPU 指标搭配使用。如需了解这些指标的用途和解释，请参阅 DCGM 功能概览中的性能分析指标。

如需详细了解 NVIDIA 数据中心 GPU 管理器，请参阅 DCGM 文档。此集成与 DCGM 3.1 版至 3.3.9 版兼容。

这些指标仅适用于 Linux 系统。系统不会从 NVIDIA GPU 模型 P100 和 P4 收集性能分析指标。

前提条件

如需收集 NVIDIA DCGM 指标，您必须执行以下操作：

安装 NVIDIA Datacenter 驱动程序。
安装 DCGM。
安装 Ops Agent。
- 版本 1 指标：Ops Agent 2.38.0 版或更高版本。只有 Ops Agent 2.38.0 版、2.41.0 版或更高版本与 GPU 监控兼容。请勿在挂接了 GPU 的虚拟机上安装 Ops Agent 2.39.0 和 2.40.0 版。如需了解详情，请参阅代理崩溃，报告中提及 NVIDIA。
- 版本 2 指标：Ops Agent 2.51.0 版或更高版本。

安装 DCGM 并验证安装

您必须安装 DCGM 3.1 版至 3.3.9 版，并确保其作为特权服务运行。如需安装 DCGM，请参阅 DCGM 文档中的安装部分。

如需验证 DCGM 是否正常运行，请执行以下操作：

运行以下命令来检查 DCGM 服务的状态：

sudo service nvidia-dcgm status

如果服务正在运行，相应的 nvidia-dcgm 服务会被列为 active (running)。输出类似以下内容：

● nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; disabled; vendor preset: enabled)
Active: active (running) since Sat 2023-01-07 15:24:29 UTC; 3s ago
Main PID: 24388 (nv-hostengine)
Tasks: 7 (limit: 14745)
CGroup: /system.slice/nvidia-dcgm.service
       └─24388 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

运行以下命令来验证是否能够找到 GPU 设备：

dcgmi discovery --list

如果找到设备，输出将类似于以下内容：

1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA A100-SXM4-40GB                                          |
|        | PCI Bus ID: 00000000:00:04.0                                         |
|        | Device UUID: GPU-a2d9f5c7-87d3-7d57-3277-e091ad1ba957                |
+--------+----------------------------------------------------------------------+

为 DCGM 配置 Ops Agent

按照配置 Ops Agent 指南，添加从 DCGM 服务收集遥测数据所需的元素并重启代理。

配置示例

以下命令会创建相关配置来收集和注入 NVIDIA DCGM 的接收器版本 2 指标：

# Configures Ops Agent to collect telemetry from the app. You must restart the agent for the configuration to take effect.
set -e

# Check if the file exists
if [ ! -f /etc/google-cloud-ops-agent/config.yaml ]; then
  # Create the file if it doesn't exist.
  sudo mkdir -p /etc/google-cloud-ops-agent
  sudo touch /etc/google-cloud-ops-agent/config.yaml
fi

# Create a back up of the existing file so existing configurations are not lost.
sudo cp /etc/google-cloud-ops-agent/config.yaml /etc/google-cloud-ops-agent/config.yaml.bak

# Configure the Ops Agent.
sudo tee /etc/google-cloud-ops-agent/config.yaml > /dev/null << EOF
metrics:
  receivers:
    dcgm:
      type: dcgm
      receiver_version: 2
  service:
    pipelines:
      dcgm:
        receivers:
          - dcgm
EOF

如果您只想收集 DCGM 性能分析指标，请将 receiver_version 字段的值替换为 1。您也可以完全移除 receiver_version 条目；默认版本为 1。您无法同时使用这两个版本。

为使这些更改生效，您必须重启 Ops Agent：

要重启代理，请在您的实例上运行以下命令：
```
sudo systemctl restart google-cloud-ops-agent
```
如需确认代理已重启，请运行以下命令并验证“Metrics Agent”和“Logging Agent”组件是否已启动：
```
sudo systemctl status "google-cloud-ops-agent*"
```

如果您收到类似“Unable to connect to DCGM daemon at localhost:5555 on libdcgm.so not Found; Is the DCGM daemon running?”（无法连接到 libdcgm.so 的 localhost:5555 上的 DCGM 守护程序；DCGM 守护程序是否正在运行？）的错误消息，则表示您可能安装了 4.0 版 DGCM 服务。DCGM 共享库已重命名为 libdgcdm.so.4，Ops Agent DCGM 接收器无法识别该库。您必须使用 DCGM 3.1 版至 3.3.9 版。

如果您使用的是自定义服务账号（而不是默认的 Compute Engine 服务账号），或者您的 Compute Engine 虚拟机非常旧，则可能需要为 Ops Agent 授权。

配置指标收集

如需从 NVIDIA DCGM 注入指标，您必须为 NVIDIA DCGM 生成的指标创建接收器，然后为新的接收器创建流水线。

此接收器不支持在配置中使用多个实例，例如，监控多个端点。所有这些实例都会写入相同的时序，并且 Cloud Monitoring 无法区分它们。

如需为 dcgm 指标配置接收器，请指定以下字段：

字段	默认值	说明
`collection_interval`	`60s`	时长，例如 `30s` 或 `5m`。
`endpoint`	`localhost:5555`	DCGM 服务的地址，格式为 `host:port`。
`receiver_version`	`1`	值为 1 或 2。版本 2 提供了更多可用指标。
`type`		该值必须为 `dcgm`。

监控的内容

下表提供了 Ops Agent 从 NVIDIA DGCM 实例收集的指标列表。并非所有指标都适用于所有 GPU 模型。系统不会从 NVIDIA GPU 模型 P100 和 P4 收集性能分析指标。

版本 1 指标

以下指标是使用 dcgm 接收器版本 1 收集的。

指标类型
种类、类型受监控的资源	标签
`workload.googleapis.com/dcgm.gpu.profiling.dram_utilization` ^†
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.nvlink_traffic_rate` ^†
`GAUGE`、`INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.pcie_traffic_rate` ^†
`GAUGE`、`INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.pipe_utilization` ^†
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `pipe` ^‡ `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.sm_occupancy` ^†
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.sm_utilization` ^†
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`

^† 不适用于 GPU 模型 P100 和 P4。

^‡ 对于 L4，pipe 值 fp64 不受支持。

版本 2 指标

以下指标是使用 dcgm 接收器版本 2 收集的。

指标类型
种类、类型受监控的资源	标签
`workload.googleapis.com/gpu.dcgm.clock.frequency`
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.clock.throttle_duration.time`
`CUMULATIVE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid` `violation` ^†
`workload.googleapis.com/gpu.dcgm.codec.decoder.utilization`
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.codec.encoder.utilization`
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.ecc_errors`
`CUMULATIVE`、`INT64` gce_instance	`error_type` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.energy_consumption`
`CUMULATIVE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.memory.bandwidth_utilization`
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.memory.bytes_used`
`GAUGE`、`INT64` gce_instance	`gpu_number` `model` `state` `uuid`
`workload.googleapis.com/gpu.dcgm.nvlink.io` ^‡
`CUMULATIVE`、`INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.pcie.io` ^‡
`CUMULATIVE`、`INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.pipe.utilization` ^‡
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `pipe` ^§ `uuid`
`workload.googleapis.com/gpu.dcgm.sm.utilization` ^‡
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.temperature`
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.utilization`
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`

^† 对于 P100 和 P4，仅支持 violation 值 power、thermal 和 sync_boost。

^‡ 不适用于 GPU 模型 P100 和 P4。

^§ 对于 L4，pipe 值 fp64 不受支持。

GPU 指标

此外，Ops Agent 的内置配置还会收集 agent.googleapis.com/gpu 指标，这些指标由 NVIDIA 管理库 (NVML) 报告。虽然您无需在 Ops Agent 中进行任何额外配置便可收集这些指标，但您必须创建挂接 GPU 的虚拟机并安装 GPU 驱动程序。如需了解详情，请参阅 gpu 指标简介。dcgm 接收器版本 1 指标旨在作为这些默认指标的补充，而 dcgm 接收器版本 2 指标旨在独立使用。

验证配置

本部分介绍如何验证您是否正确配置了 NVIDIA DCGM 接收器。Ops Agent 可能需要一两分钟才会开始收集遥测数据。

如需验证 NVIDIA DCGM 指标是否已发送到 Cloud Monitoring，请执行以下操作：

在 Google Cloud 控制台中，前往 Metrics Explorer 页面：
进入 Metrics Explorer

如果您使用搜索栏查找此页面，请选择子标题为监控的结果。
在查询构建器窗格的工具栏中，选择名为 MQL 或 PromQL 的按钮。
验证是否在语言切换开关中选择了 MQL。语言切换开关位于同一工具栏中，用于设置查询的格式。

对于 v1 指标，请在编辑器中输入以下查询，然后点击运行查询：

fetch gce_instance
| metric 'workload.googleapis.com/dcgm.gpu.profiling.sm_utilization'
| every 1m

对于 v2 指标，请在编辑器中输入以下查询，然后点击运行：

fetch gce_instance
| metric 'workload.googleapis.com/gpu.dcgm.sm.utilization'
| every 1m

查看信息中心

如需查看 NVIDIA DCGM 指标，您必须配置一个图表或信息中心。NVIDIA DCGM 集成服务可为您提供一个或多个信息中心。在您配置集成并且 Ops Agent 开始收集指标数据后，所有信息中心都会自动安装。

您还可以在不安装集成的情况下查看信息中心的静态预览。

如需查看已安装的信息中心，请执行以下操作：

在 Google Cloud 控制台中，前往 信息中心页面：
前往信息中心

如果您使用搜索栏查找此页面，请选择子标题为监控的结果。
选择信息中心列表标签页，然后选择集成类别。
点击您要查看的信息中心的名称。

如果您已配置集成，但尚未安装信息中心，请检查 Ops Agent 是否正在运行。如果信息中心内没有图表的指标数据，则信息中心的安装将失败。Ops Agent 开始收集指标后，系统会为您安装信息中心。

如需查看信息中心的静态预览，请执行以下操作：

在 Google Cloud 控制台中，前往集成页面：
前往集成

如果您使用搜索栏查找此页面，请选择子标题为监控的结果。
点击 Compute Engine 部署平台过滤条件。
找到 NVIDIA DCGM 的条目，然后点击查看详细信息。
选择信息中心标签页以查看静态预览。如果信息中心已安装，您可以通过点击查看信息中心来转到信息中心。

如需详细了解 Cloud Monitoring 中的信息中心，请参阅信息中心和图表。

如需详细了解如何使用集成页面，请参阅管理集成。

DCGM 限制和暂停性能剖析

如果同时使用 DCGM 和一些其他的 NVIDIA 开发者工具（例如 Nsight Systems 或 Nsight Compute），则可能会发生冲突。此限制适用于 NVIDIA A100 及更低版本的 GPU。如需了解详情，请参阅 DCGM 功能概览中的性能剖析采样率部分。

如果您需要使用 Nsight Systems 之类的工具且不希望出现因为冲突而导致的明显服务中断，则可以根据需要使用以下命令来暂停或恢复指标收集：

dcgmi profile --pause
dcgmi profile --resume

暂停性能剖析后，系统不会从虚拟机发出 Ops Agent 收集的任何 DCGM 指标。

后续步骤

如需查看有关如何使用 Ansible 安装 Ops Agent、配置第三方应用和安装示例信息中心的演示，请观看安装 Ops Agent 以排查第三方应用的问题视频。