此页面由 Cloud Translation API 翻译。

NVIDIA Data Center GPU Manager (DCGM)

NVIDIA 数据中心 GPU 管理器集成从 DCGM 收集关键的高级 GPU 指标。您可以通过选择 dcgm 接收器的版本，将 Ops Agent 配置为收集两组不同指标中的一组：

dcgm 接收器版本 2 提供了一组精选的指标，用于监控连接到给定虚拟机实例的 GPU 的性能和状态。
dcgm 接收器版本 1 提供了一组性能分析指标，旨在与默认 GPU 指标结合使用。关于目的和解释的信息请参阅剖析指标请参阅 DCGM 功能概览。

如需详细了解 NVIDIA 数据中心 GPU 管理器，请参阅 DCGM 文档。此集成与 DCGM 3.1 版及更高版本兼容。

这些指标仅适用于 Linux 系统。系统不会从 NVIDIA GPU 模型 P100 和 P4 收集性能分析指标。

前提条件

如需收集 NVIDIA DCGM 指标，您必须执行以下操作：

安装 DCGM。
安装 Ops Agent。
- 版本 1 指标：Ops Agent 2.38.0 或更高版本。只有 Ops Agent 2.38.0、2.41.0 或更高版本与 GPU 监控功能兼容。请勿在挂接了 GPU 的虚拟机上安装 Ops Agent 2.39.0 和 2.40.0 版。如需了解详情，请参阅代理崩溃，报告中提及 NVIDIA。
- 版本 2 指标：Ops Agent 2.51.0 或更高版本。

安装 DCGM 并验证安装

您必须安装 DCGM 3.1 版及更高版本，并确保其作为特权服务运行。要安装 DCGM，请参阅安装。。

如需验证 DCGM 是否正常运行，请执行以下操作：

运行以下命令来检查 DCGM 服务的状态：

sudo service nvidia-dcgm status

如果服务正在运行，相应的 nvidia-dcgm 服务会被列为 active (running)。输出类似以下内容：

● nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; disabled; vendor preset: enabled)
Active: active (running) since Sat 2023-01-07 15:24:29 UTC; 3s ago
Main PID: 24388 (nv-hostengine)
Tasks: 7 (limit: 14745)
CGroup: /system.slice/nvidia-dcgm.service
       └─24388 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

运行以下命令来验证是否能够找到 GPU 设备：

dcgmi discovery --list

如果找到设备，输出将类似于以下内容：

1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA A100-SXM4-40GB                                          |
|        | PCI Bus ID: 00000000:00:04.0                                         |
|        | Device UUID: GPU-a2d9f5c7-87d3-7d57-3277-e091ad1ba957                |
+--------+----------------------------------------------------------------------+

为 DCGM 配置 Ops Agent

按照配置 Ops Agent 指南，添加从 DCGM 服务收集遥测数据所需的元素并重启代理。

配置示例

以下命令会创建配置接收 NVIDIA DCGM 的接收器版本 2 指标并重启 Ops Agent：

# Configures Ops Agent to collect telemetry from the app and restart Ops Agent.
set -e

# Create a back up of the existing file so existing configurations are not lost.
sudo cp /etc/google-cloud-ops-agent/config.yaml /etc/google-cloud-ops-agent/config.yaml.bak

# Configure the Ops Agent.
sudo tee /etc/google-cloud-ops-agent/config.yaml > /dev/null << EOF
metrics:
  receivers:
    dcgm:
      type: dcgm
      receiver_version: 2
  service:
    pipelines:
      dcgm:
        receivers:
          - dcgm
EOF

sudo service google-cloud-ops-agent restart
sleep 20

如果您只想收集 DCGM 性能分析指标，请将 receiver_version 字段的值替换为 1。您也可以完全移除 receiver_version 条目；默认版本为 1。您无法同时使用这两个版本。

运行这些命令后，您可以检查代理是否已重启。运行以下命令，并验证分代理组件“指标代理”和“Logging 代理”是否列为“活跃（正在运行）”：

sudo systemctl status google-cloud-ops-agent"*"

如果您使用的是自定义服务账号（而不是默认的 Compute Engine 服务账号），或者您的 Compute Engine 虚拟机非常旧，则可能需要为 Ops Agent 授权。

配置指标收集

如需从 NVIDIA DCGM 注入指标，您必须为 NVIDIA DCGM 生成的指标创建接收器，然后为新的接收器创建流水线。

此接收器不支持在配置中使用多个实例，例如，监控多个端点。所有这些实例都会写入相同的时序，并且 Cloud Monitoring 无法区分它们。

如需为 dcgm 指标配置接收器，请指定以下字段：

字段	默认值	说明
`collection_interval`	`60s`	时长，例如 `30s` 或 `5m`。
`endpoint`	`localhost:5555`	DCGM 服务的地址，格式为 `host:port`。
`receiver_version`	`1`	值为 1 或 2。版本 2 提供了更多指标。
`type`		该值必须为 `dcgm`。

监控的内容

下表提供了 Ops Agent 收集的指标列表从 NVIDIA DGCM 实例中获取。并非所有指标都适用于所有 GPU 型号。系统不会从 NVIDIA GPU 模型 P100 和 P4 收集性能分析指标。

版本 1 指标

使用 dcgm 接收器版本 1 可收集以下指标。

指标类型
种类、类型受监控的资源	标签
`workload.googleapis.com/dcgm.gpu.profiling.dram_utilization` ^†
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.nvlink_traffic_rate` ^†
`GAUGE`、`INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.pcie_traffic_rate` ^†
`GAUGE`、`INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.pipe_utilization` ^†
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `pipe` ^‡ `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.sm_occupancy` ^†
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.sm_utilization` ^†
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`

^† 不适用于 GPU 型号 P100 和 P4。

^‡ 对于 L4，pipe 值 fp64 不受支持。

版本 2 指标

以下指标是使用 dcgm 接收器版本 2 收集的。

指标类型
种类、类型受监控的资源	标签
`workload.googleapis.com/gpu.dcgm.clock.frequency`
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.clock.throttle_duration.time`
`CUMULATIVE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid` `violation` ^†
`workload.googleapis.com/gpu.dcgm.codec.decoder.utilization`
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.codec.encoder.utilization`
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.ecc_errors`
`CUMULATIVE`、`INT64` gce_instance	`error_type` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.energy_consumption`
`CUMULATIVE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.memory.bandwidth_utilization`
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.memory.bytes_used`
`GAUGE`、`INT64` gce_instance	`gpu_number` `model` `state` `uuid`
`workload.googleapis.com/gpu.dcgm.nvlink.io` ^‡
`CUMULATIVE`、`INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.pcie.io` ^‡
`CUMULATIVE`、`INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.pipe.utilization` ^‡
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `pipe` ^§ `uuid`
`workload.googleapis.com/gpu.dcgm.sm.utilization` ^‡
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.temperature`
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.utilization`
`GAUGE`、`DOUBLE` gce_instance	`gpu_number` `model` `uuid`

^† 对于 P100 和 P4，仅支持 violation 值 power、thermal 和 sync_boost。

^‡ 不适用于 GPU 型号 P100 和 P4。

^§ 对于 L4，pipe 值 fp64 不受支持。

GPU 指标

此外，Ops Agent 的内置配置还会收集agent.googleapis.com/gpu 指标，由 NVIDIA 报告 Management Library (NVML)。虽然您无需在 Ops Agent 中进行任何额外配置便可收集这些指标，但您必须创建挂接 GPU 的虚拟机并安装 GPU 驱动程序。如需了解详情，请参阅关于 gpu 指标。dcgm 接收器版本 1 指标旨在补充这些默认指标，而 dcgm 接收器版本 2 指标旨在独立使用。

验证配置

本部分介绍如何验证您是否正确配置了 NVIDIA DCGM 接收器。Ops Agent 可能需要一两分钟才会开始收集遥测数据。

如需验证 NVIDIA DCGM 指标是否已发送到 Cloud Monitoring，请执行以下操作：

在 Google Cloud 控制台中，转到 Metrics Explorer 页面：
进入 Metrics Explorer

如果您使用搜索栏查找此页面，请选择子标题为监控的结果。
在查询构建器窗格的工具栏中，选择名为 MQL 或 MQL 的按钮。
验证已在MQL切换开关中选择 MQL。语言切换开关位于同一工具栏中，用于设置查询的格式。

对于 v1 指标，请在编辑器中输入以下查询，然后点击运行查询：

fetch gce_instance
| metric 'workload.googleapis.com/dcgm.gpu.profiling.sm_utilization'
| every 1m

对于 v2 指标，请在编辑器中输入以下查询，然后点击运行：

fetch gce_instance
| metric 'workload.googleapis.com/gpu.dcgm.sm.utilization'
| every 1m

查看信息中心

如需查看 NVIDIA DCGM 指标，您必须配置一个图表或信息中心。NVIDIA DCGM 集成服务可为您提供一个或多个信息中心。在您配置集成并且 Ops Agent 开始收集指标数据后，所有信息中心都会自动安装。

您还可以在不安装集成的情况下查看信息中心的静态预览。

如需查看已安装的信息中心，请执行以下操作：

在 Google Cloud 控制台中，转到 信息中心页面：
前往信息中心

如果您使用搜索栏查找此页面，请选择子标题为监控的结果。
选择信息中心列表标签页，然后选择集成类别。
点击您要查看的信息中心的名称。

如果您已配置集成，但尚未安装信息中心，请检查 Ops Agent 是否正在运行。如果信息中心内没有图表的指标数据，则信息中心的安装将失败。Ops Agent 开始收集指标后，系统会为您安装信息中心。

如需查看信息中心的静态预览，请执行以下操作：

在 Google Cloud 控制台中，转到集成页面：
前往集成

如果您使用搜索栏查找此页面，请选择子标题为监控的结果。
点击 Compute Engine 部署平台过滤条件。
找到 NVIDIA DCGM 的条目，然后点击查看详细信息。
选择信息中心标签页以查看静态预览。如果信息中心已安装，您可以通过点击查看信息中心来转到信息中心。

如需详细了解 Cloud Monitoring 中的信息中心，请参阅信息中心和图表。

如需详细了解如何使用集成页面，请参阅管理集成。

DCGM 限制和暂停性能剖析

如果同时使用 DCGM 和一些其他的 NVIDIA 开发者工具（例如 Nsight Systems 或 Nsight Compute），则可能会发生冲突。此限制适用于 NVIDIA A100 及更低版本的 GPU。如需了解详情，请参阅 DCGM 功能概览中的性能剖析采样率部分。

如果您需要使用 Nsight Systems 之类的工具且不希望出现因为冲突而导致的明显服务中断，则可以根据需要使用以下命令来暂停或恢复指标收集：

dcgmi profile --pause
dcgmi profile --resume

暂停性能剖析后，系统不会从虚拟机发出 Ops Agent 收集的任何 DCGM 指标。

后续步骤

如需查看有关如何使用 Ansible 安装 Ops Agent、配置第三方应用和安装示例信息中心的演示，请观看安装 Ops Agent 以排查第三方应用的问题视频。