本頁面由 Cloud Translation API 翻譯而成。

NVIDIA Data Center GPU Manager (DCGM)

NVIDIA Data Center GPU Manager 整合功能會從 DCGM 收集重要的進階 GPU 指標。您可以選取 dcgm 接收器的版本，將作業套件代理程式設為收集兩組不同指標的其中一組：

第 2 版 dcgm 接收器提供精選的指標組合，可監控附加至特定 VM 執行個體的 GPU 效能和狀態。
dcgm 接收器第 1 版提供一組剖析指標，可與預設 GPU 指標搭配使用。如要瞭解這些指標的目的和解讀方式，請參閱 DCGM 功能總覽中的「剖析指標」。

如要進一步瞭解 NVIDIA Data Center GPU Manager，請參閱 DCGM 文件。這項整合功能與 DCGM 3.1 至 3.3.9 版相容。

這些指標僅適用於 Linux 系統。 系統不會從 NVIDIA GPU 型號 P100 和 P4 收集剖析指標。

必要條件

如要收集 NVIDIA DCGM 指標，請完成下列步驟：

安裝 NVIDIA Datacenter 驅動程式。
安裝 DCGM。
安裝作業套件代理程式。
- 第 1 版指標：作業套件代理程式 2.38.0 以上版本。只有 2.38.0 版或 2.41.0 以上版本的 Ops Agent 才能監控 GPU。請勿在連接 GPU 的 VM 上安裝 Ops Agent 2.39.0 和 2.40.0 版。詳情請參閱「代理程式當機，且報告提及 NVIDIA」。
- 第 2 版指標：Ops Agent 2.51.0 以上版本。

安裝 DCGM 並驗證安裝作業

您必須安裝 DCGM 3.1 至 3.3.9 版，並確保 DCGM 以具備權限的服務形式執行。如要安裝 DCGM，請參閱 DCGM 說明文件中的「安裝」一節。

如要確認 DCGM 是否正常運作，請執行下列操作：

執行下列指令，檢查 DCGM 服務的狀態：

sudo service nvidia-dcgm status

如果服務正在執行，nvidia-dcgm 服務會列為 active (running)。輸出結果會與下列內容相似：

● nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; disabled; vendor preset: enabled)
Active: active (running) since Sat 2023-01-07 15:24:29 UTC; 3s ago
Main PID: 24388 (nv-hostengine)
Tasks: 7 (limit: 14745)
CGroup: /system.slice/nvidia-dcgm.service
       └─24388 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

執行下列指令，確認系統是否找到 GPU 裝置：

dcgmi discovery --list

如果找到裝置，輸出內容會類似以下內容：

1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA A100-SXM4-40GB                                          |
|        | PCI Bus ID: 00000000:00:04.0                                         |
|        | Device UUID: GPU-a2d9f5c7-87d3-7d57-3277-e091ad1ba957                |
+--------+----------------------------------------------------------------------+

設定作業套件代理程式以使用 DCGM

按照設定 Ops Agent 指南操作，新增必要元素，從 DCGM 服務收集遙測資料，然後重新啟動代理程式。

範例設定

下列指令會建立設定，收集和擷取 NVIDIA DCGM 的接收器版本 2 指標：

# Configures Ops Agent to collect telemetry from the app. You must restart the agent for the configuration to take effect.
set -e

# Check if the file exists
if [ ! -f /etc/google-cloud-ops-agent/config.yaml ]; then
  # Create the file if it doesn't exist.
  sudo mkdir -p /etc/google-cloud-ops-agent
  sudo touch /etc/google-cloud-ops-agent/config.yaml
fi

# Create a back up of the existing file so existing configurations are not lost.
sudo cp /etc/google-cloud-ops-agent/config.yaml /etc/google-cloud-ops-agent/config.yaml.bak

# Configure the Ops Agent.
sudo tee /etc/google-cloud-ops-agent/config.yaml > /dev/null << EOF
metrics:
  receivers:
    dcgm:
      type: dcgm
      receiver_version: 2
  service:
    pipelines:
      dcgm:
        receivers:
          - dcgm
EOF

如要只收集 DCGM 剖析指標，請將 receiver_version 欄位的值替換為 1。您也可以完全移除 receiver_version 項目；預設版本為 1。你無法同時使用這兩個版本。

如要讓這些變更生效，請重新啟動 Ops Agent：

如要重新啟動代理程式，請在執行個體上執行下列指令：
```
sudo systemctl restart google-cloud-ops-agent
```
如要確定代理程式已重新啟動，請執行下列指令，並驗證「指標代理程式」和「Logging 代理程式」元件是否已啟動：
```
sudo systemctl status "google-cloud-ops-agent*"
```

如果收到「Unable to connect to DCGM daemon at localhost:5555 on libdcgm.so not Found; Is the DCGM daemon running?」(無法連線至 localhost:5555 的 DCGM 精靈，找不到 libdcgm.so；DCGM 精靈是否正在執行？) 等錯誤訊息，可能是因為您安裝了 DGCM 服務 4.0 版。DCGM 共用程式庫已重新命名為 libdgcdm.so.4，但 Ops Agent DCGM 接收器無法辨識。您必須使用 DCGM 3.1 至 3.3.9 版。

如果您使用自訂服務帳戶 (而非預設的 Compute Engine 服務帳戶)，或是使用非常舊的 Compute Engine VM，可能需要授權 Ops Agent。

設定指標收集

如要從 NVIDIA DCGM 擷取指標，您必須為 NVIDIA DCGM 產生的指標建立接收器，然後為新的接收器建立管道。

這個接收器不支援在設定中使用多個執行個體，例如監控多個端點。所有這類執行個體都會寫入相同的時間序列，而 Cloud Monitoring 無法區分這些執行個體。

如要為 dcgm 指標設定接收器，請指定下列欄位：

欄位	預設	說明
`collection_interval`	`60s`	時間長度，例如 `30s` 或 `5m`。
`endpoint`	`localhost:5555`	DCGM 服務的位址，格式為 `host:port`。
`receiver_version`	`1`	可以是 1 或 2。第 2 版提供更多指標。
`type`		這個值必須是 `dcgm`。

監控的內容

下表列出作業套件代理程式從 NVIDIA DGCM 執行個體收集的指標。並非所有 GPU 型號都提供所有指標。系統不會從 NVIDIA GPU 型號 P100 和 P4 收集剖析指標。

第 1 版指標

使用第 1 版 dcgm 接收器時，系統會收集下列指標。

指標類型
類型受監控資源	標籤
`workload.googleapis.com/dcgm.gpu.profiling.dram_utilization` ^†
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.nvlink_traffic_rate` ^†
`GAUGE`, `INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.pcie_traffic_rate` ^†
`GAUGE`, `INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.pipe_utilization` ^†
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `pipe` ^‡ `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.sm_occupancy` ^†
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/dcgm.gpu.profiling.sm_utilization` ^†
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`

^† 不適用於 P100 和 P4 GPU 型號。

^‡ L4 不支援 pipe 值 fp64。

第 2 版指標

使用第 2 版 dcgm 接收器時，系統會收集下列指標。

指標類型
類型受監控資源	標籤
`workload.googleapis.com/gpu.dcgm.clock.frequency`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.clock.throttle_duration.time`
`CUMULATIVE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid` `violation` ^†
`workload.googleapis.com/gpu.dcgm.codec.decoder.utilization`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.codec.encoder.utilization`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.ecc_errors`
`CUMULATIVE`, `INT64` gce_instance	`error_type` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.energy_consumption`
`CUMULATIVE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.memory.bandwidth_utilization`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.memory.bytes_used`
`GAUGE`, `INT64` gce_instance	`gpu_number` `model` `state` `uuid`
`workload.googleapis.com/gpu.dcgm.nvlink.io` ^‡
`CUMULATIVE`, `INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.pcie.io` ^‡
`CUMULATIVE`, `INT64` gce_instance	`direction` `gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.pipe.utilization` ^‡
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `pipe` ^§ `uuid`
`workload.googleapis.com/gpu.dcgm.sm.utilization` ^‡
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.temperature`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`
`workload.googleapis.com/gpu.dcgm.utilization`
`GAUGE`, `DOUBLE` gce_instance	`gpu_number` `model` `uuid`

^† 對於 P100 和 P4，僅支援 violation 值 power、thermal 和 sync_boost。

^‡ 不適用於 P100 和 P4 GPU 型號。

^§ L4 不支援 pipe 值 fp64。

GPU 指標

此外，作業套件代理程式的內建設定也會收集 agent.googleapis.com/gpu 指標，這些指標是由 NVIDIA 管理程式庫 (NVML) 回報。您不需要在 Ops Agent 中進行任何額外設定，即可收集這些指標，但必須建立已連結 GPU 的 VM，並安裝 GPU 驅動程式。詳情請參閱「關於 gpu 指標」一文。dcgm 接收器第 1 版指標旨在輔助這些預設指標，而 dcgm 接收器第 2 版指標則可單獨使用。

驗證設定

本節說明如何確認您已正確設定 NVIDIA DCGM 接收器。作業套件代理程式可能需要一到兩分鐘，才會開始收集遙測資料。

如要確認 NVIDIA DCGM 指標是否傳送至 Cloud Monitoring，請按照下列步驟操作：

前往 Google Cloud 控制台的「Metrics Explorer」頁面：
前往 Metrics Explorer

如果您是使用搜尋列尋找這個頁面，請選取子標題為「Monitoring」的結果。
在查詢建構工具窗格的工具列中，選取名稱為 MQL 或 PromQL 的按鈕。
確認已在「Language」(語言) 切換按鈕中選取「PromQL」。語言切換按鈕位於同一工具列，可供你設定查詢格式。

如要查詢第 1 版指標，請在編輯器中輸入下列查詢，然後按一下「執行查詢」：
```
{"workload.googleapis.com/dcgm.gpu.profiling.sm_utilization", monitored_resource="gce_instance"}
```
如要查詢第 2 版指標，請在編輯器中輸入下列查詢，然後按一下「執行」：
```
{"workload.googleapis.com/gpu.dcgm.sm.utilization", monitored_resource="gce_instance"}
```

查看資訊主頁

如要查看 NVIDIA DCGM 指標，您必須設定圖表或資訊主頁。 NVIDIA DCGM 整合功能包含一或多個資訊主頁。設定整合功能後，Ops Agent 就會開始收集指標資料，並自動安裝所有資訊主頁。

您也可以查看資訊主頁的靜態預覽畫面，不必安裝整合服務。

如要查看已安裝的資訊主頁，請按照下列步驟操作：

在 Google Cloud 控制台中，前往「Dashboards」(資訊主頁) 頁面：
前往「Dashboards」(資訊主頁)

如果您是使用搜尋列尋找這個頁面，請選取子標題為「Monitoring」的結果。
選取「資訊主頁清單」分頁，然後選擇「整合」類別。
按一下要查看的資訊主頁名稱。

如果您已設定整合功能，但尚未安裝資訊主頁，請檢查作業套件代理程式是否正在執行。如果資訊主頁中的圖表沒有指標資料，資訊主頁安裝作業就會失敗。作業套件代理程式開始收集指標後，系統就會為您安裝資訊主頁。

如要查看資訊主頁的靜態預覽畫面，請按照下列步驟操作：

前往 Google Cloud 控制台的「Integrations」(整合) 頁面：
前往「整合」

如果您是使用搜尋列尋找這個頁面，請選取子標題為「Monitoring」的結果。
按一下「Compute Engine」部署平台篩選器。
找出「NVIDIA DCGM」項目，然後按一下「查看詳細資料」。
選取「資訊主頁」分頁標籤，即可查看靜態預覽畫面。如果已安裝資訊主頁，請點選「查看資訊主頁」前往。

如要進一步瞭解 Cloud Monitoring 中的資訊主頁，請參閱「資訊主頁和圖表」。

如要進一步瞭解如何使用「整合」頁面，請參閱「管理整合」一文。

DCGM 限制和暫停剖析

並行使用 DCGM 可能會與其他 NVIDIA 開發人員工具 (例如 Nsight Systems 或 Nsight Compute) 發生衝突。這項限制適用於 NVIDIA A100 和更早的 GPU。詳情請參閱 DCGM 功能總覽中的「設定取樣率」。

如需使用 Nsight Systems 等工具，但不想大幅中斷作業，可以使用下列指令暫停或繼續收集指標：

dcgmi profile --pause
dcgmi profile --resume

暫停剖析後，作業套件代理程式收集的 DCGM 指標都不會從 VM 發出。

後續步驟

如要逐步瞭解如何使用 Ansible 安裝作業套件代理程式、設定第三方應用程式，以及安裝範例資訊主頁，請觀看「安裝 Ops Agent 以排解第三方應用程式的問題」影片。