Halaman ini ditujukan untuk Operator dan administrator IT yang mengelola
siklus proses infrastruktur teknologi yang mendasarinya. Untuk mempelajari lebih lanjut peran umum dan contoh tugas yang kami referensikan dalam konten Google Cloud , lihat Tugas dan peran pengguna GKE Enterprise umum.
Sebelum memulai
Untuk menggunakan Google Cloud Managed Service for Prometheus guna mengumpulkan metrik dari DCGM, deployment Google Distributed Cloud Anda harus memenuhi persyaratan berikut:
Konfigurasikan resource PodMonitoring untuk Google Cloud Managed Service for Prometheus guna mengumpulkan metrik yang diekspor. Jika Anda mengalami masalah saat menginstal aplikasi atau eksportir karena kebijakan organisasi atau keamanan yang membatasi, sebaiknya konsultasikan dokumentasi open source untuk mendapatkan dukungan.
Untuk menyerap data metrik yang dikeluarkan oleh Pod Ekspor DCGM
(nvidia-dcgm-exporter), Google Cloud Managed Service for Prometheus
menggunakan scraping target. Pengambilan data target dan penyerapan metrik dikonfigurasi menggunakan
resource kustom Kubernetes.
Layanan terkelola menggunakan resource kustom PodMonitoring.
Resource kustom PodMonitoring hanya meng-scrape target di namespace tempat resource tersebut di-deploy. Untuk meng-scrape target di beberapa namespace, deploy resource kustom PodMonitoring yang sama di setiap namespace.
Buat file manifes dengan konfigurasi berikut:
Bagian selector dalam manifes menentukan bahwa Pod Ekspor DCGM,
nvidia-dcgm-exporter, dipilih untuk pemantauan. Pod ini di-deploy
saat Anda menginstal Operator GPU NVIDIA.
Anda dapat menggunakan Metrics Explorer untuk memverifikasi bahwa Anda telah mengonfigurasi eksportir DCGM dengan benar. Mungkin perlu waktu satu atau dua menit hingga Cloud Monitoring dapat
menyimpan metrik Anda.
Untuk memverifikasi bahwa metrik telah diserap, lakukan tindakan berikut:
Di konsol Google Cloud , buka halaman
leaderboardMetrics explorer:
Jika Anda menggunakan kotak penelusuran untuk menemukan halaman ini, pilih hasil yang subjudulnya adalah Monitoring.
Gunakan Prometheus Query Language (PromQL) untuk menentukan data yang akan ditampilkan pada diagram:
Di toolbar panel pembuat kueri, klik < > PromQL.
Masukkan kueri Anda ke editor kueri. Misalnya, untuk membuat diagram
rata-rata jumlah detik yang dihabiskan CPU dalam setiap mode selama satu jam terakhir,
gunakan kueri berikut:
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-07-31 UTC."],[],[],null,["If your [cluster has nodes that use NVIDIA\nGPUs](/kubernetes-engine/distributed-cloud/bare-metal/docs/how-to/gpu-manual-use), you can monitor the GPU utilization,\nperformance, and health by configuring the cluster to send [NVIDIA Data Center\nGPU Manager (DCGM)](https://developer.nvidia.com/dcgm) metrics to\nCloud Monitoring. This solution uses Google Cloud Managed Service for Prometheus to collect\nmetrics from NVIDIA DCGM.\n\nThis page is for IT administrators and Operators who manage the\nlifecycle of the underlying tech infrastructure. To learn more about common\nroles and example tasks that we reference in Google Cloud content, see [Common\nGKE user roles and\ntasks](/kubernetes-engine/enterprise/docs/concepts/roles-tasks).\n\nBefore you begin\n\nTo use Google Cloud Managed Service for Prometheus to collect metrics from DCGM, your\nGoogle Distributed Cloud deployment must meet the following requirements:\n\n- NVIDIA [DCGM-Exporter tool](https://github.com/NVIDIA/dcgm-exporter) must be\n already installed on your cluster. DCGM-Exporter is installed when you\n install NVIDIA GPU Operator. For NVIDIA GPU Operator installation instructions, see\n [Install and verify the\n NVIDIA GPU Operator](/kubernetes-engine/distributed-cloud/bare-metal/docs/how-to/gpu-manual-use#install_verify).\n\n- Google Cloud Managed Service for Prometheus must be enabled. For instructions, see [Enable\n Google Cloud Managed Service for Prometheus](/kubernetes-engine/distributed-cloud/bare-metal/docs/how-to/application-logging-monitoring#enable_managed_prometheus).\n\nConfigure a PodMonitoring resource\n\nConfigure a PodMonitoring resource for Google Cloud Managed Service for Prometheus to\ncollect the exported metrics. If you are having trouble installing an\napplication or exporter due to restrictive security or organizational policies,\nthen we recommend you consult open-source documentation for support.\n\nTo ingest the metric data emitted by the DCGM Exporter Pod\n(`nvidia-dcgm-exporter`), Google Cloud Managed Service for Prometheus\nuses target scraping. Target scraping and metrics ingestion are configured using\nKubernetes [custom\nresources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/).\nThe managed service uses\n[PodMonitoring](https://github.com/GoogleCloudPlatform/prometheus-engine/blob/v0.13.0/doc/api.md#podmonitoring)\ncustom resources.\n\nA PodMonitoring custom resource scrapes targets in the namespace in which it's\ndeployed only. To scrape targets in multiple namespaces, deploy the same\nPodMonitoring custom resource in each namespace.\n\n1. Create a manifest file with the following configuration:\n\n The `selector` section in the manifest specifies that the DCGM Exporter Pod,\n `nvidia-dcgm-exporter`, is selected for monitoring. This Pod is deployed\n when you install the NVIDIA GPU Operator. \n\n apiVersion: monitoring.googleapis.com/v1\n kind: PodMonitoring\n metadata:\n name: dcgm-gmp\n spec:\n selector:\n matchLabels:app: nvidia-dcgm-exporter\n endpoints:\n - port: metrics\n interval: 30s\n\n2. Deploy the PodMonitoring custom resource:\n\n kubectl apply -n \u003cvar translate=\"no\"\u003eNAMESPACE\u003c/var\u003e -f \u003cvar translate=\"no\"\u003eFILENAME\u003c/var\u003e --kubeconfig \u003cvar translate=\"no\"\u003eKUBECONFIG\u003c/var\u003e\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eNAMESPACE\u003c/var\u003e: the namespace into which you're\n deploying the PodMonitoring custom resource.\n\n - \u003cvar translate=\"no\"\u003eFILENAME\u003c/var\u003e: the path of the manifest file for the\n PodMonitoring custom resource.\n\n - \u003cvar translate=\"no\"\u003eKUBECONFIG\u003c/var\u003e: the path of the kubeconfig file for the\n cluster.\n\n3. Verify that the PodMonitoring custom resource is installed in the\n intended namespace, run the following command:\n\n kubectl get podmonitoring -n \u003cvar translate=\"no\"\u003eNAMESPACE\u003c/var\u003e --kubeconfig \u003cvar translate=\"no\"\u003eKUBECONFIG\u003c/var\u003e\n\n The output should look similar to the following: \n\n NAME AGE\n dcgm-gmp 3m37s\n\nVerify the configuration\n\nYou can use Metrics Explorer to verify that you correctly configured the\nDCGM exporter. It might take one or two minutes for Cloud Monitoring to\ningest your metrics.\n\nTo verify the metrics are ingested, do the following:\n\n1. In the Google Cloud console, go to the\n *leaderboard* **Metrics explorer** page:\n\n [Go to **Metrics explorer**](https://console.cloud.google.com/monitoring/metrics-explorer)\n\n \u003cbr /\u003e\n\n If you use the search bar to find this page, then select the result whose subheading is\n **Monitoring**.\n2. Use Prometheus Query Language (PromQL) to specify the data to display on the\n chart:\n\n 1. In the toolbar of the query-builder pane, click **\\\u003c \\\u003e PromQL**.\n\n 2. Enter your query into the query editor. For example, to chart the\n average number of seconds CPUs spent in each mode over the past hour,\n use the following query:\n\n DCGM_FI_DEV_GPU_UTIL{cluster=\"\u003cvar translate=\"no\"\u003e\u003cspan class=\"devsite-syntax-s\"\u003eCLUSTER_NAME\u003c/span\u003e\u003c/var\u003e\", namespace=\"\u003cvar translate=\"no\"\u003e\u003cspan class=\"devsite-syntax-s\"\u003eNAMESPACE\u003c/span\u003e\u003c/var\u003e\"}\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e: the name of the cluster with nodes\n that are using GPUs.\n\n - \u003cvar translate=\"no\"\u003eNAMESPACE\u003c/var\u003e: the namespace into which you deployed\n the PodMonitoring custom resource.\n\n For more information about using PromQL, see [PromQL in\n Cloud Monitoring](/monitoring/promql)."]]