Google에서 권장하는 Compute Engine용 원격 분석 수집 솔루션인 운영 에이전트를 사용하여 가상 머신(VM) 인스턴스에서 GPU 사용률 및 GPU 메모리와 같은 측정항목을 추적할 수 있습니다.
운영 에이전트를 사용하면 다음과 같이 GPU VM을 관리할 수 있습니다.
사전 구성된 대시보드로 NVIDIA GPU Fleet 상태를 시각화합니다.
사용률이 낮은 GPU를 파악하고 워크로드를 통합하여 비용을 최적화합니다.
GPU 용량을 확장하거나 기존 GPU를 업그레이드할 시기를 결정하기 위해 추세를 살펴보고 확장을 계획합니다.
NVIDIA Data Center GPU Manager(DCGM) 프로파일링 측정항목을 사용하여 GPU 내의 병목 현상 및 성능 문제를 식별합니다.
이 문서에서는 운영 에이전트를 사용하여 Linux VM에서 GPU를 모니터링하는 절차를 설명합니다. 또는 Linux VM에서 GPU 사용량을 모니터링하기 위해 설정할 수도 있는 보고 스크립트를 GitHub에서 사용할 수 있습니다. compute-gpu-monitoring 모니터링 스크립트를 참조하세요.
이 스크립트는 활발하게 관리되지 않습니다.
운영 에이전트 버전 2.38.0 이상에서는 에이전트가 설치된 Linux VM의 GPU 사용률 및 GPU 메모리 사용률을 자동으로 추적할 수 있습니다. NVIDIA 관리 라이브러리(NVML)에서 가져온 이러한 측정항목은 GPU 및 GPU를 사용하는 모든 프로세스별로 추적됩니다.
운영 에이전트에서 모니터링하는 측정항목을 보려면 에이전트 측정항목: gpu를 참조하세요.
운영 에이전트와 NVIDIA 데이터 센터 GPU 관리자(DCGM)의 통합을 설정할 수도 있습니다. 이러한 통합을 통해 운영 에이전트가 GPU의 하드웨어 카운터를 사용하여 측정항목을 추적할 수 있습니다. DCGM은 GPU 기기 수준 측정항목에 대한 액세스를 제공합니다. 여기에는 스트리밍 멀티 프로세서(SM) 블록 사용률, SM 점유율, SM 파이프 사용률, PCIe 트래픽 속도, NVLink 트래픽 비율이 포함됩니다. 운영 에이전트에서 모니터링하는 측정항목을 보려면 타사 애플리케이션 측정항목: NVIDIA 데이터 센터 GPU 관리자(DCGM)를 참조하세요.
또한 운영 에이전트는 NVIDIA Data Center GPU Manager(DCGM)에 대한 통합을 제공하여 스트리밍 멀티 프로세서(SM) 블록 사용률, SM 점유율, SM 파이프 사용률, PCIe 트래픽 속도, NVLink 트래픽 속도 등의 주요 고급 GPU 측정항목을 수집합니다.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-09-04(UTC)"],[[["\u003cp\u003eThe Ops Agent, version 2.38.0 or later, is Google's recommended solution for tracking GPU utilization and memory on Linux virtual machines (VMs) and can manage your GPU VMs.\u003c/p\u003e\n"],["\u003cp\u003eUsing the Ops Agent, you can visualize GPU fleet health, optimize costs, plan scaling, identify bottlenecks with NVIDIA Data Center GPU Manager (DCGM) profiling metrics, and set alerts.\u003c/p\u003e\n"],["\u003cp\u003eThe Ops Agent collects metrics from the NVIDIA Management Library (NVML) and, with optional DCGM integration, can track advanced GPU metrics such as Streaming Multiprocessor utilization and PCIe traffic rate.\u003c/p\u003e\n"],["\u003cp\u003eTo use the Ops Agent, users must ensure their VMs have attached GPUs, installed GPU drivers, and support the Ops Agent with their Linux operating system, in addition to installing the agent.\u003c/p\u003e\n"],["\u003cp\u003eYou can review NVML metrics within the Compute Engine's Observability tab and review DCGM metrics in the Monitoring section, with provided dashboards.\u003c/p\u003e\n"]]],[],null,["# Monitoring GPU performance on Linux VMs\n\nLinux\n\n*** ** * ** ***\n\n| **Tip:** If you want to monitor A4 or A3 Ultra machine types that are deployed using the features provided by Cluster Director, see [Monitor VMs and clusters](/ai-hypercomputer/docs/monitor) in the AI Hypercomputer documentation instead.\n\nYou can track metrics such as GPU utilization and GPU memory from your\nvirtual machine (VM) instances by using the\n[Ops Agent](/stackdriver/docs/solutions/agents/ops-agent), which is\nGoogle's recommended telemetry collection solution for Compute Engine.\nBy using the Ops Agent, you can manage your GPU VMs as follows:\n\n- Visualize the health of your NVIDIA GPU fleet with our pre-configured dashboards.\n- Optimize costs by identifying underutilized GPUs and consolidating workloads.\n- Plan scaling by looking at trends to decide when to expand GPU capacity or upgrade existing GPUs.\n- Use NVIDIA Data Center GPU Manager (DCGM) profiling metrics to identify bottlenecks and performance issues within your GPUs.\n- Set up [managed instance groups (MIGs)](/compute/docs/instance-groups#managed_instance_groups) to autoscale resources.\n- Get alerts on metrics from your NVIDIA GPUs.\n\nThis document covers the procedures for monitoring GPUs on Linux VMs by using\nthe Ops Agent. Alternatively, a reporting script is available on GitHub that can\nalso be setup for monitoring GPU usage on Linux VMs, see\n[`compute-gpu-monitoring` monitoring script](https://github.com/GoogleCloudPlatform/compute-gpu-monitoring/tree/main/linux).\nThis script is not actively maintained.\n\nFor monitoring GPUs on Windows VMs, see\n[Monitoring GPU performance (Windows)](/compute/docs/gpus/monitor-gpus-windows).\n\nOverview\n--------\n\nThe Ops Agent, version 2.38.0 or later, can automatically track GPU\nutilization and GPU memory usage rates on your Linux VMs that have the agent\ninstalled. These metrics, obtained from the NVIDIA Management Library (NVML),\nare tracked per GPU and per process for any process that uses GPUs.\nTo view the metrics that are monitored by the Ops Agent,\nsee [Agent metrics: gpu](/monitoring/api/metrics_opsagent#agent-gpu).\n\nYou can also set up the NVIDIA Data Center GPU Manager (DCGM) integration with\nthe Ops Agent. This integration allows the Ops Agent to track metrics\nusing the hardware counters on the GPU. DCGM provides access to the\nGPU device-level metrics. These include Streaming Multiprocessor (SM)\nblock utilization, SM occupancy, SM pipe utilization, PCIe traffic rate,\nand NVLink traffic rate. To view the metrics monitored by the Ops Agent, see\n[Third-party application metrics: NVIDIA Data Center GPU Manager (DCGM)](/monitoring/api/metrics_opsagent#opsagent-dcgm).\n\nTo review GPU metrics by using the Ops Agent, complete the following steps:\n\n1. On each VM, check that you have met [the requirements](#requirements).\n2. On each VM, [install the Ops Agent](#install-ops-agent).\n3. Optional: On each VM, set up the [NVIDIA Data Center GPU Manager (DCGM) integration](#dcgm).\n4. Review [metrics in Cloud Monitoring](#review-metrics-dashboard).\n\nLimitations\n-----------\n\n- The Ops Agent doesn't track GPU utilization on VMs that use Container-Optimized OS.\n\nRequirements\n------------\n\nOn each of your VMs, check that you meet the following requirements:\n\n- Each VM must have [GPUs attached](/compute/docs/gpus/create-vm-with-gpus).\n- Each VM must have a [GPU driver installed](/compute/docs/gpus/install-drivers-gpu#verify-driver-install).\n- The Linux operating system and version for each of your VM must support the Ops Agent. See the list of [Linux operating systems](/stackdriver/docs/solutions/agents/ops-agent#linux_operating_systems) that support the Ops Agent.\n- Ensure you have `sudo` access to each VM.\n\nInstall the Ops Agent\n---------------------\n\nTo install the Ops Agent, complete the following steps:\n\n1. If you were previously using the\n [`compute-gpu-monitoring` monitoring script](https://github.com/GoogleCloudPlatform/compute-gpu-monitoring/tree/main/linux)\n to track GPU utilization, disable the service before installing the Ops Agent.\n To disable the monitoring script, run the following command:\n\n ```\n sudo systemctl --no-reload --now disable google_gpu_monitoring_agent\n ```\n2. Install the latest version of the Ops Agent. For detailed instructions, see\n [Installing the Ops Agent](/stackdriver/docs/solutions/agents/ops-agent/install-index).\n\n3. After you have installed the Ops agent, if you need to install or upgrade your\n GPU drivers by using the\n [installation scripts provided by Compute Engine](/compute/docs/gpus/install-drivers-gpu#installation_scripts),\n review the *limitations* section.\n\nReview NVML metrics in Compute Engine\n-------------------------------------\n\nYou can review the NVML metrics that the Ops Agent collects from the\n**Observability** tabs for Compute Engine Linux VM instances.\n\nTo view the metrics for a single VM do the following:\n\n1. In the Google Cloud console, go to the **VM instances** page.\n\n [Go to VM instances](https://console.cloud.google.com/compute/instances)\n2. Select a VM to open the **Details** page.\n\n3. Click the **Observability** tab to display information about the VM.\n\n4. Select the **GPU** quick filter.\n\nTo view the metrics for multiple VMs, do the following:\n\n1. In the Google Cloud console, go to the **VM instances** page.\n\n [Go to VM instances](https://console.cloud.google.com/compute/instances)\n2. Click the **Observability** tab.\n\n3. Select the **GPU** quick filter.\n\nOptional: Set up NVIDIA Data Center GPU Manager (DCGM) integration\n------------------------------------------------------------------\n\nThe Ops Agent also provides integration for NVIDIA Data Center GPU Manager\n(DCGM) to collect key advanced GPU metrics such as Streaming Multiprocessor (SM)\nblock utilization, SM occupancy, SM pipe utilization, PCIe traffic rate,\nand NVLink traffic rate.\n\nThese advanced GPU metrics are not collected from NVIDIA P100 and P4 models.\n\nFor detailed instructions on how to setup and use this integration on each VM,\nsee [NVIDIA Data Center GPU Manager (DCGM)](/stackdriver/docs/solutions/agents/ops-agent/third-party-nvidia).\n\nReview DCGM metrics in Cloud Monitoring\n---------------------------------------\n\n1. In the Google Cloud console, go to the **Monitoring \\\u003e Dashboards** page.\n\n [Go to Monitoring](https://console.cloud.google.com/monitoring/dashboards)\n2. Select the **Sample Library** tab.\n\n3. In the filter_list **Filter** field,\n type **NVIDIA** . The\n **NVIDIA GPU Monitoring Overview (GCE and GKE)**\n dashboard displays.\n\n If you have set up the NVIDIA Data Center GPU Manager (DCGM) integration, the\n **NVIDIA GPU Monitoring Advanced DCGM Metrics (GCE Only)**\n dashboard also displays.\n\n4. For the required dashboard, click **Preview** . The **Sample dashboard preview**\n page displays.\n\n5. From the **Sample dashboard preview** page, click **Import sample dashboard**.\n\n - The **NVIDIA GPU Monitoring Overview (GCE and GKE)**\n dashboard displays the GPU metrics such as GPU utilization, NIC traffic rate,\n and GPU memory usage.\n\n Your GPU utilization display is similar to the following output:\n\n - The\n **NVIDIA GPU Monitoring Advanced DCGM Metrics (GCE Only)**\n dashboard displays key advanced metrics such as SM utilization, SM occupancy,\n SM pipe utilization, PCIe traffic rate, and NVLink traffic rate.\n\n Your Advanced DCGM Metric display is similar to the following output:\n\nWhat's next?\n------------\n\n- To handle GPU host maintenance, see [Handling GPU host maintenance events](/compute/docs/gpus/gpu-host-maintenance).\n- To improve network performance, see [Use higher network bandwidth](/compute/docs/gpus/optimize-gpus)."]]