Monitoring GPU performance

To help with better utilization of resources, you can track the GPU usage rates of your instances. When you know the GPU usage rates, you can then perform tasks such as setting up managed instance groups that can be used to autoscale resources based on needs.

To review GPU metrics using Stackdriver Monitoring, complete the following steps:

  1. On each VM instance, set up the GPU metrics reporting script. This script performs the following tasks:

    • Installs the GPU metrics reporting agent. This agent runs at intervals on the instance to collect GPU data, and sends this data to Stackdriver Monitoring.
    • Creates a custom/gpu_utilization metrics field in Stackdriver Monitoring. This field stores GPU specific data that you can analyze in Stackdriver Monitoring.
  2. View logs in Google Cloud Stackdriver Monitoring.

Setting up the GPU metrics reporting script

  1. On each of your VM instances, check that you meet the following requirements:

  2. On each of your VM instances, install the GPU metrics agent. To install the metrics agent, complete the following steps:

    1. Download the GPU metrics reporting scripts.

      git clone https://github.com/GoogleCloudPlatform/tensorflow-inference-tensorrt5-t4-gpu.git
    2. Switch to the metrics_reporting folder.

      cd tensorflow-inference-tensorrt5-t4-gpu/metrics_reporting
    3. Set up the installation environment for the metrics agent.

      pip install -r ./requirements.txt
    4. Move the metric reporting script to your root directory.

      sudo cp report_gpu_metrics.py /root/
    5. Enable the GPU metrics agent.

      cat <<-EOH > /lib/systemd/system/gpu_utilization_agent.service
      [Unit]
      Description=GPU Utilization Metric Agent
      [Service]
      PIDFile=/run/gpu_agent.pid
      ExecStart=/bin/bash --login -c '/usr/bin/python /root/report_gpu_metrics.py'
      User=root
      Group=root
      WorkingDirectory=/
      Restart=always
      [Install]
      WantedBy=multi-user.target
      EOH
      
    6. Reload the system daemon.

      systemctl daemon-reload
    7. Enable the gpu monitoring service.

      systemctl --no-reload --now enable /lib/systemd/system/gpu_utilization_agent.service

Reviewing metrics in Stackdriver Monitoring

  1. In the Google Cloud Console, select Monitoring, or use the following button:

    Go to Monitoring

  2. If Metrics Explorer is shown in the navigation pane, click Metrics Explorer. Otherwise, select Resources and then select Metrics Explorer.

  3. Search for gpu_utilization.

    Screenshot of Stackdriver Monitoring initiation.

  4. Your GPU utilization should resemble the following output:

    Screenshot of Stackdriver Monitoring running.

  5. (Optional) Set up autoscaling using managed instance groups. To get started, you can review the Setting up a multiple-zone cluster section of the TensorFlow inference workload tutorial.

What's next?

Trang này có hữu ích không? Hãy cho chúng tôi biết đánh giá của bạn:

Gửi phản hồi về...

Compute Engine Documentation