Monitoring GPU performance


To help with better utilization of resources, you can track the GPU usage rates of your instances. When you know the GPU usage rates, you can then perform tasks such as setting up managed instance groups that can be used to autoscale resources based on needs.

To review GPU metrics using Cloud Monitoring, complete the following steps:

  1. On each VM instance, set up the GPU metrics reporting script. This script performs the following tasks:

    • Installs the GPU metrics reporting agent. This agent runs at intervals on the instance to collect GPU data, and sends this data to Cloud Monitoring.
    • Creates a custom/gpu_utilization metrics field in Cloud Monitoring. This field stores GPU specific data that you can analyze in Cloud Monitoring.
  2. View logs in Google Cloud Cloud Monitoring.

Setting up the GPU metrics reporting script

  1. On each of your VM instances, check that you meet the following requirements:

  2. On each of your VM instances, install the GPU metrics agent. To install the metrics agent, complete the following steps:

    1. Download the GPU metrics reporting scripts.

      git clone https://github.com/GoogleCloudPlatform/tensorflow-inference-tensorrt5-t4-gpu.git
    2. Switch to the metrics_reporting folder.

      cd tensorflow-inference-tensorrt5-t4-gpu/metrics_reporting
    3. Set up the installation environment for the metrics agent.

      • If you are using Python 2, run the following command:

        pip install -r ./requirements.txt
      • If you are using Python 3, run the following command:

        pip3 install -r ./requirements.txt
    4. Move the metric reporting script to your root directory.

      sudo cp report_gpu_metrics.py /root/
    5. Temporarily allow access to the /lib/systemd/system/ directory:

      sudo chmod 777 /lib/systemd/system/
    6. Enable the GPU metrics agent.

      • If you are using Python 2, run the following command:

        cat <<-EOH > /lib/systemd/system/gpu_utilization_agent.service
        [Unit]
        Description=GPU Utilization Metric Agent
        [Service]
        PIDFile=/run/gpu_agent.pid
        ExecStart=/bin/bash --login -c '/usr/bin/python /root/report_gpu_metrics.py'
        User=root
        Group=root
        WorkingDirectory=/
        Restart=always
        [Install]
        WantedBy=multi-user.target
        EOH
        
      • If you are using Python 3, run the following command:

        cat <<-EOH > /lib/systemd/system/gpu_utilization_agent.service
        [Unit]
        Description=GPU Utilization Metric Agent
        [Service]
        PIDFile=/run/gpu_agent.pid
        ExecStart=/bin/bash --login -c '/opt/conda/bin/python /root/report_gpu_metrics.py'
        User=root
        Group=root
        WorkingDirectory=/
        Restart=always
        [Install]
        WantedBy=multi-user.target
        EOH
        
    7. Reset the permissions on the /lib/systemd/system directory. Run the following command:

      sudo chmod 755 /lib/systemd/system/
    8. Reload the system daemon.

      sudo systemctl daemon-reload
    9. Enable the gpu monitoring service.

      sudo systemctl --no-reload --now enable /lib/systemd/system/gpu_utilization_agent.service

Reviewing metrics in Cloud Monitoring

  1. In the Google Cloud Console, select Monitoring, or use the following button:

    Go to Monitoring

    The first time you access any Monitoring functionality for a Google Cloud project, the project is associated with a Workspace. If you've never used Monitoring, then a Workspace is automatically created. Otherwise, a dialog is displayed and you are asked to select between creating a Workspace and adding your project to an existing Workspace.

  2. In the Monitoring navigation pane, click Metrics Explorer.

  3. Ensure the Metric tab is selected:

    Display the metric-selection tab.

  4. Search for gpu_utilization.

    Cloud Monitoring initiation.

  5. Your GPU utilization should resemble the following output:

    Cloud Monitoring running.

  6. (Optional) Set up autoscaling using managed instance groups. To get started, you can review the Setting up a multiple-zone cluster section of the TensorFlow inference workload tutorial.

What's next?