Monitoring GPU performance on Linux VMs


To help with better utilization of resources, you can track the GPU usage rates of your virtual machine (VM) instances.

When you know the GPU usage rates, you can perform tasks such as setting up managed instance groups that can be used to autoscale resources.

For Windows VMs, see Monitoring GPU performance (Windows).

To review GPU metrics using Cloud Monitoring, complete the following steps:

  1. On each VM, set up the GPU metrics reporting script. This script installs the GPU metrics reporting agent. This agent runs at intervals on the VM to collect GPU data, and sends this data to Cloud Monitoring.

  2. View logs in Google Cloud Cloud Monitoring.

Set up the GPU metrics reporting script

Requirements

On each of your VMs, check that you meet the following requirements:

  • Each VM must have GPUs attached.
  • Each VM must have a GPU driver installed.
  • Each VM must have Python 3.6 or newer installed.
  • Each VM must have the virtualenv and pip utility installed.

Download the agent

Download the monitoring script into the /opt/google directory. You have two main options:

  • Download using the git utility
  • Download as a package using wget

Using git

# We need to use sudo to be able to write to /opt
sudo mkdir -p /opt/google
cd /opt/google
sudo git clone https://github.com/GoogleCloudPlatform/compute-gpu-monitoring.git 

As ZIP package

# We need to use sudo to be able to write to /opt
sudo mkdir -p /opt/google
sudo wget https://github.com/GoogleCloudPlatform/compute-gpu-monitoring/archive/refs/heads/main.zip /opt/google
cd /opt/google
sudo unzip main.zip
sudo chmod -R 755 compute-gpu-monitoring
sudo rm main.zip

Set up the virtual environment

To use the monitoring script, you need to install its required modules. We recommend that you create a virtual environment for this module separate from the default python installation. To create this virtual environment, use either pipenv or virtualenv.

Using virtualenv

If you are using virtualenv and pip, you'll need to create the virtual environment. To create the environment, run the following command:

cd /opt/google/compute-gpu-monitoring/linux
sudo virtualenv -p python3 venv
sudo venv/bin/pip install -Ur requirements.txt

Using pipenv

If you are using pipenv, run the following command:


# Pipenv creates a virtual environment for you and installs the necessary modules.

cd /opt/google/compute-gpu-monitoring/linux
sudo pipenv sync

Start the agent on system boot

On systems that use systemd to manage their services, use the following steps to add the GPU monitoring agent to the list of automatically started services.

Using virtualenv

The google_gpu_monitoring_agent_venv.service file contains prepared service definition for systemd for installations using virtualenv.

sudo cp /opt/google/compute-gpu-monitoring/linux/systemd/google_gpu_monitoring_agent_venv.service /lib/systemd/system
sudo systemctl daemon-reload
sudo systemctl --no-reload --now enable /lib/systemd/system/google_gpu_monitoring_agent_venv.service

Using pipenv

The google_gpu_monitoring_agent.service file contains prepared service definition for systemd for installations using pipenv.

sudo cp /opt/google/compute-gpu-monitoring/linux/systemd/google_gpu_monitoring_agent.service /lib/systemd/system
sudo systemctl daemon-reload
sudo systemctl --no-reload --now enable /lib/systemd/system/google_gpu_monitoring_agent.service

Review metrics in Cloud Monitoring

  1. In the Google Cloud Console, go to the Metrics Explorer page.

    Go to Monitoring

  2. In the Resource type drop-down, select VM instance.

  3. In the Metric drop-down, type custom/instance/gpu/utilization.

    Your GPU utilization should resemble the following output:

    Cloud Monitoring initiation.

What's next?