To help with better utilization of resources, you can track the GPU usage rates of your instances. When you know the GPU usage rates, you can then perform tasks such as setting up managed instance groups that can be used to autoscale resources based on needs.
To review GPU metrics using Cloud Monitoring, complete the following steps:
On each VM instance, set up the GPU metrics reporting script. This script performs the following tasks:
- Installs the GPU metrics reporting agent. This agent runs at intervals on the instance to collect GPU data, and sends this data to Cloud Monitoring.
- Creates a
custom/gpu_utilization
metrics field in Cloud Monitoring. This field stores GPU specific data that you can analyze in Cloud Monitoring.
Setting up the GPU metrics reporting script
On each of your VM instances, check that you meet the following requirements:
- Each VM instance must have GPUs attached.
- Each VM instance must have a GPU driver installed.
- Each VM instance must have the
pip
utility installed.
On each of your VM instances, install the GPU metrics agent. To install the metrics agent, complete the following steps:
Download the GPU metrics reporting scripts.
git clone https://github.com/GoogleCloudPlatform/tensorflow-inference-tensorrt5-t4-gpu.git
Switch to the
metrics_reporting
folder.cd tensorflow-inference-tensorrt5-t4-gpu/metrics_reporting
Set up the installation environment for the metrics agent.
If you are using Python 2, run the following command:
pip install -r ./requirements.txt
If you are using Python 3, run the following command:
pip3 install -r ./requirements.txt
Move the metric reporting script to your root directory.
sudo cp report_gpu_metrics.py /root/
Temporarily allow access to the
/lib/systemd/system/
directory:sudo chmod 777 /lib/systemd/system/
Enable the GPU metrics agent.
If you are using Python 2, run the following command:
cat <<-EOH > /lib/systemd/system/gpu_utilization_agent.service [Unit] Description=GPU Utilization Metric Agent [Service] PIDFile=/run/gpu_agent.pid ExecStart=/bin/bash --login -c '/usr/bin/python /root/report_gpu_metrics.py' User=root Group=root WorkingDirectory=/ Restart=always [Install] WantedBy=multi-user.target EOH
If you are using Python 3, run the following command:
cat <<-EOH > /lib/systemd/system/gpu_utilization_agent.service [Unit] Description=GPU Utilization Metric Agent [Service] PIDFile=/run/gpu_agent.pid ExecStart=/bin/bash --login -c '/opt/conda/bin/python /root/report_gpu_metrics.py' User=root Group=root WorkingDirectory=/ Restart=always [Install] WantedBy=multi-user.target EOH
Reset the permissions on the
/lib/systemd/system
directory. Run the following command:sudo chmod 755 /lib/systemd/system/
Reload the system daemon.
sudo systemctl daemon-reload
Enable the gpu monitoring service.
sudo systemctl --no-reload --now enable /lib/systemd/system/gpu_utilization_agent.service
Reviewing metrics in Cloud Monitoring
In the Google Cloud Console, select Monitoring, or use the following button:
The first time you access any Monitoring functionality for a Google Cloud project, the project is associated with a Workspace. If you've never used Monitoring, then a Workspace is automatically created. Otherwise, a dialog is displayed and you are asked to select between creating a Workspace and adding your project to an existing Workspace.
In the Monitoring navigation pane, click
Metrics Explorer.
Ensure the Metric tab is selected:
Search for
gpu_utilization
.Your GPU utilization should resemble the following output:
(Optional) Set up autoscaling using managed instance groups. To get started, you can review the Setting up a multiple-zone cluster section of the TensorFlow inference workload tutorial.
What's next?
- Learn more about GPUs on Compute Engine.
- To handle GPU host maintenance, see Handling GPU host maintenance events.
- To optimize GPU performance, see Optimizing GPU performance.