To help with better utilization of resources, you can track the GPU usage rates of your instances. When you know the GPU usage rates, you can then perform tasks such as setting up managed instance groups that can be used to autoscale resources based on needs.
To review GPU metrics using Stackdriver Monitoring, complete the following steps:
On each VM instance, set up the GPU metrics reporting script. This script performs the following tasks:
- Installs the GPU metrics reporting agent. This agent runs at intervals on the instance to collect GPU data, and sends this data to Stackdriver Monitoring.
- Creates a
custom/gpu_utilizationmetrics field in Stackdriver Monitoring. This field stores GPU specific data that you can analyze in Stackdriver Monitoring.
Setting up the GPU metrics reporting script
On each of your VM instances, check that you meet the following requirements:
On each of your VM instances, install the GPU metrics agent. To install the metrics agent, complete the following steps:
Download the GPU metrics reporting scripts.
git clone https://github.com/GoogleCloudPlatform/tensorflow-inference-tensorrt5-t4-gpu.git
Switch to the
Set up the installation environment for the metrics agent.
pip install -r ./requirements.txt
Move the metric reporting script to your root directory.
sudo cp report_gpu_metrics.py /root/
Enable the GPU metrics agent.
cat <<-EOH > /lib/systemd/system/gpu_utilization_agent.service [Unit] Description=GPU Utilization Metric Agent [Service] PIDFile=/run/gpu_agent.pid ExecStart=/bin/bash --login -c '/usr/bin/python /root/report_gpu_metrics.py' User=root Group=root WorkingDirectory=/ Restart=always [Install] WantedBy=multi-user.target EOH
Reload the system daemon.
Enable the gpu monitoring service.
systemctl --no-reload --now enable /lib/systemd/system/gpu_utilization_agent.service
Reviewing metrics in Stackdriver Monitoring
In the Google Cloud Console, select Monitoring, or use the following button:
If Metrics Explorer is shown in the navigation pane, click Metrics Explorer. Otherwise, select Resources and then select Metrics Explorer.
Your GPU utilization should resemble the following output:
(Optional) Set up autoscaling using managed instance groups. To get started, you can review the Setting up a multiple-zone cluster section of the TensorFlow inference workload tutorial.