PyTorch XLA performance profiling
Overview
This guide walks you through how to use Cloud TPU performance tools and the metrics auto-analysis feature with PyTorch. These tools help you debug and optimize training workload performance.
For more information about Cloud TPU performance with PyTorch, see the following blog posts:
- Scaling deep learning PyTorch workloads
- PyTorch XLA performance debugging on TPU VMs - part 1
- PyTorch XLA performance debugging on TPU VMs - part 2
- PyTorch XLA performance debugging on TPU VMs - part 3
- Lazy tensor performance with PyTorch XLA
If you are new to PyTorch / XLA refer to the PyTorch API_GUIDE and troubleshooting docs. For Cloud TPU refer to the concepts document.
TPU VM + PyTorch/XLA profiling
Use this section to profile PyTorch/XLA using the TPU VM architecture.
Export Environment Variables
Create variables for your project ID and the zone to use for your TPU resources.
export PROJECT_ID=PROJECT_ID export ZONE=ZONE
Create a Cloud TPU
Please refer to
Manage TPUs
and after setup, create a v3-8 TPU VM, which comes with torch
,
torch_xla
,
torchvision
, and tensorboard
preinstalled.
Create a TPU resource.
gcloud compute tpus tpu-vm create profiler-tutorial-tpu-vm \ --project=${PROJECT_ID} \ --zone=${ZONE} \ --version=v2-alpha \ --accelerator-type=v3-8
Tensorboard server startup
SSH onto the VM, install the
tensorboard-plugin-profile
, and start up a tensorboard server.gcloud compute tpus tpu-vm ssh profiler-tutorial-tpu-vm \ --project ${PROJECT_ID} \ --zone ${ZONE} \ --ssh-flag="-4 -L 9001:localhost:9001"
pip3 install tf-nightly==2.6.0.dev20210511 tb-nightly==2.6.0a20210512 tbp-nightly==2.5.0a20210511
tensorboard --logdir ./tensorboard --port 9001
When you view the TensorBoard output at http://localhost:9001 on your local machine, you should see something like this:
If you view the TensorBoard output at http://localhost:9001 you can also access the above profile page by selecting the PROFILE option on the dropdown on the top-right corner next to the UPLOAD button:
Profile the model
On a new terminal window on your development environment, export the same
environment variables as above and ssh
onto your TPU VM:
In the new terminal window, export your project ID and zone variables again, since this is in a new shell.
export PROJECT_ID=PROJECT_ID export ZONE=ZONE
ssh
into the VM:gcloud compute tpus tpu-vm ssh profiler-tutorial-tpu-vm \ --project ${PROJECT_ID} \ --zone ${ZONE}
Clone the PyTorch/XLA repository and run our e2e test:
git clone -b r1.8 https://github.com/pytorch/xla export XRT_TPU_CONFIG="localservice;0;localhost:51011" python3 xla/test/test_profiler.py # takes <1 min
Before starting the training, edit the following lines in
xla/test/test_profile_mp_mnist.py
:Change:
accuracy = train_mnist(flags, dynamic_graph=True, fetch_often=True)
To:accuracy = train_mnist(flags, dynamic_graph=False, fetch_often=False)
The two arguments above in
train_mnist
, artificially cause dynamic graphs and tensor fetches, which are explored later in the Auto-metrics Analysis section. For now, since you are just profiling the TPU the following example runs with nominal performance.Start a training run:
XLA_HLO_DEBUG=1 python3 xla/test/test_profile_mp_mnist.py --num_epochs 1000 --fake_data
TPU + Client Profiling
Once the training is running view the TensorBoard output at http://localhost:9001 and capture a profile using the following instructions:
You should then see the following page reloaded:
Currently in the TPU VM setup, only the trace viewer
tool is selected so under the Tools dropdown
select trace_viewer
and inspect the traces. You can
see that in the TPU VM setup you see both the "client" side and TPU device side
traces in one full view:
Cleanup
Exit from your VM and then delete the TPU, VM, and Cloud Storage bucket by running the following commands:
(vm)$ exit
Delete the TPU VM you created:
Delete your Cloud TPU and Compute Engine resources.
$ gcloud compute tpus tpu-vm delete profiler-tutorial-tpu-vm \ --project ${PROJECT_ID} --zone=${ZONE}
Verify the resources have been deleted by running the following command. The deletion might take several minutes. A response like the one below indicates your instances have been successfully deleted.
$ gcloud compute tpus tpu-vm list --project ${PROJECT_ID} --zone=${ZONE}
Listed 0 items.