Troubleshooting PyTorch - TPU
This guide provides troubleshooting information to help you identify and resolve problems you might encounter while training PyTorch models on Cloud TPU. For a more general guide to getting started with Cloud TPU, see the PyTorch quickstart.
Troubleshooting slow training performance
If your model trains slowly, generate and review a metrics report.
To automatically analyze the metrics report and provide a summary, simply run your workload with PT_XLA_DEBUG=1.
For more information about issues that might cause your model to train slowly, see Known performance caveats.
Performance profiling
To profile your workload in depth to discover bottlenecks, you can use the following resources:
- PyTorch/XLA performance profiling
- PyTorch/XLA profiling Colab
- Sample MNIST training script with profiling
More debugging tools
You can specify environment variables to control the behavior of the PyTorch/XLA software stack.
If the PyTorch process stops responding, file a GitHub issue and include stack traces.
A debug_run.py utility
is provided in scripts/debug_run.py which can be used to create a tar.gz
archive with the information required to debug PyTorch/XLA executions.
Managing XLA tensors
XLA tensor Quirks describes what you should and should not do when working with XLA tensors and shared weights.