Troubleshooting PyTorch - TPU
Stay organized with collections
Save and categorize content based on your preferences.
This guide provides troubleshooting information to help you identify and resolve problems you might encounter while training PyTorch models on Cloud TPU. For a more general guide to getting started with Cloud TPU, see the PyTorch quickstart.
Troubleshooting slow training performance
If your model trains slowly, generate and review a metrics report.
To automatically analyze the metrics report and provide a summary, run your workload with PT_XLA_DEBUG=1.
For more information about issues that might cause your model to train slowly, see Known performance caveats.
Performance profiling
To profile your workload in-depth to discover bottlenecks, review these resources:
More debugging tools
You can specify environment variables to control the behavior of the PyTorch/XLA software stack.
If you encounter an unexpected bug and need help, file a GitHub issue.
Managing XLA tensors
XLA tensor Quirks describes what you should and shouldn't do when working with XLA tensors and shared weights.