Troubleshooting PyTorch - TPU

This guide provides troubleshooting information to help you identify and resolve problems you might encounter while training PyTorch models on Cloud TPU. For a more general guide to getting started with Cloud TPU, see the PyTorch quickstart.

Troubleshooting slow training performance

If your model trains slowly, generate and review a metrics report.

To automatically analyze the metrics report and provide a summary, simply run your workload with PT_XLA_DEBUG=1.

For more information about issues that might cause your model to train slowly, see Known performance caveats.

Performance profiling

To profile your workload in depth to discover bottlenecks, you can use the following resources:

More debugging tools

You can specify environment variables to control the behavior of the PyTorch/XLA software stack.

If the PyTorch process stops responding, file a GitHub issue and include stack traces.

A debug_run.py utility is provided in scripts/debug_run.py which can be used to create a tar.gz archive with the information required to debug PyTorch/XLA executions.

Managing XLA tensors

XLA tensor Quirks describes what you should and should not do when working with XLA tensors and shared weights.