Troubleshooting PyTorch - TPU
bookmark_border Stay organized with collections Save and categorize content based on your preferences.

This guide provides troubleshooting information to help you identify and resolve problems you might encounter while training PyTorch models on Cloud TPU. For a more general guide to getting started with Cloud TPU, see the PyTorch quickstart.

Troubleshooting slow training performance

If your model trains slowly, generate and review a metrics report.

To automatically analyze the metrics report and provide a summary, run your workload with PT_XLA_DEBUG=1.

For more information about issues that might cause your model to train slowly, see Known performance caveats.

Performance profiling

To profile your workload in-depth to discover bottlenecks, review these resources:

More debugging tools

You can specify environment variables to control the behavior of the PyTorch/XLA software stack.

If you encounter an unexpected bug and need help, file a GitHub issue.

Managing XLA tensors

XLA tensor Quirks describes what you should and shouldn't do when working with XLA tensors and shared weights.