Troubleshooting Cloud TPU errors and performance issues

These troubleshooting documents describe error conditions and performance issues you might see while training with Cloud TPUs using TensorFlow, JAX, and PyTorch.

Monitoring with Stacktrace describes how to create log-based metrics that can be used to create alerts and visualizing dashboards to help debug errors and performance issues.

If you cannot tell whether the problem you are seeing is specific to a particular framework, start with Troubleshooting TensorFlow - TPU.