Troubleshooting Cloud TPU errors and performance issues
Stay organized with collections
Save and categorize content based on your preferences.
These troubleshooting documents describe error conditions and
performance issues you might see while training with Cloud
TPUs using TensorFlow, JAX, and PyTorch.
Monitoring with Stacktrace
describes how to create log-based metrics that can be used to create
alerts and visualizing dashboards to help debug errors and performance
issues.
If you cannot tell whether the
problem you are seeing is specific to a particular framework, start with
Troubleshooting TensorFlow - TPU.