Troubleshoot your Cloud TPU workflow
Once you have your training or inference workload running on TPUs, the next step is to ensure your workload is working as expected. Cloud TPU generates metrics and logs that enable you to look for and debug any TPU VMs that are not behaving as expected. We refer to such VMs as outliers throughout this documentation.
The general troubleshooting workflow is:
- View Cloud TPU metrics to check for outlier TPU VMs
- View Cloud TPU logs for the outlier TPU VMs
- Profile your workload
You can view metrics and logs in the Metrics Explorer and the Logs Explorer in the Google Cloud console. You can also use monitoring and logging dashboards to collect all Cloud TPU related metrics and logs in individual dashboards.
Cloud TPU VM metrics
Cloud Monitoring automatically collects metrics from your TPUs and their host Compute Engine VMs. Metrics track numerical quantities over time, for example, CPU utilization, network usage, or TensorCore idle duration. For more information on Cloud TPU metrics, see Monitoring TPU VMs.
Cloud TPU logs
Cloud Logging automatically collects logs from your TPUs and their host Compute Engine VMs. Cloud Logging tracks events generated by Cloud TPU. You can also instrument your code to generate logs. Two types of logs are generated by Cloud TPU:
- TPU Worker logs
- Audited resource logs
TPU Worker logs contain information about a specific TPU worker in a specific zone, for example the amount of memory available on the TPU worker (system_available_memory_GiB).
Audited Resource logs contain information about when a specific Cloud TPU API
was called and who made the call. For example CreateNode
, UpdateNode
, and
DeleteNode
.
You can also use the cloud-tpu-diagnostics
PyPi package to write stack traces
to logs. For more information, see Debugging TPU VMs.
For more information about logs, see Logging.
Monitoring and logging dashboards
Having a single page in the Google Cloud console can make viewing and interpreting Cloud TPU-related metrics and logs easier. The monitoring-debugging GitHub repository contains a set of scripts and configuration files that use Terraform to automatically deploy dashboards that contain all Cloud TPU related metrics and logs in dashboards. To set up these dashboards in your Google Cloud project, see Monitoring and Logging Dashboards.
Profiling your workloads on TPU VMs
Profiling lets you optimize your model's training performance on TPU VMs. You use TensorBoard and the TPU TensorBoard plug-in to profile your model. For more information about how to profile your workload, see Profile your model on TPU VMs.
For more information about using TensorBoard with one of the supported frameworks, see the following documents: