Troubleshoot your Cloud TPU workflow

Once you have your training or inference workload running on TPUs, the next step is to ensure your workload is working as expected. Cloud TPU generates metrics and logs that enable you to look for and debug any TPU VMs that are not behaving as expected. We refer to such VMs as outliers throughout this documentation.

The general troubleshooting workflow is:

  1. View Cloud TPU metrics to check for outlier TPU VMs
  2. View Cloud TPU logs for the outlier TPU VMs
  3. Profile your workload

You can view metrics and logs in the Metrics Explorer and the Logs Explorer in the Google Cloud console. You can also use monitoring and logging dashboards to collect all Cloud TPU related metrics and logs in individual dashboards.

Cloud TPU VM metrics

Cloud Monitoring automatically collects metrics from your TPUs and their host Compute Engine VMs. Metrics track numerical quantities over time, for example, CPU utilization, network usage, or TensorCore idle duration. For more information on Cloud TPU metrics, see Monitoring TPU VMs.

Cloud TPU logs

Cloud Logging automatically collects logs from your TPUs and their host Compute Engine VMs. Cloud Logging tracks events generated by Cloud TPU. You can also instrument your code to generate logs. Two types of logs are generated by Cloud TPU:

  • TPU Worker logs
  • Audited resource logs

TPU Worker logs contain information about a specific TPU worker in a specific zone, for example the amount of memory available on the TPU worker (system_available_memory_GiB).

Audited Resource logs contain information about when a specific Cloud TPU API was called and who made the call. For example CreateNode, UpdateNode, and DeleteNode.

You can also use the cloud-tpu-diagnostics PyPi package to write stack traces to logs. For more information, see Debugging TPU VMs.

For more information about logs, see Logging.

Monitoring and logging dashboards

Having a single page in the Google Cloud console can make viewing and interpreting Cloud TPU-related metrics and logs easier. The monitoring-debugging GitHub repository contains a set of scripts and configuration files that use Terraform to automatically deploy dashboards that contain all Cloud TPU related metrics and logs in dashboards. To set up these dashboards in your Google Cloud project, see Monitoring and Logging Dashboards.

Profiling your workloads on TPU VMs

Profiling lets you optimize your model's training performance on TPU VMs. You use TensorBoard and the TPU TensorBoard plug-in to profile your model. For more information about how to profile your workload, see Profile your model on TPU VMs.

For more information about using TensorBoard with one of the supported frameworks, see the following documents: