[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-04。"],[],[],null,["# Troubleshoot your Cloud TPU workflow\n====================================\n\nOnce you have your training or inference workload running on TPUs, the\nnext step is to ensure your workload is working as expected. Cloud TPU\ngenerates metrics and logs that enable you to look for and debug any TPU VMs\nthat are not behaving as expected. We refer to such VMs as *outliers* throughout\nthis documentation.\n\nThe general troubleshooting workflow is:\n\n1. View Cloud TPU metrics to check for outlier TPU VMs\n2. View Cloud TPU logs for the outlier TPU VMs\n3. Profile your workload\n\nYou can view metrics and logs in the [Metrics Explorer](/monitoring/charts/metrics-explorer)\nand the [Logs Explorer](/logging/docs/view/logs-explorer-interface) in the Google Cloud\nconsole. You can also use monitoring and logging dashboards to collect all Cloud TPU\nrelated metrics and logs in individual dashboards.\n\n\nCloud TPU VM metrics\n--------------------\n\nCloud Monitoring automatically collects metrics from your TPUs and their host\nCompute Engine VMs. Metrics track numerical quantities over time, for example,\nCPU utilization, network usage, or TensorCore idle duration. For more information\non Cloud TPU metrics, see [Monitoring TPU VMs](/tpu/docs/troubleshooting/tpu-vm-monitoring).\n\nCloud TPU logs\n--------------\n\nCloud Logging automatically collects logs from your TPUs and their host\nCompute Engine VMs. Cloud Logging tracks events generated by Cloud TPU.\nYou can also instrument your code to generate logs. Two types of logs are\ngenerated by Cloud TPU:\n\n- TPU Worker logs\n- Audited resource logs\n\nTPU Worker logs contain information about a specific TPU worker in a specific\nzone, for example the amount of memory available on the TPU worker\n(system_available_memory_GiB).\n\nAudited Resource logs contain information about when a specific Cloud TPU API\nwas called and who made the call. For example `CreateNode`, `UpdateNode`, and\n`DeleteNode`.\n\nYou can also use the `cloud-tpu-diagnostics` PyPi package to write stack traces\nto logs. For more information, see [Debugging TPU VMs](/tpu/docs/troubleshooting/debugging).\n\nFor more information about logs, see [Logging](/tpu/docs/troubleshooting/tpu-vm-monitoring#locate_logs).\n\nMonitoring and logging dashboards\n---------------------------------\n\nHaving a single page in the Google Cloud console can make viewing and interpreting\nCloud TPU-related metrics and logs easier. The [monitoring-debugging](https://github.com/google/cloud-tpu-monitoring-debugging)\nGitHub repository contains a set of scripts and configuration files that use\n[Terraform](https://developer.hashicorp.com/terraform) to automatically deploy\ndashboards that contain all Cloud TPU related metrics and logs in dashboards.\nTo set up these dashboards in your Google Cloud project, see\n[Monitoring and Logging Dashboards](/tpu/docs/troubleshooting/dashboards).\n\nProfiling your workloads on TPU VMs\n-----------------------------------\n\nProfiling lets you optimize your model's training performance on TPU VMs.\nYou use [TensorBoard](https://www.tensorflow.org/tensorboard) and the\n[TPU TensorBoard plug-in](/tpu/docs/profile-tpu-vm#install-plugin)\nto profile your model. For more information about how to profile your workload,\nsee [Profile your model on TPU VMs](/tpu/docs/profile-tpu-vm).\n\nFor more information about using TensorBoard with one of the supported\nframeworks, see the following documents:\n\n- [PyTorch performance guide](/tpu/docs/pytorch-xla-performance-profiling-tpu-vm)\n- [JAX performance guide](https://jax.readthedocs.io/en/latest/profiling.html)"]]