TPU でトレーニング ワークロードまたは推論ワークロードを実行したら、次のステップは、ワークロードが期待どおりに機能しているかを確認することです。Cloud TPU は、想定どおりに動作していない TPU VM を検索してデバッグできるように、指標とログを生成します。このドキュメントでは、このような VM を外れ値と呼びます。
一般的なトラブルシューティング ワークフローは次のとおりです。
Cloud TPU の指標を表示して、外れ値の TPU VM を確認する
外れ値の TPU VM の Cloud TPU ログを確認する
ワークロードをプロファイリングする
指標とログは、Metrics Explorer と Google Cloudコンソールのログ エクスプローラで確認できます。モニタリングとロギングのダッシュボードを使用して、Cloud TPU 関連のすべての指標とログを個別のダッシュボードで収集することもできます。
Cloud TPU VM の指標
Cloud Monitoring は、TPU とそのホストの Compute Engine VM から指標を自動的に収集します。指標は、CPU 使用率、ネットワーク使用量、TensorCore のアイドル状態の期間など、時間の経過に伴う数値を追跡します。Cloud TPU の指標の詳細については、TPU VM のモニタリングをご覧ください。
[[["わかりやすい","easyToUnderstand","thumb-up"],["問題の解決に役立った","solvedMyProblem","thumb-up"],["その他","otherUp","thumb-up"]],[["わかりにくい","hardToUnderstand","thumb-down"],["情報またはサンプルコードが不正確","incorrectInformationOrSampleCode","thumb-down"],["必要な情報 / サンプルがない","missingTheInformationSamplesINeed","thumb-down"],["翻訳に関する問題","translationIssue","thumb-down"],["その他","otherDown","thumb-down"]],["最終更新日 2025-09-04 UTC。"],[],[],null,["# Troubleshoot your Cloud TPU workflow\n====================================\n\nOnce you have your training or inference workload running on TPUs, the\nnext step is to ensure your workload is working as expected. Cloud TPU\ngenerates metrics and logs that enable you to look for and debug any TPU VMs\nthat are not behaving as expected. We refer to such VMs as *outliers* throughout\nthis documentation.\n\nThe general troubleshooting workflow is:\n\n1. View Cloud TPU metrics to check for outlier TPU VMs\n2. View Cloud TPU logs for the outlier TPU VMs\n3. Profile your workload\n\nYou can view metrics and logs in the [Metrics Explorer](/monitoring/charts/metrics-explorer)\nand the [Logs Explorer](/logging/docs/view/logs-explorer-interface) in the Google Cloud\nconsole. You can also use monitoring and logging dashboards to collect all Cloud TPU\nrelated metrics and logs in individual dashboards.\n\n\nCloud TPU VM metrics\n--------------------\n\nCloud Monitoring automatically collects metrics from your TPUs and their host\nCompute Engine VMs. Metrics track numerical quantities over time, for example,\nCPU utilization, network usage, or TensorCore idle duration. For more information\non Cloud TPU metrics, see [Monitoring TPU VMs](/tpu/docs/troubleshooting/tpu-vm-monitoring).\n\nCloud TPU logs\n--------------\n\nCloud Logging automatically collects logs from your TPUs and their host\nCompute Engine VMs. Cloud Logging tracks events generated by Cloud TPU.\nYou can also instrument your code to generate logs. Two types of logs are\ngenerated by Cloud TPU:\n\n- TPU Worker logs\n- Audited resource logs\n\nTPU Worker logs contain information about a specific TPU worker in a specific\nzone, for example the amount of memory available on the TPU worker\n(system_available_memory_GiB).\n\nAudited Resource logs contain information about when a specific Cloud TPU API\nwas called and who made the call. For example `CreateNode`, `UpdateNode`, and\n`DeleteNode`.\n\nYou can also use the `cloud-tpu-diagnostics` PyPi package to write stack traces\nto logs. For more information, see [Debugging TPU VMs](/tpu/docs/troubleshooting/debugging).\n\nFor more information about logs, see [Logging](/tpu/docs/troubleshooting/tpu-vm-monitoring#locate_logs).\n\nMonitoring and logging dashboards\n---------------------------------\n\nHaving a single page in the Google Cloud console can make viewing and interpreting\nCloud TPU-related metrics and logs easier. The [monitoring-debugging](https://github.com/google/cloud-tpu-monitoring-debugging)\nGitHub repository contains a set of scripts and configuration files that use\n[Terraform](https://developer.hashicorp.com/terraform) to automatically deploy\ndashboards that contain all Cloud TPU related metrics and logs in dashboards.\nTo set up these dashboards in your Google Cloud project, see\n[Monitoring and Logging Dashboards](/tpu/docs/troubleshooting/dashboards).\n\nProfiling your workloads on TPU VMs\n-----------------------------------\n\nProfiling lets you optimize your model's training performance on TPU VMs.\nYou use [TensorBoard](https://www.tensorflow.org/tensorboard) and the\n[TPU TensorBoard plug-in](/tpu/docs/profile-tpu-vm#install-plugin)\nto profile your model. For more information about how to profile your workload,\nsee [Profile your model on TPU VMs](/tpu/docs/profile-tpu-vm).\n\nFor more information about using TensorBoard with one of the supported\nframeworks, see the following documents:\n\n- [PyTorch performance guide](/tpu/docs/pytorch-xla-performance-profiling-tpu-vm)\n- [JAX performance guide](https://jax.readthedocs.io/en/latest/profiling.html)"]]