[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-04。"],[],[],null,["# Debugging Cloud TPU VMs\n=======================\n\nThis document describes how to use the [cloud-tpu-diagnostics](https://pypi.org/project/cloud-tpu-diagnostics/)\nPyPI package to generate stack traces for processes running in TPU VMs. This\npackage dumps the Python traces when a fault occurs, for example segmentation\nfaults, floating-point exceptions, or illegal operation exceptions.\nAdditionally, it also periodically collects stack traces to help you debug\nsituations when the program is unresponsive.\n\n\nTo use the [cloud-tpu-diagnostics](https://pypi.org/project/cloud-tpu-diagnostics/)\nPyPI package, you must install it by running `pip install cloud-tpu-diagnostics`\non all TPU VMs. You can do this with one `gcloud compute tpus tpu-vm ssh`\ncommand. For example: \n\n```bash\n gcloud compute tpus tpu-vm ssh you-tpu-name \\\n --zone=your-zone \\\n --project=your-project-name \\\n --worker=all \\\n --command=\"pip install cloud-tpu-diagnostics\"\n```\n\nYou must also add the following code to your scripts running on all TPU VMs. \n\n from cloud_tpu_diagnostics import diagnostic\n from cloud_tpu_diagnostics.configuration import debug_configuration\n from cloud_tpu_diagnostics.configuration import diagnostic_configuration\n from cloud_tpu_diagnostics.configuration import stack_trace_configuration\n\n stack_trace_config = stack_trace_configuration.StackTraceConfig(\n collect_stack_trace = True,\n stack_trace_to_cloud = True)\n debug_config = debug_configuration.DebugConfig(\n stack_trace_config = stack_trace_config)\n diagnostic_config = diagnostic_configuration.DiagnosticConfig(\n debug_config = debug_config)\n\nBy default, stack traces are collected every 10 minutes. You can change\nthe duration between two stack trace collection events to 5 minutes, for example: \n\n stack_trace_config = stack_trace_configuration.StackTraceConfig(\n collect_stack_trace = True,\n stack_trace_to_cloud = True,\n stack_trace_interval_seconds = 300)\n\nWrap your main method with `diagnose()` to periodically collect the stack traces: \n\n with diagnostic.diagnose(diagnostic_config):\n run_main()\n\nThis configuration starts collecting stack traces inside the `/tmp/debugging`\ndirectory on each TPU VM. There is an agent running on all TPU VMs that uploads\nthe traces from a temporary directory to Cloud Logging."]]