이 문서에서는 cloud-tpu-diagnostics PyPI 패키지를 사용하여 TPU VM에서 실행되는 프로세스에 관한 스택 트레이스를 생성하는 방법을 설명합니다. 이 패키지는 세분화 결함, 부동 소수점 예외, 잘못된 작업 예외와 같은 오류가 발생할 때 Python trace를 덤프합니다.
또한 프로그램이 응답하지 않는 상황을 디버그하는 데 도움이 되도록 주기적으로 스택 트레이스를 수집합니다.
cloud-tpu-diagnostics PyPI 패키지를 사용하려면 모든 TPU VM에서 pip install cloud-tpu-diagnostics를 실행하여 설치해야 합니다. gcloud compute tpus tpu-vm ssh 명령어 하나로 이 작업을 수행할 수 있습니다. 예를 들면 다음과 같습니다.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-09-04(UTC)"],[],[],null,["# Debugging Cloud TPU VMs\n=======================\n\nThis document describes how to use the [cloud-tpu-diagnostics](https://pypi.org/project/cloud-tpu-diagnostics/)\nPyPI package to generate stack traces for processes running in TPU VMs. This\npackage dumps the Python traces when a fault occurs, for example segmentation\nfaults, floating-point exceptions, or illegal operation exceptions.\nAdditionally, it also periodically collects stack traces to help you debug\nsituations when the program is unresponsive.\n\n\nTo use the [cloud-tpu-diagnostics](https://pypi.org/project/cloud-tpu-diagnostics/)\nPyPI package, you must install it by running `pip install cloud-tpu-diagnostics`\non all TPU VMs. You can do this with one `gcloud compute tpus tpu-vm ssh`\ncommand. For example: \n\n```bash\n gcloud compute tpus tpu-vm ssh you-tpu-name \\\n --zone=your-zone \\\n --project=your-project-name \\\n --worker=all \\\n --command=\"pip install cloud-tpu-diagnostics\"\n```\n\nYou must also add the following code to your scripts running on all TPU VMs. \n\n from cloud_tpu_diagnostics import diagnostic\n from cloud_tpu_diagnostics.configuration import debug_configuration\n from cloud_tpu_diagnostics.configuration import diagnostic_configuration\n from cloud_tpu_diagnostics.configuration import stack_trace_configuration\n\n stack_trace_config = stack_trace_configuration.StackTraceConfig(\n collect_stack_trace = True,\n stack_trace_to_cloud = True)\n debug_config = debug_configuration.DebugConfig(\n stack_trace_config = stack_trace_config)\n diagnostic_config = diagnostic_configuration.DiagnosticConfig(\n debug_config = debug_config)\n\nBy default, stack traces are collected every 10 minutes. You can change\nthe duration between two stack trace collection events to 5 minutes, for example: \n\n stack_trace_config = stack_trace_configuration.StackTraceConfig(\n collect_stack_trace = True,\n stack_trace_to_cloud = True,\n stack_trace_interval_seconds = 300)\n\nWrap your main method with `diagnose()` to periodically collect the stack traces: \n\n with diagnostic.diagnose(diagnostic_config):\n run_main()\n\nThis configuration starts collecting stack traces inside the `/tmp/debugging`\ndirectory on each TPU VM. There is an agent running on all TPU VMs that uploads\nthe traces from a temporary directory to Cloud Logging."]]