Debugging Cloud TPU VMs
This document describes how to use the cloud-tpu-diagnostics PyPI package to generate stack traces for processes running in TPU VMs. This package dumps the Python traces when a fault occurs, for example segmentation faults, floating-point exceptions, or illegal operation exceptions. Additionally, it also periodically collects stack traces to help you debug situations when the program is unresponsive.
To use the cloud-tpu-diagnostics
PyPI package, you must install it by running pip install cloud-tpu-diagnostics
on all TPU VMs. You can do this with one gcloud compute tpus tpu-vm ssh
command. For example:
gcloud compute tpus tpu-vm ssh you-tpu-name \ --zone=your-zone \ --project=your-project-name \ --worker=all \ --command="pip install cloud-tpu-diagnostics"
You must also add the following code to your scripts running on all TPU VMs.
from cloud_tpu_diagnostics import diagnostic
from cloud_tpu_diagnostics.configuration import debug_configuration
from cloud_tpu_diagnostics.configuration import diagnostic_configuration
from cloud_tpu_diagnostics.configuration import stack_trace_configuration
stack_trace_config = stack_trace_configuration.StackTraceConfig(
collect_stack_trace = True,
stack_trace_to_cloud = True)
debug_config = debug_configuration.DebugConfig(
stack_trace_config = stack_trace_config)
diagnostic_config = diagnostic_configuration.DiagnosticConfig(
debug_config = debug_config)
By default, stack traces are collected every 10 minutes. You can change the duration between two stack trace collection events to 5 minutes, for example:
stack_trace_config = stack_trace_configuration.StackTraceConfig(
collect_stack_trace = True,
stack_trace_to_cloud = True,
stack_trace_interval_seconds = 300)
Wrap your main method with diagnose()
to periodically collect the stack traces:
with diagnostic.diagnose(diagnostic_config):
run_main()
This configuration starts collecting stack traces inside the /tmp/debugging
directory on each TPU VM. There is an agent running on all TPU VMs that uploads
the traces from a temporary directory to Cloud Logging.