Debugging Cloud TPU VMs

This document describes how to use the cloud-tpu-diagnostics PyPI package to generate stack traces for processes running in TPU VMs. This package dumps the Python traces when a fault occurs, for example segmentation faults, floating-point exceptions, or illegal operation exceptions. Additionally, it also periodically collects stack traces to help you debug situations when the program is unresponsive.

To use the cloud-tpu-diagnostics PyPI package, you must install it by running pip install cloud-tpu-diagnostics on all TPU VMs. You can do this with one gcloud compute tpus tpu-vm ssh command. For example:

  gcloud compute tpus tpu-vm ssh you-tpu-name \
  --zone=your-zone \
  --project=your-project-name \
  --worker=all \
  --command="pip install cloud-tpu-diagnostics"

You must also add the following code to your scripts running on all TPU VMs.

from cloud_tpu_diagnostics import diagnostic
from cloud_tpu_diagnostics.configuration import debug_configuration
from cloud_tpu_diagnostics.configuration import diagnostic_configuration
from cloud_tpu_diagnostics.configuration import stack_trace_configuration

stack_trace_config = stack_trace_configuration.StackTraceConfig(
                      collect_stack_trace = True,
                      stack_trace_to_cloud = True)
debug_config = debug_configuration.DebugConfig(
                stack_trace_config = stack_trace_config)
diagnostic_config = diagnostic_configuration.DiagnosticConfig(
                      debug_config = debug_config)

By default, stack traces are collected every 10 minutes. You can change the duration between two stack trace collection events to 5 minutes, for example:

stack_trace_config = stack_trace_configuration.StackTraceConfig(
                      collect_stack_trace = True,
                      stack_trace_to_cloud = True,
                      stack_trace_interval_seconds = 300)

Wrap your main method with diagnose() to periodically collect the stack traces:

with diagnostic.diagnose(diagnostic_config):
    run_main()

This configuration starts collecting stack traces inside the /tmp/debugging directory on each TPU VM. There is an agent running on all TPU VMs that uploads the traces from a temporary directory to Cloud Logging.