Jump to Content
Compute

Announcing a new monitoring library to optimize TPU performance

July 18, 2025
Sohamn Chatterjee

Product Manager

Xingyao Leo Zhang

Software Engineer

Try Gemini 2.5

Our most intelligent model is now available on Vertex AI

Try now

For more than a decade, TPUs have powered Google’s most demanding AI training and serving workloads. And there is strong demand from customers for Cloud TPUs as well. When running advanced AI workloads, you need to be able to monitor and optimize the efficiency of your training and inference jobs, and swiftly diagnose performance bottlenecks, node degradation, and host interruptions. Ultimately, you need real-time optimization logic built into your training and inference pipelines so you can maximize the efficiency of your applications — whether you’re optimizing for ML Goodput, operational cost, or time-to-market. 

Today, we're thrilled to introduce a new monitoring library for Google Cloud TPUs, a new set of observability and diagnostic tools that provide granular, integrated performance and accelerator utilization insights so you can continuously assess and improve the efficiency of your Cloud TPU workloads. 

Note: If you have shell access to the TPU VM and just need some diagnostic information (e.g., to observe memory usage for a running process), you can use tpu-info, a command-line tool for viewing TPU metrics.

Unlocking dynamic optimization: Key metrics in action

The monitoring library provides snapshot-mode access to a rich set of metrics, such as Tensor core utilization, high-bandwidth memory (HBM) usage, and buffer transfer latency. Metrics are sampled every second (1 Hz) for consistency. See the documentation for a full list of metrics and how you can use them.

You can use these metrics in your code directly to dynamically optimize for greater efficiency. For instance, if your duty_cycle_pct (a measure of utilization) is consistently low, you can programmatically adjust your data pipeline or increase batch size to better saturate the Tensor core. If hbm_capacity_usage approaches limits, your code could trigger a dynamic reduction in model size or activate memory-saving strategies to avoid out-of-memory errors. Similarly, hlo_exec_timing (how long operations are taking to execute on the accelerator) and hlo_queue_size (how many operations are waiting to be executed) can inform runtime adjustments to communication patterns or workload distribution based on observed bottlenecks.

Let’s see how to set up the library with a couple of realistic examples.

Getting started with the library

The TPU monitoring library is integrated within the LibTPU library. Here’s how to install it:

Loading...

For JAX or PyTorch users, libTPU is included in your installation when you install jax[tpu] or torch_xla[tpu] (read more about PyTorch/XLA and JAX installation).

You can refer to the library in your Python code: from libtpu.sdk import tpumonitoring. You can then discover supported functionality with sdk.monitoring.help() and list available metric names using tpumonitoring.list_supported_metrics().

Example 1. Monitoring TPU duty cycle during training for dynamic adjustment 

Integrate duty_cycle_pct logging into your JAX training loop to track how busy the TPUs are.

lang-py
Loading...

A consistently low duty cycle suggests potential CPU bottlenecks or inefficient data loading. This example simply prints out the value, but in the real world you can trigger a re-sharding or other actions.

Example 2. Checking HBM utilization before JAX inference for resource management 

While running JAX programs on Cloud TPUs, optimizing HBM usage during compilation presents a significant opportunity. By proactively getting insights on potential TPU memory reservations during compilation, you can unlock greater efficiency and prevent out-of-Memory (OOM) errors, which is especially crucial for scaling large models. By checking the hbm_capacity_usage metric from the monitoring library you can see the available HBM, allowing for dynamic adjustments to your inference strategy and mitigating memory errors.

lang-py
Loading...

If HBM usage is unexpectedly high, you might consider optimizing your model size, batching strategy, or input data pipelines.

Maximize your TPU utilization

In this post, we showed you two simple examples of how you can improve the efficiency of your TPU workloads with some proactive monitoring. The TPU monitoring library can help you improve the utilization of your accelerator, dynamically tune them to your use cases, and ensure you have the best cost efficiencies. 

To learn more about the TPU monitoring library, please visit the documentation. To get started with Cloud TPUs, please visit our Intro to TPU documentation.

Posted in