Compute

Announcing a new monitoring library to optimize TPU performance

July 18, 2025

Sohamn Chatterjee

Product Manager

Xingyao Leo Zhang

Software Engineer

Try Gemini 3

Our most intelligent model is now available on Vertex AI and Gemini Enterprise

Try now

For more than a decade, TPUs have powered Google’s most demanding AI training and serving workloads. And there is strong demand from customers for Cloud TPUs as well. When running advanced AI workloads, you need to be able to monitor and optimize the efficiency of your training and inference jobs, and swiftly diagnose performance bottlenecks, node degradation, and host interruptions. Ultimately, you need real-time optimization logic built into your training and inference pipelines so you can maximize the efficiency of your applications — whether you’re optimizing for ML Goodput, operational cost, or time-to-market.

Today, we're thrilled to introduce a new monitoring library for Google Cloud TPUs, a new set of observability and diagnostic tools that provide granular, integrated performance and accelerator utilization insights so you can continuously assess and improve the efficiency of your Cloud TPU workloads.

Note: If you have shell access to the TPU VM and just need some diagnostic information (e.g., to observe memory usage for a running process), you can use tpu-info, a command-line tool for viewing TPU metrics.

Unlocking dynamic optimization: Key metrics in action

The monitoring library provides snapshot-mode access to a rich set of metrics, such as Tensor core utilization, high-bandwidth memory (HBM) usage, and buffer transfer latency. Metrics are sampled every second (1 Hz) for consistency. See the documentation for a full list of metrics and how you can use them.

You can use these metrics in your code directly to dynamically optimize for greater efficiency. For instance, if your duty_cycle_pct (a measure of utilization) is consistently low, you can programmatically adjust your data pipeline or increase batch size to better saturate the Tensor core. If hbm_capacity_usage approaches limits, your code could trigger a dynamic reduction in model size or activate memory-saving strategies to avoid out-of-memory errors. Similarly, hlo_exec_timing (how long operations are taking to execute on the accelerator) and hlo_queue_size (how many operations are waiting to be executed) can inform runtime adjustments to communication patterns or workload distribution based on observed bottlenecks.

Let’s see how to set up the library with a couple of realistic examples.

Getting started with the library

The TPU monitoring library is integrated within the LibTPU library. Here’s how to install it:

For JAX or PyTorch users, libTPU is included in your installation when you install jax[tpu] or torch_xla[tpu] (read more about PyTorch/XLA and JAX installation).

You can refer to the library in your Python code: from libtpu.sdk import tpumonitoring. You can then discover supported functionality with sdk.monitoring.help() and list available metric names using tpumonitoring.list_supported_metrics().

Example 1. Monitoring TPU duty cycle during training for dynamic adjustment

Integrate duty_cycle_pct logging into your JAX training loop to track how busy the TPUs are.

lang-py

import jax
from libtpu.sdk import tpumonitoring
import time

# --- Your JAX model and training setup would go here ---
#  --- Example placeholder model and data (replace with your actual setup)---
def simple_model(x):
    return jnp.sum(x)

def loss_fn(params, x, y):
    preds = simple_model(x)
    return jnp.mean((preds - y)**2)

def train_step(params, x, y, optimizer):
    grads = jax.grad(loss_fn)(params, x, y)
    return optimizer.update(grads, params)

key = jax.random.PRNGKey(0)
params = jnp.array([1.0, 2.0]) # Example params
optimizer = None # Your optimizer (for example, optax.adam)
data_x = jnp.ones((10, 10))
data_y = jnp.zeros((10,))

num_epochs = 10
log_interval_steps = 2  # Log duty cycle every 2 steps

for epoch in range(num_epochs):
    for step in range(5): # Example steps per epoch

params = train_step(params, data_x, data_y, optimizer)

if (step + 1) % log_interval_steps == 0:
            # --- Integrate TPU Monitoring Library here to get duty_cycle ---
            
            
            duty_cycle_metric = tpumonitoring.get_metric(metric_name="duty_cycle_pct")             duty_cycle_data = duty_cycle_metric.data() 
            print(f"Epoch {epoch+1}, Step {step+1}: TPU Duty Cycle Data:")             print(f"  Description: {duty_cycle_metric.description()}")
            print(f"  Data: {duty_cycle_data}")             # --- End TPU Monitoring Library Integration ---

# --- Rest of your training loop logic ---
        time.sleep(0.1) # Simulate some computation

print("Training complete.")

A consistently low duty cycle suggests potential CPU bottlenecks or inefficient data loading. This example simply prints out the value, but in the real world you can trigger a re-sharding or other actions.

Example 2. Checking HBM utilization before JAX inference for resource management

While running JAX programs on Cloud TPUs, optimizing HBM usage during compilation presents a significant opportunity. By proactively getting insights on potential TPU memory reservations during compilation, you can unlock greater efficiency and prevent out-of-Memory (OOM) errors, which is especially crucial for scaling large models. By checking the hbm_capacity_usage metric from the monitoring library you can see the available HBM, allowing for dynamic adjustments to your inference strategy and mitigating memory errors.

lang-py

If HBM usage is unexpectedly high, you might consider optimizing your model size, batching strategy, or input data pipelines.

Maximize your TPU utilization

In this post, we showed you two simple examples of how you can improve the efficiency of your TPU workloads with some proactive monitoring. The TPU monitoring library can help you improve the utilization of your accelerator, dynamically tune them to your use cases, and ensure you have the best cost efficiencies.

To learn more about the TPU monitoring library, please visit the documentation. To get started with Cloud TPUs, please visit our Intro to TPU documentation.

Posted in

Compute

Simplify VM OS agent management at scale: Introducing VM Extensions Manager

By Omkar Suram • 4-minute read

Compute

Automate AI and HPC clusters with Cluster Director, now generally available

By Ilias Katsardis • 7-minute read

Compute

Google named a Leader in The Forrester Wave™: AI Infrastructure Solutions, Q4 2025

By Mark Lohmeyer • 7-minute read

Compute

AI agents are here. Is your infrastructure ready?

By Dave McCarthy • 5-minute read

Announcing a new monitoring library to optimize TPU performance

Sohamn Chatterjee

Xingyao Leo Zhang

Try Gemini 3

Unlocking dynamic optimization: Key metrics in action

Getting started with the library

Maximize your TPU utilization

Related articles

Simplify VM OS agent management at scale: Introducing VM Extensions Manager

Automate AI and HPC clusters with Cluster Director, now generally available

Google named a Leader in The Forrester Wave™: AI Infrastructure Solutions, Q4 2025

AI agents are here. Is your infrastructure ready?