Tensor Processing Units (TPUs)

Engineered for next-generation AI

Build, optimize, and scale training, inference, and reinforcement learning workloads to power autonomous reasoning agents 

Overview

A decade of Tensor Processing Units (TPUs)

TPUs are custom-designed accelerators purpose-built for AI workloads such as agents, code generation, large language models, media content generation, synthetic speech, vision services, recommendation engines, and personalization models, among others. TPUs power Gemini, and all of Google AI powered applications like Search, Photos, and Maps, all serving over 1 Billion users.

Purpose-built for agentic AI

The shift to Agentic AI requires infrastructure capable of multi-step reasoning and continuous reinforcement learning. TPUs break the inference "memory wall" by hosting massive KV caches entirely on-silicon, utilizing expanded on-chip SRAM with TPU 8i. Combined with our SparseCore engine to offload communication tasks, this architecture reduces core idle time. The result is low-latency, predictable performance that powers complex reasoning loops.

Performance without compromise

Speed up your deployment time by reducing training timelines for frontier models. Cloud TPUs maximize goodput, ensuring that nearly every compute cycle is spent on active learning. This is supported by an high-speed Inter-Chip Interconnect, Optical Circuit Switching, and Virgo Network, so accelerators operate as a highly reliable, unified system.

Sustainable economics at scale

TPUs are engineered to improve value and power consumption by focusing on the computational demands of AI, eliminating the operational overhead found in multi-purpose architectures. Integrated power management dynamically adjusts to real-time request volume, delivering high performance-per-watt and supports complex AI workloads sustainably.

Open, flexible, and reliable operations

Build on an open ecosystem using familiar libraries and tools. Cloud TPUs provide native, high-performance support for PyTorch and JAX, and support the vLLM engine for fast inference. Manage and scale these deployments reliably across global clusters with Google Kubernetes Engine (GKE).

Cloud TPU versions

Cloud TPU versionDescriptionAvailability

TPU 8i

TPU 8i is optimized for post-training and inference while provides an 80% performance-per-dollar improvement over previous generations for low-latency inference for large MoE models.

Coming soon

TPU 8t

TPU 8t is built for large-scale pre-training and embedding-heavy workloads at a scale of 9,600 chips in a single superpod, provides to 2.7x performance-per-dollar improvement over Ironwood for large-scale training.

Coming soon

Ironwood

7th-generation energy-efficient TPU engineered for large-scale training, reasoning, and inference. Features 9,216 liquid-cooled chips per pod, provides 42.5 ExaFlops and 4X better performance per chip over Trillium.

Ironwood  is generally available in North America (Central) and Europe (West region)

Trillium

Sixth-generation TPU featuring improved energy efficiency and peak compute performance for training and inference. Operates with 67% more energy efficiency and provides 4.7x higher peak compute performance per chip compared to previous generation TPU v5e.

Trillium is generally available in North America (US East region), Europe (West region), and Asia (Northeast region)

Additional information on Cloud TPU versions

TPU 8i

Description

TPU 8i is optimized for post-training and inference while provides an 80% performance-per-dollar improvement over previous generations for low-latency inference for large MoE models.

Availability

Coming soon

TPU 8t

Description

TPU 8t is built for large-scale pre-training and embedding-heavy workloads at a scale of 9,600 chips in a single superpod, provides to 2.7x performance-per-dollar improvement over Ironwood for large-scale training.

Availability

Coming soon

Ironwood

Description

7th-generation energy-efficient TPU engineered for large-scale training, reasoning, and inference. Features 9,216 liquid-cooled chips per pod, provides 42.5 ExaFlops and 4X better performance per chip over Trillium.

Availability

Ironwood  is generally available in North America (Central) and Europe (West region)

Trillium

Description

Sixth-generation TPU featuring improved energy efficiency and peak compute performance for training and inference. Operates with 67% more energy efficiency and provides 4.7x higher peak compute performance per chip compared to previous generation TPU v5e.

Availability

Trillium is generally available in North America (US East region), Europe (West region), and Asia (Northeast region)

Additional information on Cloud TPU versions

How It Works

Get an inside look at the magic of Google Cloud TPUs, including a rare inside view of the data centers . Customers use Cloud TPUs to run some of the large-scale AI workloads and that capacity comes from much more than just a chip. In this video, take a look at the components of the TPU system, including data center networking, optical circuit switches, water cooling systems, biometric security verification and more.


Replace w/TPU Video!
Common Uses

Run large-scale AI pre-training workloads

Speed up time-to-market for frontier models

Reduce pre-training timelines for massive foundation models. The TPU 8t provides high-performance compute power within a single pod and scales via the Virgo network. Paired with rapid storage access and Axion-powered NUMA isolation, the architecture achieves high Goodput—ensuring compute cycles are spent on active model building rather than idling during data transfer or hardware resets.

Speed up time-to-market for frontier models

Reduce pre-training timelines for massive foundation models. The TPU 8t provides high-performance compute power within a single pod and scales via the Virgo network. Paired with rapid storage access and Axion-powered NUMA isolation, the architecture achieves high Goodput—ensuring compute cycles are spent on active model building rather than idling during data transfer or hardware resets.

Efficient post-training and reinforcement learning

Scale reinforcement learning workloads efficiently

Build base models into intelligent agents through intensive post-training workflows. The 8th generation of TPU system rapidly processes continuous reinforcement learning trials, rewarding the best reasoning paths without the cycle delays common to previous generations. This allows you to efficiently fine-tune world models, enabling agents to refine their reasoning in simulated environments before executing in the real world.


Scale reinforcement learning workloads efficiently

Build base models into intelligent agents through intensive post-training workflows. The 8th generation of TPU system rapidly processes continuous reinforcement learning trials, rewarding the best reasoning paths without the cycle delays common to previous generations. This allows you to efficiently fine-tune world models, enabling agents to refine their reasoning in simulated environments before executing in the real world.


Low-latency AI inference workloads at scale

High-performance, cost-efficient inference

Break the inference memory wall. The TPU 8i expands on-chip SRAM and high-bandwidth memory, hosting high-capacity KV caches entirely on-silicon. By using the SparseCore-Collectives Acceleration Engine (SC-CAE) to offload global communication tasks, this architecture significantly reduces on-chip latency, freeing the main compute cores for pure, low-latency token generation.

High-performance, cost-efficient inference

Break the inference memory wall. The TPU 8i expands on-chip SRAM and high-bandwidth memory, hosting high-capacity KV caches entirely on-silicon. By using the SparseCore-Collectives Acceleration Engine (SC-CAE) to offload global communication tasks, this architecture significantly reduces on-chip latency, freeing the main compute cores for pure, low-latency token generation.

Generate a solution
What problem are you trying to solve?
What you'll get:
Step-by-step guide
Reference architecture
Available pre-built solutions
This service was built with Gemini Enterprise Agent Platform. You must be 18 or older to use it. Do not enter sensitive, confidential, or personal info.

Start your proof of concept

Try Cloud TPUs for free

Get a quick intro to using Cloud TPUs

Run PyTorch on TPUs

Run JAX on TPUs

Serve using vLLM on TPUs

Business Case


Autonomous reasoning agents

TPUs provide the memory bandwidth and low-latency inference required to run continuous, multi-step reasoning loops for real-time coding assistants, autonomous customer service, and security operations.

Foundation models and multimodal generative AI

Delivering continuous, high-throughput compute, TPUs efficiently build and serve massive foundation models across text, image, audio, and video modalities.

Precision science and healthcare

TPUs manage complex, matrix-heavy mathematics to accelerate computationally intensive simulations for structural biology, genomic sequencing, and drug discovery.



Physical AI

Build physical agents that interact with and adapt to the real world. Simulate and train robots, autonomous agents, and industrial machines faster and more efficiently with synthetic and real-world data.

Google Cloud