Build, optimize, and scale training, inference, and reinforcement learning workloads to power autonomous reasoning agents
Overview
TPUs are custom-designed accelerators purpose-built for AI workloads such as agents, code generation, large language models, media content generation, synthetic speech, vision services, recommendation engines, and personalization models, among others. TPUs power Gemini, and all of Google AI powered applications like Search, Photos, and Maps, all serving over 1 Billion users.
The shift to Agentic AI requires infrastructure capable of multi-step reasoning and continuous reinforcement learning. TPUs break the inference "memory wall" by hosting massive KV caches entirely on-silicon, utilizing expanded on-chip SRAM with TPU 8i. Combined with our SparseCore engine to offload communication tasks, this architecture reduces core idle time. The result is low-latency, predictable performance that powers complex reasoning loops.
Speed up your deployment time by reducing training timelines for frontier models. Cloud TPUs maximize goodput, ensuring that nearly every compute cycle is spent on active learning. This is supported by an high-speed Inter-Chip Interconnect, Optical Circuit Switching, and Virgo Network, so accelerators operate as a highly reliable, unified system.
TPUs are engineered to improve value and power consumption by focusing on the computational demands of AI, eliminating the operational overhead found in multi-purpose architectures. Integrated power management dynamically adjusts to real-time request volume, delivering high performance-per-watt and supports complex AI workloads sustainably.
Build on an open ecosystem using familiar libraries and tools. Cloud TPUs provide native, high-performance support for PyTorch and JAX, and support the vLLM engine for fast inference. Manage and scale these deployments reliably across global clusters with Google Kubernetes Engine (GKE).
Cloud TPU versions
| Cloud TPU version | Description | Availability |
|---|---|---|
TPU 8i | TPU 8i is optimized for post-training and inference while provides an 80% performance-per-dollar improvement over previous generations for low-latency inference for large MoE models. | Coming soon |
TPU 8t | TPU 8t is built for large-scale pre-training and embedding-heavy workloads at a scale of 9,600 chips in a single superpod, provides to 2.7x performance-per-dollar improvement over Ironwood for large-scale training. | Coming soon |
Ironwood | 7th-generation energy-efficient TPU engineered for large-scale training, reasoning, and inference. Features 9,216 liquid-cooled chips per pod, provides 42.5 ExaFlops and 4X better performance per chip over Trillium. | Ironwood is generally available in North America (Central) and Europe (West region) |
Trillium | Sixth-generation TPU featuring improved energy efficiency and peak compute performance for training and inference. Operates with 67% more energy efficiency and provides 4.7x higher peak compute performance per chip compared to previous generation TPU v5e. | Trillium is generally available in North America (US East region), Europe (West region), and Asia (Northeast region) |
Additional information on Cloud TPU versions
TPU 8i
TPU 8i is optimized for post-training and inference while provides an 80% performance-per-dollar improvement over previous generations for low-latency inference for large MoE models.
Coming soon
TPU 8t
TPU 8t is built for large-scale pre-training and embedding-heavy workloads at a scale of 9,600 chips in a single superpod, provides to 2.7x performance-per-dollar improvement over Ironwood for large-scale training.
Coming soon
Ironwood
7th-generation energy-efficient TPU engineered for large-scale training, reasoning, and inference. Features 9,216 liquid-cooled chips per pod, provides 42.5 ExaFlops and 4X better performance per chip over Trillium.
Ironwood is generally available in North America (Central) and Europe (West region)
Trillium
Sixth-generation TPU featuring improved energy efficiency and peak compute performance for training and inference. Operates with 67% more energy efficiency and provides 4.7x higher peak compute performance per chip compared to previous generation TPU v5e.
Trillium is generally available in North America (US East region), Europe (West region), and Asia (Northeast region)
Additional information on Cloud TPU versions
How It Works
Get an inside look at the magic of Google Cloud TPUs, including a rare inside view of the data centers . Customers use Cloud TPUs to run some of the large-scale AI workloads and that capacity comes from much more than just a chip. In this video, take a look at the components of the TPU system, including data center networking, optical circuit switches, water cooling systems, biometric security verification and more.
Reduce pre-training timelines for massive foundation models. The TPU 8t provides high-performance compute power within a single pod and scales via the Virgo network. Paired with rapid storage access and Axion-powered NUMA isolation, the architecture achieves high Goodput—ensuring compute cycles are spent on active model building rather than idling during data transfer or hardware resets.
Reduce pre-training timelines for massive foundation models. The TPU 8t provides high-performance compute power within a single pod and scales via the Virgo network. Paired with rapid storage access and Axion-powered NUMA isolation, the architecture achieves high Goodput—ensuring compute cycles are spent on active model building rather than idling during data transfer or hardware resets.
Build base models into intelligent agents through intensive post-training workflows. The 8th generation of TPU system rapidly processes continuous reinforcement learning trials, rewarding the best reasoning paths without the cycle delays common to previous generations. This allows you to efficiently fine-tune world models, enabling agents to refine their reasoning in simulated environments before executing in the real world.
Build base models into intelligent agents through intensive post-training workflows. The 8th generation of TPU system rapidly processes continuous reinforcement learning trials, rewarding the best reasoning paths without the cycle delays common to previous generations. This allows you to efficiently fine-tune world models, enabling agents to refine their reasoning in simulated environments before executing in the real world.
Break the inference memory wall. The TPU 8i expands on-chip SRAM and high-bandwidth memory, hosting high-capacity KV caches entirely on-silicon. By using the SparseCore-Collectives Acceleration Engine (SC-CAE) to offload global communication tasks, this architecture significantly reduces on-chip latency, freeing the main compute cores for pure, low-latency token generation.
Break the inference memory wall. The TPU 8i expands on-chip SRAM and high-bandwidth memory, hosting high-capacity KV caches entirely on-silicon. By using the SparseCore-Collectives Acceleration Engine (SC-CAE) to offload global communication tasks, this architecture significantly reduces on-chip latency, freeing the main compute cores for pure, low-latency token generation.
Business Case
Autonomous reasoning agents
TPUs provide the memory bandwidth and low-latency inference required to run continuous, multi-step reasoning loops for real-time coding assistants, autonomous customer service, and security operations.
Foundation models and multimodal generative AI
Delivering continuous, high-throughput compute, TPUs efficiently build and serve massive foundation models across text, image, audio, and video modalities.
Precision science and healthcare
TPUs manage complex, matrix-heavy mathematics to accelerate computationally intensive simulations for structural biology, genomic sequencing, and drug discovery.
Physical AI
Build physical agents that interact with and adapt to the real world. Simulate and train robots, autonomous agents, and industrial machines faster and more efficiently with synthetic and real-world data.