Compute

Announcing Trillium, the sixth generation of Google Cloud TPU

May 14, 2024

https://storage.googleapis.com/gweb-cloudblog-publish/images/Most-Advanced-TPU_1.max-2500x2500.png

Amin Vahdat

VP/GM, ML, Systems, and Cloud AI

Try Gemini 1.5 models

Google's most advanced multimodal models in Vertex AI

Generative AI is transforming how we interact with technology while simultaneously opening tremendous efficiency opportunities for business impact. But these advances require ever greater compute, memory, and communication to train and fine tune the most capable models and to serve them interactively to a global user population. For more than a decade, we at Google have been developing custom AI-specific hardware, Tensor Processing Units, or TPUs, to push forward the frontier of what is possible in scale and efficiency.

This hardware supported a number of the innovations we announced today at Google I/O, including new models like Gemini 1.5 Flash, Imagen 3, and Gemma 2; all of these models have been trained on and are served using TPUs. To deliver the next frontier of models and enable you to do the same, we’re excited to announce Trillium, our sixth-generation TPU, the most performant and most energy-efficient TPU to date.

Trillium TPUs achieve an impressive 4.7X increase in peak compute performance per chip compared to TPU v5e. We doubled the High Bandwidth Memory (HBM) capacity and bandwidth, and also doubled the Interchip Interconnect (ICI) bandwidth over TPU v5e. Additionally, Trillium is equipped with third-generation SparseCore, a specialized accelerator for processing ultra-large embeddings common in advanced ranking and recommendation workloads. Trillium TPUs make it possible to train the next wave of foundation models faster and serve those models with reduced latency and lower cost. Critically, our sixth-generation TPUs are also our most sustainable: Trillium TPUs are over 67% more energy-efficient than TPU v5e.

Trillium can scale up to 256 TPUs in a single high-bandwidth, low-latency pod. Beyond this pod-level scalability, with multislice technology and Titanium Intelligence Processing Units (IPUs), Trillium TPUs can scale to hundreds of pods, connecting tens of thousands of chips in a building-scale supercomputer interconnected by a multi-petabit-per-second datacenter network.

The next phase of AI innovation with Trillium

More than a decade ago, Google recognized the need for a first-of-its-kind chip for machine learning. In 2013, we began work on the world’s first purpose-built AI accelerator, TPU v1, followed by the first Cloud TPU in 2017. Without TPUs, many of Google’s most popular services — such as real-time voice search, photo object recognition, and interactive language translation, along with the state-of-the-art foundation models such as Gemini, Imagen, and Gemma — would not be possible. In fact, the scale and efficiency of TPUs enabled foundational work on Transformers in Google Research, the algorithmic underpinnings of modern generative AI.

4.7X increase in compute performance per Trillium chip

TPUs were designed from the ground up for neural networks, and we’re always working to improve training and serving times for AI workloads. Trillium achieves 4.7X peak compute per chip compared to TPU v5e. To achieve this level of performance, we’ve expanded the size of matrix multiply units (MXUs) and increased the clock speed. Additionally, SparseCores accelerate embedding-heavy workloads by strategically offloading random and fine-grained access from TensorCores.

2X ICI and High Bandwidth Memory (HBM) capacity and bandwidth

Doubling the HBM capacity and bandwidth allows Trillium to work with larger models with more weights and larger key-value caches. Next-generation HBM enables higher memory bandwidth, improved power efficiency, and a flexible channel architecture to increase memory throughput. This improves training time and serving latency for large models. That’s twice the model weights and key-value caches, accessed faster and with more compute capacity for accelerating ML workloads. Doubling the ICI bandwidth enables training and inference jobs to scale to tens of thousands of chips powered by a strategic combination of custom optical ICI interconnects with 256 chips in a pod and Google Jupiter Networking that extends scalability to hundreds of pods in a cluster.

Trillium will power the next generation of AI models

Trillium TPUs will power the next wave of AI models and agents, and we’re looking forward to helping enable our customers with these advanced capabilities. For example, Essential AI’s mission is to deepen the partnership between humans and computers, and is looking forward to using Trillium to reinvent how businesses operate. Nuro is dedicated to creating a better everyday life through robotics by training their models with Cloud TPUs; Deep Genomics is powering the future of drug discovery with AI and looking forward to how their next foundational model, powered by Trillium, will change the lives of patients; and Deloitte, Google Cloud Partner of the Year for AI, will offer Trillium to transform businesses with generative AI. Support for training and serving of long-context, multimodal models on Trillium TPUs will also enable Google DeepMind to train and serve the future generations of Gemini models faster, more efficiently, and with lower latency than ever before

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_jeff_dean.max-2100x2100.png

https://storage.googleapis.com/gweb-cloudblog-publish/images/Blog-post-5.max-2100x2100.png

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_Andrew_Clare.max-2100x2100.png

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_Brendan_Frey.max-2100x2100.png

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_Matt_Lacey.max-2100x2100.png

Trillium and AI Hypercomputer

Trillium TPUs are a part of Google Cloud's AI Hypercomputer, a groundbreaking supercomputing architecture designed specifically for cutting-edge AI workloads. It integrates performance-optimized infrastructure (including Trillium TPUs), open-source software frameworks, and flexible consumption models. Our commitment to open-source libraries like JAX, PyTorch/XLA, and Keras 3 empowers developers. Support for JAX and XLA means that declarative model description written for any previous generation of TPUs maps directly to the new hardware and network capabilities of Trillium TPUs. We've also partnered with Hugging Face on Optimum-TPU for streamlined model training and serving.

“Our partnership with Google Cloud makes it easier for Hugging Face users to fine-tune and run open models on Google Cloud’s AI infrastructure, including TPUs. We are excited to further accelerate open source AI with the upcoming sixth-generation Trillium TPUs, and we expect open models to continue to deliver optimal performance thanks to the 4.7X increase in performance per chip compared to the previous generation. We will make the performance of Trillium easily available to all AI builders through our new Optimum-TPU library!" - Jeff Boudier, Head of Product, Hugging Face

SADA (An Insight Company) has been Partner of the Year each year since 2017 and delivers Google Cloud Services for maximum impact.

As a proud Google Cloud Premier Partner, SADA has a 20-year long history with the world’s established AI pioneer. We are rapidly integrating AI for thousands of diverse customers. With our depth of experience and the AI Hypercomputer architecture, we can't wait to help our customers unlock the value of this next frontier of generative AI models with Trillium. - Miles Ward, CTO, SADA

AI Hypercomputer also offers the flexible consumption models required for AI/ML workloads. Dynamic Workload Scheduler (DWS) makes it easier to access AI/ML resources and helps customers optimize their spend. Flex start mode can improve the experience of bursty workloads such as training, fine-tuning, or batch jobs, by scheduling all the accelerators needed simultaneously, regardless of your entry point: Vertex AI Training, Google Kubernetes Engine (GKE) or Google Cloud Compute Engine.

Lightricks is excited to gain value back with the increase in performance coupled with the efficiency gain from AI Hypercomputer.

“We’ve been using TPUs for our text-to-image and text-to-video models since Cloud TPU v4. With TPU v5p and AI Hypercomputer efficiencies, we achieved a whopping 2.5X increase in training speed! The 6th generation of Trillium TPUs are incredible with a 4.7X increased compute performance per chip and 2X HBM Capacity and Bandwidth improvement over the previous generation. This came just in time for us as we scale our text-to-video models. We’re also looking forward to using Dynamic Workload Scheduler’s flex start mode to manage our batch inference jobs and to manage our future TPU reservations.” - Yoav HaCohen, PhD, Core Generative AI Research Team Lead, Lightricks

Learn more about Google Cloud Trillium TPUs

Google Cloud TPUs are the cutting-edge of AI acceleration, custom-designed and optimized to empower large-scale artificial intelligence models. Exclusively available through Google Cloud, TPUs deliver unparalleled performance and cost-efficiency for training and serving AI solutions. Whether it's the complex intricacies of large language models or the creative potential of image generation, TPUs help enable developers and researchers to push the boundaries of what's possible in the world of artificial intelligence.

The sixth-generation Trillium TPUs are a culmination of over a decade of research and innovation and will be available later this year. To learn more about Trillium TPUs and AI Hypercomputer, please complete this form and our sales team will be in touch.

Posted in