This document describes the architecture for the hardware and software components of the Cloud TPU system.
Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning (ML) workloads. TPUs are designed from the ground up with the benefit of Google's deep experience and leadership in machine learning.
A single TPU board contains four TPU chips. Each chip contains two TPU cores. The
resources available on a TPU core vary by version. Each TPU core has scalar,
vector, and matrix multiplication units (MXUs). The MXUs provide the bulk of the
compute power in a TPU chip. Each MXU is capable of performing 16K
multiply-accumulate operations in each cycle at reduced bfloat16 precision.
bfloat16 is a 16-bit
floating point representation that provides better training and model accuracy
than the IEEE half-precision representation. Each of the cores on a TPU board can
execute user computations independently. High-bandwidth interconnects allow the
chips to communicate directly with each other. More detailed technical
information can be found in A Domain Specific Supercomputer for Training Deep Neural Networks.
TPUs were designed to be scaled out to a TPU Pod. A TPU Pod is a supercomputer that can have up to 2048 TPU cores, allowing you to distribute the processing load across multiple TPU boards. In a TPU Pod configuration, dedicated high-speed network interfaces connect multiple TPU devices together to provide a larger number of TPU cores and a larger pool of TPU memory for your machine learning workloads.
Each TPU board is connected to a high-performance CPU-based host machine for things like loading and preprocessing data to feed to the TPUs.
Cloud TPU is a service that gives you access to TPUs through Google Cloud Platform (GCP). You can use Cloud TPU to run machine learning workloads on Google's TPU accelerator hardware. Cloud TPU is designed for maximum performance and flexibility to help researchers, developers, and businesses train ML workloads.
You can use the Cloud TPU API to automate TPU management for your TPUs. As a result, it is easy to scale up to massive compute clusters, run your workloads, and scale those clusters back down when your workloads are complete. The hardware support built into the chips results in effectively linear performance scaling across a broad range of deep learning workloads.
Cloud TPU provides access to different TPU configurations:
- A TPU Pod
- A TPU Slice
- A TPU board
In a TPU Pod, all of the TPU chips in a TPU Pod are connected directly to each other over a high-speed interconnect that skips the communication delays of going through the CPU hosts. The chips are connected in a 2-D torus, with each chip communicating directly to four neighbors. This architecture leads to especially high performance for common communication patterns in ML workloads like all-reduce. Architectural details and performance characteristics of TPU v2 and v3 are shown in the following table, with more information published in A Domain Specific Supercomputer for Training Deep Neural Networks.
|Feature||TPU v2||TPU v3|
|Network links x Gbits/s/Chip||4 x 496||4 x 656|
|Bisection Bandwidth Terabits/full pod||15.9||42|
A TPU Slice is a portion of a TPU Pod. If you do not need the resources of an entire Pod, you can use a portion of a Pod. There are a number of different Slice configurations available for more information, see the Cloud TPU Pricing section in Cloud TPU.
Single TPU Board
A single-board TPU configuration is a stand-alone board with 4 TPU chips (8 TPU cores) with no network connections to other TPU boards. Single board TPUs are not part of a TPU Pod configuration and do not occupy a portion of a TPU Pod. Read the TPU types page to see what single board TPU configurations are available.
The TPU version defines the architecture for each TPU core, the amount of high-bandwidth memory (HBM) for each TPU core, the interconnects between the cores on each TPU board, and the networking interfaces available for inter-device communication. The available TPU versions are: v2 and v3.
A TPU v2 board contains four TPU chips each with two cores. There is 8 GiB of HBM for each TPU core and one MXU per core. A TPU v2 Pod has up to 512 TPU cores and 4 TiB of memory.
A TPU v3 board contains four TPU chips each with two cores. There is 16 GiB of HBM for each TPU core and two MXUs for each core. A TPU v3 Pod has up to 2048 TPU cores and 32 TiB of memory.
Performance benefits of TPU v3 over v2
The increased FLOPS per core and memory capacity in TPU v3 configurations can improve the performance of your models in the following ways:
TPU v3 configurations provide significant performance benefits per core for compute-bound models. Memory-bound models on TPU v2 configurations might not achieve this same performance improvement if they are also memory-bound on TPU v3 configurations.
In cases where data does not fit into memory on TPU v2 configurations, TPU v3 can provide improved performance and reduced recomputation of intermediate values (re-materialization).
TPU v3 configurations can run new models with batch sizes that did not fit on TPU v2 configurations. For example, TPU v3 might allow deeper ResNets and larger images with RetinaNet.
Models that are nearly input-bound ("infeed") on TPU v2 because training steps are waiting for input might also be input-bound with Cloud TPU v3. The pipeline performance guide can help you resolve infeed issues.
Cloud TPU VM Architecture
Each TPU board is physically connected to a host machine (TPU Host).
In a TPU Pod, there is a TPU host for each TPU board.
How you interact with the TPU host (and the TPU board) depends upon the TPU VM architecture you are using: TPU Nodes or TPU VMs.
TPU Nodes are the original TPU experience. They require an extra user VM that communicates with the TPU Host over gRPC; there is no direct access to the TPU host
When you use TPU VMs, you SSH directly to a Google Compute Engine VM running on the TPU host. You get root access to the machine so you can run any code you wish. You get access to debug logs and error messages directly from the TPU compiler and runtime. TPU VMs support new use cases that aren't possible with TPU Nodes. For example, you can execute custom ops in the input pipeline and you can use local storage.
Because there is no user VM, there is no need for a network, Virtual Private Cloud, or firewall between your code and the TPU VM which improves the performance of your input pipeline. Additionally TPU VMs are cheaper because you don't have to pay for user VMs.
Frameworks like JAX, PyTorch, and TensorFlow access TPUs via a shared library
libtpu that's present on every TPU VM. This library includes the XLA
compiler used to compile TPU programs, the TPU runtime used to run compiled
programs, and the TPU driver used by the runtime for low-level access to the TPU.
With TPU VMs, instead of your Python code running on a user VM, it can run directly on the TPU Host.
For more information on TensorFlow and Cloud TPU see Running TensorFlow Models on Cloud TPU
The Cloud TPU Node system architecture was originally built for TensorFlow and its distributed programming model. The TPU hosts are inaccessible to the user and run only a headless copy of a TensorFlow server. They don't run Python or any user code not represented as a TensorFlow graph. User code runs in a separate, remote VM that communicates with the TPU hosts over the network.
With TPU VMs your PyTorch code runs directly on the TPU hosts.
For more information on PyTorch and Cloud TPU see Running PyTorch Models on Cloud TPU
PyTorch runs on the Cloud TPU node architecture using a library called XRT, which allows sending XLA graphs and runtime instructions over TensorFlow gRPC connections and executing them on the TensorFlow servers. A user VM is required for each TPU Host.
With TPU VMs there is no need for user VMs because you can run your code directly on the TPU hosts.
For more information on running JAX on Cloud TPU, see JAX quickstart.
JAX on Cloud TPU Nodes runs similar to PyTorch in that a separate user VM is required for each host VM.
- Read Cloud Tensor Processing Units (TPUs) to compare Cloud TPU to other processors.