System Architecture

Tensor Processing Units (TPUs) are application specific integrated circuits (ASICs) designed by Google to accelerate machine learning workloads. Cloud TPU is a Google Cloud service that makes TPUs available as a scalable resource.

TPUs are designed to perform matrix operations quickly making them ideal for machine learning workloads. You can run machine learning workloads on TPUs using frameworks such as TensorFlow, Pytorch, and JAX.

Cloud TPU terms

If you are new to Cloud TPUs, check out the TPU documentation home. The following sections explain terms and related concepts used in this document.

Batch inference

Batch or offline inference refers to doing inference outside of production pipelines typically on a bulk of inputs. Batch inference is used for offline tasks such as data labeling and also for evaluating the trained model. Latency SLOs are not a priority for batch inference.

Inference

Inference is the process of using a trained model to make predictions on new data. It is used by the serving process.

Queued resource

A representation of TPU resources, used to enqueue and manage a request for a single-slice or multi-slice TPU environment. See Queued Resources user guide for more information.

Serving

Serving is the process of deploying a trained machine learning model to a production environment where it can be used to make predictions or decisions. Latency and service-level availability are important for serving.

Single host and multi host

A TPU host is a VM that runs on a physical computer connected to TPU hardware. TPU workloads can use one or more host.

A single-host workload is limited to one TPU VM and can access 1, 4, or 8 TPU chips. A multi-host TPU v5e workload can access 8, 12, 16, 32, 64, 128, or 256 TPU chips with one TPU VM for every four TPU chips. Multi-host workloads distribute training across multiple TPU VMs.

TPU v5e supports single and multi-host training and single host inference. Multi-host inference is supported using Sax. For more information, see Large Language Model Serving.

Slices

A Pod slice is a collection of chips all located inside the same TPU Pod connected by high-speed inter chip interconnects (ICI).

v5e slices are described with 2D slice shapes. Each number in the slice shape corresponds to the number of v5e chips in one of dimension. For example, 4x2 describes an arrangement of 8 v5e chips in a 4 x 2 grid.

TPU v4 slices can be described in terms of v4 chips. v4 slices are described with 3D shapes. Each number in the slice shape corresponds to the number of v4 chips in a dimension. For example 4x4x8 describes an arrangement of 128 v4 chips in a 4 x 4 x 8 cube.

v4 slices can also be described in terms of the number of TensorCores in the slice. For example, v4-128 describes a v4 slice with 128 TensorCores. v4 slices are available with 16, 32, 64, 128, 256, 512, 1024, 2048, or 4096 TensorCores.

TPU v3 slices are described in terms of TensorCores and are available with 32, 128, 512, 1024, or 2048 TensorCores. TPU v2 slices are also described in terms of TensorCores and are available with 32, 128, 256, or 512 TensorCores.

Chip shape and chip topology also refer to slice shapes.

See the table in the Accelerator Types section for a list of supported slice shapes for v5e.

TPU Pod

A TPU Pod is a contiguous set of TPUs grouped together over a specialized network. The number of TPU chips in a TPU Pod is dependent on the TPU version.

TPU VM

A virtual machine running Linux that has access to the underlying TPUs. For v5e TPUs, each TPU VM has direct access to 1, 4, or 8 chips depending on the user-specified accelerator type. For v4 and earlier, each TPU VM has access to 4 TPU chips. A TPU VM is also known as a worker.

TPU chip

A TPU chip contains one or more TensorCores. The number of TensorCores depend on the version of the TPU chip. Each TensorCore consists of one or more matrix-multiply units (MXUs), a vector unit, and a scalar unit.

An MXU is composed of 128 x 128 multiply-accumulators in a systolic array. MXUs provide the bulk of the compute power in a TensorCore. Each MXU is capable of performing 16K multiply-accumulate operations per cycle. All multiplies take bfloat16 inputs, but all accumulations are performed in FP32 number format.

The vector unit is used for general computation such as activations and softmax. The scalar unit is used for control flow, calculating memory addresses, and other maintenance operations.

TensorCores

TPU chips have one or two TensorCores to run matrix multiplication. TPU v5e has one TensorCore per chip. TPU v2, v3, v4, and v5p have two TensorCores per chip. For more information about TensorCores, see this ACM article.

Worker

See TPU VM.

TPU versions

The exact layout of a TPU depends on the TPU version that you use. Architectural details and performance characteristics of TPU v2 and v3 are available in A Domain Specific Supercomputer for Training Deep Neural Networks.

TPU v5e

Each v5e chip contains one TensorCore. Each TensorCore has 4 Matrix Multiply Units (MXU), a vector unit, and a scalar unit.

The following diagram illustrates a TPU v5e chip.

v5e Pod chip

The following table shows the key chip specifications and their values for v5e.

Key chip specifications v5e values
Peak compute per chip (bf16) 197 TFLOPs
Peak compute per chip (Int8) 393 TFLOPs
HBM2 capacity and bandwidth 16 GB, 819 GBps
Interchip Interconnect BW 1600 Gbps

The following table shows Pod specifications and their values for v5e.

Key Pod specifications v5e values
TPU Pod size 256 chips
Interconnect topology 2D Torus
Peak compute per Pod 100 PetaOps(Int8)
All-reduce bandwidth per Pod 51.2 TB/s
Bisection bandwidth per Pod 1.6 TB/s
Data center network bandwidth per Pod 6.4 Tbps

TPU v4

Each TPU v4 chip contains two TensorCores. Each TensorCore has four MXUs, a vector unit, and a scalar unit. The following table shows the key specifications for a v4 TPU Pod.

Key specifications v4 Pod values
Peak compute per chip 275 teraflops (bf16 or int8)
HBM2 capacity and bandwidth 32 GiB, 1200 GBps
Measured min/mean/max power 90/170/192 W
TPU Pod size 4096 chips
Interconnect topology 3D mesh
Peak compute per Pod 1.1 exaflops (bf16 or int8)
All-reduce bandwidth per Pod 1.1 PB/s
Bisection bandwidth per Pod 24 TB/s

The following diagram illustrates a TPU v4 chip.

image

3D mesh and 3D torus

v4 TPUs have a direct connection to the nearest neighboring chips in 3 dimensions, resulting in a 3D mesh of networking connections. When the slice is equal or larger than a single cube, the connections can be configured as a 3D torus. In general, the performance of a 3D configuration will be better than a 3D mesh configuration.

TPU v3

Each v3 TPU chip contains two TensorCores. Each TensorCore has two MXUs, a vector unit, and a scalar unit. The following table shows the key specifications and their values for a v3 TPU Pod.

Key specifications v3 Pod values
Peak compute per chip 123 teraflops (bf16)
HBM2 capacity and bandwidth 32 GiB, 900 GBps
Measured min/mean/max power 123/220/262 W
TPU Pod size 1024 chips
Interconnect topology 2D torus
Peak compute per Pod 126 petaflops (bf16)
All-reduce bandwidth per Pod 340 TB/s
Bisection bandwidth per Pod 6.4 TB/s

The following diagram illustrates a TPU v3 chip.

image

Performance benefits of TPU v4 over v3

This section describes the performance benefits of TPU v4

Memory System:

Non Uniform Memory Access (NUMA) is a computer memory architecture for machines that have multiple CPUs. Each CPU has direct access to a block of high-speed memory. A CPU and it's memory is called a NUMA node. NUMA nodes are connected to NUMA nodes that are directly adjacent to each other. A CPU from one NUMA node can access memory in another NUMA node, but this access is slower than accessing memory within a NUMA node.

Software running on a multi-CPU machine can place data needed by a CPU within its NUMA node, increasing memory throughput. For more information about NUMA, see Non Uniform Memory Access on Wikipedia.

You can take advantage of NUMA-locality benefits by binding your training script to NUMA Node 0.

To enable NUMA node binding:

  1. Install the numactl command line tool.

     $ sudo apt-get update
     $ sudo apt-get install numactl
    

  2. Use numactl --cpunodebind=0 when launching your training script. This binds your script code to NUMA Node 0.

     $ numactl --cpunodebind=0 python3 your-training-script
    

Enable NUMA node binding if:

  • If your workload has a heavy dependence on CPU workloads (for example, image classification, recommendation workloads) regardless of framework.
  • If you are using a TPU runtime version without a -pod suffix (for example, tpu-vm-tf-2.10.0-v4).

Other memory system differences:

  • v4 TPU chips have a unified 32-GiB HBM memory space across the entire chip, enabling better coordination between the two on-chip TensorCores.
  • Improved HBM performance using latest memory standards and speeds.
  • Improved DMA performance profile with built-in support for high-performance striding at 512B granularities.

TensorCores:

  • Twice the number of MXUs and a higher clock rate delivering 275 max TFLOPS.
  • 2x transposition and permutation bandwidth.
  • Load-store memory access model for Common Memory (Cmem).
  • Faster MXU weight loading bandwidth and 8-bit mode support to allow lower batch sizes and improved inference latency.

Inter-chip Interconnect:

Six interconnect links per chip to enable network topologies that have smaller network diameters.

Other:

  • x16 PCIE gen3 interface to host (direct connect).
  • Improved security model.
  • Improved energy efficiency.

Performance benefits of TPU v3 over v2

The increased FLOPS per TensorCore and memory capacity in TPU v3 configurations can improve the performance of your models in the following ways:

  • TPU v3 configurations provide significant performance benefits per TensorCore for compute-bound models. Memory-bound models on TPU v2 configurations might not achieve this same performance improvement if they are also memory-bound on TPU v3 configurations.

  • In cases where data does not fit into memory on TPU v2 configurations, TPU v3 can provide improved performance and reduced recomputation of intermediate values (rematerialization).

  • TPU v3 configurations can run new models with batch sizes that did not fit on TPU v2 configurations. For example, TPU v3 might allow deeper ResNets and larger images with RetinaNet.

Models that are nearly input-bound ("infeed") on TPU v2 because training steps are waiting for input might also be input-bound with Cloud TPU v3. The pipeline performance guide can help you resolve infeed issues.

Cloud TPU VM Architectures

How you interact with the TPU host (and the TPU board) depends upon the TPU VM architecture you're using: TPU Nodes or TPU VMs.

TPU Node Architecture

The TPU Node architecture consists of a user VM that communicates with the TPU host over gRPC. When using this architecture, you cannot directly access the TPU Host, making it difficult to debug training and TPU errors.

image

TPU VM Architecture

The TPU VM architecture lets you directly connect to the VM physically connected to the TPU device using SSH. You have root access to the VM, so you can run arbitrary code. You can access compiler and runtime debug logs and error messages.

image

What's next