TPU v3

This document describes the architecture and supported configurations of Cloud TPU v3.

System architecture

Each v3 TPU chip contains two TensorCores. Each TensorCore has two matrix-multiply units (MXUs), a vector unit, and a scalar unit. The following table shows the key specifications and their values for a v3 TPU Pod.

Key specifications v3 Pod values
Peak compute per chip 123 teraflops (bf16)
HBM2 capacity and bandwidth 32 GiB, 900 GBps
Measured min/mean/max power 123/220/262 W
TPU Pod size 1024 chips
Interconnect topology 2D torus
Peak compute per Pod 126 petaflops (bf16)
All-reduce bandwidth per Pod 340 TB/s
Bisection bandwidth per Pod 6.4 TB/s

The following diagram illustrates a TPU v3 chip.

Diagram of a TPU v3 chip

Architectural details and performance characteristics of TPU v3 are available in A Domain Specific Supercomputer for Training Deep Neural Networks.

Performance benefits of TPU v3 over v2

The increased FLOPS per TensorCore and memory capacity in TPU v3 configurations can improve the performance of your models in the following ways:

  • TPU v3 configurations provide significant performance benefits per TensorCore for compute-bound models. Memory-bound models on TPU v2 configurations might not achieve this same performance improvement if they are also memory-bound on TPU v3 configurations.

  • In cases where data does not fit into memory on TPU v2 configurations, TPU v3 can provide improved performance and reduced recomputation of intermediate values (rematerialization).

  • TPU v3 configurations can run new models with batch sizes that did not fit on TPU v2 configurations. For example, TPU v3 might allow deeper ResNet models and larger images with RetinaNet.

Models that are nearly input-bound ("infeed") on TPU v2 because training steps are waiting for input might also be input-bound with Cloud TPU v3. The pipeline performance guide can help you resolve infeed issues.

Configurations

A TPU v3 Pod is composed of 1024 chips interconnected with high-speed links. To create a TPU v3 device or Pod slice, use the --accelerator-type flag in the TPU creation command (gcloud compute tpus tpu-vm). You specify the accelerator type by specifying the TPU version and the number of TPU cores. For example, for a single v3 TPU, use --accelerator-type=v3-8. For a v3 Pod slice with 128 TensorCores, use --accelerator-type=v3-128.

The following command shows how to create a v3 TPU Pod slice with 128 TensorCores:

  $ gcloud compute tpus tpu-vm create tpu-name \
    --zone=zone \
    --accelerator-type=v3-128 \
    --version=tpu-vm-tf-2.16.1-pjrt

The following table lists the supported v3 TPU types:

TPU version Support ends
v3-8 (End date not yet set)
v3-32 (End date not yet set)
v3-128 (End date not yet set)
v3-256 (End date not yet set)
v3-512 (End date not yet set)
v3-1024 (End date not yet set)
v3-2048 (End date not yet set)

For more information about managing TPUs, see Manage TPUs. For more information about the system architecture of Cloud TPU, see System architecture.