Tensor Processing Units (TPUs) are ML accelerators designed by Google. Cloud TPU makes TPUs available as a scalable GPC cloud resource. You can run machine learning workloads on Cloud TPUs using machine learning frameworks such as TensorFlow, Pytorch, and JAX.
A single TPU device contains 4 chips, each of which contains 2 TPU cores. A TPU core contains one or more Matrix Multiply Units (MXU), a Vector Processing Unit (VPU), and a Scalar Unit.
The MXU is composed of 128 x 128 multiply/accumulators in a systolic array. The MXUs provide the bulk of the compute power in a TPU chip. Each MXU is capable of performing 16K multiply-accumulate operations in each cycle using the bfloat16 number format.
The VPU is used for general computation such as activations, softmax, and so on. The scalar unit is used for control flow, calculating memory address, and other maintenance operations.
The exact layout of a TPU device depends on the TPU version you use. Architectural details and performance characteristics of TPU v2 and v3 are available in A Domain Specific Supercomputer for Training Deep Neural Networks.
A TPU v2 board contains four TPU chips and 16 GiB of HBM. Each TPU chip contains two cores. Each core has a MXU, a vector unit, and a scalar unit.
A TPU v3 board contains four TPU chips and 32 GiB of HBM. Each TPU chip contains two cores. Each core has a MXU, a vector unit, and a scalar unit.
Cloud TPU provides the following TPU configurations:
- A single TPU device
- A TPU Pod - a group of TPU devices connected by high-speed interconnects
- A TPU slice - a subdivision of a TPU Pod
Performance benefits of TPU v3 over v2
The increased FLOPS per core and memory capacity in TPU v3 configurations can improve the performance of your models in the following ways:
TPU v3 configurations provide significant performance benefits per core for compute-bound models. Memory-bound models on TPU v2 configurations might not achieve this same performance improvement if they are also memory-bound on TPU v3 configurations.
In cases where data does not fit into memory on TPU v2 configurations, TPU v3 can provide improved performance and reduced recomputation of intermediate values (re-materialization).
TPU v3 configurations can run new models with batch sizes that did not fit on TPU v2 configurations. For example, TPU v3 might allow deeper ResNets and larger images with RetinaNet.
Models that are nearly input-bound ("infeed") on TPU v2 because training steps are waiting for input might also be input-bound with Cloud TPU v3. The pipeline performance guide can help you resolve infeed issues.
TPUs are available in the following configurations:
- A single TPU board
- A TPU Pod
- A TPU Pod slice
Single TPU Board
A single-board TPU configuration is a stand-alone board with 4 TPU chips (8 TPU cores) with no network connections to other TPU boards. Single board TPUs are not part of a TPU Pod configuration and do not occupy a portion of a TPU Pod.
TPU Pods and Slices
In a TPU Pod or TPU Pod slice, TPU chips are connected using a high-speed interconnect and each TPU chip communicates directly with the other chips on the TPU device. The TPU runtime automatically handles distributing data to each TPU core in a Pod or slice. Pod slices are available with 32, 128, 512, 1024, or 2048 cores.
Cloud TPU VM Architectures
TPUs can only perform matrix operations, so each TPU board is connected to a CPU-based host machine to perform operations that cannot be executed on the TPU. The host machines are responsible for loading data from Cloud Storage, preprocessing data, and sending data to the TPU.
In a TPU Pod, there is a TPU host for each TPU board.
How you interact with the TPU host (and the TPU board) depends upon the TPU VM architecture you are using: TPU Nodes or TPU VMs.
TPU Nodes are the original TPU experience. They require an extra user VM that communicates with the TPU Host over gRPC; there is no direct access to the TPU host.
When you use TPU VMs, you SSH directly to the Google Compute Engine VM that is physically connect to the TPU device. You get root access to the VM so you can run arbitrary code. You can access compiler and runtime debug logs and error messages.
Frameworks like JAX, PyTorch, and TensorFlow access TPUs via a shared library
libtpu that's present on every TPU VM. This library includes the XLA
compiler used to compile TPU programs, the TPU runtime used to run compiled
programs, and the TPU driver used by the runtime for low-level access to the TPU.
With TPU VMs, instead of your Python code running on a user VM, it can run directly on the TPU Host.
For more information on TensorFlow and Cloud TPU see Running TensorFlow Models on Cloud TPU
The Cloud TPU Node system architecture was originally built for TensorFlow. The TPU hosts are inaccessible to the user and run a headless copy of TensorFlow server. They don't run Python or any user code not represented as a TensorFlow graph. User code runs in a separate, remote VM that communicates with the TPU hosts over the gRPC network.
With TPU VMs your PyTorch code runs directly on the TPU hosts.
For more information on PyTorch and Cloud TPU see Running PyTorch Models on Cloud TPU
PyTorch runs on the Cloud TPU node architecture using a library called XRT, which allows sending XLA graphs and runtime instructions over TensorFlow gRPC connections and executing them on the TensorFlow servers. A user VM is required for each TPU Host.
With TPU VMs there is no need for user VMs because you can run your code directly on the TPU hosts.
For more information on running JAX on Cloud TPU, see JAX quickstart.
JAX on Cloud TPU Nodes runs similar to PyTorch in that a separate user VM is required for each host VM.