System architecture

Tensor Processing Units (TPUs) are application specific integrated circuits (ASICs) designed by Google to accelerate machine learning workloads. Cloud TPU is a Google Cloud service that makes TPUs available as a scalable resource.

TPUs are designed to perform matrix operations quickly making them ideal for machine learning workloads. You can run machine learning workloads on TPUs using frameworks such as TensorFlow, Pytorch, and JAX.

Cloud TPU terms

If you are new to Cloud TPUs, check out the TPU documentation home. The following sections explain terms and related concepts used in this document.

Batch inference

Batch or offline inference refers to doing inference outside of production pipelines typically on a bulk of inputs. Batch inference is used for offline tasks such as data labeling and also for evaluating the trained model. Latency SLOs are not a priority for batch inference.

TPU chip

A TPU chip contains one or more TensorCores. The number of TensorCores depends on the version of the TPU chip. Each TensorCore consists of one or more matrix-multiply units (MXUs), a vector unit, and a scalar unit.

An MXU is composed of 128 x 128 multiply-accumulators in a systolic array. MXUs provide the bulk of the compute power in a TensorCore. Each MXU is capable of performing 16K multiply-accumulate operations per cycle. All multiplies take bfloat16 inputs, but all accumulations are performed in FP32 number format.

The vector unit is used for general computation such as activations and softmax. The scalar unit is used for control flow, calculating memory addresses, and other maintenance operations.

TPU cube

A 4x4x4 topology. This is only applicable to 3D topologies (beginning with the v4 TPU version).

Inference

Inference is the process of using a trained model to make predictions on new data. It is used by the serving process.

Multislice versus single slice

Multislice is a group of slices, extending TPU connectivity beyond the inter-chip interconnect (ICI) connections and leveraging the data-center network (DCN) for transmitting data beyond a slice. Data within each slice is still transmitted by ICI. Using this hybrid connectivity, Multislice enables parallelism across slices and lets you use a greater number of TPU cores for a single job than what a single slice can accommodate.

TPUs can be used to run a job either on a single slice or multiple slices. Refer to the Multislice introduction for more details.

Cloud TPU ICI resiliency

ICI resiliency helps improve fault tolerance of optical links and optical circuit switches (OCS) that connect TPUs between cubes. (ICI connections within a cube use copper links that are not impacted). ICI resiliency allows ICI connections to be routed around OCS and optical ICI faults. As a result, it improves the scheduling availability of TPU slices, with the trade-off of temporary degradation in ICI performance.

Similar to Cloud TPU v4, ICI resiliency is enabled by default for v5p slices that are one cube or larger:

v5p-128 when specifying acclerator type
4x4x4 when specifying accelerator config

Queued resource

A representation of TPU resources, used to enqueue and manage a request for a single-slice or multi-slice TPU environment. See Queued Resources user guide for more information.

Serving

Serving is the process of deploying a trained machine learning model to a production environment where it can be used to make predictions or decisions. Latency and service-level availability are important for serving.

Single host and multi host

A TPU host is a VM that runs on a physical computer connected to TPU hardware. TPU workloads can use one or more host.

A single-host workload is limited to one TPU VM. A multi-host workload distributes training across multiple TPU VMs.

Slices

A Pod slice is a collection of chips all located inside the same TPU Pod connected by high-speed inter chip interconnects (ICI). Slices are described in terms of chips or TensorCores, depending on the TPU version.

Chip shape and chip topology also refer to slice shapes.

SparseCore

v5p includes four SparseCores per chip which are Dataflow processors that accelerate models relying on embeddings found in recommendation models.

TPU Pod

A TPU Pod is a contiguous set of TPUs grouped together over a specialized network. The number of TPU chips in a TPU Pod is dependent on the TPU version.

TPU VM or worker

A virtual machine running Linux that has access to the underlying TPUs. A TPU VM is also known as a worker.

TensorCores

TPU chips have one or two TensorCores to run matrix multiplication. For more information about TensorCores, see this ACM article.

Worker

See TPU VM.

TPU versions

The exact architecture of a TPU chip depends on the TPU version that you use. Each TPU version also supports different slice sizes and configurations. For more information about the system architecture and supported configurations, see the following pages:

Cloud TPU VM architectures

How you interact with the TPU host (and the TPU board) depends upon the TPU VM architecture you're using: TPU Nodes or TPU VMs.

TPU VM architecture

The TPU VM architecture lets you directly connect to the VM physically connected to the TPU device using SSH. You have root access to the VM, so you can run arbitrary code. You can access compiler and runtime debug logs and error messages.

TPU Node architecture

The TPU Node architecture consists of a user VM that communicates with the TPU host over gRPC. When using this architecture, you cannot directly access the TPU Host, making it difficult to debug training and TPU errors.