System architecture

Tensor Processing Units (TPUs) are application specific integrated circuits (ASICs) designed by Google to accelerate machine learning workloads. Cloud TPU is a Google Cloud service that makes TPUs available as a scalable resource.

TPUs are designed to perform matrix operations quickly making them ideal for machine learning workloads. You can run machine learning workloads on TPUs using frameworks such as TensorFlow, Pytorch, and JAX.

Cloud TPU terms

If you are new to Cloud TPUs, check out the TPU documentation home. The following sections explain terms and related concepts used in this document.

Batch inference

Batch or offline inference refers to doing inference outside of production pipelines typically on a bulk of inputs. Batch inference is used for offline tasks such as data labeling and also for evaluating the trained model. Latency SLOs are not a priority for batch inference.

TPU chip

A TPU chip contains one or more TensorCores. The number of TensorCores depends on the version of the TPU chip. Each TensorCore consists of one or more matrix-multiply units (MXUs), a vector unit, and a scalar unit.

An MXU is composed of either 256 x 256 (TPU v6e) or 128 x 128 (TPU versions prior to v6e) multiply-accumulators in a systolic array. MXUs provide the bulk of the compute power in a TensorCore. Each MXU is capable of performing 16K multiply-accumulate operations per cycle. All multiplies take bfloat16 inputs, but all accumulations are performed in FP32 number format.

The vector unit is used for general computation such as activations and softmax. The scalar unit is used for control flow, calculating memory addresses, and other maintenance operations.

TPU cube

A 4x4x4 topology. This is only applicable to 3D topologies (beginning with the v4 TPU version).

Inference

Inference is the process of using a trained model to make predictions on new data. It is used by the serving process.

Multislice versus single slice

Multislice is a group of slices, extending TPU connectivity beyond the inter-chip interconnect (ICI) connections and leveraging the data-center network (DCN) for transmitting data beyond a slice. Data within each slice is still transmitted by ICI. Using this hybrid connectivity, Multislice enables parallelism across slices and lets you use a greater number of TPU cores for a single job than what a single slice can accommodate.

TPUs can be used to run a job either on a single slice or multiple slices. Refer to the Multislice introduction for more details.

Cloud TPU ICI resiliency

ICI resiliency helps improve fault tolerance of optical links and optical circuit switches (OCS) that connect TPUs between cubes. (ICI connections within a cube use copper links that are not impacted). ICI resiliency allows ICI connections to be routed around OCS and optical ICI faults. As a result, it improves the scheduling availability of TPU slices, with the trade-off of temporary degradation in ICI performance.

Similar to Cloud TPU v4, ICI resiliency is enabled by default for v5p slices that are one cube or larger:

  • v5p-128 when specifying acclerator type
  • 4x4x4 when specifying accelerator config

Queued resource

A representation of TPU resources, used to enqueue and manage a request for a single-slice or multi-slice TPU environment. See Queued Resources user guide for more information.

Serving

Serving is the process of deploying a trained machine learning model to a production environment where it can be used to make predictions or decisions. Latency and service-level availability are important for serving.

Single host, multi host, and sub host

A TPU host is a VM that runs on a physical computer connected to TPU hardware. TPU workloads can use one or more host.

A single-host workload is limited to one TPU VM. A multi-host workload distributes training across multiple TPU VMs. A sub-host workload doesn't use all of the chips on a TPU VM.

Slices

A Pod slice is a collection of chips all located inside the same TPU Pod connected by high-speed inter chip interconnects (ICI). Slices are described in terms of chips or TensorCores, depending on the TPU version.

Chip shape and chip topology also refer to slice shapes.

SparseCore

SparseCores are dataflow processors that accelerate models relying on embeddings found in recommendation models. v5p includes four SparseCores per chip, and v6e includes two SparseCores per chip.

TPU Pod

A TPU Pod is a contiguous set of TPUs grouped together over a specialized network. The number of TPU chips in a TPU Pod is dependent on the TPU version.

TPU VM or worker

A virtual machine running Linux that has access to the underlying TPUs. A TPU VM is also known as a worker.

TensorCores

TPU chips have one or two TensorCores to run matrix multiplication. For more information about TensorCores, see this ACM article.

Worker

See TPU VM.

TPU versions

The exact architecture of a TPU chip depends on the TPU version that you use. Each TPU version also supports different slice sizes and configurations. For more information about the system architecture and supported configurations, see the following pages:

TPU architectures

There have been two TPU architectures describing how a VM is physically connected to the TPU device: TPU Node and TPU VM. TPU Node was the original TPU architecture for v2 and v3 TPU versions. With v4, TPU VM became the default architecture, but both architectures were supported. The TPU Node architecture is deprecated and only TPU VM is supported. If you are using TPU Nodes, see Moving from TPU Node to TPU VM architecture to convert from TPU Node to TPU VM architecture.

TPU VM architecture

The TPU VM architecture lets you directly connect to the VM physically connected to the TPU device using SSH. You have root access to the VM, so you can run arbitrary code. You can access compiler and runtime debug logs and error messages.

image

TPU Node architecture

The TPU Node architecture consists of a user VM that communicates with the TPU host over gRPC. When using this architecture, you cannot directly access the TPU Host, making it difficult to debug training and TPU errors.

image

Moving from TPU Node to TPU VM architecture

If you have TPUs using the TPU Node architecture, use the following steps to identify, delete, and re-provision them as TPU VMs.

  1. Go to the TPUs page:

    Go to TPUs

    1. Locate your TPU and its architecture under the Architecture heading. If the architecture is "TPU VM", you don't need to take any action. If the architecture is "TPU Node" you need to delete and re-provision the TPU.
  2. Delete and re-provision the TPU.

    See Managing TPUs for instructions on deleting and re-provisioning TPUs.

What's next