Introduction to Cloud TPU

Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. For more detailed information about TPU hardware, see System Architecture. Cloud TPU is a web service that makes TPUs available as scalable computing resources on Google Cloud.

TPUs train your models more efficiently using hardware designed for performing large matrix operations often found in machine learning algorithms. TPUs have on-chip high-bandwidth memory (HBM) letting you use larger models and batch sizes. TPUs can be connected in groups called Pods that scale up your workloads with little to no code changes.

How does it work?

To understand how TPUs work, it helps to understand how other accelerators address the computational challenges of training ML models.

How a CPU works

A CPU is a general-purpose processor based on the von Neumann architecture. That means a CPU works with software and memory like this:

An illustration of how a CPU works

The greatest benefit of CPUs is their flexibility. You can load any kind of software on a CPU for many different types of applications. For example, you can use a CPU for word processing on a PC, controlling rocket engines, executing bank transactions, or classifying images with a neural network.

A CPU loads values from memory, performs a calculation on the values and stores the result back in memory for every calculation. Memory access is slow when compared to the calculation speed and can limit the total throughput of CPUs. This is often referred to as the von Neumann bottleneck.

How a GPU works

To gain higher throughput, GPUs contain thousands of Arithmetic Logic Units (ALUs) in a single processor. A modern GPU usually contains between 2,500–5,000 ALUs. The large number of processors means you can execute thousands of multiplications and additions simultaneously.

An illustration of how a GPU works

This GPU architecture works well on applications with massive parallelism, such as matrix operations in a neural network. In fact, on a typical training workload for deep learning, a GPU can provide an order of magnitude higher throughput than a CPU.

But, the GPU is still a general-purpose processor that has to support many different applications and software. Therefore, GPUs have the same problem as CPUs. For every calculation in the thousands of ALUs, a GPU must access registers or shared memory to read operands and store the intermediate calculation results.

How a TPU works

Google designed Cloud TPUs as a matrix processor specialized for neural network workloads. TPUs can't run word processors, control rocket engines, or execute bank transactions, but they can handle massive matrix operations used in neural networks at fast speeds.

The primary task for TPUs is matrix processing, which is a combination of multiply and accumulate operations. TPUs contain thousands of multiply-accumulators that are directly connected to each other to form a large physical matrix. This is called a systolic array architecture. Cloud TPU v3, contain two systolic arrays of 128 x 128 ALUs, on a single processor.

The TPU host streams data into an infeed queue. The TPU loads data from the infeed queue and stores them in HBM memory. When the computation is completed, the TPU loads the results into the outfeed queue. The TPU host then reads the results from the outfeed queue and stores them in the host's memory.

To perform the matrix operations, the TPU loads the parameters from HBM memory into the Matrix Multiplication Unit (MXU).

An illustration of how a TPU loads parameters from memory

Then, the TPU loads data from HBM memory. As each multiplication is executed, the result is passed to the next multiply-accumulator. The output is the summation of all multiplication results between the data and parameters. No memory access is required during the matrix multiplication process.

An illustration of how a TPU loads data from memory

As a result, TPUs can achieve a high-computational throughput on neural network calculations.

XLA compiler

Code that runs on TPUs must be compiled by the accelerator linear algebra (XLA) compiler. XLA is a just-in-time compiler that takes the graph emitted by an ML framework application and compiles the linear algebra, loss, and gradient components of the graph into TPU machine code. The rest of the program runs on the TPU host machine. The XLA compiler is part of the TPU runtime that runs on a TPU host machine.

When to use TPUs

Cloud TPUs are optimized for specific workloads. In some situations, you might want to use GPUs or CPUs on Compute Engine instances to run your machine learning workloads. In general, you can decide what hardware is best for your workload based on the following guidelines:

CPUs

  • Quick prototyping that requires maximum flexibility
  • Simple models that do not take long to train
  • Small models with small, effective batch sizes
  • Models that contain many custom TensorFlow operations written in C++
  • Models that are limited by available I/O or the networking bandwidth of the host system

GPUs

  • Models with a significant number of custom TensorFlow/PyTorch/JAX operations that must run at least partially on CPUs
  • Models with TensorFlow ops that are not available on Cloud TPU (see the list of available TensorFlow ops)
  • Medium-to-large models with larger effective batch sizes

TPUs

  • Models dominated by matrix computations
  • Models with no custom TensorFlow/PyTorch/JAX operations inside the main training loop
  • Models that train for weeks or months
  • Large models with large effective batch sizes

Cloud TPUs are not suited to the following workloads:

  • Linear algebra programs that require frequent branching or contain many element-wise algebra operations
  • Workloads that access memory in a sparse manner
  • Workloads that require high-precision arithmetic
  • Neural network workloads that contain custom operations in the main training loop

Best practices for model development

A program whose computation is dominated by non-matrix operations such as add, reshape, or concatenate, will likely not achieve high MXU utilization. The following are some guidelines to help you choose and build models that are suitable for Cloud TPU.

Layout

The XLA compiler performs code transformations, including tiling a matrix multiply into smaller blocks, to efficiently execute computations on the matrix unit (MXU). The structure of the MXU hardware, a 128x128 systolic array, and the design of TPU's memory subsystem, which prefers dimensions that are multiples of 8, are used by the XLA compiler for tiling efficiency. Consequently, certain layouts are more conducive to tiling, while others require reshapes to be performed before they can be tiled. Reshape operations are often memory bound on the Cloud TPU.

Shapes

The XLA compiler compiles an ML graph just in time for the first batch. If any subsequent batches have different shapes, the model doesn't work. (Re-compiling the graph every time the shape changes is too slow.) Therefore, any model that has tensors with dynamic shapes that change at runtime isn't well suited to TPUs.

Padding

A high performing Cloud TPU program is one where the dense compute can be easily tiled into 128x128 chunks. When a matrix computation cannot occupy an entire MXU, the compiler pads tensors with zeroes. There are two drawbacks to padding:

  • Tensors padded with zeroes under-utilize the TPU core.
  • Padding increases the amount of on-chip memory storage required for a tensor and can lead to an out-of-memory error in the extreme case.

While padding is automatically performed by the XLA compiler when necessary, one can determine the amount of padding performed by means of the op_profile tool. You can avoid padding by picking tensor dimensions that are well suited to TPUs.

Dimensions

Choosing suitable tensor dimensions goes a long way in extracting maximum performance from the TPU hardware, particularly the MXU. The XLA compiler attempts to use either the batch size or the feature dimension to maximally utilize the MXU. Therefore, one of these must be a multiple of 128. Otherwise, the compiler will pad one of them to 128. Ideally, batch size as well as feature dimensions should be multiples of 8, which enables extracting high performance from the memory subsystem.

VPC Service Controls integration

Cloud TPU VPC Service Controls enables you to define security perimeters around your Cloud TPU resources and control the movement of data across the perimeter boundary. To learn more about VPC Service Controls, see VPC Service Controls overview. To learn about the limitations in using Cloud TPU with VPC Service Controls, see supported products and limitations.

Edge TPU

Machine learning models trained in the cloud increasingly need to run inferencing "at the edge"—that is, on devices that operate on the edge of the Internet of Things (IoT). These devices include sensors and other smart devices that gather real-time data, make intelligent decisions, and then take action or communicate their information to other devices or the cloud.

Because such devices must operate on limited power (including battery power), Google designed the Edge TPU coprocessor to accelerate ML inferencing on low-power devices. An individual Edge TPU can perform 4 trillion operations per second (4 TOPS), using only 2 watts of power—in other words, you get 2 TOPS per watt. For example, the Edge TPU can execute state-of-the-art mobile vision models such as MobileNet V2 at almost 400 frames per second, and in a power efficient manner.

This low-power ML accelerator augments Cloud TPU and Cloud IoT to provide an end-to-end (cloud-to-edge, hardware + software) infrastructure that facilitates your AI-based solutions.

The Edge TPU is available for your own prototyping and production devices in several form-factors, including a single-board computer, a system-on-module, a PCIe/M.2 card, and a surface-mounted module. For more information about the Edge TPU and all available products, visit coral.ai.

Getting started with Cloud TPU

Requesting help

Contact Cloud TPU support. If you have an active Google Cloud project, be prepared to provide the following information:

  • Your Google Cloud project ID
  • Your TPU node name, if exists
  • Other information you want to provide

What's next?

Looking to learn more about Cloud TPU? The following resources may help: