Introduction to Cloud TPU

Tensor Processing Units or TPUs are hardware accelerators designed by Google for machine learning workloads. For more detailed information about TPU hardware, see System Architecture. Cloud TPU is a web service that makes TPUs available as scalable computing resources on Google Cloud Platform.

TPUs train your models more efficiently using hardware designed for performing large matrix operations often found in machine learning algorithms. TPUs have on-chip high bandwidth memory allowing you to use larger models and batch sizes. TPUs can be connected in groups called Pods that allow you to scale up your workloads with little to no code changes.

We recommend starting your machine learning project with a single TPU and scale out to a TPU Pod for production. To get you started, you can take advantage of a set of open source reference models that Google's research and engineering teams optimize for use with TPUs. For more information, see Reference models.

How does it work?

To understand how TPUs work, it helps to understand how other accelerators address the computational challenges of training ML models.

How a CPU works

A CPU is a general purpose processor based on the von Neumann architecture. That means a CPU works with software and memory like this:

An illustration of how a CPU works

The greatest benefit of CPUs are their flexibility. You can load any kind of software on a CPU for many different types of applications. For example, you can use a CPU for word processing on a PC, controlling rocket engines, executing bank transactions or classifying images with a neural network.

A CPU loads values from memory, performs a calculation on the values and stores the result back in memory for every calculation. Memory access is slow when compared to the calculation speed and this limits the total throughput of CPUs. This is often referred to as the von Neumann bottleneck.

How a GPU works

To gain higher throughput, GPUs contain thousands of Arithmetic Logic Units (ALUs) in a single processor. A modern GPU usually contains between 2,500–5,000 ALUs. This large number of processors means you can execute thousands of multiplications and additions simultaneously.

An illustration of how a GPU works

This GPU architecture works well on applications with massive parallelism, such as matrix operations in a neural network. In fact, on a typical training workload for deep learning, a GPU can provide an order of magnitude higher throughput than a CPU. This is why the GPU is the most popular processor architecture used in deep learning.

But, the GPU is still a general purpose processor that has to support many different applications and software. This means GPUs have the same problem as CPUs. For every calculation in the thousands of ALUs, a GPU must access registers or shared memory to read operands and store the intermediate calculation results.

How a TPU works

Google designed Cloud TPUs as a matrix processor specialized for neural network workloads. TPUs can't run word processors, control rocket engines, or execute bank transactions, but they can handle massive matrix operations used in neural networks at very fast speeds.

The primary task for TPUs is matrix processing, which is a combination of multiply and accumulate operations. TPUs contain thousands of multiply-accumulators that are directly connected to each other to form a large physical matrix. This is called a systolic array architecture. In the case of Cloud TPU v3, there are two systolic arrays of 128 x 128 ALUs, on a single processor.

The TPU host streams data into an infeed queue. The TPU loads data from the infeed queue and stores them in HBM memory. When the computation is completed, the TPU loads the results into the outfeed queue. The TPU host then reads the results from the outfeed queue and stores them in the host's memory.

To perform the matrix operations, the TPU loads the parameters from HBM memory into the MXU.

An illustration of how a TPU loads parameters from memory

Then, the TPU loads data from HBM memory. As each multiplication is executed, the result will be passed to the next multiply-accumulator. So the output will be the summation of all multiplication results between the data and parameters. No memory access is required during the matrix multiplication process.

An illustration of how a TPU loads data from memory

As a result, TPUs can achieve a high computational throughput on neural network calculations.

When to use TPUs

Cloud TPUs are optimized for specific workloads. In some situations, you might want to use GPUs or CPUs on Compute Engine instances to run your machine learning workloads. In general, you can decide what hardware is best for your workload based on the following guidelines:

CPUs

  • Quick prototyping that requires maximum flexibility
  • Simple models that do not take long to train
  • Small models with small, effective batch sizes
  • Models that are dominated by custom TensorFlow/PyTorch/JAX operations written in C++
  • Models that are limited by available I/O or the networking bandwidth of the host system

GPUs

  • Models with a significant number of custom TensorFlow/PyTorch/JAX operations that must run at least partially on CPUs
  • Models with TensorFlow/PyTorch ops that are not available on Cloud TPU
  • Medium-to-large models with larger effective batch sizes

TPUs

  • Models dominated by matrix computations
  • Models with no custom TensorFlow/PyTorch/JAX operations inside the main training loop
  • Models that train for weeks or months
  • Large models with large effective batch sizes

Cloud TPUs are not suited to the following workloads:

  • Linear algebra programs that require frequent branching or are dominated by element-wise algebra
  • Workloads that access memory in a sparse manner
  • Workloads that require high-precision arithmetic
  • Neural network workloads that contain custom operations in the main training loop

VPC Service controls integration

Cloud TPU VPC Service Controls enables you to define security perimeters around your Cloud TPU resources and control the movement of data across the perimeter boundary. To learn more about VPC Service Controls, see VPC Service Controls overview. To learn about the limitations in using Cloud TPU with VPC Service Controls, see supported products and limitations.

Getting started with Cloud TPU

Set up a GCP account Before you can use Cloud TPU resources, you must create a Google Cloud Platform account and project.
Activate Cloud TPU APIs To train a model, you must activate the Compute Engine and Cloud TPU APIs.
Grant Cloud TPU access to your Cloud Storage buckets Cloud TPU requires access to the Cloud Storage bucket(s) where you store your datasets.
Train your model Read one of the Cloud TPU quickstarts or tutorials to get started.
Analyze your model Use TensorBoard or other tools to visualize your model and track key metrics during the model training process such as learning rate, loss, and accuracy.

What's next?

Looking to learn more about Cloud TPU? The following resources may help.

Quickstart using Compute Engine Try training a model using Cloud TPU with one of our quickstarts.
TPU Colabs Experiment with Cloud TPU using a variety of free Colabs.
Cloud TPU Tutorials Test out Cloud TPU using a variety of ML models.
Cloud TPU system architecture Find more indepth information about TPUs.
Pricing Get a sense of how Cloud TPU can process your machine learning workloads in a cost-effective manner.
Contact sales Have a specific implementation or application that you want to discuss? Reach out to our sales department.