# Cloud TPU beginner's guide

Train machine learning (ML) models cost-effectively and faster using custom application-specific integrated circuits (ASICs) designed by Google.

Tensor Processing Units (TPUs) are ASIC devices designed specifically to handle the computational demands of machine learning applications. The Cloud TPU family of products make the benefits of TPUs available via scalable and easy-to-use cloud computing resource for all ML researchers, ML engineers, developers, and data scientists running cutting-edge ML models on Google Cloud. With the ability to scale from a TPU v2 node with 8 cores to a full TPU v3 node with 2048 cores, Cloud TPU can provide over 100 petaflops of performance.

## How does it work?

To better understand how Cloud TPU can benefit you and your machine learning applications, it helps to review how neural networks work within machine learning applications.

Many of the most impressive artificial intelligence (AI) breakthroughs over the past several years have been achieved with so-called deep neural networks. The behavior of these networks is loosely inspired by findings from neuroscience, but the term neural network is now applied to a broad class of mathematical structures that are not constrained to match biological findings. Many of the most popular neural network structures, or architectures, are organized in a hierarchy of layers. The most accurate and useful models tend to contain many layers, which is where they get the term deep. Most of these deep neural networks accept input data, such as images, audio, text, or structured data, apply a series of transformations, and then produce output that can be used to make predictions.

For example, consider a single-layer neural network for recognizing a hand-written digit image, as shown in the following diagram:

In this example, the input image is a grid of 28 x 28 grayscale pixels. As a first step, each image is translated into a string of 784 numerical values, described more formally as a vector with 784 dimensions. In this example, the output neuron that corresponds to the digit `8` directly accepts the input pixels, multiplies them by a set of parameter values known as weights, and passes along the result. There is an individual weight for each of the red lines in the diagram above.

By matching its weights against the input it receives, the neuron is acting as a similarity filter, as illustrated here:

While this is a basic example, it illustrates the core behavior of much more complex neural networks containing many layers that perform many intermediate transformations of the input data they receive. At each layer, the incoming input data, which may have been heavily altered by preceding layers, is matched against the weights of each neuron in the network. Those neurons then pass along their responses as input to the next layer.

How are these weights calculated for every neuron set? That takes place through a training process. This process often involves processing very large labeled datasets over and over. These datasets may contain millions or even billions of labeled examples! Training state-of-the art ML models on large datasets can take weeks even on powerful hardware. Google designed and built TPUs to increase productivity by making it possible to complete massive computational workloads like these in minutes or hours instead of weeks.

### How a CPU works

The last section provided a working definition of neural networks and the type of computations they involve. To understand the TPU's role in these networks, it helps to understand how other hardware devices address these computational challenges. To start, consider the CPU.

The CPU is a general purpose processor based on the von Neumann architecture. That means a CPU works with software and memory, like this:

The greatest benefit of CPU is its flexibility. With its von Neumann architecture, you can load any kind of software for millions of different applications. You could use a CPU for word processing in a PC, controlling rocket engines, executing bank transactions, or classifying images with a neural network.

But, because the CPU is so flexible, the hardware doesn't always know what the next calculation is until it reads the next instruction from the software. A CPU has to store the calculation results inside the CPU registers or L1 cache for every single calculation. This memory access becomes the downside of CPU architecture called the von Neumann bottleneck. The huge scale of neural network calculations means that these future steps are entirely predictable. Each CPU's Arithmetic Logic Units (ALUs), which are the components that hold and control multipliers and adders, can execute only one calculation at a time. Each time, the CPU has to access memory, which limits the total throughput and consumes significant energy.

### How a GPU works

To gain higher throughput than a CPU, a GPU uses a simple strategy: employ thousands of ALUs in a single processor. In fact, the modern GPU usually has between 2,500–5,000 ALUs in a single processor. This large number of processors means you could execute thousands of multiplications and additions simultaneously.

This GPU architecture works well on applications with massive parallelism, such as matrix multiplication in a neural network. In fact, on a typical training workload for deep learning, a GPU can provide an order of magnitude higher throughput than a CPU. This is why the GPU is the most popular processor architecture used in deep learning.

But, the GPU is still a general purpose processor that has to support millions of different applications and software. This means GPUs have the same problem as CPUs: the von Neumann bottleneck. For every single calculation in the thousands of ALUs, a GPU must access registers or shared memory to read and store the intermediate calculation results. Because the GPU performs more parallel calculations on its thousands of ALUs, it also spends proportionally more energy accessing memory, which increases the footprint of GPU for complex wiring.

### How a TPU works

Google designed Cloud TPUs as a matrix processor specialized for neural network workloads. TPUs can't run word processors, control rocket engines, or execute bank transactions, but they can handle the massive multiplications and additions for neural networks at very fast speeds while consuming much less power and inside a smaller physical footprint.

One benefit TPUs have over other devices is a major reduction of the von Neumann bottleneck. Because the primary task for this processor is matrix processing, hardware designers of the TPU knew every calculation step to perform that operation. So they were able to place thousands of multipliers and adders and connect them to each other directly to form a large physical matrix of those operators. This is called a systolic array architecture. In the case of Cloud TPU v2, there are two systolic arrays of 128 x 128, aggregating 32,768 ALUs for 16 bit floating point values in a single processor.

Let's see how a systolic array executes the neural network calculations. At first, the TPU loads the parameters from memory into the matrix of multipliers and adders.

Then, the TPU loads data from memory. As each multiplication is executed, the result will be passed to the next multipliers while taking the summation at the same time. So the output will be the summation of all multiplication results between data and parameters. During the whole process of massive calculations and data passing, no memory access is required at all.

As a result, TPUs can achieve a high computational throughput on neural network calculations with much less power consumption and smaller footprint.

## Looking for other machine learning services?

Cloud TPU is one of many machine learning services available in the Google Cloud. Other resources that you might find helpful include: