Cloud TPU beginner's guide

Train machine learning (ML) models cost-effectively and faster using custom application-specific integrated circuits (ASICs) designed by Google.

Tensor Processing Units (TPUs) are ASIC devices designed specifically to handle the computational demands of machine learning applications. The Cloud TPU family of products make the benefits of TPUs available via scalable and easy-to-use cloud computing resource for all ML researchers, ML engineers, developers, and data scientists running cutting-edge ML models on Google Cloud. With the ability to scale from a single TPU device to a full Cloud TPU v2 Pod, which contains 64 devices, Cloud TPU can provide up to 11.5 petaflops of performance.

Get started with Cloud TPU on Compute Engine

Key capabilities

Accelerate machine learning applications Cloud TPU and Cloud TPU Pods are built around Google-designed custom ASIC chips specifically built to accelerate deep learning computations.
Scale your application quickly Start prototyping inexpensively with a single Cloud TPU device (180 teraflops) and then scale up without code changes on larger and larger slices of Cloud TPU Pods.
Cost effectively manage your machine learning workloads Cloud TPU offers pricing that can significantly help you reduce the cost of training and running your machine learning models.
Start with well-optimized, open source reference models. Take advantage of an ever-growing set of open source reference models that Google's research and engineering teams publish, optimize, and continuously test, including Mask R-CNN, AmoebaNet, and many other state-of-the-art models.

How does it work?

To better understand how Cloud TPU can benefit you and your machine learning applications, it helps to review how neural networks work within machine learning applications.

Many of the most impressive artificial intelligence (AI) breakthroughs over the past several years have been achieved with so-called deep neural networks. The behavior of these networks is loosely inspired by findings from neuroscience, but the term neural network is now applied to a broad class of mathematical structures that are not constrained to match biological findings. Many of the most popular neural network structures, or architectures, are organized in a hierarchy of layers. The most accurate and useful models tend to contain many layers, which is where they get the term deep. Most of these deep neural networks accept input data, such as images, audio, text, or structured data, apply a series of transformations, and then produce output that can be used to make predictions.

For example, consider a single-layer neural network for recognizing a hand-written digit image, as shown in the following diagram:

A representation of a neural network for digits

In this example, the input image is a grid of 28 x 28 grayscale pixels. As a first step, each image is translated into a string of 784 numerical values, described more formally as a vector with 784 dimensions. In this example, the output neuron that corresponds to the digit 8 directly accepts the input pixels, multiplies them by a set of parameter values known as weights, and passes along the result. There is an individual weight for each of the red lines in the diagram above.

By matching its weights against the input it receives, the neuron is acting as a similarity filter, as illustrated here:

An illustration representing how parameters work

While this is a basic example, it illustrates the core behavior of much more complex neural networks containing many layers that perform many intermediate transformations of the input data they receive. At each layer, the incoming input data, which may have been heavily altered by preceding layers, is matched against the weights of each neuron in the network. Those neurons then pass along their responses as input to the next layer.

How are these weights calculated for every neuron set? That takes place through a training process. This process often involves processing very large labeled datasets over and over. These datasets may contain millions or even billions of labeled examples! Training state-of-the art ML models on large datasets can take weeks even on powerful hardware. Google designed and built TPUs to increase productivity by making it possible to complete massive computational workloads like these in minutes or hours instead of weeks.

How a CPU works

The last section provided a working definition of neural networks and the type of computations they involve. To understand the TPU's role in these networks, it helps to understand how other hardware devices address these computational challenges. To start, consider the CPU.

The CPU is a general purpose processor based on the von Neumann architecture. That means a CPU works with software and memory, like this:

An illustration of how a CPU works

The greatest benefit of CPU is its flexibility. With its von Neumann architecture, you can load any kind of software for millions of different applications. You could use a CPU for word processing in a PC, controlling rocket engines, executing bank transactions, or classifying images with a neural network.

But, because the CPU is so flexible, the hardware doesn't always know what the next calculation is until it reads the next instruction from the software. A CPU has to store the calculation results inside the CPU registers or L1 cache for every single calculation. This memory access becomes the downside of CPU architecture called the von Neumann bottleneck. The huge scale of neural network calculations means that these future steps are entirely predictable. Each CPU's Arithmetic Logic Units (ALUs), which are the components that hold and control multipliers and adders, can execute only one calculation at a time. Each time, the CPU has to access memory, which limits the total throughput and consumes significant energy.

How a GPU works

To gain higher throughput than a CPU, a GPU uses a simple strategy: employ thousands of ALUs in a single processor. In fact, the modern GPU usually has between 2,500–5,000 ALUs in a single processor. This large number of processors means you could execute thousands of multiplications and additions simultaneously.

An illustration of how a GPU works

This GPU architecture works well on applications with massive parallelism, such as matrix multiplication in a neural network. In fact, on a typical training workload for deep learning, a GPU can provide an order of magnitude higher throughput than a CPU. This is why the GPU is the most popular processor architecture used in deep learning.

But, the GPU is still a general purpose processor that has to support millions of different applications and software. This means GPUs have the same problem as CPUs: the von Neumann bottleneck. For every single calculation in the thousands of ALUs, a GPU must access registers or shared memory to read and store the intermediate calculation results. Because the GPU performs more parallel calculations on its thousands of ALUs, it also spends proportionally more energy accessing memory, which increases the footprint of GPU for complex wiring.

How a TPU works

Google designed Cloud TPUs as a matrix processor specialized for neural network workloads. TPUs can't run word processors, control rocket engines, or execute bank transactions, but they can handle the massive multiplications and additions for neural networks at very fast speeds while consuming much less power and inside a smaller physical footprint.

One benefit TPUs have over other devices is a major reduction of the von Neumann bottleneck. Because the primary task for this processor is matrix processing, hardware designers of the TPU knew every calculation step to perform that operation. So they were able to place thousands of multipliers and adders and connect them to each other directly to form a large physical matrix of those operators. This is called a systolic array architecture. In the case of Cloud TPU v2, there are two systolic arrays of 128 x 128, aggregating 32,768 ALUs for 16 bit floating point values in a single processor.

Let's see how a systolic array executes the neural network calculations. At first, the TPU loads the parameters from memory into the matrix of multipliers and adders.

An illustration of how a TPU loads parameters from memory

Then, the TPU loads data from memory. As each multiplication is executed, the result will be passed to the next multipliers while taking the summation at the same time. So the output will be the summation of all multiplication results between data and parameters. During the whole process of massive calculations and data passing, no memory access is required at all.

An illustration of how a TPU loads data from memory

As a result, TPUs can achieve a high computational throughput on neural network calculations with much less power consumption and smaller footprint.

Implementation path

Choose a TPU service. Select the Google Cloud Platform service that you want to use to launch and manage Cloud TPUs. You can select from Compute Engine, Google Kubernetes Engine, or AI Platform.
Choose a TPU version. Cloud TPUs come in a couple of different versions. Choose the one that has the best combination of power and cost effectiveness.
Choose between using TPU devices or pods. Cloud TPUs scale to meet the needs of your machine learning application. Select from using a single device slice or all the way up to a full pod.
Configure your tools. Use ctpu, a powerful command line tool for managing compute engine and TPUs simultaneously. You can also use tools you may already be familiar with, such as TensorBoard.
Store your data. Easily store your data using Cloud Storage.
Run your model. Use one of the many tutorials we've created to select a machine learning model that best fits your application.

Looking for other machine learning services?

Cloud TPU is one of many machine learning services available in the Google Cloud Platform. Other resources that you might find helpful include:

Cloud Video Intelligence Cloud Video Intelligence makes videos searchable and discoverable by extracting metadata, identifying key nouns, and annotating the content of the video. By calling an easy-to-use REST API, you can now search every moment of every video file in your catalog and find each occurrence of key nouns as well as its significance. Separate signal from noise by retrieving relevant information by video, shot, or frame.
Cloud AutoML Vision Cloud AutoML Vision enables you to derive insight from your images with our powerful pretrained API models or easily train custom vision models with Cloud AutoML Vision Beta. The API quickly classifies images into thousands of categories (such as "sailboat" or "Eiffel Tower"), detects individual objects and faces within images, and finds and reads printed words contained within images. Cloud AutoML Vision lets you build and train custom ML models with minimal ML expertise to meet domain-specific business needs.
Cloud Speech-to-Text Cloud Speech-to-Text enables developers to convert audio to text by applying neural network models in an easy-to-use API. The API recognizes 120 languages and variants, to support your global user base. You can enable voice command-and-control, transcribe audio from call centers, and more. It can process real-time streaming or prerecorded audio, using Google's machine learning technology.
Cloud Text-to-Speech Cloud Text-to-Speech enables developers to synthesize natural-sounding speech with 32 voices, available in multiple languages and variants. It applies DeepMind's groundbreaking research in WaveNet and Google's neural networks to deliver the highest fidelity possible. With this easy-to-use API, you can create lifelike interactions with your users, across many applications and devices.
Cloud Natural Language Cloud Natural Language reveals the structure and meaning of text by offering powerful machine learning models in an easy-to-use REST API. And with Cloud AutoML Natural Language Beta you can build and train ML models easily, without extensive ML expertise. You can use Cloud Natural Language to extract information about people, places, events, and much more mentioned in text documents, news articles, or blog posts. You can also use it to understand sentiment about your product on social media or parse intent from customer conversations happening in a call center or a messaging app.
Cloud Translation Cloud Translation provides a simple programmatic interface for translating an arbitrary string into any supported language. The Cloud Translation API is highly responsive, so websites and applications can integrate with Cloud Translation API for fast, dynamic translation of source text from the source language to a target language (e.g., French to English). In addition the API, you can also use Cloud AutoML Translation Beta to quickly and easily build and train high-quality models that are specific to your project or domain.

For even more options, check out the Cloud AI Products page.

What's next?

Looking to learn more about Cloud TPU? The following resources may help.

Quickstart using Compute Engine Take a few minutes to learn how to set up and use Cloud TPU using Google Cloud Platform.
TPU Colabs Experiment with Cloud TPU using a variety of free Colabs.
Cloud TPU Tutorials Test out Cloud TPU using a variety of ML models.
Pricing Get a sense of how Cloud TPU can process your machine learning workloads in a cost-effective manner.
Contact sales Have a specific implementation or application that you want to discuss? Reach out to our sales department.
Was this page helpful? Let us know how we did:

Send feedback about...