Cloud TPU beginner's guide

Train machine learning (ML) models cost-effectively and faster using custom application-specific integrated circuits (ASICs) designed by Google.

Tensor Processing Units (TPUs) are ASIC devices designed specifically to handle the computational demands of machine learning applications. The Cloud TPU family of products make the benefits of TPUs available via scalable and easy-to-use cloud computing resource for all ML researchers, ML engineers, developers, and data scientists running cutting-edge ML models on Google Cloud. With the ability to scale from a TPU v2 node with 8 cores to a full TPU v3 node with 2048 cores, Cloud TPU can provide over 100 petaflops of performance.

Get started with Cloud TPU on Compute Engine

Key capabilities

Accelerate machine learning applications Cloud TPUs are built around Google-designed custom ASIC chips specifically built to accelerate deep learning computations.
Scale your application quickly Start prototyping inexpensively with a single Cloud TPU device (180 teraflops) and then scale up without code changes on larger Cloud TPU nodes.
Cost effectively manage your machine learning workloads Cloud TPU offers pricing that can significantly help you reduce the cost of training and running your machine learning models.
Start with well-optimized, open source reference models. Take advantage of an ever-growing set of open source reference models that Google's research and engineering teams publish, optimize, and continuously test, including Mask R-CNN, AmoebaNet, and many other state-of-the-art models.

How does it work?

To better understand how Cloud TPU can benefit you and your machine learning applications, it helps to review how neural networks work within machine learning applications.

Many of the most impressive artificial intelligence (AI) breakthroughs over the past several years have been achieved with so-called deep neural networks. The behavior of these networks is loosely inspired by findings from neuroscience, but the term neural network is now applied to a broad class of mathematical structures that are not constrained to match biological findings. Many of the most popular neural network structures, or architectures, are organized in a hierarchy of layers. The most accurate and useful models tend to contain many layers, which is where they get the term deep. Most of these deep neural networks accept input data, such as images, audio, text, or structured data, apply a series of transformations, and then produce output that can be used to make predictions.

For example, consider a single-layer neural network for recognizing a hand-written digit image, as shown in the following diagram:

A representation of a neural network for digits

In this example, the input image is a grid of 28 x 28 grayscale pixels. As a first step, each image is translated into a string of 784 numerical values, described more formally as a vector with 784 dimensions. In this example, the output neuron that corresponds to the digit 8 directly accepts the input pixels, multiplies them by a set of parameter values known as weights, and passes along the result. There is an individual weight for each of the red lines in the diagram above.

By matching its weights against the input it receives, the neuron is acting as a similarity filter, as illustrated here:

An illustration representing how parameters work

While this is a basic example, it illustrates the core behavior of much more complex neural networks containing many layers that perform many intermediate transformations of the input data they receive. At each layer, the incoming input data, which may have been heavily altered by preceding layers, is matched against the weights of each neuron in the network. Those neurons then pass along their responses as input to the next layer.

How are these weights calculated for every neuron set? That takes place through a training process. This process often involves processing very large labeled datasets over and over. These datasets may contain millions or even billions of labeled examples! Training state-of-the art ML models on large datasets can take weeks even on powerful hardware. Google designed and built TPUs to increase productivity by making it possible to complete massive computational workloads like these in minutes or hours instead of weeks.

How a CPU works

The last section provided a working definition of neural networks and the type of computations they involve. To understand the TPU's role in these networks, it helps to understand how other hardware devices address these computational challenges. To start, consider the CPU.

The CPU is a general purpose processor based on the von Neumann architecture. That means a CPU works with software and memory, like this:

An illustration of how a CPU works

The greatest benefit of CPU is its flexibility. With its von Neumann architecture, you can load any kind of software for millions of different applications. You could use a CPU for word processing in a PC, controlling rocket engines, executing bank transactions, or classifying images with a neural network.

But, because the CPU is so flexible, the hardware doesn't always know what the next calculation is until it reads the next instruction from the software. A CPU has to store the calculation results inside the CPU registers or L1 cache for every single calculation. This memory access becomes the downside of CPU architecture called the von Neumann bottleneck. The huge scale of neural network calculations means that these future steps are entirely predictable. Each CPU's Arithmetic Logic Units (ALUs), which are the components that hold and control multipliers and adders, can execute only one calculation at a time. Each time, the CPU has to access memory, which limits the total throughput and consumes significant energy.

How a GPU works

To gain higher throughput than a CPU, a GPU uses a simple strategy: employ thousands of ALUs in a single processor. In fact, the modern GPU usually has between 2,500–5,000 ALUs in a single processor. This large number of processors means you could execute thousands of multiplications and additions simultaneously.

An illustration of how a GPU works

This GPU architecture works well on applications with massive parallelism, such as matrix multiplication in a neural network. In fact, on a typical training workload for deep learning, a GPU can provide an order of magnitude higher throughput than a CPU. This is why the GPU is the most popular processor architecture used in deep learning.

But, the GPU is still a general purpose processor that has to support millions of different applications and software. This means GPUs have the same problem as CPUs: the von Neumann bottleneck. For every single calculation in the thousands of ALUs, a GPU must access registers or shared memory to read and store the intermediate calculation results. Because the GPU performs more parallel calculations on its thousands of ALUs, it also spends proportionally more energy accessing memory, which increases the footprint of GPU for complex wiring.

How a TPU works

Google designed Cloud TPUs as a matrix processor specialized for neural network workloads. TPUs can't run word processors, control rocket engines, or execute bank transactions, but they can handle the massive multiplications and additions for neural networks at very fast speeds while consuming much less power and inside a smaller physical footprint.

One benefit TPUs have over other devices is a major reduction of the von Neumann bottleneck. Because the primary task for this processor is matrix processing, hardware designers of the TPU knew every calculation step to perform that operation. So they were able to place thousands of multipliers and adders and connect them to each other directly to form a large physical matrix of those operators. This is called a systolic array architecture. In the case of Cloud TPU v2, there are two systolic arrays of 128 x 128, aggregating 32,768 ALUs for 16 bit floating point values in a single processor.

Let's see how a systolic array executes the neural network calculations. At first, the TPU loads the parameters from memory into the matrix of multipliers and adders.

An illustration of how a TPU loads parameters from memory

Then, the TPU loads data from memory. As each multiplication is executed, the result will be passed to the next multipliers while taking the summation at the same time. So the output will be the summation of all multiplication results between data and parameters. During the whole process of massive calculations and data passing, no memory access is required at all.

An illustration of how a TPU loads data from memory

As a result, TPUs can achieve a high computational throughput on neural network calculations with much less power consumption and smaller footprint.

Cloud TPU Prerequisites

Set up a GCP account Before you can use Cloud TPU resources to train models you must have a Google Cloud (GCP) account and project set up.
Activate Cloud TPU APIs To train a model, you must activate the Compute Engine and Cloud TPU APIs.
Grant Cloud TPU access to your Cloud Storage buckets Cloud Storage buckets allow you to store your datasets.
Choose a TPU service Select the Google Cloud service that you want to use to launch and manage Cloud TPUs. You can select from Compute Engine, Google Kubernetes Engine, or AI Platform.
Choose a TPU type Cloud TPUs come in several different types. Choose the one that has the best combination of power and cost effectiveness. You can save money by using preemptible TPUs. For more information, see Using Preemptible TPUs.
Create and delete Cloud TPUs Running a Machine Learning (ML) model on Cloud TPUs requires a Compute Engine VM and Cloud TPU resources.
Run your model Use one of the many tutorials we've created to select a machine learning model that best fits your application.
Analyize your model Use TensorBoard or other tools to visualize your model and track key metrics during the model training process such as learning rate, loss, and accuracy.

Looking for other machine learning services?

Cloud TPU is one of many machine learning services available in the Google Cloud. Other resources that you might find helpful include:

Video Intelligence API Video Intelligence API makes videos searchable and discoverable by extracting metadata, identifying key nouns, and annotating the content of the video. By calling an easy-to-use REST API, you can now search every moment of every video file in your catalog and find each occurrence of key nouns as well as its significance. Separate signal from noise by retrieving relevant information by video, shot, or frame.
Cloud Vision Cloud Vision enables you to derive insight from your images with our powerful pretrained API models or easily train custom vision models with AutoML Vision Beta. The API quickly classifies images into thousands of categories (such as "sailboat" or "Eiffel Tower"), detects individual objects and faces within images, and finds and reads printed words contained within images. AutoML Vision lets you build and train custom ML models with minimal ML expertise to meet domain-specific business needs.
Speech-to-Text Speech-to-Text enables developers to convert audio to text by applying neural network models in an easy-to-use API. The API recognizes 120 languages and variants, to support your global user base. You can enable voice command-and-control, transcribe audio from call centers, and more. It can process real-time streaming or prerecorded audio, using Google's machine learning technology.
Text-to-Speech Text-to-Speech enables developers to synthesize natural-sounding speech with 32 voices, available in multiple languages and variants. It applies DeepMind's groundbreaking research in WaveNet and Google's neural networks to deliver the highest fidelity possible. With this easy-to-use API, you can create lifelike interactions with your users, across many applications and devices.
Cloud Natural Language API Cloud Natural Language API reveals the structure and meaning of text by offering powerful machine learning models in an easy-to-use REST API. And with AutoML Natural Language Beta you can build and train ML models easily, without extensive ML expertise. You can use Cloud Natural Language API to extract information about people, places, events, and much more mentioned in text documents, news articles, or blog posts. You can also use it to understand sentiment about your product on social media or parse intent from customer conversations happening in a call center or a messaging app.
Cloud Translation Cloud Translation provides a simple programmatic interface for translating an arbitrary string into any supported language. The Cloud Translation API is highly responsive, so websites and applications can integrate with Cloud Translation API for fast, dynamic translation of source text from the source language to a target language (e.g., French to English). In addition the API, you can also use AutoML Translation Beta to quickly and easily build and train high-quality models that are specific to your project or domain.

For even more options, check out the Cloud AI Products page.

What's next?

Looking to learn more about Cloud TPU? The following resources may help.

Quickstart using Compute Engine Take a few minutes to learn how to set up and use Cloud TPU using Google Cloud.
TPU Colabs Experiment with Cloud TPU using a variety of free Colabs.
Cloud TPU Tutorials Test out Cloud TPU using a variety of ML models.
Pricing Get a sense of how Cloud TPU can process your machine learning workloads in a cost-effective manner.
Contact sales Have a specific implementation or application that you want to discuss? Reach out to our sales department.