This document covers the architecture of the Cloud TPU system.
Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. These TPUs are designed from the ground up with the benefit of Google's deep experience and leadership in machine learning.
Cloud TPU enables you to run machine learning workloads on Google's TPU accelerator hardware using TensorFlow. Cloud TPU is designed for maximum performance and flexibility to help researchers, developers, and businesses build TensorFlow compute clusters that can use CPUs, GPUs, and TPUs. High-level Tensorflow APIs make it easy to run replicated models on the Cloud TPU hardware.
Cloud TPU is accessible to user programs expressed in TensorFlow on a Compute Engine VM. When the program is run, a TensorFlow computation graph is generated and sent to the Cloud TPU over gRPC. The Cloud TPU server compiles the computation graph just in time and sends the program binary to the Cloud TPU for execution. Inputs to the model are often stored in Cloud Storage. These inputs are fed directly to the Cloud TPU server, which then streams them to a Cloud TPU for consumption.
Cloud TPU chips are interconnected and therefore communication between chips does not have to involve the host CPU or host networking. The APIs used to program Cloud TPU can take advantage of Cloud TPU Pods without code changes. As a result, it is easy to scale up to massive compute clusters. The hardware support built into the chips results in effectively linear performance scaling across a broad range of deep learning workloads.
The system architecture described above is intended for informational purposes and to influence high-level design of your programs/models. In practice, the Cloud TPU software stack removes the complexity of generating, running, and feeding TPU Cloud programs. The next section describes the Cloud TPU software stack.
The block diagram below shows the Cloud TPU software architecture, consisting of the neural network model, TPU Estimator and TensorFlow client, TensorFlow server and XLA compiler.
TPU Estimators are a set of high-level APIs that build upon Estimators which simplify building models for Cloud TPU and which extract maximum TPU performance. When writing a neural network model that uses Cloud TPU, you should use the TPU Estimator APIs.
TPU Estimators translate your programs into TensorFlow operations, which are then converted into a computational graph by a TensorFlow client. A TensorFlow client communicates the computational graph to a TensorFlow server.
A TensorFlow server runs on a Cloud TPU server. When the server receives a computational graph from the TensorFlow client, the server performs the following actions:
- loads inputs from Cloud Storage.
- partitions the graph into portions that can run on a Cloud TPU and those that must run on a CPU.
- generates XLA operations corresponding to the sub-graph that is to run on Cloud TPU.
- invokes the XLA compiler.
XLA is a just-in-time compiler that takes as input High Level Optimizer (HLO) operations that are produced by the TensorFlow server. XLA generates binary code to be run on Cloud TPU, including orchestration of data from on-chip memory to hardware execution units and inter-chip communication. The generated binary is loaded onto Cloud TPU using PCIe connectivity between the Cloud TPU server and the Cloud TPU and is then launched for execution.
Cloud TPU hardware is comprised of four independent chips. The following block diagram describes the components of a single chip. Each chip consists of two compute cores called Tensor Cores. A Tensor Core consists of scalar, vector and matrix units (MXU). In addition, 8 GB of on-chip memory (HBM) is associated with each Tensor Core for Cloud TPU v2; 16 GB for Cloud TPU v3.
The bulk of the compute horsepower in a Cloud TPU is provided by the MXU. Each MXU is capable of performing 16K multiply-accumulate operations in each cycle. While the MXU's inputs and outputs are 32-bit floating point values, the MXU performs multiplies at reduced bfloat16 precision. Bfloat16 is a 16-bit floating point representation that provides better training and model accuracy than the IEEE half-precision representation.
From a software perspective, each of the 8 cores on a Cloud TPU can execute user computations (XLA ops) independently. High-bandwidth interconnects allow the chips to communicate directly with each other.