AI & Machine Learning

Cloud TPU machine learning accelerators now available in beta

February 12, 2018

John Barrus

Product Manager for Cloud TPUs, Google Cloud

Zak Stone

Senior Product Manager for Cloud TPUs, Google Brain Team

Starting today, Cloud TPUs are available in beta on Google Cloud to help machine learning (ML) experts train and run their ML models more quickly.

https://storage.googleapis.com/gweb-cloudblog-publish/images/cloud-tpu-betaup22.max-700x700.PNG

Cloud TPUs are a family of Google-designed hardware accelerators that are optimized to speed up and scale up specific ML workloads programmed with TensorFlow. Built with four custom ASICs, each Cloud TPU packs up to 180 teraflops of floating-point performance and 64 GB of high-bandwidth memory onto a single board. These boards can be used alone or connected together via an ultra-fast, dedicated network to form multi-petaflop ML supercomputers that we call “TPU pods.” We will offer these larger supercomputers on GCP later this year.

We designed Cloud TPUs to deliver differentiated performance per dollar for targeted TensorFlow workloads and to enable ML engineers and researchers to iterate more quickly. For example:

Instead of waiting for a job to schedule on a shared compute cluster, you can have interactive, exclusive access to a network-attached Cloud TPU via a Google Compute Engine VM that you control and can customize.
Rather than waiting days or weeks to train a business-critical ML model, you can train several variants of the same model overnight on a fleet of Cloud TPUs and deploy the most accurate trained model in production the next day.
Using a single Cloud TPU and following this tutorial, you can train ResNet-50 to the expected accuracy on the ImageNet benchmark challenge in less than a day, all for well under $200!

ML model training, made easy

Traditionally, writing programs for custom ASICs and supercomputers has required deeply specialized expertise. By contrast, you can program Cloud TPUs with high-level TensorFlow APIs, and we have open-sourced a set of reference high-performance Cloud TPU model implementations to help you get started right away:

ResNet-50 and other popular models for image classification
Transformer for machine translation and language modeling
RetinaNet for object detection

To save you time and effort, we continuously test these model implementations both for performance and for convergence to the expected accuracy on standard datasets.

Over time, we'll open-source additional model implementations. Adventurous ML experts may be able to optimize other TensorFlow models for Cloud TPUs on their own using the documentation and tools we provide.

By getting started with Cloud TPUs now, you’ll be able to benefit from dramatic time-to-accuracy improvements when we introduce TPU pods later this year. As we announced at NIPS 2017, both ResNet-50 and Transformer training times drop from the better part of a day to under 30 minutes on a full TPU pod, no code changes required.

Two Sigma, a leading investment management firm, is impressed with the performance and ease of use of Cloud TPUs.

We made a decision to focus our deep learning research on the cloud for many reasons, but mostly to gain access to the latest machine learning infrastructure. Google Cloud TPUs are an example of innovative, rapidly evolving technology to support deep learning, and we found that moving TensorFlow workloads to TPUs has boosted our productivity by greatly reducing both the complexity of programming new models and the time required to train them. Using Cloud TPUs instead of clusters of other accelerators has allowed us to focus on building our models without being distracted by the need to manage the complexity of cluster communication patterns.

— Alfred Spector, Chief Technology Officer, Two Sigma

Tweet this quote

A scalable ML platform

Cloud TPUs also simplify planning and managing ML computing resources:

You can provide your teams with state-of-the-art ML acceleration and adjust your capacity dynamically as their needs change.
Instead of committing the capital, time and expertise required to design, install and maintain an on-site ML computing cluster with specialized power, cooling, networking and storage requirements, you can benefit from large-scale, tightly-integrated ML infrastructure that has been heavily optimized at Google over many years.
There’s no more struggling to keep drivers up-to-date across a large collection of workstations and servers. Cloud TPUs are preconfigured—no driver installation required!
You are protected by the same sophisticated security mechanisms and practices that safeguard all Google Cloud services.

Since working with Google Cloud TPUs, we’ve been extremely impressed with their speed—what could normally take days can now take hours. Deep learning is fast becoming the backbone of the software running self-driving cars. The results get better with more data, and there are major breakthroughs coming in algorithms every week. In this world, Cloud TPUs help us move quickly by incorporating the latest navigation-related data from our fleet of vehicles and the latest algorithmic advances from the research community.

— Anantha Kancherla, Head of Software, Self-Driving Level 5, Lyft

Tweet this quote

Here at Google Cloud, we want to provide customers with the best cloud for every ML workload and will offer a variety of high-performance CPUs (including Intel Skylake) and GPUs (including NVIDIA’s Tesla V100) alongside Cloud TPUs.