AI & Machine Learning

Cloud TPU now offers preemptible pricing and global availability

June 19, 2018

Brennan Saeta

TensorFlow Tech Lead for Cloud TPUs

Deep neural networks have enabled breakthroughs across a variety of business and research challenges, including translating text between languages, transcribing speech, classifying image content, and mastering the game of Go. Because training and running deep learning models can be extremely computationally demanding, we rely on our custom-built Tensor Processing Units (TPUs) to power several of our major products, including Translate, Photos, Search, Assistant, and Gmail.

Cloud TPUs allow businesses everywhere to transform their own products and services with machine learning, and we’re working hard to make Cloud TPUs as widely available and as affordable as possible. As of today, Cloud TPUs are available in two new regions in Europe and Asia, and we are also introducing preemptible pricing for Cloud TPUs that is 70% lower than the normal price.

Cloud TPUs are available in the United States, Europe, and Asia at the following rates, and you can get started in minutes via our Quickstart guide:

https://storage.googleapis.com/gweb-cloudblog-publish/images/cloud-tpusjlll.max-700x700.PNG

https://storage.googleapis.com/gweb-cloudblog-publish/images/gcp-cloud-TPU.max-700x700.jpg

One Cloud TPU (v2-8) can deliver up to 180 teraflops and includes 64 GB of high-bandwidth memory. The colorful cables link multiple TPU devices together over a custom 2-D mesh network to form Cloud TPU Pods. These accelerators are programmed via TensorFlow and are widely available today on Google Cloud Platform.

Benchmarking Cloud TPU performance-per-dollar

Training a machine learning model is analogous to compiling code: ML training needs to happen fast for engineers, researchers, and data scientists to be productive, and ML training needs to be affordable for models to be trained over and over as a production application is built, deployed, and refined. Key metrics include time-to-accuracy and training cost.

Researchers at Stanford recently hosted an open benchmarking competition called DAWNBench that focused on time-to-accuracy and training cost, and Cloud TPUs won first place in the large-scale ImageNet Training Cost category. On a single Cloud TPU, our open-source AmoebaNet reference model cost only $49.30 to reach the target accuracy, and our open-source ResNet-50 model cost just $58.53. Our TPU Pods also won the ImageNet Training Time category: the same ResNet-50 code running on just half of a TPU pod was nearly six times faster than any non-TPU submission, reaching the target accuracy in approximately 30 minutes!

Although we restricted ourselves to standard algorithms and standard learning regimes for the competition, another DAWNBench submission from fast.ai (3rd place in ImageNet Training Cost, 4th place in ImageNet Training Time) altered the standard ResNet-50 training procedure in two clever ways to achieve faster convergence (GPU implementation here). After DAWNBench was over, we easily applied the same optimizations to our Cloud TPU ResNet-50 implementation. This reduced ResNet-50 training time on a single Cloud TPU from 8.9 hours to 3.5 hours, a 2.5X improvement, which made it possible to train ResNet-50 for just $25 with normal pricing.

Preemptible Cloud TPUs make the Cloud TPU platform even more affordable. You can now train ResNet-50 on ImageNet from scratch for just $7.50. Preemptible Cloud TPUs allow fault-tolerant workloads to run more cost-effectively than ever before; these TPUs behave similarly to Preemptible VMs. And because TensorFlow has built-in support for saving and restoring from checkpoints, deadline-insensitive workloads can easily take advantage of preemptible pricing. This means you can train cutting-edge deep learning models to achieve DAWNBench-level accuracy for less than you might pay for lunch!

Select Open-Source Reference Models	Normal training cost (TF 1.8)	Preemptible training cost (TF 1.8)
ResNet-50 (with optimizations from fast.ai): Image classification	~$25	~$7.50
ResNet-50 (original implementation): Image classification	~$59	~$18
AmoebaNet: Image classification (model architecture evolved from scratch on TPUs to maximize accuracy)	~$49	~$15
RetinaNet: Object detection	~$40	~$12
Transformer: Neural machine translation	~$41	~$13
ASR Transformer: Speech recognition (transcribe speech to text)	~$86	~$27

Start using Cloud TPUs today

We aim for Google Cloud to be the best place to run all of your machine learning workloads. Cloud TPUs offer great performance-per-dollar for training and batch inference across a variety of machine learning applications, and we also offer top-of-the-line GPUs with recently-improved preemptible pricing.

We’re excited to see what you build! To get started, please check out the Cloud TPU Quickstart, try our open source reference models, and be sure to sign up for a free trial to start with $300 in cloud credits. Finally, we encourage you to watch our Cloud-TPU-related sessions from Google I/O and the TensorFlow Dev Summit: “Effective machine learning with Cloud TPUs” and “Training Performance: A user’s guide to converge faster.”