AI & Machine Learning

AI Platform adds Cloud TPU support for training

May 21, 2018

Nikhil Kothari

Engineering Lead, Dataplex and Open Data Analytics

Starting today, AI Platform offers the option to accelerate training with Cloud TPUs as a beta feature. Getting started is easy, since Cloud TPU quota is now available to all GCP customers.

AI Platform enables you to train and deploy machine learning models on datasets of many types and sizes, using the flexibility and production-readiness of TensorFlow. As a managed service, AI Platform handles the infrastructure, compute resources, and job scheduling on your behalf, allowing you to focus on data and modeling.

In March 2017, we launched AI Platform (then known as Cloud Machine Learning Engine) to provide a managed TensorFlow service, with the ability to scale machine learning workloads using distributed training and GPU acceleration. Over the last year, we have continued to release new features and improvements including beta support for NVIDIA V100 GPUs, online prediction as a deployment capability, and improvements to the hyperparameter tuning feature.

Today, we are adding support for Cloud TPUs, enabling you to train a variety of high-performance, open-source reference models with differentiated performance per dollar. Or, you can choose to accelerate your own models written with high-level TensorFlow APIs.

Recently launched in beta, Cloud TPUs are a family of Google-designed hardware accelerators built from the ground up for machine learning. Cloud TPUs recently won the ImageNet Training Cost category of Stanford’s DAWNBench competition, and their performance and cost advantages were recently analyzed in detail.

Getting started with Cloud TPU on ML Engine

AI Platform automatically handles provisioning and management of Cloud TPU nodes, so you can use TPUs just as easily as CPUs and GPUs. Additionally, you can use AI Platform’s hyperparameter tuning feature in your Cloud TPU jobs to optimize your hyperparameters—combining scale, performance, and algorithms to improve your models. Finally, the resulting models can be deployed with AI Platform to issue prediction requests, or submit batch prediction jobs.

Read this guide to learn more about how you can use Cloud TPUs with AI Platform for training jobs.

AI & Machine Learning