Hyperparameter tuning using TPUs in AI Platform
Yu-Han Liu
Developer Programs Engineer, Google Cloud AI
Puneith Kaul
Software Engineering Manager
Hyperparameter tuning is one of the cornerstones of building successful machine learning models, and it’s crucial for the success training of any model. AI Platform is a managed service which provides out of the box support for hyperparameter tuning using Google Vizier. Using a technique called Bayesian Optimization, you can step back from the details of the tuning and focus on model architecture. What is important to note here, though, is that you can deploy cloud compute resources to handle such a tuning task, and with Cloud TPU, that task has become much faster and easier: you can successfully and efficiently fine-tune your own training process. In this post we will walk you through the details of performing hyperparameter tuning using Tensor Processing Units (TPU) on AI Platform.
How to use hyperparameter tuning on TPUs
If you are not familiar with running hyperparameter tuning on CPUs or GPUs, you should follow that tutorial first before you run hyperparameter tuning on TPUs. We will explain in this post how to run your hyperparameter tuning job on TPUs. The steps are as follows:
Create a YAML configuration file
Make sure your training code writes evaluation metrics periodically
Submit the hyperparameter job using the trainer code
Create a YAML configuration file
Find the above code sample on GitHub here.
There are two differences between a typical config.yaml for a training job on CPUs or GPUs and the one shown above:
- We are using
cloud_tpu
for the worker type. Note that you can deploy only one worker when using TPUs. - We have a
hyperparameters
section, signaling to Cloud ML Engine that this is a hyperparameter tuning job.
Each parallel trial will be run independently of the others on its own cluster. For example, with the configuration above, we would be running 3 clusters in parallel, each with one master Cloud ML Engine instance and one Cloud TPU worker instance.
Evaluation metrics
The configuration YAML file above specified a hyperparameterMetricTag
to be monitored by CMLE’s Hyperparameter Tuning service, in this case top_5_accuracy
. We need to calculate and report this metric in our code so the Hyperparameter Tuning service can suggest hyperparameters for the next trials based on the values of past trials.
eval_metrics
in your model_fn
definition:Find above code sample on Github here.
Submitting the hyperparameter job
You can use the gcloud
SDK to submit the hyperparameter job to Cloud ML Engine with the following commands:
Find above code sample on Github here.
You can monitor the hyperparameter tuning job on Cloud Console. Each time when a trial has been completed, you can see its outcome at the following URL:
https://console.cloud.google.com/mlengine/jobs/$JOB_NAME
TPU hyperparameter tuning using TensorFlow
Cloud ML Engine provides support for hyperparameter tuning as a service for CPU, GPU, and now also TPU. Faster training speed means you will be able to run more hyperparameter tuning trials with the same amount of time or cost, resulting in better hyperparameters. Note that currently, you can only execute TPU-based hyperparameter tuning with TensorFlow.
See Stanford’s DAWNbench benchmark to learn more about Cloud TPU’s training speed in terms of cost.
Conclusion
Hyperparameter tuning is often a necessary step in training a more accurate model. We hope that this post gives you enough guidance on how to do hyperparameter tuning on TPUs, and demonstrates how easy it is to tune your model on AI Platform.