AI & Machine Learning
Hyperparameter tuning using TPUs in Cloud ML Engine
Hyperparameter tuning is one of the cornerstones of building successful machine learning models, and it’s crucial for the success training of any model. Cloud Machine Learning (ML) Engine is a managed service which provides out of the box support for hyperparameter tuning using Google Vizier. Using a technique called Bayesian Optimization, you can step back from the details of the tuning and focus on model architecture. What is important to note here, though, is that you can deploy cloud compute resources to handle such a tuning task, and with Cloud TPU, that task has become much faster and easier: you can successfully and efficiently fine-tune your own training process. In this post we will walk you through the details of performing hyperparameter tuning using Tensor Processing Units (TPU) on Cloud ML Engine.
How to use hyperparameter tuning on TPUs
If you are not familiar with running hyperparameter tuning on CPUs or GPUs, you should follow that tutorial first before you run hyperparameter tuning on TPUs. We will explain in this post how to run your hyperparameter tuning job on TPUs. The steps are as follows:
Create a YAML configuration file
Make sure your training code writes evaluation metrics periodically
Submit the hyperparameter job using the trainer code
Create a YAML configuration file
- parameterName: base_learning_rate
Find the above code sample on GitHub here.
There are two differences between a typical config.yaml for a training job on CPUs or GPUs and the one shown above:
- We are using
cloud_tpufor the worker type. Note that you can deploy only one worker when using TPUs.
- We have a
hyperparameterssection, signaling to Cloud ML Engine that this is a hyperparameter tuning job.
Each parallel trial will be run independently of the others on its own cluster. For example, with the configuration above, we would be running 3 clusters in parallel, each with one master Cloud ML Engine instance and one Cloud TPU worker instance.
The configuration YAML file above specified a
hyperparameterMetricTag to be monitored by CMLE’s Hyperparameter Tuning service, in this case
top_5_accuracy. We need to calculate and report this metric in our code so the Hyperparameter Tuning service can suggest hyperparameters for the next trials based on the values of past trials.
You can report the metric through
def model_fn(features, labels, mode, params):
eval_metrics = None
if mode == tf.estimator.ModeKeys.EVAL:
def metric_fn(labels, logits):
predictions = tf.argmax(logits, axis=1)
top_1_accuracy = tf.metrics.accuracy(labels, predictions)
in_top_5 = tf.cast(tf.nn.in_top_k(logits, labels, 5), tf.float32)
top_5_accuracy = tf.metrics.mean(in_top_5)
eval_metrics = (metric_fn, [labels, logits])
gcloud ml-engine jobs submit training $JOB_NAME \
--staging-bucket $STAGING_BUCKET \
--runtime-version 1.8 \
--module-name resnet.resnet_main \
--package-path resnet/ \
--region $REGION \
--config config_resnet.yaml \
Find above code sample on Github here.
You can monitor the hyperparameter tuning job on Cloud Console. Each time when a trial has been completed, you can see its outcome at the following URL:
TPU hyperparameter tuning using TensorFlow
Cloud ML Engine provides support for hyperparameter tuning as a service for CPU, GPU, and now also TPU. Faster training speed means you will be able to run more hyperparameter tuning trials with the same amount of time or cost, resulting in better hyperparameters. Note that currently, you can only execute TPU-based hyperparameter tuning with TensorFlow.
See Stanford’s DAWNbench benchmark to learn more about Cloud TPU’s training speed in terms of cost.
Hyperparameter tuning is often a necessary step in training a more accurate model. We hope that this post gives you enough guidance on how to do hyperparameter tuning on TPUs, and demonstrates how easy it is to tune your model on Cloud ML Engine.