AI & Machine Learning

Hyperparameter tuning using TPUs in Cloud ML Engine

TPUv2 Bird's-eye View with Heatsinks

Hyperparameter tuning is one of the cornerstones of building successful machine learning models, and it’s crucial for the success training of any model. Cloud Machine Learning (ML) Engine is a managed service which provides out of the box support for hyperparameter tuning using Google Vizier. Using a technique called Bayesian Optimization, you can step back from the details of the tuning and focus on model architecture. What is important to note here, though, is that you can deploy cloud compute resources to handle such a tuning task, and with Cloud TPU, that task has become much faster and easier: you can successfully and efficiently fine-tune your own training process. In this post we will walk you through the details of performing hyperparameter tuning using Tensor Processing Units (TPU) on Cloud ML Engine.

How to use hyperparameter tuning on TPUs

If you are not familiar with running hyperparameter tuning on CPUs or GPUs, you should follow that tutorial first before you run hyperparameter tuning on TPUs. We will explain in this post how to run your hyperparameter tuning job on TPUs. The steps are as follows:

  1. Create a YAML configuration file

  2. Make sure your training code writes evaluation metrics periodically

  3. Submit the hyperparameter job using the trainer code

Create a YAML configuration file

  trainingInput:
  scaleTier: CUSTOM
  masterType: standard
  workerCount: 1
  workerType: cloud_tpu
  hyperparameters:
    goal: MAXIMIZE
    enableTrialEarlyStopping: true
    hyperparameterMetricTag: top_5_accuracy
    maxTrials: 32
    maxParallelTrials: 2
    params:
      - parameterName: base_learning_rate
        type: DOUBLE
        scaleType: UNIT_REVERSE_LOG_SCALE
        minValue: 0.02
        maxValue: 0.5
...

Find the above code sample on GitHub here.

There are two differences between a typical config.yaml for a training job on CPUs or GPUs and the one shown above:

  1. We are using cloud_tpu for the worker type. Note that you can deploy only one worker when using TPUs.
  2. We have a hyperparameters section, signaling to Cloud ML Engine that this is a hyperparameter tuning job.

Each parallel trial will be run independently of the others on its own cluster. For example, with the configuration above, we would be running 3 clusters in parallel, each with one master Cloud ML Engine instance and one Cloud TPU worker instance.

Evaluation metrics

The configuration YAML file above specified a hyperparameterMetricTag to be monitored by CMLE’s Hyperparameter Tuning service, in this case top_5_accuracy. We need to calculate and report this metric in our code so the Hyperparameter Tuning service can suggest hyperparameters for the next trials based on the values of past trials.


You can report the metric through eval_metrics in your model_fn definition:


Language: Python

  def model_fn(features, labels, mode, params):
  eval_metrics = None
  if mode == tf.estimator.ModeKeys.EVAL:
    def metric_fn(labels, logits):
      predictions = tf.argmax(logits, axis=1)
      top_1_accuracy = tf.metrics.accuracy(labels, predictions)
      in_top_5 = tf.cast(tf.nn.in_top_k(logits, labels, 5), tf.float32)
      top_5_accuracy = tf.metrics.mean(in_top_5)

      return {
          'top_1_accuracy': top_1_accuracy,
          'top_5_accuracy': top_5_accuracy,
      }

    eval_metrics = (metric_fn, [labels, logits])

  return tpu_estimator.TPUEstimatorSpec(
      ...
      eval_metrics=eval_metrics)

Find above code sample on Github here.

Submitting the hyperparameter job

You can use the gcloud SDK to submit the hyperparameter job to Cloud ML Engine with the following commands:

  now=$(date +"%Y%m%d_%H%M%S")
BUCKET="gs://my-gcs-bucket"

JOB_NAME="tpu_hptuning_"$now""
JOB_DIR=$BUCKET"/"$JOB_NAME

STAGING_BUCKET=$BUCKET
REGION=us-central1
DATA_DIR=gs://cloud-tpu-test-datasets/fake_imagenet
OUTPUT_PATH=$JOB_DIR

gcloud ml-engine jobs submit training $JOB_NAME \
    --staging-bucket $STAGING_BUCKET \
    --runtime-version 1.8 \
    --module-name resnet.resnet_main \
    --package-path resnet/ \
    --region $REGION \
    --config config_resnet.yaml \
    -- \
    --data_dir=$DATA_DIR \
    --model_dir=$OUTPUT_PATH \
    --resnet_depth=50

Find above code sample on Github here.

You can monitor the hyperparameter tuning job on Cloud Console.  Each time when a trial has been completed, you can see its outcome at the following URL:

https://console.cloud.google.com/mlengine/jobs/$JOB_NAME

TPU hyperparameter tuning using TensorFlow

Cloud ML Engine provides support for hyperparameter tuning as a service for CPU, GPU, and now also TPU. Faster training speed means you will be able to run more hyperparameter tuning trials with the same amount of time or cost, resulting in better hyperparameters. Note that currently, you can only execute TPU-based hyperparameter tuning with TensorFlow.

See Stanford’s DAWNbench benchmark to learn more about Cloud TPU’s training speed in terms of cost.

Conclusion

Hyperparameter tuning is often a necessary step in training a more accurate model. We hope that this post gives you enough guidance on how to do hyperparameter tuning on TPUs, and demonstrates how easy it is to tune your model on Cloud ML Engine.