Using Hyperparameter Tuning

This page shows you how to use Cloud Machine Learning Engine hyperparameter tuning when training your model. Hyperparameter tuning optimizes a target variable that you specify. The target variable is called the hyperparameter metric. When you start a job with hyperparameter tuning, you establish the name of your hyperparameter metric. This is the name you assign to the scalar summary that you add to your trainer.

The steps involved in hyperparameter tuning

To use hyperparameter tuning in your training job you must perform the following steps:

  1. Specify the hyperparameter tuning configuration for your training job by including a HyperparameterSpec in your TrainingInput object.

  2. Include the following code in your training application:

    • Parse the command-line arguments representing the hyperparameters you want to tune, and use the values to set the hyperparameters for your training trial.
    • Add your hyperparameter metric to the summary for your graph.

Below are more details of each step.

Specify the hyperparameter tuning configuration for your training job

Create a HyperparameterSpec object to hold the hyperparameter tuning configuration for your training job, and add the HyperparameterSpec as the hyperparameters object in your TrainingInput object.

In your HyperparameterSpec, set the hyperparameterMetricTag to a value representing your chosen metric. For example: metric1. If you don't specify a hyperparameterMetricTag, Cloud ML Engine looks for a metric with the name training/hptuning/metric.

gcloud

Add your hyperparameter configuration information to your configuration YAML file. Below is an example. For a working config file, see hptuning_config.yaml in the census estimator sample.

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: complex_model_m
  parameterServerType: large_model
  workerCount: 9
  parameterServerCount: 3
  hyperparameters:
    goal: MAXIMIZE
    hyperparameterMetricTag: metric1
    maxTrials: 30
    maxParallelTrials: 1
    enableTrialEarlyStopping: True
    params:
    - parameterName: hidden1
      type: INTEGER
      minValue: 40
      maxValue: 400
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: numRnnCells
      type: DISCRETE
      discreteValues:
      - 1
      - 2
      - 3
      - 4
    - parameterName: rnnCellType
      type: CATEGORICAL
      categoricalValues:
      - BasicLSTMCell
      - BasicRNNCell
      - GRUCell
      - LSTMCell
      - LayerNormBasicLSTMCell

Python

Make a dictionary representing your HyperparameterSpec and add it to your training input. The following example assumes that you have already created a TrainingInput dictionary (in this case named training_inputs) as shown in the training job configuration guide.

# Add hyperparameter tuning to the job config.
hyperparams = {
    'goal': 'MAXIMIZE',
    'hyperparameterMetricTag': 'metric1',
    'maxTrials': 30,
    'maxParallelTrials': 1,
    'enableTrialEarlyStopping': True,
    'params': []}

hyperparams['params'].append({
    'parameterName':'hidden1',
    'type':'INTEGER',
    'minValue': 40,
    'maxValue': 400,
    'scaleType': 'UNIT_LINEAR_SCALE'})

hyperparams['params'].append({
    'parameterName':'numRnnCells',
    'type':'DISCRETE',
    'discreteValues': [1, 2, 3, 4]})

hyperparams['params'].append({
    'parameterName':'rnnCellType',
    'type': 'CATEGORICAL',
    'categoricalValues': [
        'BasicLSTMCell',
        'BasicRNNCell',
        'GRUCell',
        'LSTMCell',
        'LayerNormBasicLSTMCell'
    ]
})

# Add hyperparameter specification to the training inputs dictionary.
training_inputs['hyperparameters'] = hyperparams

# Build the job spec.
job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

Check the code in your training application

In your application, handle the command-line arguments for the hyperparameters and add your metric to the graph summary.

Handle the command-line arguments for the hyperparameters you want to tune

Cloud ML Engine sets command-line arguments when it calls your training application. Make use of the command-line arguments in your code:

  1. Define a name for each hyperparameter argument and parse it using whatever argument parser you prefer (typically argparse). The argument names must match the parameter names that you specified in the job configuration, as described above.

  2. Assign the values from the command-line arguments to the hyperparameters in your graph.

Add your hyperparameter metric to the graph summary

Cloud ML Engine looks for your hyperparameter metric when the graph's summary writer is called. Note: The canned TensorFlow estimator uses the same metric name for training and evaluation. You need a separate metric for hyperparameter tuning, to ensure that Cloud ML Engine can determine the source of the metric.

Your code depends on whether you're using the TensorFlow Estimator API or the core TensorFlow APIs. Below are examples for both situations:

Estimator

Use the following code to add your hyperparameter metric to the summary for your graph. The example assumes that the name of your metric is metric1:

# Create metric for hyperparameter tuning
def my_metric(labels, predictions):
    pred_values = predictions['predictions']
    return {'metric1': tf.metrics.root_mean_squared_error(labels, pred_values)}

# Create estimator to train and evaluate
def train_and_evaluate(output_dir):

    estimator = tf.estimator.DNNLinearCombinedRegressor(...)

    estimator = tf.contrib.estimator.add_metrics(estimator, my_metric)

    train_spec = ...
    exporter = ...
    eval_spec = tf.estimator.EvalSpec(
        input_fn = ...,
        start_delay_secs = 60, # start evaluating after N seconds
        throttle_secs = 300,  # evaluate every N seconds
        exporters = exporter)
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

See the TensorFlow Estimator census sample.

TensorFlow core

Create a summary writer, tf.summary.FileWriter, and add a summary to the writer with your metric as a tag. The following example assumes that the name of your metric is metric1:

from tensorflow.core.framework.summary_pb2 import Summary
...
summary = Summary(value=[Summary.Value(tag='metric1', simple_value=loss_val)])
eval_path = os.path.join(args['job_dir'], 'metric1')
summary_writer = tf.summary.FileWriter(eval_path)

# Note: adding the summary to the writer is enough for hyperparameter tuning.
# ML Engine looks for any summary added with the hyperparameter metric tag.
summary_writer.add_summary(summary)
summary_writer.flush()

See the TensorFlow core census sample

Getting details of a hyperparameter tuning job while running

You can monitor hyperparameter tuning by getting the detailed status of your running training job.

The TrainingOutput object in the response's Job resource has the following values set during a training job with hyperparameter tuning:

  • isHyperparameterTuningJob set to True.

  • trials is present and contains a list of HyperparameterOutput objects, one per trial.

You can also retrieve the trial ID from the TF_CONFIG environment variable See the guide to getting details from TF_CONFIG.

Getting hyperparameter tuning results

When the training runs are complete, you can call projects.jobs.get to get the results. The TrainingOutput object in the job resource contains the metrics for all runs, with the metrics for the best-tuned run identified.

Use the same detailed status request as you use to monitor the job during processing to get this information.

You can see the results from each trial in the job description. Find the trial that yielded the most desirable value for your hyperparameter metric. If the trial meets your standard for success of the model, you can use the hyperparameter values shown for that trial in subsequent runs of your model.

Sometimes multiple trials give identical results for your tuning metric. In such a case, you should determine which of the hyperparameter values are most advantageous by other measures. For example, if you are tuning the number of nodes in a hidden layer and you get identical results when the value is set to 8 as when it's set to 20, you should use 8, because more nodes means more processing and cost for no improvement in your model.

Setting a limit to the number of trials

You should decide how many trials you want to allow the service to run and set the maxTrials value in the HyperparameterSpec object.

There are two competing interests to consider when deciding how many trials to allow:

  • time (and consequently cost)
  • accuracy

Increasing the number of trials generally yields better results, but it is not always so. In most cases there is a point of diminishing returns after which additional trials have little or no effect on the accuracy. It may be best to start with a small number of trials to gauge the effect your chosen hyperparameters have on your model's accuracy before starting a job with a large number of trials.

To get the most out of hyperparameter tuning, you shouldn't set your maximum value lower than ten times the number of hyperparameters you use.

Running parallel trials

You can specify a number of trials to run in parallel by setting maxParallelTrials in the HyperparameterSpec object.

Running parallel trials has the benefit of reducing the time the training job takes (real time—the total processing time required is not typically changed). However, running in parallel can reduce the effectiveness of the tuning job overall. That is because hyperparameter tuning uses the results of previous trials to inform the values to assign to the hyperparameters of subsequent trials. When running in parallel, some trials start without having the benefit of the results of any trials still running.

If you use parallel trials, the training service provisions multiple training processing clusters (or multiple individual machines in the case of a single-process trainer). The scale tier that you set for your job is used for each individual training cluster.

Stopping trials early

You can specify that Cloud ML Engine must automatically stop a trial that has become clearly unpromising. This saves you the cost of continuing a trial that is unlikely to be useful.

To permit stopping a trial early, set the enableTrialEarlyStopping value in the HyperparameterSpec to TRUE.

Resuming a completed hyperparameter tuning job

You can continue a completed hyperparameter tuning job, to start from a state that is partially optimized. This makes it possible to reuse the knowledge gained in the previous hyperparameter tuning job.

To resume a hyperparameter tuning job, submit a new hyperparameter tuning job with the following configuration:

  • Set the resumePreviousJobId value in the HyperparameterSpec to the job ID of the previous trial.
  • Specify values for maxTrials and maxParallelTrials.

Cloud ML Engine uses the previous job ID to find and reuse the same goal, params, and hyperparameterMetricTag values to continue the hyperparameter tuning job.

Use consistent hyperparameterMetricTag name and params for similar jobs, even when the jobs have different parameters. This practice makes it possible for Cloud ML Engine to improve optimization over time.

The following examples show the use of the resumePreviousJobId configuration:

gcloud

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: complex_model_m
  parameterServerType: large_model
  workerCount: 9
  parameterServerCount: 3
  hyperparameters:
    enableTrialEarlyStopping: TRUE
    maxTrials: 30
    maxParallelTrials: 1
    resumePreviousJobId: [PREVIOUS_JOB_IDENTIFIER]

Python

# Add hyperparameter tuning to the job config.
hyperparams = {
    'enableTrialEarlyStopping': True,
    'maxTrials': 30,
    'maxParallelTrials': 1,
    'resumePreviousJobId': [PREVIOUS_JOB_IDENTIFIER]}

# Add the hyperparameter specification to the training inputs dictionary.
training_inputs['hyperparameters'] = hyperparams

# Build the job spec.
job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

Hyperparameter tuning with Cloud TPU

If you're running your hyperparameter tuning job with Cloud TPU on Cloud ML Engine, best practice is to use the eval_metrics property in TPUEstimatorSpec.

See the ResNet-50 TPU hyperparameter tuning sample for a working example of hyperparameter tuning with Cloud TPU.

Instead of using the eval_metrics property to use the hyperparameter tuning service, an alternative is to call tf.summary in host_call. For details, see TPUEstimatorSpec.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud ML Engine for TensorFlow