Using Hyperparameter Tuning

This page shows you how to use Cloud Machine Learning Engine hyperparameter tuning when training your model. The process involves making some changes to your TensorFlow application code and adding some configuration information when you submit your training job. You can learn more about this feature in the hyperparameter tuning overview in this documentation.

Hyperparameter tuning optimizes a single target variable that you specify. The target variable is called the hyperparameter metric.

The steps involved in hyperparameter tuning

To use hyperparameter tuning in your training job you must perform the following steps:

  1. Decide which hyperparameters you want to tune.

  2. Ensure your training application includes the following:

    1. Add command-line arguments for each hyperparameter you want to tune.

    2. Use the values passed in those arguments to set the value of the hyperparameters for your training trial.

    3. Add your hyperparameter metric to the summary for your graph.

  3. Specify the hyperparameters to tune by including a HyperparameterSpec with your training job's configuration data.

Decide which hyperparameters to tune

Before you make any changes to your application code, think about which hyperparameters are the most important to tune your target value. Remember that each hyperparameter that you tune will significantly increase the amount of time the tuning job takes.

Check the code in your training application

Your code depends on whether you're using the TensorFlow Estimator API or the core TensorFlow APIs. For complete examples of using the Estimator API and the core TensorFlow APIs on Cloud ML Engine, see the following sample applications on GitHub:

Add command-line arguments for the hyperparameters you want to tune

Cloud ML Engine sets command-line arguments when it calls your training application. Define a name for each hyperparameter argument and parse it in your application using whatever argument parser you prefer (typically argparse).

You must use the same argument names when you configure your training job, as described below.

Set your hyperparameters to the values received

Assign the values from the command-line arguments to the hyperparameters in your graph.

Add your hyperparameter metric to the graph summary

Cloud ML Engine looks for your hyperparameter metric when the graph's summary writer is called. Note: The canned TensorFlow estimator uses the same metric name for training and evaluation. You need a separate metric for hyperparameter tuning, to ensure that Cloud ML Engine can determine the source of the metric.

If you're using the TensorFlow Estimator API, use the following code to add your hyperparameter metric to the summary for your graph. The example assumes that the name of your metric is metric1:

# Create metric for hyperparameter tuning
def my_metric(labels, predictions):
    pred_values = predictions['predictions']
    return {'metric1': tf.metrics.root_mean_squared_error(labels, pred_values)}

# Create estimator to train and evaluate
def train_and_evaluate(output_dir):

    estimator = tf.estimator.DNNLinearCombinedRegressor(...)

    estimator = tf.contrib.estimator.add_metrics(estimator, my_metric)

    train_spec = ...
    exporter = ...
    eval_spec = tf.estimator.EvalSpec(
        input_fn = ...,
        start_delay_secs = 60, # start evaluating after N seconds
        throttle_secs = 300,  # evaluate every N seconds
        exporters = exporter)
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

If you're using the core TensorFlow APIs, create a summary writer, tf.summary.FileWriter, and add a summary to the writer with your metric as a tag. The following example assumes that the name of your metric is metric1:

from tensorflow.core.framework.summary_pb2 import Summary
...
summary = Summary(value=[Summary.Value(tag='hyperparameterMetricTag', simple_value=loss_val)])
eval_path = os.path.join(args['job_dir'], 'metric1')
summary_writer = tf.summary.FileWriter(eval_path)

# Note: adding the summary to the writer is enough for hyperparameter tuning.
# ML Engine looks for any summary added with the hyperparameter metric tag.
summary_writer.add_summary(summary)
summary_writer.flush()

If you don't specify a hyperparameterMetricTag, Cloud ML Engine looks for a metric with the name training/hptuning/metric.

Manage the output file location

Note: If you are using the --job-dir argument to specify where the training job must store its model, you can skip this section. The hyperparameter tuning trial number is automatically appended to the --job-dir argument as a subdirectory for each trial.

You should write your application to output to a different subdirectory for each hyperparameter tuning trial. If you don't, each trial overwrites the previous one and you lose your data.

We recommend using a base output location with the hyperparameter tuning trial number appended to it. The trial number running on a given replica is stored in the TF_CONFIG environment variable as the trial member of the task object. The following example shows how you might construct an output path in individual replicas of your training job.

def makeTrialOutputPath(output_path):
    '''
    For a given static output path, returns a path with
    the hyperparameter tuning trial number appended.

    Dependencies: os, json
    '''
    # Get the configuration data from the environment variable.
    env = json.loads(os.environ.get('TF_CONFIG', '{}'))

    # Get the task information.
    taskInfo = env.get('task')

    if taskInfo:

        trial = taskInfo.get('trial', '')

        if trial:
            return os.path.join(output_path, trial)

    return output_path

Specify the hyperparameter tuning configuration for your training job

With your training application coded to handle hyperparameter tuning, you must also include the specific configuration to use when you start a training job. Configure your hyperparameter tuning information in a HyperparameterSpec object and add it to your TrainingInput object as the hyperparameters object.

Set the hyperparameterMetricTag member in your HyperparameterSpec to a value representing your chosen metric. For example: metric1.

If you don't specify a hyperparameterMetricTag, Cloud ML Engine looks for a metric with the name training/hptuning/metric.

gcloud

Add your hyperparameter configuration information to your configuration YAML file. Below is an example. For a working config file, see hptuning_config.yaml in the census estimator sample.

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: complex_model_m
  parameterServerType: large_model
  workerCount: 9
  parameterServerCount: 3
  hyperparameters:
    goal: MAXIMIZE
    hyperparameterMetricTag: metric1
    maxTrials: 30
    maxParallelTrials: 1
    enableTrialEarlyStopping: True
    params:
    - parameterName: hidden1
      type: INTEGER
      minValue: 40
      maxValue: 400
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: numRnnCells
      type: DISCRETE
      discreteValues:
      - 1
      - 2
      - 3
      - 4
    - parameterName: rnnCellType
      type: CATEGORICAL
      categoricalValues:
      - BasicLSTMCell
      - BasicRNNCell
      - GRUCell
      - LSTMCell
      - LayerNormBasicLSTMCell

Python

When configuring your training job in Python code, you make a dictionary representing your HyperparameterSpec and add it to your training input.

The following example assumes that you have already created a TrainingInput dictionary (in this case named training_inputs) as shown in the training job configuration guide.

# Add hyperparameter tuning to the job config.
hyperparams = {
    'goal': 'MAXIMIZE',
    'hyperparameterMetricTag': 'metric1',
    'maxTrials': 30,
    'maxParallelTrials': 1,
    'enableTrialEarlyStopping': True,
    'params': []}

hyperparams['params'].append({
    'parameterName':'hidden1',
    'type':'INTEGER',
    'minValue': 40,
    'maxValue': 400,
    'scaleType': 'UNIT_LINEAR_SCALE'})

hyperparams['params'].append({
    'parameterName':'numRnnCells',
    'type':'DISCRETE',
    'discreteValues': [1, 2, 3, 4]})

hyperparams['params'].append({
    'parameterName':'rnnCellType',
    'type': 'CATEGORICAL',
    'categoricalValues': [
        'BasicLSTMCell',
        'BasicRNNCell',
        'GRUCell',
        'LSTMCell',
        'LayerNormBasicLSTMCell'
    ]
})

# Add hyperparameter specification to the training inputs dictionary.
training_inputs['hyperparameters'] = hyperparams

# Build the job spec.
job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

You can get more details about hyperparameter types and values in the hyperparameter tuning overview.

Monitoring hyperparameter tuning in progress

You can monitor hyperparameter tuning by getting the detailed status of your running training job.

The TrainingOutput object in the response's Job resource has the following values set during a training job with hyperparameter tuning:

  • isHyperparameterTuningJob set to True.

  • trials is present and contains a list of HypeparameterOutput objects, one per trial.

Getting hyperparameter tuning results

When the training runs are complete, you can call projects.jobs.get to get the results. The TrainingOutput object in the job resource contains the metrics for all runs, with the metrics for the best-tuned run identified.

Use the same detailed status request that you do to monitor the job during processing to get this information.

You'll get the results from each trial in the job description. Find the trial that yielded the most desirable value for your hyperparameter metric. If the trial meets your standard for success of the model, you can use the hyperparameter values shown for that trial in subsequent runs of your model.

Sometimes you will find multiple trials that give identical results for your tuning metric. In such a case, you should determine which of the hyperparameter values are most advantageous by other measures. For example, if you are tuning the number of nodes in a hidden layer and you get identical results when the value is set to 8 that you do when it's set to 20, you should use 8, because more nodes means more processing and cost for no improvement in your model.

Continuing a completed hyperparameter tuning job

You can continue a completed hyperparameter tuning job. This makes it possible to reuse the knowledge gained in the previous hyperparameter tuning job and start from a state that is partially optimized.

To resume a hyperparameter tuning job, set the resumePreviousJobId value of the HyperparameterSpec object to the job ID of the previous trial, and specify maxTrials and maxParallelTrials values.

Cloud ML Engine then uses the previous job ID to find and reuse the same goal, params, and hyperparameterMetricTag values to continue the hyperparameter tuning job.

gcloud

The following example adds hyperparameter tuning configuration to the example YAML file shown in the training configuration instructions.

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: complex_model_m
  parameterServerType: large_model
  workerCount: 9
  parameterServerCount: 3
  hyperparameters:
    enableTrialEarlyStopping: TRUE
    maxTrials: 30
    maxParallelTrials: 1
    resumePreviousJobId: [PREVIOUS_JOB_IDENTIFIER]

Python

When configuring your training job in Python code, you make a dictionary representing your HyperparameterSpec and add it to your training input.

The following example assumes that you have already created a TrainingInput dictionary (in this case named training_inputs) as shown in the training job configuration guide.

# Add hyperparameter tuning to the job config.
hyperparams = {
    'enableTrialEarlyStopping': True,
    'maxTrials': 30,
    'maxParallelTrials': 1,
    'resumePreviousJobId': [PREVIOUS_JOB_IDENTIFIER]}

# Add the hyperparameter specification to the training inputs dictionary.
training_inputs['hyperparameters'] = hyperparams

# Build the job spec.
job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

What's next

Send feedback about...

Cloud ML Engine for TensorFlow