Using hyperparameter tuning

This page shows you how to use AI Platform Training hyperparameter tuning when training your model. Hyperparameter tuning optimizes a target variable that you specify. The target variable is called the hyperparameter metric. When you start a job with hyperparameter tuning, you establish the name of your hyperparameter metric. This is the name you assign to the scalar summary that you add to your trainer.

The steps involved in hyperparameter tuning

To use hyperparameter tuning in your training job you must perform the following steps:

  1. Specify the hyperparameter tuning configuration for your training job by including a HyperparameterSpec in your TrainingInput object.

  2. Include the following code in your training application:

    • Parse the command-line arguments representing the hyperparameters you want to tune, and use the values to set the hyperparameters for your training trial.
    • Add your hyperparameter metric to the summary for your graph.

Below are more details of each step.

Specify the hyperparameter tuning configuration for your training job

Create a HyperparameterSpec object to hold the hyperparameter tuning configuration for your training job, and add the HyperparameterSpec as the hyperparameters object in your TrainingInput object.

The hyperparameter tuning job will create trial jobs. If you want to accelerate the training trial job process, you can specify a custom machine type in the TrainingInput object. For example, to create trial jobs with each trial job using n1-standard-8 VMs, you can specify masterType to be n1-standard-8 and leave the worker config empty.

In your HyperparameterSpec, set the hyperparameterMetricTag to a value representing your chosen metric. If you don't specify a hyperparameterMetricTag, AI Platform Training looks for a metric with the name training/hptuning/metric. The following example shows how to create a configuration for a metric named metric1:

gcloud

Add your hyperparameter configuration information to your configuration YAML file. Below is an example. For a working config file, see hptuning_config.yaml in the census estimator sample.

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: complex_model_m
  parameterServerType: large_model
  workerCount: 9
  parameterServerCount: 3
  hyperparameters:
    goal: MAXIMIZE
    hyperparameterMetricTag: metric1
    maxTrials: 30
    maxParallelTrials: 1
    enableTrialEarlyStopping: True
    params:
    - parameterName: hidden1
      type: INTEGER
      minValue: 40
      maxValue: 400
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: numRnnCells
      type: DISCRETE
      discreteValues:
      - 1
      - 2
      - 3
      - 4
    - parameterName: rnnCellType
      type: CATEGORICAL
      categoricalValues:
      - BasicLSTMCell
      - BasicRNNCell
      - GRUCell
      - LSTMCell
      - LayerNormBasicLSTMCell

Python

Make a dictionary representing your HyperparameterSpec and add it to your training input. The following example assumes that you have already created a TrainingInput dictionary (in this case named training_inputs) as shown in the training job configuration guide.

# Add hyperparameter tuning to the job config.
hyperparams = {
    'goal': 'MAXIMIZE',
    'hyperparameterMetricTag': 'metric1',
    'maxTrials': 30,
    'maxParallelTrials': 1,
    'enableTrialEarlyStopping': True,
    'params': []}

hyperparams['params'].append({
    'parameterName':'hidden1',
    'type':'INTEGER',
    'minValue': 40,
    'maxValue': 400,
    'scaleType': 'UNIT_LINEAR_SCALE'})

hyperparams['params'].append({
    'parameterName':'numRnnCells',
    'type':'DISCRETE',
    'discreteValues': [1, 2, 3, 4]})

hyperparams['params'].append({
    'parameterName':'rnnCellType',
    'type': 'CATEGORICAL',
    'categoricalValues': [
        'BasicLSTMCell',
        'BasicRNNCell',
        'GRUCell',
        'LSTMCell',
        'LayerNormBasicLSTMCell'
    ]
})

# Add hyperparameter specification to the training inputs dictionary.
training_inputs['hyperparameters'] = hyperparams

# Build the job spec.
job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

Check the code in your training application

In your application, handle the command-line arguments for the hyperparameters and report your hyperparameter metric to AI Platform Training.

Handle the command-line arguments for the hyperparameters you want to tune

AI Platform Training sets command-line arguments when it calls your training application. Make use of the command-line arguments in your code:

  1. Define a name for each hyperparameter argument and parse it using whatever argument parser you prefer (typically argparse). The argument names must match the parameter names that you specified in the job configuration, as described above.

  2. Assign the values from the command-line arguments to the hyperparameters in your training code.

Report your hyperparameter metric to AI Platform Training

The way to report your hyperparameter metric to the AI Platform Training service depends on whether you are using TensorFlow for training or not. It also depends on whether you are using a runtime version or a custom container for training.

We recommend that your training code reports your hyperparameter metric to AI Platform Training frequently in order to take advantage of early stopping.

TensorFlow with a runtime version

If you use an AI Platform Training runtime version and train with TensorFlow, then you can report your hyperparameter metric to AI Platform Training by writing the metric to a TensorFlow summary. Use one of the following functions:

Using a different TensorFlow API that calls one of the preceding functions, as in the following Estimator example, also reports the hyperparameter metric to AI Platform Training.

The following examples show the basics of two different ways to write your hyperparameter metric to a summary. Both examples assume you are training a regression model, and they write the root-mean-square-error between ground-truth labels and evaluation predictions as a hyperparameter metric named metric1.

Keras

The following example uses a custom Keras callback to write a scalar summary at the end of each training epoch:

class MyMetricCallback(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None):
        tf.summary.scalar('metric1', logs['RootMeanSquaredError'], epoch)

logdir = "logs/scalars/" + datetime.now().strftime("%Y%m%d-%H%M%S")
file_writer = tf.summary.create_file_writer(logdir + "/metrics")
file_writer.set_as_default()

model = tf.keras.Sequential(
    tf.keras.layers.Dense(1, activation='linear', input_dim=784))
model.compile(
    optimizer='rmsprop',
    loss='mean_squared_error',
    metrics=['RootMeanSquaredError'])

model.fit(
    x_train,
    y_train,
    batch_size=64,
    epochs=10,
    steps_per_epoch=5,
    verbose=0,
    callbacks=[MyMetricCallback()])

Estimator

The following example uses tf.estimator.add_metrics to add your hyperparameter metric to the summary for your graph.

Note that Estimators write a graph summary every time their evaluate method runs. This example uses tf.estimator.EvalSpec with tf.estimator.train_and_evaluate to configure the estimator to evaluate and write summaries every 300 seconds during training.

# Create metric for hyperparameter tuning
def my_metric(labels, predictions):
    # Note that different types of estimator provide different different
    # keys on the predictions Tensor. predictions['predictions'] is for
    # regression output.
    pred_values = predictions['predictions']
    return {'metric1': tf.compat.v1.metrics.root_mean_squared_error(labels, pred_values)}

# Create estimator to train and evaluate
def train_and_evaluate(output_dir):

    estimator = tf.estimator.DNNLinearCombinedRegressor(...)

    estimator = tf.estimator.add_metrics(estimator, my_metric)

    train_spec = ...
    eval_spec = tf.estimator.EvalSpec(
        start_delay_secs = 60, # start evaluating after 60 seconds
        throttle_secs = 300,  # evaluate every 300 seconds
        ...)
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

Other machine learning frameworks or custom containers

If you use a custom container for training or if you want to perform hyperparameter tuning with a framework other than TensorFlow, then you must use the cloudml-hypertune Python package to report your hyperparameter metric to AI Platform Training.

See an example of using cloudml-hypertune.

Getting details of a hyperparameter tuning job while running

You can monitor hyperparameter tuning by getting the detailed status of your running training job.

The TrainingOutput object in the response's Job resource has the following values set during a training job with hyperparameter tuning:

  • isHyperparameterTuningJob set to True.

  • trials is present and contains a list of HyperparameterOutput objects, one per trial.

You can also retrieve the trial ID from the TF_CONFIG environment variable See the guide to getting details from TF_CONFIG.

Getting hyperparameter tuning results

When the training runs are complete, you can view the results of each trial in the Google Cloud console. Alternatively, you can call projects.jobs.get to get the results. The TrainingOutput object in the job resource contains the metrics for all runs, with the metrics for the best-tuned run identified.

Use the same detailed status request as you use to monitor the job during processing to get this information.

You can see the results from each trial in the job description. In the Google Cloud console, you can filter trials by rmse, learning_rate, and training steps. Find the trial that yielded the most desirable value for your hyperparameter metric. If the trial meets your standard for success of the model, you can use the hyperparameter values shown for that trial in subsequent runs of your model.

Sometimes multiple trials give identical results for your tuning metric. In such a case, you should determine which of the hyperparameter values are most advantageous by other measures. For example, if you are tuning the number of nodes in a hidden layer and you get identical results when the value is set to 8 as when it's set to 20, you should use 8, because more nodes means more processing and cost for no improvement in your model.

The FAILED trial state in HyperparameterOutput can either mean that training failed for that trial, or that the trial failed to report the hyperparameter tuning metric. In the latter case, the parent job can succeed even if the trial fails. You may view the trial log to see whether training failed for the trial.

Setting a limit to the number of trials

You should decide how many trials you want to allow the service to run and set the maxTrials value in the HyperparameterSpec object.

There are two competing interests to consider when deciding how many trials to allow:

  • time (and consequently cost)
  • accuracy

Increasing the number of trials generally yields better results, but it is not always so. In most cases there is a point of diminishing returns after which additional trials have little or no effect on the accuracy. It may be best to start with a small number of trials to gauge the effect your chosen hyperparameters have on your model's accuracy before starting a job with a large number of trials.

To get the most out of hyperparameter tuning, you shouldn't set your maximum value lower than ten times the number of hyperparameters you use.

Handling failed trials

If your hyperparameter tuning trials exit with errors, you might want to end the training job early. Set the maxFailedTrials field in the HyperparameterSpec to the number of failed trials that you want to allow. After this number of trials fails, AI Platform Training ends the training job. The maxFailedTrials value must be less than or equal to maxTrials.

If you do not set maxFailedTrials, or if you set it to 0, AI Platform Training uses the following rules to handle failing trials:

  • If the first trial of your job fails, AI Platform Training ends the job immediately. Failure during the first trial suggests a problem in your training code, so further trials are also likely to fail. Ending the job gives you the opportunity to diagnose the problem without waiting for more trials and incurring greater costs.
  • If the first trial succeeds, AI Platform Training might end the job after failures during subsequent trials based on one of the following criteria:
    • The number of failed trials has grown too high.
    • The ratio of failed trials to successful trials has grown too high.

These internal thresholds are subject to change. To ensure a specific behavior, set the maxFailedTrials field.

Running parallel trials

You can specify a number of trials to run in parallel by setting maxParallelTrials in the HyperparameterSpec object.

Running parallel trials has the benefit of reducing the time the training job takes (real time—the total processing time required is not typically changed). However, running in parallel can reduce the effectiveness of the tuning job overall. That is because hyperparameter tuning uses the results of previous trials to inform the values to assign to the hyperparameters of subsequent trials. When running in parallel, some trials start without having the benefit of the results of any trials still running.

If you use parallel trials, the training service provisions multiple training processing clusters (or multiple individual machines in the case of a single-process trainer). The scale tier that you set for your job is used for each individual training cluster.

Stopping trials early

You can specify that AI Platform Training must automatically stop a trial that has become clearly unpromising. This saves you the cost of continuing a trial that is unlikely to be useful.

To permit stopping a trial early, set the enableTrialEarlyStopping value in the HyperparameterSpec to TRUE.

Resuming a completed hyperparameter tuning job

You can continue a completed hyperparameter tuning job, to start from a state that is partially optimized. This makes it possible to reuse the knowledge gained in the previous hyperparameter tuning job.

To resume a hyperparameter tuning job, submit a new hyperparameter tuning job with the following configuration:

  • Set the resumePreviousJobId value in the HyperparameterSpec to the job ID of the previous trial.
  • Specify values for maxTrials and maxParallelTrials.

AI Platform Training uses the previous job ID to find and reuse the same goal, params, and hyperparameterMetricTag values to continue the hyperparameter tuning job.

Use consistent hyperparameterMetricTag name and params for similar jobs, even when the jobs have different parameters. This practice makes it possible for AI Platform Training to improve optimization over time.

The following examples show the use of the resumePreviousJobId configuration:

gcloud

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: complex_model_m
  parameterServerType: large_model
  workerCount: 9
  parameterServerCount: 3
  hyperparameters:
    enableTrialEarlyStopping: TRUE
    maxTrials: 30
    maxParallelTrials: 1
    resumePreviousJobId: [PREVIOUS_JOB_IDENTIFIER]

Python

# Add hyperparameter tuning to the job config.
hyperparams = {
    'enableTrialEarlyStopping': True,
    'maxTrials': 30,
    'maxParallelTrials': 1,
    'resumePreviousJobId': [PREVIOUS_JOB_IDENTIFIER]}

# Add the hyperparameter specification to the training inputs dictionary.
training_inputs['hyperparameters'] = hyperparams

# Build the job spec.
job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

Hyperparameter tuning with Cloud TPU

If you're running your hyperparameter tuning job with Cloud TPU on AI Platform Training, best practice is to use the eval_metrics property in TPUEstimatorSpec.

See the ResNet-50 TPU hyperparameter tuning sample for a working example of hyperparameter tuning with Cloud TPU.

Instead of using the eval_metrics property to use the hyperparameter tuning service, an alternative is to call tf.summary in host_call. For details, see TPUEstimatorSpec.

What's next