This page shows you how to use AI Platform Training hyperparameter tuning when training your model. Hyperparameter tuning optimizes a target variable that you specify. The target variable is called the hyperparameter metric. When you start a job with hyperparameter tuning, you establish the name of your hyperparameter metric. This is the name you assign to the scalar summary that you add to your trainer.
The steps involved in hyperparameter tuning
To use hyperparameter tuning in your training job you must perform the following steps:
Specify the hyperparameter tuning configuration for your training job by including a HyperparameterSpec in your TrainingInput object.
Include the following code in your training application:
- Parse the command-line arguments representing the hyperparameters you want to tune, and use the values to set the hyperparameters for your training trial.
- Add your hyperparameter metric to the summary for your graph.
Below are more details of each step.
Specify the hyperparameter tuning configuration for your training job
Create a HyperparameterSpec object to hold the hyperparameter tuning
configuration for your training job, and add the HyperparameterSpec
as the
hyperparameters
object in your TrainingInput object.
The hyperparameter tuning job will create trial jobs. If you want to accelerate
the training trial job process, you can specify a custom machine type in the
TrainingInput object. For example, to create trial jobs with each trial
job using n1-standard-8
VMs, you can specify masterType
to be n1-standard-8
and leave the worker config empty.
In your HyperparameterSpec
, set the hyperparameterMetricTag
to a value
representing your chosen metric. If you don't specify
a hyperparameterMetricTag
, AI Platform Training looks for a metric with the
name training/hptuning/metric
. The following example shows how to create a
configuration for a metric named metric1
:
gcloud
Add your hyperparameter configuration information to your configuration YAML
file. Below is an example. For a working config file, see
hptuning_config.yaml
in the census estimator sample.
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: large_model
workerCount: 9
parameterServerCount: 3
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: metric1
maxTrials: 30
maxParallelTrials: 1
enableTrialEarlyStopping: True
params:
- parameterName: hidden1
type: INTEGER
minValue: 40
maxValue: 400
scaleType: UNIT_LINEAR_SCALE
- parameterName: numRnnCells
type: DISCRETE
discreteValues:
- 1
- 2
- 3
- 4
- parameterName: rnnCellType
type: CATEGORICAL
categoricalValues:
- BasicLSTMCell
- BasicRNNCell
- GRUCell
- LSTMCell
- LayerNormBasicLSTMCell
Python
Make a dictionary representing your HyperparameterSpec and add it to your
training input. The following example assumes that you have already created
a TrainingInput
dictionary (in this case named training_inputs
) as shown
in the training job configuration
guide.
# Add hyperparameter tuning to the job config.
hyperparams = {
'goal': 'MAXIMIZE',
'hyperparameterMetricTag': 'metric1',
'maxTrials': 30,
'maxParallelTrials': 1,
'enableTrialEarlyStopping': True,
'params': []}
hyperparams['params'].append({
'parameterName':'hidden1',
'type':'INTEGER',
'minValue': 40,
'maxValue': 400,
'scaleType': 'UNIT_LINEAR_SCALE'})
hyperparams['params'].append({
'parameterName':'numRnnCells',
'type':'DISCRETE',
'discreteValues': [1, 2, 3, 4]})
hyperparams['params'].append({
'parameterName':'rnnCellType',
'type': 'CATEGORICAL',
'categoricalValues': [
'BasicLSTMCell',
'BasicRNNCell',
'GRUCell',
'LSTMCell',
'LayerNormBasicLSTMCell'
]
})
# Add hyperparameter specification to the training inputs dictionary.
training_inputs['hyperparameters'] = hyperparams
# Build the job spec.
job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}
Check the code in your training application
In your application, handle the command-line arguments for the hyperparameters and report your hyperparameter metric to AI Platform Training.
Handle the command-line arguments for the hyperparameters you want to tune
AI Platform Training sets command-line arguments when it calls your training application. Make use of the command-line arguments in your code:
Define a name for each hyperparameter argument and parse it using whatever argument parser you prefer (typically
argparse
). The argument names must match the parameter names that you specified in the job configuration, as described above.Assign the values from the command-line arguments to the hyperparameters in your training code.
Report your hyperparameter metric to AI Platform Training
The way to report your hyperparameter metric to the AI Platform Training service depends on whether you are using TensorFlow for training or not. It also depends on whether you are using a runtime version or a custom container for training.
We recommend that your training code reports your hyperparameter metric to AI Platform Training frequently in order to take advantage of early stopping.
TensorFlow with a runtime version
If you use an AI Platform Training runtime version and train with TensorFlow, then you can report your hyperparameter metric to AI Platform Training by writing the metric to a TensorFlow summary. Use one of the following functions:
tf.compat.v1.summary.FileWriter.add_summary
(also known astf.summary.FileWriter.add_summary
in TensorFlow 1.x)tf.summary.scalar
(only in TensorFlow 2.x)
Using a different TensorFlow API that calls one of the preceding functions, as in the following Estimator example, also reports the hyperparameter metric to AI Platform Training.
The following examples show the basics of two different ways to write your
hyperparameter metric to a summary. Both examples assume you are training a
regression model, and they write the root-mean-square-error between ground-truth
labels and evaluation predictions as a hyperparameter metric named metric1
.
Keras
The following example uses a custom Keras callback to write a scalar summary at the end of each training epoch:
class MyMetricCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
tf.summary.scalar('metric1', logs['RootMeanSquaredError'], epoch)
logdir = "logs/scalars/" + datetime.now().strftime("%Y%m%d-%H%M%S")
file_writer = tf.summary.create_file_writer(logdir + "/metrics")
file_writer.set_as_default()
model = tf.keras.Sequential(
tf.keras.layers.Dense(1, activation='linear', input_dim=784))
model.compile(
optimizer='rmsprop',
loss='mean_squared_error',
metrics=['RootMeanSquaredError'])
model.fit(
x_train,
y_train,
batch_size=64,
epochs=10,
steps_per_epoch=5,
verbose=0,
callbacks=[MyMetricCallback()])
Estimator
The following example uses
tf.estimator.add_metrics
to add your hyperparameter metric to the summary for your graph.
Note that Estimators write a graph summary every time their evaluate
method
runs. This example uses
tf.estimator.EvalSpec
with
tf.estimator.train_and_evaluate
to configure the estimator to evaluate and write summaries every 300 seconds
during training.
# Create metric for hyperparameter tuning
def my_metric(labels, predictions):
# Note that different types of estimator provide different different
# keys on the predictions Tensor. predictions['predictions'] is for
# regression output.
pred_values = predictions['predictions']
return {'metric1': tf.compat.v1.metrics.root_mean_squared_error(labels, pred_values)}
# Create estimator to train and evaluate
def train_and_evaluate(output_dir):
estimator = tf.estimator.DNNLinearCombinedRegressor(...)
estimator = tf.estimator.add_metrics(estimator, my_metric)
train_spec = ...
eval_spec = tf.estimator.EvalSpec(
start_delay_secs = 60, # start evaluating after 60 seconds
throttle_secs = 300, # evaluate every 300 seconds
...)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
Other machine learning frameworks or custom containers
If you use a custom container for training
or if you want to perform hyperparameter tuning with a framework other than
TensorFlow, then you must use the
cloudml-hypertune
Python package to report your hyperparameter metric to AI Platform Training.
See an example of using
cloudml-hypertune
.
Getting details of a hyperparameter tuning job while running
You can monitor hyperparameter tuning by getting the detailed status of your running training job.
The TrainingOutput
object in the
response's Job resource has the following values set during a training job with
hyperparameter tuning:
isHyperparameterTuningJob
set toTrue
.trials
is present and contains a list ofHyperparameterOutput
objects, one per trial.
You can also retrieve the trial ID from the TF_CONFIG
environment variable
See the guide to
getting details from TF_CONFIG
.
Getting hyperparameter tuning results
When the training runs are complete, you can
view the results of each trial in the Google Cloud console.
Alternatively, you can call
projects.jobs.get
to get the
results. The
TrainingOutput
object in the job resource contains the metrics for all runs, with the metrics
for the best-tuned run identified.
Use the same detailed status request as you use to monitor the job during processing to get this information.
You can see the results from each trial in the job description. In the
Google Cloud console, you can filter trials by rmse
, learning_rate
, and
training steps
. Find the trial that yielded the most desirable value for your
hyperparameter metric. If the trial meets your standard for success of the
model, you can use the hyperparameter values shown for that trial in subsequent
runs of your model.
Sometimes multiple trials give identical results for your tuning metric. In such a case, you should determine which of the hyperparameter values are most advantageous by other measures. For example, if you are tuning the number of nodes in a hidden layer and you get identical results when the value is set to 8 as when it's set to 20, you should use 8, because more nodes means more processing and cost for no improvement in your model.
The FAILED
trial state in
HyperparameterOutput can
either mean that training failed for that trial, or that the trial failed to
report the hyperparameter tuning metric. In the latter case, the parent job can
succeed even if the trial fails. You may view the trial log to see whether
training failed for the trial.
Setting a limit to the number of trials
You should decide how many trials you want to allow the service to run and set
the maxTrials
value in the HyperparameterSpec object.
There are two competing interests to consider when deciding how many trials to allow:
- time (and consequently cost)
- accuracy
Increasing the number of trials generally yields better results, but it is not always so. In most cases there is a point of diminishing returns after which additional trials have little or no effect on the accuracy. It may be best to start with a small number of trials to gauge the effect your chosen hyperparameters have on your model's accuracy before starting a job with a large number of trials.
To get the most out of hyperparameter tuning, you shouldn't set your maximum value lower than ten times the number of hyperparameters you use.
Handling failed trials
If your hyperparameter tuning trials exit with errors, you might want to
end the training job early. Set the maxFailedTrials
field in the
HyperparameterSpec to the number of failed trials that you want to allow.
After this number of trials fails, AI Platform Training ends the training job.
The maxFailedTrials
value must be less than or equal to maxTrials
.
If you do not set maxFailedTrials
, or if you set it to 0
,
AI Platform Training uses the following rules to handle failing trials:
- If the first trial of your job fails, AI Platform Training ends the job immediately. Failure during the first trial suggests a problem in your training code, so further trials are also likely to fail. Ending the job gives you the opportunity to diagnose the problem without waiting for more trials and incurring greater costs.
- If the first trial succeeds, AI Platform Training might end the job after
failures during subsequent trials based on one of the following criteria:
- The number of failed trials has grown too high.
- The ratio of failed trials to successful trials has grown too high.
These internal thresholds are subject to change. To ensure a specific behavior,
set the maxFailedTrials
field.
Running parallel trials
You can specify a number of trials to run in parallel by setting
maxParallelTrials
in the HyperparameterSpec object.
Running parallel trials has the benefit of reducing the time the training job takes (real time—the total processing time required is not typically changed). However, running in parallel can reduce the effectiveness of the tuning job overall. That is because hyperparameter tuning uses the results of previous trials to inform the values to assign to the hyperparameters of subsequent trials. When running in parallel, some trials start without having the benefit of the results of any trials still running.
If you use parallel trials, the training service provisions multiple training processing clusters (or multiple individual machines in the case of a single-process trainer). The scale tier that you set for your job is used for each individual training cluster.
Stopping trials early
You can specify that AI Platform Training must automatically stop a trial that has become clearly unpromising. This saves you the cost of continuing a trial that is unlikely to be useful.
To permit stopping a trial early, set the enableTrialEarlyStopping
value
in the HyperparameterSpec to TRUE
.
Resuming a completed hyperparameter tuning job
You can continue a completed hyperparameter tuning job, to start from a state that is partially optimized. This makes it possible to reuse the knowledge gained in the previous hyperparameter tuning job.
To resume a hyperparameter tuning job, submit a new hyperparameter tuning job with the following configuration:
- Set the
resumePreviousJobId
value in the HyperparameterSpec to the job ID of the previous trial. - Specify values for
maxTrials
andmaxParallelTrials
.
AI Platform Training uses the previous job ID to find and reuse the
same goal
, params
, and hyperparameterMetricTag
values to continue
the hyperparameter tuning job.
Use consistent hyperparameterMetricTag
name and params
for similar jobs,
even when the jobs have different parameters. This practice makes it possible
for AI Platform Training to improve optimization over time.
The following examples show the use of the resumePreviousJobId
configuration:
gcloud
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: large_model
workerCount: 9
parameterServerCount: 3
hyperparameters:
enableTrialEarlyStopping: TRUE
maxTrials: 30
maxParallelTrials: 1
resumePreviousJobId: [PREVIOUS_JOB_IDENTIFIER]
Python
# Add hyperparameter tuning to the job config.
hyperparams = {
'enableTrialEarlyStopping': True,
'maxTrials': 30,
'maxParallelTrials': 1,
'resumePreviousJobId': [PREVIOUS_JOB_IDENTIFIER]}
# Add the hyperparameter specification to the training inputs dictionary.
training_inputs['hyperparameters'] = hyperparams
# Build the job spec.
job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}
Hyperparameter tuning with Cloud TPU
If you're running your hyperparameter tuning job with Cloud TPU on
AI Platform Training, best practice is to use the eval_metrics
property in TPUEstimatorSpec
.
See the ResNet-50 TPU hyperparameter tuning sample for a working example of hyperparameter tuning with Cloud TPU.
Instead of using the eval_metrics
property to use the hyperparameter tuning
service, an alternative is to call tf.summary
in host_call
. For details,
see TPUEstimatorSpec
.
What's next
- Learn more about the concepts involved in hyperparameter tuning.
- Read a blog post measuring the impact of recent improvements to hyperparameter tuning.
- Read a blog post about Bayesian optimization and hyperparameter tuning.