This page shows you how to use Cloud Machine Learning Engine hyperparameter tuning when training your model. The process involves making some changes to your TensorFlow application code and adding some configuration information when you submit your training job. You can learn more about this feature in the hyperparameter tuning overview in this documentation.
Hyperparameter tuning optimizes a single target variable that you specify. The target variable is called the hyperparameter metric.
The steps involved in hyperparameter tuning
To use hyperparameter tuning in your training job you must perform the following steps:
-
Decide which hyperparameters you want to tune.
-
Ensure your training application includes the following:
-
Add command-line arguments for each hyperparameter you want to tune.
-
Use the values passed in those arguments to set the value of the hyperparameters for your training trial.
-
Add your hyperparameter metric to the summary for your graph.
-
-
Specify the hyperparameters to tune by including a
HyperparameterSpec
with your training job's configuration data.
Decide which hyperparameters to tune
Before you make any changes to your application code, think about which hyperparameters are the most important to tune your target value. Remember that each hyperparameter that you tune will significantly increase the amount of time the tuning job takes.
Check the code in your training application
Your code depends on whether you're using the TensorFlow Estimator API or the core TensorFlow APIs. For complete examples of using the Estimator API and the core TensorFlow APIs on Cloud ML Engine, see the following sample applications on GitHub:
Add command-line arguments for the hyperparameters you want to tune
Cloud ML Engine sets command-line arguments when it calls your training
application. Define a name for each hyperparameter argument and parse it in your
application using whatever argument parser you prefer (typically argparse
).
You must use the same argument names when you configure your training job, as described below.
Set your hyperparameters to the values received
Assign the values from the command-line arguments to the hyperparameters in your graph.
Add your hyperparameter metric to the graph summary
Cloud ML Engine looks for your hyperparameter metric when the graph's summary writer is called. Note: The canned TensorFlow estimator uses the same metric name for training and evaluation. You need a separate metric for hyperparameter tuning, to ensure that Cloud ML Engine can determine the source of the metric.
If you're using the
TensorFlow Estimator API,
use the following code to add your hyperparameter metric to the
summary for your graph. The example assumes that the name of your metric is
metric1
:
# Create metric for hyperparameter tuning
def my_metric(labels, predictions):
pred_values = predictions['predictions']
return {'metric1': tf.metrics.root_mean_squared_error(labels, pred_values)}
# Create estimator to train and evaluate
def train_and_evaluate(output_dir):
estimator = tf.estimator.DNNLinearCombinedRegressor(...)
estimator = tf.contrib.estimator.add_metrics(estimator, my_metric)
train_spec = ...
exporter = ...
eval_spec = tf.estimator.EvalSpec(
input_fn = ...,
start_delay_secs = 60, # start evaluating after N seconds
throttle_secs = 300, # evaluate every N seconds
exporters = exporter)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
If you're using the core TensorFlow APIs, create a summary writer,
tf.summary.FileWriter,
and
add a summary
to the writer with your metric as a tag. The following example assumes that the
name of your metric is metric1
:
from tensorflow.core.framework.summary_pb2 import Summary
...
summary = Summary(value=[Summary.Value(tag='hyperparameterMetricTag', simple_value=loss_val)])
eval_path = os.path.join(args['job_dir'], 'metric1')
summary_writer = tf.summary.FileWriter(eval_path)
# Note: adding the summary to the writer is enough for hyperparameter tuning.
# ML Engine looks for any summary added with the hyperparameter metric tag.
summary_writer.add_summary(summary)
summary_writer.flush()
If you don't specify a hyperparameterMetricTag
, Cloud ML Engine
looks for a metric with the name training/hptuning/metric
.
Manage the output file location
Note: If you are using the --job-dir
argument to specify where the training
job must store its model, you can skip this section. The hyperparameter tuning
trial number is automatically appended to the --job-dir
argument as a
subdirectory for each trial.
You should write your application to output to a different subdirectory for each hyperparameter tuning trial. If you don't, each trial overwrites the previous one and you lose your data.
We recommend using a base output location with the hyperparameter tuning trial
number appended to it. The trial number running on a given replica is stored in
the TF_CONFIG
environment variable as the trial
member of the task
object.
The following example shows how you might construct an output path in individual
replicas of your training job.
def makeTrialOutputPath(output_path):
'''
For a given static output path, returns a path with
the hyperparameter tuning trial number appended.
Dependencies: os, json
'''
# Get the configuration data from the environment variable.
env = json.loads(os.environ.get('TF_CONFIG', '{}'))
# Get the task information.
taskInfo = env.get('task')
if taskInfo:
trial = taskInfo.get('trial', '')
if trial:
return os.path.join(output_path, trial)
return output_path
Specify the hyperparameter tuning configuration for your training job
With your training application coded to handle hyperparameter tuning, you must also include the specific configuration to use when you start a training job. Configure your hyperparameter tuning information in a HyperparameterSpec object and add it to your TrainingInput object as the hyperparameters object.
Set the hyperparameterMetricTag
member in your
HyperparameterSpec
to a value representing your chosen metric. For example: metric1
.
If you don't specify a hyperparameterMetricTag
, Cloud ML Engine
looks for a metric with the name training/hptuning/metric
.
gcloud
Add your hyperparameter configuration information to your configuration YAML
file. Below is an example. For a working config file, see
hptuning_config.yaml
in the census estimator sample.
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: large_model
workerCount: 9
parameterServerCount: 3
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: metric1
maxTrials: 30
maxParallelTrials: 1
enableTrialEarlyStopping: True
params:
- parameterName: hidden1
type: INTEGER
minValue: 40
maxValue: 400
scaleType: UNIT_LINEAR_SCALE
- parameterName: numRnnCells
type: DISCRETE
discreteValues:
- 1
- 2
- 3
- 4
- parameterName: rnnCellType
type: CATEGORICAL
categoricalValues:
- BasicLSTMCell
- BasicRNNCell
- GRUCell
- LSTMCell
- LayerNormBasicLSTMCell
Python
When configuring your training job in Python code, you make a dictionary representing your HyperparameterSpec and add it to your training input.
The following example assumes that you have already created a
TrainingInput
dictionary (in this case named training_inputs
) as shown
in the
training job configuration
guide.
# Add hyperparameter tuning to the job config.
hyperparams = {
'goal': 'MAXIMIZE',
'hyperparameterMetricTag': 'metric1',
'maxTrials': 30,
'maxParallelTrials': 1,
'enableTrialEarlyStopping': True,
'params': []}
hyperparams['params'].append({
'parameterName':'hidden1',
'type':'INTEGER',
'minValue': 40,
'maxValue': 400,
'scaleType': 'UNIT_LINEAR_SCALE'})
hyperparams['params'].append({
'parameterName':'numRnnCells',
'type':'DISCRETE',
'discreteValues': [1, 2, 3, 4]})
hyperparams['params'].append({
'parameterName':'rnnCellType',
'type': 'CATEGORICAL',
'categoricalValues': [
'BasicLSTMCell',
'BasicRNNCell',
'GRUCell',
'LSTMCell',
'LayerNormBasicLSTMCell'
]
})
# Add hyperparameter specification to the training inputs dictionary.
training_inputs['hyperparameters'] = hyperparams
# Build the job spec.
job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}
You can get more details about hyperparameter types and values in the hyperparameter tuning overview.
Monitoring hyperparameter tuning in progress
You can monitor hyperparameter tuning by getting the detailed status of your running training job.
The TrainingOutput object in the response's Job resource has the following values set during a training job with hyperparameter tuning:
-
isHyperparameterTuningJob
set toTrue
. -
trials
is present and contains a list ofHypeparameterOutput
objects, one per trial.
Getting hyperparameter tuning results
When the training runs are complete, you can call projects.jobs.get
to get the
results. The
TrainingOutput
object in the job resource contains the metrics for all runs, with the metrics
for the best-tuned run identified.
Use the same detailed status request that you do to monitor the job during processing to get this information.
You'll get the results from each trial in the job description. Find the trial that yielded the most desirable value for your hyperparameter metric. If the trial meets your standard for success of the model, you can use the hyperparameter values shown for that trial in subsequent runs of your model.
Sometimes you will find multiple trials that give identical results for your tuning metric. In such a case, you should determine which of the hyperparameter values are most advantageous by other measures. For example, if you are tuning the number of nodes in a hidden layer and you get identical results when the value is set to 8 that you do when it's set to 20, you should use 8, because more nodes means more processing and cost for no improvement in your model.
Continuing a completed hyperparameter tuning job
You can continue a completed hyperparameter tuning job. This makes it possible to reuse the knowledge gained in the previous hyperparameter tuning job and start from a state that is partially optimized.
To resume a hyperparameter tuning job, set the resumePreviousJobId
value of the
HyperparameterSpec
object to the job ID of the previous trial, and specify maxTrials
and
maxParallelTrials
values.
Cloud ML Engine then uses the previous job ID to find and reuse the
same goal
, params
, and hyperparameterMetricTag
values to continue
the hyperparameter tuning job.
gcloud
The following example adds hyperparameter tuning configuration to the example YAML file shown in the training configuration instructions.
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: large_model
workerCount: 9
parameterServerCount: 3
hyperparameters:
enableTrialEarlyStopping: TRUE
maxTrials: 30
maxParallelTrials: 1
resumePreviousJobId: [PREVIOUS_JOB_IDENTIFIER]
Python
When configuring your training job in Python code, you make a dictionary representing your HyperparameterSpec and add it to your training input.
The following example assumes that you have already created a
TrainingInput
dictionary (in this case named training_inputs
) as shown
in the
training job configuration
guide.
# Add hyperparameter tuning to the job config.
hyperparams = {
'enableTrialEarlyStopping': True,
'maxTrials': 30,
'maxParallelTrials': 1,
'resumePreviousJobId': [PREVIOUS_JOB_IDENTIFIER]}
# Add the hyperparameter specification to the training inputs dictionary.
training_inputs['hyperparameters'] = hyperparams
# Build the job spec.
job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}