BigQuery ML supports hyperparameter tuning when training ML models using
CREATE MODEL
statements. Hyperparameter tuning is commonly used to improve
model performance by searching for the optimal hyperparameters. Hyperparameter
tuning supports the following model types:
For information about enabling new features and for the answers to common questions about fine-tuning machine learning models using BigQuery ML hyperparameter tuning, see the Q&A section below.
For information about BigQuery ML hyperparameter tuning, see Hyperparameter tuning Overview.
For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.
For locations that BigQuery ML hyperparameter tuning is available, see Locations.
CREATE MODEL
syntax
To run hyperparameter tuning, add the num_trials training option parameter
to the CREATE MODEL
statement to specify the maximum number of sub-models to train.
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name OPTIONS(Existing Training Options, NUM_TRIALS = int64_value, [, MAX_PARALLEL_TRIALS = int64_value ] [, HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' } ] [, hyperparameter={HPARAM_RANGE(min, max) | HPARAM_CANDIDATES([candidates]) }... ] [, HPARAM_TUNING_OBJECTIVES = { 'R2_SCORE' | 'ROC_AUC' | ... } ] [, DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' } ] [, DATA_SPLIT_COL = string_value ] [, DATA_SPLIT_EVAL_FRACTION = float64_value ] [, DATA_SPLIT_TEST_FRACTION = float64_value ] ) AS query_statement
Existing Training Options
Hyperparameter tuning supports most training options with the limitation that once a training option is explicitly set, it can’t be treated as a tunable hyperparameter. For example, the combination below is not valid:
l1_reg=0.1, l1_reg=hparam_range(0, 10)
NUM_TRIALS
Syntax
NUM_TRIALS = int64_value
Description
The maximum number of submodels to train. The tuning will stop when num_trials
submodels are trained, or when the hyperparameter search space is exhausted.
The maximum value is 100.
Arguments
int64_value
is an 'INT64'
. Allowed values are 1 to 100.
MAX_PARALLEL_TRIALS
Syntax
MAX_PARALLEL_TRIALS = int64_value
Description
The maximum number of trials to run at the same time. The default value is 1 and the maximum value is 5.
Arguments
int64_value
is an 'INT64'
. Allowed values are 1 to 5.
HPARAM_TUNING_ALGORITHM
Syntax
HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' }
Description
The algorithm used to tune the hyperparameters.
Arguments
HPARAM_TUNING_ALGORITHM
accepts the following values:
'VIZIER_DEFAULT'
(default and recommended): Uses the default algorithm in Vertex Vizier to tune hyperparameters. This algorithm is the most powerful tuning algorithm and performs a mixture of advanced search algorithms including Bayesian Optimization with Gaussian Process. It also uses transfer learning to take advantage of previously tuned models.'RANDOM_SEARCH'
: Uses random search to explore the search space.'GRID_SEARCH'
: Uses grid search to explore the search space. This algorithm is only available when every hyperparameter's search space is discrete.
HYPERPARAMETER
Syntax
hyperparameter={HPARAM_RANGE(min, max) | HPARAM_CANDIDATES([candidates]) }...
Description
The configuration of the search space of a hyperparameter. See Hyperparameters and Objectives for each model type to see its supported tunable hyperparameters.
Arguments
Accepts one of the following arguments:
HPARAM_RANGE(min, max)
: Use this argument to specify the search space of continuous values from a hyperparameter, for examplelearn_rate = HPARAM_RANGE(0.0001, 1.0)
.HPARAM_CANDIDATES([candidates])
to specify a hyperparameter with discrete values, likeOPTIMIZER=HPARAM_CANDIDATES(['adagrad', 'sgd', 'ftrl'])
.
HPARAM_TUNING_OBJECTIVES
Syntax
HPARAM_TUNING_OBJECTIVES = { 'R2_SCORE' | 'ROC_AUC' | ... }
Description
The objective metrics for the model. The candidates are a subset of the model evaluation metrics. Currently only one objective is supported.
Arguments
See Hyperparameters and Objectives for each model type to see its supported arguments and defaults.
DATA_SPLIT_TEST_FRACTION
Syntax
DATA_SPLIT_TEST_FRACTION = float64_value
Description
This option is used with 'RANDOM' and 'SEQ' splits. It specifies the fraction of the data used as test data for the final evaluation metrics reporting. The fraction is accurate to two decimal places. See the Data Split section for more details.
Arguments
float64_value
is a FLOAT64
that specifies the
fraction of the data used as test data for the final evaluation metrics
reporting. The default value is 0.0.
Hyperparameters and Objectives
The following table lists the supported hyperparameters and their objectives for each model type:
Model type | Hyperparameter Objectives | Hyperparameter | Valid Range | Default Range | Scale Type |
---|---|---|---|---|---|
LINEAR_REG |
mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score (default)
explained_variance
|
l1_reg
l2_reg |
(0, ∞]
(0, ∞] |
(0, 10]
(0, 10] |
LOG
LOG |
LOGISTIC_REG |
precision
recall
accuracy
f1_score
log_loss
roc_auc (default)
|
l1_reg
l2_reg |
(0, ∞]
(0, ∞] |
(0, 10]
(0, 10] |
LOG
LOG |
KMEANS |
davies_bouldin_index
|
num_clusters | [2, 100] | [2, 10] | LINEAR |
MATRIX_ FACTORIZATION (explicit) |
mean_squared_error
|
num_factors
l2_reg |
[2, 200]
(0, ∞) |
[2, 20]
(0, 10] |
LINEAR
LOG |
MATRIX_ FACTORIZATION (implicit) |
mean_average_precision (default)
mean_squared_error
normalized_discounted_cumulative_gain
average_rank
|
num_factors
l2_reg wals_alpha |
[2, 200]
(0, ∞) [0, ∞) |
[2, 20]
(0, 10] [0, 100] |
LINEAR
LOG LINEAR |
AUTOENCODER |
mean_absolute_error
mean_squared_error
mean_squared_log_error
|
learn_rate
batch_size l1_reg l2_reg l1_reg_activation dropout hidden_units optimizer activation_fn |
[0, 1]
(0, ∞) (0, ∞) (0, ∞) (0, ∞) [0, 1) Array of [1, ∞) { adam , adagrad , ftrl , rmsprop , sgd }
{ relu , relu6 , crelu , elu , selu , sigmoid , tanh }
|
[0, 1]
[16, 1024] (0, 10] (0, 10] (0, 10] [0, 0.8] N/A { adam , adagrad , ftrl , rmsprop , sgd }
N/A |
LOG
LOG LOG LOG LOG LINEAR N/A N/A N/A |
DNN_CLASSIFIER |
precision
recall
accuracy
f1_score
log_loss
roc_auc (default)
|
batch_size
dropout hidden_units learn_rate optimizer l1_reg l2_reg activation_fn |
(0, ∞)
[0, 1) Array of [1, ∞) [0, 1] { adam , adagrad , ftrl , rmsprop , sgd }
(0, ∞) (0, ∞) { relu , relu6 , crelu , elu , selu , sigmoid , tanh }
|
[16, 1024]
[0, 0.8] N/A [0, 1] { adam , adagrad , ftrl , rmsprop , sgd }
(0, 10] (0, 10] N/A |
LOG
LINEAR N/A LINEAR N/A LOG LOG N/A |
DNN_REGRESSOR |
mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score (default)
explained_variance
|
||||
DNN_LINEAR_ COMBINED_ CLASSIFIER |
precision
recall
accuracy
f1_score
log_loss
roc_auc (default)
|
batch_size
dropout hidden_units l1_reg l2_reg activation_fn |
(0, ∞)
[0, 1) Array of [1, ∞) (0, ∞) (0, ∞) { relu , relu6 , crelu , elu , selu , sigmoid , tanh }
|
[16, 1024]
[0, 0.8] N/A (0, 10] (0, 10] N/A |
LOG
LINEAR N/A LOG LOG N/A |
DNN_LINEAR_ COMBINED_ REGRESSOR |
mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score (default)
explained_variance
|
||||
BOOSTED_TREE_ CLASSIFIER |
precision
recall
accuracy
f1_score
log_loss
roc_auc (default)
|
learn_rate
l1_reg l2_reg dropout max_tree_depth subsample min_split_loss num_parallel_tree min_tree_child_weight colsample_bytree colsample_bylevel colsample_bynode booster_type dart_normalize_type tree_method |
[0, ∞)
(0, ∞) (0, ∞) [0, 1] [1, 20] (0, 1] [0, ∞) [1, ∞) [0, ∞) [0, 1] [0, 1] [0, 1] { gbtree , dart }
{ tree , forest }
{ auto , exact , approx , hist }
|
[0, 1]
(0, 10] (0, 10] N/A [1, 10] (0, 1] N/A N/A N/A N/A N/A N/A N/A N/A N/A |
LINEAR
LOG LOG LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR N/A N/A N/A |
BOOSTED_TREE_ REGRESSOR |
mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score (default)
explained_variance
|
||||
RANDOM_FOREST_ CLASSIFIER |
precision
recall
accuracy
f1_score
log_loss
roc_auc (default)
|
l1_reg
l2_reg max_tree_depth subsample min_split_loss num_parallel_tree min_tree_child_weight colsample_bytree colsample_bylevel colsample_bynode tree_method |
(0, ∞)
(0, ∞) [1, 20] (0, 1) [0, ∞) [2, ∞) [0, ∞) [0, 1] [0, 1] [0, 1] { auto , exact , approx , hist }
|
(0, 10]
(0, 10] [1, 20] (0, 1) N/A [2, 200] N/A N/A N/A N/A N/A |
LOG
LOG LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR N/A |
RANDOM_FOREST_ REGRESSOR |
mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score (default)
explained_variance
|
The default hyperparameters and their search space are applied if no
explicit search space is specified using hparam_range
or hparam_candidates
for
any parameter. The default, hparam_tuning_objectives
are applied if
hparam_tuning_objectives
is not explicitly specified.
Most LOG scale hyperparameters use the open lower boundary of 0. You can still
set 0 as the lower boundary in the hparam_range
function, which will be
converted to converted to 1e-14.
Conditional hyperparameters are supported. For example, dart_normalize_type
is
only tunable when booster_type
is 'dart'
, in which case you can still
specify both search space which can handle the conditions automatically:
booster_type=hparam_candidates(['dart', 'gbtree'])
dart_normalize_type=hparam_candidates(['tree', 'forest'])
Search Starting Point
If no explicit search space is specified using hparam_range
or hparam_candidates
for
any parameter of the hyperparameter-tuning model, the search starts from the default hyperparameters of non-hyperparameter-tuning models.
If search space is specified using hparam_range
or hparam_candidates
, the search starting points depends on whether the search space includes the default hyperparameter of non-hyperparameter-tuning models:
- If it includes the default hyperparameter: search starts from the default hyperparameters of non-hyperparameter-tuning models.
- If it doesn't include the default hyperparameter: search starts from the nearest value in the search space compared to the default hyperparameter.
Data Split
The training and evaluation split differ from non-hyperparameter-tuning models, as hyperparameter tuning introduces 3-way data split that divides data into training, evaluation, and test. Training and evaluation sets are used in each trial training as in normal models. The trial hyperparameter suggestions are calculated based upon the evaluation data metrics. At the end of each trial training, the third test set is used to evaluate the trial and record its metrics in the model. This ensures the objectivity of the final reporting evaluation metrics by using a final, unseen test set. Evaluation data is used to calculate the intermediate metrics for hyperparameter suggestion, while the test data is used to calculate the final objective model metrics.
All hyperparameter tuning uses a 3-way data split unless
data_split_method='NO_SPLIT'
is specified. Use 'NO_SPLIT'
with caution when
tuning num_factors
in a matrix factorization model. Although 'NO_SPLIT'
allows better performance with larger num_factors, the validity of the results
might be degraded.
The following table displays behavior for non-hyperparameter-tuning and hyperparameter tuning training:
Data split method | Non-hyperparameter-tuning | Hyperparameter tuning |
---|---|---|
NO_SPLIT | No data split. | No data split. |
RANDOM | Works with DATA_SPLIT_EVAL_FRACTION. | Adds a DATA_SPLIT_TEST_FRACTION to specify test data fraction. |
SEQUENTIAL | Works with DATA_SPLIT_EVAL_FRACTION. | Adds a DATA_SPLIT_TEST_FRACTION to specify test data fraction. Data is sequentially split into training, evaluation, and test. |
CUSTOM |
Works with bool_value DATA_SPLIT_COL:
TRUE - evaluation data FALSE - training data |
Works with string_value DATA_SPLIT_COL:
'TRAIN' - training data 'EVAL' - evaluation data 'TEST' - test data |
AUTO |
With fewer than 500 rows in the input data, all rows are used as training data.
When there are between 500 and 50,000 rows in the input data, 20% of the data is used as evaluation data in a RANDOM split. When there are more than 50,000 rows in the input data, only 10,000 of them are used as evaluation data in a RANDOM split. |
When there are fewer than 500 rows in the input data, all rows are used as training data.
When there are more than 500 rows, 10% of the data is used as evaluation data and the other 10% is used as test data in a RANDOM split. |
If the 3-way split is not desired, you can use the 2-way split by specifying
'DATA_SPLIT_TEST_FRACTION'
as 0. When the test set is empty, the evaluation
set is used as the test set for the final evaluation metrics reporting.
The metrics from models that are generated from a normal training job and a hyperparameter-tuning job are only comparible when the data split fractions are equal. For example,
Non-hyperparameter-tuning,
data_split_method='random', data_split_eval_fraction=0.2
Hyperparameter tuning,
data_split_method='random', data_split_eval_fraction=0.2, data_split_test_fraction=0
Performance Guarantee
The model performance of hyperparameter tuning is guaranteed to be no worse than the non-hyperparameter-tuning model when the default search space is used. This model always uses the default hyperparameters in the first trial.
To confirm the performance improvements provided by hyperparameter tuning, compare the optimal trial to the first trial.
Neural Architecture Search
Neural architecture searches referencing neural network models are supported by
using ARRAY<STRUCT<ARRAY<INT64>>>
. When configured, each STRUCT
in the array
represents a candidate neural architecture, where each layer of the hidden
units is specified as an array of INT64
.
For example:
hidden_units=hparam_candidates([struct([8]), struct([8, 16]), struct([16, 32, 64])])
represents a neural acrchitecture search with 3 candidates, which include "a single layer of 8 neurons", "two layers of neurons with 8 and 16 in sequence", and "three layers of neurons with 16, 32 and 64 in sequence", respectively.
Transfer Learning
Transfer learning is enabled by default when the hparam_tuning_algorithm is
VIZIER_DEFAULT
. The hyperparameter tuning benefits from previously tuned
models with the same model_type in the same project if the hyperparameter
search space is equivalent. They also benefit if the hyperparameter search space is a subset
of previous tuning jobs. The transfer of learning doesn’t require that the input data be the same.
Transfer learning helps solve the cold start problem where the system performs random exploration during the first trial batch. Transfer learning provides the system with some initial knowledge about the hyperparameters and their objectives. To continuously improve the model quality, always train a new hyperparameter tuning model with the same or a subset of hyperparameters.
"Subset” refers to hyperparameter names and types, but not their ranges. For example,
(a:[0, 10])
is considered as a subset of(a:[-1, 1], b:[0, 1])
.Transfer learning helps hyperparameter tuning converge faster, instead of helping submodels to converge.
Error Handling
Hyperparameter tuning handles errors in the following ways:
Cancellation: If a training job is cancelled while running, then all successful trials remain usable.
Invalid input: If the user input is invalid, then a user error is returned.
Invalid hyperparameters: If the hyperparameters are invalid for a trial, then the trial is skipped and marked as INFEASIBLE in the ML.TRIAL_INFO function.
Trial internal error: If more than 10% of
num_trials
fail due to INTERNAL_ERROR, then the training job stops with a user error is returned.If less than 10%
num_trials
fail due to INTERNAL_ERROR, the training continues with the failed trials marked as FAILED in the ML.TRIAL_INFO function.
Model Serving
Output models from hyperparameter tuning can be used by a number of existing model serving functions as listed below.
Existing functions
To use existing serving functions, follow these rules:
When the function takes input data, only the result from one trial (either the optimal trial by default, or a selected one) is returned:
ML.CONFUSION_MATRIX(MODEL model_name, input_data, STRUCT(<T> AS threshold)[, STRUCT(trial_id AS trial_id)])
ML.EVALUATE(MODEL model_name, input_data[, STRUCT(trial_id AS trial_id)])
ML.PREDICT(MODEL model_name, input_data[, STRUCT(trial_id AS trial_id)])
ML.RECOMMEND(MODEL model_name, input_data[, STRUCT(trial_id AS trial_id)])
ML.ROC_CURVE(MODEL model_name, input_data[, GENERATE_ARRAY(thresholds)][, STRUCT(trial_id AS trial_id)])
When the function takes no input data, all trial results are returned where the first output column is the trial_id:
ML.CENTROIDS(MODEL model_name)
ML.EVALUATE(MODEL model_name)
ML.FEATURE_INFO
doesn’t change because all trials share the same input data.ML.WEIGHTS(MODEL model_name)
Evaluation metrics from ML.EVALUATE
and ML.TRIAL_INFO
are likely different when the 3-way data split is applied. By default, ML.EVALUATE
runs against the test data; whereas ML.TRIAL_INFO
runs against the eval data. See the Data Split section for more details.
Disallowed functions
ML.TRAINING_INFO
returns information for each iteration, while such information
is not saved in hyperparameter tuning models. Instead of iteration results,
trial results are saved in hyperparameter tuning models, and you can use
ML.TRIAL_INFO
to fetch it.
Model export
You can export models created with hyperparameter tuning to Cloud Storage
locations using the EXPORT MODEL
statement. You can export the default optimal trial or any specified trial.
Pricing
The cost of hyperparameter tuning training is the sum of all executed trials costs. The pricing of a trial is consistent with the existing BigQuery ML pricing model.
Q&A
How many trials do I need to tune a model?
The rule of thumb is to use at least 10 trials for one hyperparameter. So the total number of trials should be at least 10 * num_hyperparameters. If you are using the default search space, see the table for the number of hyperparameters tuned by default.
What if I don’t see performance improvements by using hyperparameter tuning?
First make sure you follow the instructions for a fair comparison. If you still don’t see performance improvements, it means the default hyperparameters may already work well for you. You may want to focus on feature engineering or try other model types before another round of hyperparameter tuning.
What if I want to continue tuning a model?
You can just train a new hyperparameter tuning model with the same search space. The built-in transfer learning will help to continue tuning based on your previously tuned models.
Do I need to retrain the model with all data and the optimal hyperparameters?
It depends on various factors:
KMEANS
: We already use all data as the training data. So no need to retrain the model.MATRIX_FACTORIZATION
: You can retrain the model with the selected hyperparameters and all input data for a better coverage of users and items.Other model types: In most cases you don’t. We already keep 80% data as the training data during the default random data split. But you can still retrain the model with more training data and the selected hyperparameters if your data is small. Please note that leaving little eval data for early stop may worsen overfitting.
Why aren't there training and evaluation results in the Google Cloud console?
- Confirm the Enable Editor Tabs option is enabled in the Google Cloud console.