Hyperparameter tuning for CREATE MODEL statements

BigQuery ML supports hyperparameter tuning when training ML models using CREATE MODEL statements. Hyperparameter tuning is commonly used to improve model performance by searching for the optimal hyperparameters. Hyperparameter tuning supports the following model types:

For information about enabling new features and for the answers to common questions about fine-tuning machine learning models using BigQuery ML hyperparameter tuning, see the Q&A section below.

For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

CREATE MODEL syntax

To run hyperparameter tuning, add the num_trials training option parameter to the CREATE MODEL statement to specify the maximum number of sub-models to train.

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
OPTIONS(Existing Training Options,
        NUM_TRIALS = int64_value,
        MAX_PARALLEL_TRIALS = int64_value,
        HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' },
        [, hyperparameter={HPARAM_RANGE(min, max) | HPARAM_CANDIDATES([candidates]) }...]
        HPARAM_TUNING_OBJECTIVES = { 'R2_SCORE' | 'ROC_AUC' | ... },
        DATA_SPLIT_TEST_FRACTION = float64_value
)
AS query_statement

Existing Training Options

Hyperparameter tuning supports most training options with the limitation that once a training option is explicitly set, it can’t be treated as a tunable hyperparameter. For example, the combination below is not valid:

l1_reg=0.1, l1_reg=hparam_range(0, 10)

NUM_TRIALS

Syntax

NUM_TRIALS = int64_value

Description

The maximum number of submodels to train. The tuning will stop when num_trials submodels are trained, or when the hyperparameter search space is exhausted. The maximum value is 100.

Arguments

int64_value is an 'INT64'. Allowed values are 1 to 100.

MAX_PARALLEL_TRIALS

Syntax

MAX_PARALLEL_TRIALS = int64_value

Description

The maximum number of trials to run at the same time. The default value is 1 and the maximum value is 5.

Arguments

int64_value is an 'INT64'. Allowed values are 1 to 5.

HPARAM_TUNING_ALGORITHM

Syntax

HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' }

Description

The algorithm used to tune the hyperparameters.

Arguments

HPARAM_TUNING_ALGORITHM accepts the following values:

  • 'VIZIER_DEFAULT' (default and recommended): Uses the default algorithm in Vertex Vizier to tune hyperparameters. This algorithm is the most powerful and performs a mixture of advanced search algorithms including Bayesian Optimization with Gaussian Process. It also transfers learning to leverage previously tuned models.

  • 'RANDOM_SEARCH': Uses random search to explore the search space.

  • 'GRID_SEARCH': Uses grid search to explore the search space. This algorithm is only available when every hyperparameter's search space is discrete.

HYPERPARAMETER

Syntax

hyperparameter={HPARAM_RANGE(min, max) | HPARAM_CANDIDATES([candidates]) }...

Description

The configuration of the search space of a hyperparameter. See Hyperparameters and Objectives for each model type to see its supported tunable hyperparameters.

Arguments

Accepts one of the following arguments:

  • HPARAM_RANGE(min, max): Use this argument to specify the search space of continuous values from a hyperparameter, for example learn_rate = HPARAM_RANGE(0.0001, 1.0).

  • HPARAM_CANDIDATES([candidates]) to specify a hyperparameter with discrete values, like OPTIMIZER=HPARAM_CANDIDATES(['adagrad', 'sgd', 'ftrl']).

HPARAM_TUNING_OBJECTIVES

Syntax

HPARAM_TUNING_OBJECTIVES = { 'R2_SCORE' | 'ROC_AUC' | ... }

Description

The objective metrics for the model. The candidates are a subset of the model evaluation metrics. Currently only one objective is supported.

Arguments

See Hyperparameters and Objectives for each model type to see its supported arguments and defaults.

DATA_SPLIT_TEST_FRACTION

Syntax

DATA_SPLIT_TEST_FRACTION = float64_value

Description

This option is used with 'RANDOM' and 'SEQ' splits. It specifies the fraction of the data used as test data for the final evaluation metrics reporting. The fraction is accurate to two decimal places. See the Data Split section for more details.

Arguments

float64_value is a FLOAT64 that specifies the fraction of the data used as test data for the final evaluation metrics reporting. The default value is 0.0.

Hyperparameters and Objectives

The following table lists the supported hyperparameters and their objectives for each model type:

Model type Hyperparameter Valid Range Default Range Scale Type Hyperparameter Objectives
LINEAR_REG





l1_reg

l2_reg
(0, ∞]

(0, ∞]
(0, 10]

(0, 10]
LOG mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score (default)
explained_variance
LOGISTIC_REG l1_reg

l2_reg
(0, ∞]

(0, ∞]
(0, 10]

(0, 10]
LOG precision
recall
accuracy
f1_score
log_loss
roc_auc (default)
KMEANS num_clusters [2, 100] [2, 10] LINEAR davies_bouldin_index
MATRIX_FACTORIZATION (explicit) num_factors

l2_reg
[2, 200]

(0, ∞)
[2, 20]

(0, 10]
LINEAR

LOG
mean_squared_error
MATRIX_FACTORIZATION (implicit) num_factors

l2_reg

wals_alpha
[2, 200]

(0, ∞)

[0, ∞)
[2, 20]

(0, 10]

[0, 100]
LINEAR

LOG

LINEAR
mean_average_precision (default)
mean_squared_error
normalized_discounted_cumulative_gain
average_rank
DNN_CLASSIFIER





batch_size

dropout

hidden_units

learn_rate

optimizer





activation_fn
(0, ∞)

[0, ∞)

Array of [1, ∞)

[0, 1]

{adam, adagrad, ftrl, rmsprop, sgd}

{relu, relu6, crelu, elu, selu, sigmoid, tanh}
[16, 1024]

[0, 0.8]

N/A

[0, 1]

{adam, adagrad, ftrl, rmsprop, sgd}

N/A
LOG

LINEAR

N/A

LINEAR

N/A





N/A
precision
recall
accuracy
f1_score
log_loss
roc_auc (default)
DNN_REGRESSOR





batch_size

dropout

hidden_units

learn_rate

optimizer





activation_fn
(0, ∞)

[0, ∞)

Array of [1, ∞)

[0, 1]

{adam, adagrad, ftrl, rmsprop, sgd}

{relu, relu6, crelu, elu, selu, sigmoid, tanh}
[16, 1024]

[0, 0.8]

N/A

[0, 1]

{adam, adagrad, ftrl, rmsprop, sgd}

N/A
LOG

LINEAR

N/A

LINEAR

N/A





N/A
mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score (default)
explained_variance
BOOSTED_TREE_CLASSIFIER





learn_rate

l1_reg

l2_reg

dropout

max_tree_depth

subsample

min_split_loss

num_parallel_tree

min_tree_child_weight

colsample_bytree

colsample_bylevel

colsample_bynode

booster_type


dart_normalize_type


tree_method
[0, ∞)

(0, ∞)

(0, ∞)

[0, 1]

[0, ∞)

(0, 1]

[0, ∞)

[1, ∞)

[0, ∞)

[0, 1]

[0, 1]

[0, 1]

{gbtree, dart}

{tree, forest}

{auto, exact, approx, hist}
[0, 1]

(0, 10]

(0, 10]

N/A

[1, 10]

(0, 1]

N/A

N/A

N/A

N/A

N/A

N/A

N/A


N/A


N/A
LINEAR

LOG

LOG

LINEAR

LINEAR

LINEAR

LINEAR

LINEAR

LINEAR

LINEAR

LINEAR

LINEAR

N/A


N/A


N/A
precision
recall
accuracy
f1_score
log_loss
roc_auc (default)
BOOSTED_TREE_REGRESSOR





learn_rate

l1_reg

l2_reg

dropout

max_tree_depth

subsample

min_split_loss

num_parallel_tree

min_tree_child_weight

colsample_bytree

colsample_bylevel

colsample_bynode

booster_type


dart_normalize_type


tree_method
[0, ∞)

(0, ∞)

(0, ∞)

[0, 1]

[0, ∞)

(0, 1]

[0, ∞)

[1, ∞)

[0, ∞)

[0, 1]

[0, 1]

[0, 1]

{gbtree, dart}

{tree, forest}

{auto, exact, approx, hist}
[0, 1]

(0, 10]

(0, 10]

N/A

[1, 10]

(0, 1]

N/A

N/A

N/A

N/A

N/A

N/A

N/A


N/A


N/A
LINEAR

LOG

LOG

LINEAR

LINEAR

LINEAR

LINEAR

LINEAR

LINEAR

LINEAR

LINEAR

LINEAR

N/A


N/A


N/A
mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score (default)
explained_variance

The default hyperparameters and their search space are applied if no explicit search space is specified using hparam_range or hparam_candidates for any parameter. The default, hparam_tuning_objectivess are applied if hparam_tuning_objectives is not explicitly specified.

Most LOG scale hyperparameters use the open lower boundary of 0. You can still set 0 as the lower boundary in the hparam_range function, which will be converted to converted to 1e-14.

Conditional hyperparameters are supported. For example, dart_normalize_type is only tunable when booster_type is 'dart', in which case you can still specify both search space which can handle the conditions automatically:

  • booster_type=hparam_candidates(['dart', 'gbtree'])
  • dart_normalize_type=hparam_candidates(['tree', 'forest'])

Data Split

The training and evaluation split differ from non-hyperparameter-tuning models, as hyperparameter tuning introduces 3-way data split that divides data into training, evaluation, and test. Training and evaluation sets are used in each trial training as in normal models. The trial hyperparameter suggestions are calculated based upon the evaluation data metrics. At the end of each trial training, the third test set is used to evaluate the trial and record its metrics in the model. This ensures the objectivity of the final reporting evaluation metrics by using a final, unseen test set. Evaluation data is used to calculate the intermediate metrics for hyperparameter suggestion, while the test data is used to calculate the final objective model metrics.

All hyperparameter tuning uses a 3-way data split unless 'NO_SPLIT' is specified. Use 'NO_SPLIT' with caution when tuning num_factors in a matrix factorization model. Although 'NO_SPLIT' allows better performance with larger num_factors, the validity of the results may be degraded.

The following table displays behavior for non-hyperparameter-tuning and hyperparameter tuning training:

Data split method Non-hyperparameter-tuning Hyperparameter tuning
NO_SPLIT No data split. No data split.
RANDOM Works with DATA_SPLIT_EVAL_FRACTION. Adds a DATA_SPLIT_TEST_FRACTION to specify test data fraction.
SEQUENTIAL Works with DATA_SPLIT_EVAL_FRACTION. Adds a DATA_SPLIT_TEST_FRACTION to specify test data fraction. Data is sequentially split into training, evaluation, and test.
CUSTOM Works with bool_value DATA_SPLIT_COL:
TRUE - evaluation data
FALSE - training data
Works with string_value DATA_SPLIT_COL:
'TRAIN' - training data
'EVAL' - evaluation data
'TEST' - test data
AUTO With fewer than 500 rows in the input data, all rows are used as training data.

When there are between 500 and 50,000 rows in the input data, 20% of the data is used as evaluation data in a RANDOM split.

When there are more than 50,000 rows in the input data, only 10,000 of them are used as evaluation data in a RANDOM split.
When there are fewer than 500 rows in the input data, all rows are used as training data.

When there are more than 500 rows, 10% of the data is used as evaluation data and the other 10% is used as test data in a RANDOM split.

If the 3-way split is not desired, you can use the 2-way split by specifying 'DATA_SPLIT_TEST_FRACTION' as 0. When the test set is empty, the evaluation set is used as the test set for the final evaluation metrics reporting.

The metrics from models that are generated from a normal training job and a hyperparameter-tuning job are only comparible when the data split fractions are equal. For example,

  • Non-hyperparameter-tuning, data_split_method='random', data_split_eval_fraction=0.2

  • Hyperparameter tuning, data_split_method='random', data_split_eval_fraction=0.2, data_split_test_fraction=0

Neural architecture searches referencing neural network models are supported by using ARRAY<STRUCT<ARRAY<INT64>>>. When configured, each STRUCT in the array represents a candidate neural architecture, where each layer of the hidden units is specified as an array of INT64.

For example:

hidden_units=hparam_candidates([struct([8]), struct([8, 16]), struct([16, 32, 64])])

represents a neural acrchitecture search with 3 candidates, which include "a single layer of 8 neurons", "two layers of neurons with 8 and 16 in sequence", and "three layers of neurons with 16, 32 and 64 in sequence", respectively.

Transfer Learning

Transfer learning is enabled by default when the hparam_tuning_algorithm is VIZIER_DEFAULT. The hyperparameter tuning benefits from previously tuned models with the same model_type in the same project if the hyperparameter search space is equivalent. They also benefit if the hyperparameter search space is a subset of previous tuning jobs. The transfer of learning doesn’t require that the input data be the same.

Transfer learning helps solve the cold start problem where the system performs random exploration during the first trial batch. Transfer learning provides the system with some initial knowledge about the hyperparameters and their objectives. To continuously improve the model quality, always train a new hyperparameter tuning model with the same or a subset of hyperparameters.

  • "Subset” refers to hyperparameter names and types, but not their ranges. For example, (a:[0, 10]) is considered as a subset of (a:[-1, 1], b:[0, 1]).

  • Transfer learning helps hyperparameter tuning converge faster, instead of helping submodels to converge.

Error Handling

Hyperparameter tuning handles errors in the following ways:

  • Cancellation: If a training job is cancelled while running, then all successful trials remain usable.

  • Invalid input: If the user input is invalid, then a user error is returned.

  • Invalid hyperparameters: If the hyperparameters are invalid for a trial, then the trial is skipped and marked as INFEASIBLE in the ML.TRIAL_INFO function.

  • Trial internal error: If more than 10% of num_trials fail due to INTERNAL_ERROR, then the training job stops with a user error is returned.

  • If less than 10% num_trials fail due to INTERNAL_ERROR, the training continues with the failed trials marked as FAILED in the ML.TRIAL_INFO function.

Model Serving

Output models from hyperparameter tuning can be used by a number of existing model serving functions as listed below.

Existing functions

To use existing serving functions, follow these rules:

  • When the function takes input data, only the result from one trial (either the optimal trial by default, or a selected one) is returned:

    • ML.CONFUSION_MATRIX(MODEL model_name, input_data, STRUCT(<T> AS threshold)[, STRUCT(trial_id AS trial_id)])
    • ML.EVALUATE(MODEL model_name, input_data[, STRUCT(trial_id AS trial_id)])
    • ML.PREDICT(MODEL model_name, input_data[, STRUCT(trial_id AS trial_id)])
    • ML.RECOMMEND(MODEL model_name, input_data[, STRUCT(trial_id AS trial_id)])
    • ML.ROC_CURVE(MODEL model_name, input_data[, GENERATE_ARRAY(thresholds)][, STRUCT(trial_id AS trial_id)])
  • When the function takes no input data, all trial results are returned where the first output column is the trial_id:

    • ML.CENTROIDS(MODEL model_name)
    • ML.EVALUATE(MODEL model_name)
    • ML.FEATURE_IMPORTANCE(MODEL model_name)
    • ML.FEATURE_INFO doesn’t change because all trials share the same input data.
    • ML.WEIGHTS(MODEL model_name)

Disallowed functions

ML.TRAINING_INFO returns information for each iteration, while such information is not saved in hyperparameter tuning models. Instead of iteration results, trial results are saved in hyperparameter tuning models, and you can use ML.TRIAL_INFO to fetch it.

Model export

Output models from hyperparameter tuning can be exported to Cloud Storage locations using the EXPORT MODEL statement.

Pricing

The cost of hyperparameter tuning training is the sum of all executed trials costs. The pricing of a trial is consistent with the existing BigQuery ML pricing model.

Q&A

How many trials do I need to tune a model?

The rule of thumb is to use at least 10 trials for one hyperparameter. So the total number of trials should be at least 10 * num_hyperparameters. If you are using the default search space, see the table for the number of hyperparameters tuned by default.

What if I don’t see performance improvements by using hyperparameter tuning?

First make sure you follow the instructions for a fair comparison. If you still don’t see performance improvements, it means the default hyperparameters may already work well for you. You may want to focus on feature engineering or try other model types before another round of hyperparameter tuning.

What if I want to continue tuning a model?

You can just train a new hyperparameter tuning model with the same search space. The built-in transfer learning will help to continue tuning based on your previously tuned models.

Do I need to retrain the model with all data and the optimal hyperparameters?

It depends on various factors:

  • KMEANS: We already use all data as the training data. So no need to retrain the model.

  • MATRIX_FACTORIZATION: You can retrain the model with the selected hyperparameters and all input data for a better coverage of users and items.

  • Other model types: In most cases you don’t. We already keep 80% data as the training data during the default random data split. But you can still retrain the model with more training data and the selected hyperparameters if your data is small. Please note that leaving little eval data for early stop may worsen overfitting.

Why aren't there training and evaluation results in the Google Cloud Console?

  • Confirm the Enable Editor Tabs option is enabled in the Google Cloud Console.

Tabs