The CREATE MODEL statement for autoencoder models
This document describes the CREATE MODEL
statement for creating
autoencoder
models in BigQuery. Autoencoder models support
hyperparameter tuning.
You can use autoencoder models with the
ML.PREDICT
or
ML.GENERATE_EMBEDDING
functions to embed data into a lower-dimensional space, and with the
ML.DETECT_ANOMALIES
function
to perform anomaly detection.
For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name OPTIONS(model_option_list) AS query_statement model_option_list: MODEL_TYPE = { 'AUTOENCODER' } ] [, L1_REG = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ] [, L2_REG = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ] [, L1_REG_ACTIVATION = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ] [, LEARN_RATE = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ] [, OPTIMIZER = { { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' } | HPARAM_CANDIDATES([candidates]) } ] [, ACTIVATION_FN = { 'RELU' | 'RELU6' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' } | HPARAM_CANDIDATES([candidates]) } ] [, BATCH_SIZE = { int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ] [, DROPOUT = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ] [, HIDDEN_UNITS = { int_array | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ] [, TF_VERSION = { '1.15' | '2.8.0' } ] [, EARLY_STOP = { TRUE | FALSE } ] [, MIN_REL_PROGRESS = float64_value ] [, MAX_ITERATIONS = int64_value ] [, WARM_START = { TRUE | FALSE } ] [, NUM_TRIALS = int64_value ] [, MAX_PARALLEL_TRIALS = int64_value ] [, HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' } ] [, HPARAM_TUNING_OBJECTIVES = { MEAN_SQUARED_ERROR | ... } ] [, KMS_KEY_NAME = string_value ]
CREATE MODEL
Creates and trains a new model in the specified dataset. If the model name
exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates and trains a new model only if the model doesn't exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates and trains a model and replaces an existing model with the same name in the specified dataset.
model_name
The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
model_name
is not case-sensitive.
If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example, `myproject.mydataset.mymodel`.
MODEL_TYPE
Syntax
MODEL_TYPE = { 'AUTOENCODER' }
Description
The model type. This option is required.
L1_REG
Syntax
L1_REG = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }
Description
The L1 regularization strength of the OPTIMIZER
. You can only use
this option when OPTIMIZER
is set to one of the following values:
ADAGRAD
— Implements the ProximalAdagradOptimizer algorithmFTRL
— Implements the FtrlOptimizer algorithmSGD
— Implements the ProximalGradientDescentOptimizer algorithm
Arguments
If you aren't running hyperparameter tuning, then you can specify a
FLOAT64
value. The default value is 0
.
If you are running hyperparameter tuning, then you can use one of the following options:
- The
HPARAM_RANGE
keyword and twoFLOAT64
values that define the range to use for the hyperparameter. For example,L1_REG = HPARAM_RANGE(0, 5.0)
. - The
HPARAM_CANDIDATES
keyword and an array ofFLOAT64
values that provide discrete values to use for the hyperparameter. For example,L1_REG = HPARAM_CANDIDATES([0, 1.0, 3.0, 5.0])
.
When running hyperparameter tuning, the valid range is
(0, ∞)
, the default range is (0, 10.0]
, and the scale
type is LOG
.
L2_REG
Syntax
L2_REG = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }
Description
The L2 regularization strength of the OPTIMIZER
. You can only
use this option when OPTIMIZER
is set to one of the following values:
ADAGRAD
— Implements the ProximalAdagradOptimizer algorithmFTRL
— Implements the FtrlOptimizer algorithmSGD
— Implements the ProximalGradientDescentOptimizer algorithm
Arguments
If you aren't running hyperparameter tuning, then you can specify a
FLOAT64
value. The default value is 0
.
If you are running hyperparameter tuning, then you can use one of the following options:
- The
HPARAM_RANGE
keyword and twoFLOAT64
values that define the range to use for the hyperparameter. For example,L2_REG = HPARAM_RANGE(1.5, 5.0)
. - The
HPARAM_CANDIDATES
keyword and an array ofFLOAT64
values that provide discrete values to use for the hyperparameter. For example,L2_REG = HPARAM_CANDIDATES([0, 1.0, 3.0, 5.0])
.
When running hyperparameter tuning, the valid range is (0, ∞)
, the
default range is (0, 10.0]
, and the scale type is LOG
.
L1_REG_ACTIVATION
Syntax
L1_REG_ACTIVATION = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
Description
The L1 regularizer to apply a penalty on the activations at the latent space. This is a common approach to to achieve high sparsity of data representations after dimensionality reduction.
Arguments
If you aren't running hyperparameter tuning, then you can specify a FLOAT64
value. The default value is 0
.
If you are running hyperparameter tuning, use one of the following options:
- The
HPARAM_RANGE
keyword and twoFLOAT64
values that define the range of the hyperparameter. For example,L1_REG_ACTIVATION = HPARAM_RANGE(0.01, 1.0)
. - The
HPARAM_CANDIDATES
keyword and an array ofFLOAT64
values that provide discrete values to use for the hyperparameter. For example,L1_REG_ACTIVATION = HPARAM_CANDIDATES([0, 0.001, 0.01, 0.1, 1.0])
.
When running hyperparameter tuning, the valid range is
(0, ∞)
, the default range is (0, 10.0]
, and the scale
type is LOG
.
LEARN_RATE
Syntax
LEARN_RATE = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }
Description
The initial learn rate for training.
Arguments
If you aren't running hyperparameter tuning, then you can specify a
FLOAT64
value. The default value is 0.001
.
If you are running hyperparameter tuning, use one of the following options:
- Use the
HPARAM_RANGE
keyword and specify twoFLOAT64
values that define the range to use for the hyperparameter. For example,LEARN_RATE = HPARAM_RANGE(0.001, 0.005)
. - Use the
HPARAM_CANDIDATES
keyword and specify an array ofFLOAT64
values to provide discrete values to use for the hyperparameter. For example,LEARN_RATE = HPARAM_CANDIDATES([0, 0.001, 0.01, 0.1])
.
When running hyperparameter tuning, the valid range is [0, 1.0]
, the
default range is [0, 1.0]
, and the scale type is LOG
.
OPTIMIZER
Syntax
OPTIMIZER = { { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' } | HPARAM_CANDIDATES([candidates]) }
Description
The optimizer for training the model.
Arguments
This option accepts the following values:
ADAM
— Implements the Adam algorithm. This is the default.ADAGRAD
— Implements the Adagrad algorithmFTRL
— Implements the FTRL algorithmRMSPROP
— Implements the RMSProp algorithmSGD
— Implements the gradient descent algorithm
If you are running hyperparameter training, you can provide more than one
value for this option by using HPARAM_CANDIDATES
and specifying an array.
For example, OPTIMIZER = HPARAM_CANDIDATES(['ADAM', 'FTRL', 'SGD'])
.
ACTIVATION_FN
Syntax
ACTIVATION_FN = { { 'RELU' | 'RELU6' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' } | HPARAM_CANDIDATES([candidates]) }
Description
The activation function of the neural network.
Arguments
This option accepts the following values:
RELU
— Rectified linear. This is the default.RELU6
— Rectified linear 6ELU
— Exponential linearSELU
— Scaled exponential linearSIGMOID
— Sigmoid activationTANH
— Tanh activation
If you are running hyperparameter training, then you can provide more than one
value for this option by using HPARAM_CANDIDATES
and specifying an array.
For example, ACTIVATION_FN = HPARAM_CANDIDATES(['RELU', 'RELU6', 'TANH'])
.
BATCH_SIZE
Syntax
BATCH_SIZE = { int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }
Description
The mini batch size of samples that are fed to the neural network.
Arguments
If you aren't running hyperparameter tuning, specify an INT64
value that
is positive and is less than or equal to 8192
. The default value is 32
or the number of samples, whichever is smaller.
If you are running hyperparameter tuning, use one of the following options:
- The
HPARAM_RANGE
keyword and twoFLOAT64
values that define the range to use for the hyperparameter. For example,BATCH_SIZE = HPARAM_RANGE(16, 64)
. - The
HPARAM_CANDIDATES
keyword and an array ofFLOAT64
values that provide discrete values to use for the hyperparameter. For example,BATCH_SIZE = HPARAM_CANDIDATES([32, 64, 256, 1024])
.
When running hyperparameter tuning, the valid range is (0, ∞)
,
the default range is [16, 1024]
, and the scale type is LOG
.
DROPOUT
Syntax
DROPOUT = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }
Description
The dropout rate of units in the neural network.
Arguments
If you aren't running hyperparameter tuning, then you can specify a FLOAT64
value that is positive and is less than or equal to 1.0
. The default value is
0
.
If you are running hyperparameter tuning, then you must use one of the following options:
- The
HPARAM_RANGE
keyword and twoFLOAT64
values that define the range to use for the hyperparameter. For example,DROPOUT = HPARAM_RANGE(0, 0.6)
. - The
HPARAM_CANDIDATES
keyword and an array ofFLOAT64
values that provide discrete values to use for the hyperparameter. For example,DROPOUT = HPARAM_CANDIDATES([0.1, 0.3, 0.6])
.
When running hyperparameter tuning, the valid range is [0, 1.0)
, the
default range is [0, 0.8]
, and the scale type is LINEAR
.
HIDDEN_UNITS
Syntax
HIDDEN_UNITS = { int_array | HPARAM_CANDIDATES([candidates]) }
Description
The hidden layers of the neural network.
Arguments
An array of integers that represents the architecture of the hidden layers. If
not specified, BigQuery ML applies a single hidden layer that
contains no more than 128 units. The number of units is calculated as [min(128,
num_samples / (10 * (num_input_units + num_output_units)))]
. The upper bound of
the rule ensures that the model isn't over fitting.
The number in the middle of the array defines the shape of the latent space. For
example, hidden_units=[128, 64, 4, 64, 128]
defines a four-dimensional latent
space.
The number of layers in hidden_units
must be odd, and we recommend that the
sequence be symmetrical.
The following example defines a model architecture that uses three hidden layers with 256, 128, and 64 nodes, respectively.
HIDDEN_UNITS = [256, 128, 64]
If you are running hyperparameter tuning, then you must use the
HPARAM_CANDIDATES
keyword and specify an array in the form
ARRAY<STRUCT<ARRAY<INT64>>>
to provide discrete values to use for the
hyperparameter. Each struct value in the outer array represents a candidate
neural architecture. The array of INT64
values in each struct represents a
hidden layer.
The following example represents a neural architecture search with three candidates, which include a single layer of 8 neurons, two layers of neurons with 8 and 16 in sequence, and three layers of neurons with 16, 32 and 64 in sequence, respectively.
hidden_units=hparam_candidates([struct([8]), struct([8, 16]), struct([16, 32, 64])])
The valid range for the INT64
arrays is [1, ∞)
.
TF_VERSION
Syntax
TF_VERSION = { '1.15' | '2.8.0' }
Description
Specifies the TensorFlow version for model training. The default
value is 1.15
.
Set TF_VERSION
to 2.8.0
to use TensorFlow2 with the Keras API.
EARLY_STOP
Syntax
EARLY_STOP = { TRUE | FALSE }
Description
Determines whether training should stop after the first iteration in which the
relative loss improvement is less than the value specified for
MIN_REL_PROGRESS
.
Arguments
A BOOL
value. The default value is TRUE
.
MIN_REL_PROGRESS
Syntax
MIN_REL_PROGRESS = float64_value
Description
The minimum relative loss improvement that is necessary to continue training
when EARLY_STOP
is set to TRUE
. For example, a value of
0.01
specifies that each iteration must reduce the loss by 1% for training
to continue.
Arguments
A FLOAT64
value. The default value is 0.01
.
MAX_ITERATIONS
Syntax
MAX_ITERATIONS = int64_value
Description
The maximum number of training iterations, where one iteration represents a single pass of the entire training data.
Arguments
An INT64
value. The default value is 20
.
WARM_START
Syntax
WARM_START = { TRUE | FALSE }
Description
Determines whether to train a model with new training data, new model options, or both. Unless you explicitly override them, the initial options used to train the model are used for the warm start run.
In a warm start run, the iteration numbers are reset to start from zero. Use
the training run or iteration information returned by the
ML.TRAINING_INFO
function
to distinguish the warm start run from the original run.
The values of the MODEL_TYPE
, the HIDDEN_UNITS
options, and the model
retraining data schema must all remain the same as they were in the
previous training job.
Arguments
A BOOL
value. The default value is FALSE
.
NUM_TRIALS
Syntax
NUM_TRIALS = int64_value
Description
The maximum number of submodels to train. The tuning stops when NUM_TRIALS
submodels are trained, or when the hyperparameter search space is exhausted.
You must specify this option in order to use hyperparameter tuning.
Arguments
An INT64
value between 1
and 100
, inclusive.
MAX_PARALLEL_TRIALS
Syntax
MAX_PARALLEL_TRIALS = int64_value
Description
The maximum number of trials to run at the same time. If you specify a value
for this option, you must also specify a value for
NUM_TRIALS
.
Arguments
An INT64
value between 1
and 5
, inclusive. The default value is 1
.
HPARAM_TUNING_ALGORITHM
Syntax
HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' }
Description
The algorithm used to tune the hyperparameters. If you specify a value
for this option, you must also specify a value for NUM_TRIALS
.
Arguments
Specify one of the following values:
VIZIER_DEFAULT
: Use the default algorithm in Vertex AI Vizier to tune hyperparameters. This algorithm is the most powerful algorithm of those offered. It performs a mixture of advanced search algorithms, including Bayesian optimization with Gaussian processes. It also uses transfer learning to take advantage of previously tuned models. This is the default, and also the recommended approach.RANDOM_SEARCH
: Use random search to explore the search space.GRID_SEARCH
: Use grid search to explore the search space. You can only use this algorithm when every hyperparameter's search space is discrete.
HPARAM_TUNING_OBJECTIVES
Syntax
HPARAM_TUNING_OBJECTIVES = { 'MEAN_ABSOLUTE_ERROR' | 'MEAN_SQUARED_ERROR' | 'MEAN_SQUARED_LOG_ERROR' }
Description
The hyperparameter tuning objective for the model; only one objective is
supported. If you specify a value for this option, you must also specify a
value for NUM_TRIALS
.
Arguments
The possible objectives are a subset of the
model evaluation metrics
for the model type. If you aren't running hyperparameter tuning, or if you are
and you don't specify an objective, the MEAN_SQUARED_ERROR
objective is
used.
KMS_KEY_NAME
Syntax
KMS_KEY_NAME = string_value
Description
The Cloud Key Management Service customer-managed encryption key (CMEK) to use to encrypt the model.
Arguments
A STRING
value containing the fully-qualified name of the CMEK. For example,
'projects/my_project/locations/my_location/keyRings/my_ring/cryptoKeys/my_key'
Internal parameter defaults
BigQuery ML uses the following default values when building models:
loss_reduction = losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE
batch_norm = False
query_statement
The AS query_statement
clause specifies the
GoogleSQL query used to generate the training data. See the
GoogleSQL query syntax
page for the supported SQL syntax of the query_statement
clause.
Supported machine learning functions
The following machine learning functions are supported for the Autoencoder model, including:
ML.EVALUATE
for evaluating model metrics.ML.FEATURE_INFO
for reviewing information about the input features used to train a model.ML.PREDICT
for dimensionality reduction.ML.RECONSTRUCTION_LOSS
for anomaly detection and data sanitation purposes.ML.TRAINING_INFO
for tracking information about the training iterations of a model.
Example
The following example trains an autoencoder model against the table mytable
.
CREATE MODEL `project_id.mydataset.mymodel` OPTIONS(MODEL_TYPE='AUTOENCODER', ACTIVATION_FN = 'RELU', BATCH_SIZE = 16, DROPOUT = 0.1, EARLY_STOP = FALSE, HIDDEN_UNITS = [128, 64, 8, 64, 128], LEARN_RATE=0.001, MAX_ITERATIONS = 50, OPTIMIZER = 'ADAGRAD') AS SELECT * FROM `project_id.mydataset.mytable`;
Supported regions
Training Autoencoder models is not supported in all BigQuery ML regions. For a complete list of supported regions and multi-regions, see BigQuery ML locations.
Pricing
Model training does not incur any charges if you are using on-demand pricing. If you are using capacity-based pricing, model training will use reserved slots. All prediction queries are billable, regardless of the pricing model.