CREATE MODEL
statement for Autoencoder models
To create an Autoencoder model in BigQuery, use the
BigQuery ML CREATE MODEL
statement with the AUTOENCODER
model type.
A sample query can be found in the Example section below.
For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name [OPTIONS(MODEL_TYPE = { 'AUTOENCODER' }, ACTIVATION_FN = { 'RELU' | 'RELU6' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' }, BATCH_SIZE = int64_value, DROPOUT = float64_value, EARLY_STOP = { TRUE | FALSE }, HIDDEN_UNITS = int_array, L1_REG_ACTIVATION = float64_value, LEARN_RATE = float64_value, MAX_ITERATIONS = int64_value, MIN_REL_PROGRESS = float64_value, OPTIMIZER = { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' }, WARM_START = { TRUE | FALSE } )];
CREATE MODEL
Creates a new BigQuery ML model in the specified dataset. If the model name exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates a new BigQuery ML model only if the model does not currently exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates a new BigQuery ML model and replaces any existing model with the same name in the specified dataset.
Model options
BigQuery ML supports the following options: model_name
and model_type
are required. Others are optional.
model_name
model_name
is the name of the BigQuery ML model that you're creating or replacing. The model name must be unique for each dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can contain the following:
- Up to 1,024 characters
- Letters of upper or lower-case, numbers, and underscores
model_name
is not case-sensitive.
If you do not have a default project configured, then prepend the project ID to the model name in the following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example:
`myproject.mydataset.mymodel`
MODEL_TYPE
Syntax
MODEL_TYPE = { 'AUTOENCODER' }
Description
Specifies the model type. This option is required.
ACTIVATION_FN
Syntax
ACTIVATION_FN = { 'RELU' | 'RELU6' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' }
Description
For Autoencoder models, specifies the activation function of the neural network.
Arguments
The following options are available:
'RELU'
: Rectified linear'RELU6'
: Rectified linear 6'ELU'
: Exponential linear'SELU'
: Scaled exponential linear'SIGMOID'
: Sigmoid activation'TANH'
: Tanh activation
The default value is 'RELU'
.
BATCH_SIZE
Syntax
BATCH_SIZE = int64_value
Description
For Autoencoder models, specifies the mini batch size of samples that are fed to the neural network.
Arguments
A positive number with a maximum value of 8192
.
The default value is 32
or the number of samples, whichever is smaller.
DROPOUT
Syntax
DROPOUT = float64_value
Description
For Autoencoder models, specifies the dropout rate of units in the neural network.
Arguments
A valid value must be non-negative and not greater than 1.0
. The default value is 0.0
.
EARLY_STOP
Syntax
EARLY_STOP = { TRUE | FALSE }
Description
Specifies whether or not the training stops after the first iteration in which
the relative loss improvement is less than the value specified for MIN_REL_PROGRESS
.
Arguments
The value is a BOOL
. The default value is TRUE
.
HIDDEN_UNITS
Syntax
HIDDEN_UNITS = int_array
Description
For Autoencoder models, specifies the hidden layers of the neural network.
Arguments
An array of integers that represents the architecture of the hidden layers. If not specified, BigQuery ML applies a single hidden layer that contains no more than 128 units. The number of units is calculated based on a variety of aspects, such as the number of feature columns or categorical values.
An array of integers that represents the architecture of the hidden layers, where all layers are fully connected. If not specified, BigQuery ML applies a single hidden layer with the number of nodes calculated as [min(128, num_samples / (10 * (num_input_units + num_output_units)))]
. The upper bound of the rule ensures that the model isn't over fitting.
The number in the middle of the array defines the shape of the latent space. For example, hidden_units=[128, 64, 4, 64, 128]
defines a four-dimensional latent space.
The number of layers in hidden_units
must be odd, and we recommend that the sequence be symmetrical.
Example
HIDDEN_UNITS = [128, 64, 8, 64, 128]
This example presents an architecture of five hidden layers, where the encoder layers are [128, 64]
in sequence, the decoder layers are [64, 128]
, and the latent space has a dimensionality of 8.
L1_REG_ACTIVATION
Syntax
L1_REG_ACTIVATION = float64_value
Description
L1 regularizer to apply a penalty on the activations at the latent space. This is a common approach to to achieve high sparsity of data representations after dimensionality reduction.
Arguments
float64_value is a FLOAT64
. The default value is 0
.
LEARN_RATE
Syntax
LEARN_RATE = float64_value
Description
The initial learn rate for training.
Arguments
float64_value is a FLOAT64
. The default value is 0.001
.
MAX_ITERATIONS
Syntax
MAX_ITERATIONS = int64_value
Description
The maximum number of training iterations, where one iteration represents a single pass of all of the training data.
Arguments
int64_value
is an INT64
. The default value is 20
.
MIN_REL_PROGRESS
Syntax
MIN_REL_PROGRESS = float64_value
Description
The minimum relative loss improvement that is necessary to continue training
when EARLY_STOP
is set to TRUE
. For example, a value of 0.01
specifies that
each iteration must reduce the loss by 1% for training to continue.
Arguments
float64_value is a FLOAT64
. The default value is 0.01
.
OPTIMIZER
Syntax
OPTIMIZER = { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' }
Description
For Autoencoder models, specifies the optimizer for training the model.
Arguments
The following options are available:
'ADAGRAD'
: Implements the Adagrad algorithm'ADAM'
: Implements the Adam algorithm'FTRL'
: Implements the FTRL algorithm'RMSPROP'
: Implements the RMSProp algorithm'SGD'
: Implements the gradient descent algorithm
The default value is 'ADAM'
.
WARM_START
Syntax
WARM_START = { TRUE | FALSE }
Description
Retrains a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run.
In a warm start run, the iteration numbers are reset to start from zero. The
TRAINING_RUN
number or the TIMESTAMP
columns can be
used to distinguish the warm start run from the original run.
Also, in a warm start, the values of the MODEL_TYPE
, the HIDDEN_UNITS
options, and the model
retraining data schema must all remain the same as they were in the previous training job.
The warm_start
option is only supported for LINEAR_REG
,
LOGISTIC_REG
, KMEANS
, DNN_REGRESSOR
, DNN_CLASSIFIER
and AUTOENCODER
models retrain.
Arguments
Accepts a BOOL
. The default value is FALSE
.
INTERNAL DEFAULT OPTIONS
BigQuery ML also uses the following default values when you're building Autoencoder models internally.
loss_reduction = losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE
batch_norm = False
Supported machine learning functions
The following machine learning functions are supported for the Autoencoder model, including:
ML.EVALUATE for evaluating model metrics.
ML.FEATURE_INFO for reviewing information about the input features used to train a model.
ML.PREDICT for dimensionality reduction.
ML.RECONSTRUCTION_LOSS for anomaly detection and data sanitation purposes.
ML.TRAINING_INFO for tracking information about the training iterations of a model.
Example
The following example trains an Autoencoder model against 'mytable'
.
CREATE MODEL project_id:mydataset.mymodel
OPTIONS(MODEL_TYPE='AUTOENCODER',
ACTIVATION_FN = 'RELU',
BATCH_SIZE = 16,
DROPOUT = 0.1,
EARLY_STOP = FALSE,
HIDDEN_UNITS = [128, 64, 8, 64, 128],
LEARN_RATE=0.001,
MAX_ITERATIONS = 50,
OPTIMIZER = 'ADAGRAD')
AS SELECT * FROM project_id:mydataset.mytable;
Supported regions
Training Autoencoder models is not supported in all BigQuery ML regions. For a complete list of supported regions and multi-regions, see the Locations page.
Pricing
Model training does not incur any charges if you are using on-demand pricing. If you are using flat-rate pricing, model training will use reserved slots. All prediction queries are billable, regardless of the pricing model.