CREATE MODEL
statement for DNN models
To create a Deep Neural Network model in BigQuery, use the
BigQuery ML CREATE MODEL
statement with the DNN_CLASSIFIER
or
DNN_REGRESSOR
model types. These models are built using TensorFlow estimators.
For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name [OPTIONS(MODEL_TYPE = { 'DNN_CLASSIFIER' | 'DNN_REGRESSOR' }, ACTIVATION_FN = { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' }, AUTO_CLASS_WEIGHTS = { TRUE | FALSE }, BATCH_SIZE = int64_value, CLASS_WEIGHTS = struct_array, DROPOUT = float64_value, EARLY_STOP = { TRUE | FALSE }, HIDDEN_UNITS = int_array, L1_REG = float64_value, L2_REG = float64_value, LEARN_RATE = float64_value, INPUT_LABEL_COLS = string_array, MAX_ITERATIONS = int64_value, MIN_REL_PROGRESS = float64_value, OPTIMIZER = { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' }, WARM_START = { TRUE | FALSE }, DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' }, DATA_SPLIT_EVAL_FRACTION = float64_value, DATA_SPLIT_COL = string_value, ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE }, INTEGRATED_GRADIENTS_NUM_STEPS = int64_value, )];
CREATE MODEL
Creates a new BigQuery ML model in the specified dataset. If the model name exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates a new BigQuery ML model only if the model does not currently exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates a new BigQuery ML model and replaces any existing model with the same name in the specified dataset.
Model Options
BigQuery ML currently supports the following options. model_name
and model_type
are required; others are optional.
model_name
model_name
is the name of the BigQuery ML model you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can contain the following:
- Up to 1,024 characters
- Letters of either case, numbers, and underscores
model_name
is not case-sensitive.
If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example:
`myproject.mydataset.mymodel`
MODEL_TYPE
Syntax
MODEL_TYPE = { 'DNN_CLASSIFIER' | 'DNN_REGRESSOR' }
Description
Specifies the model type. This option is required.
ACTIVATION_FN
Syntax
ACTIVATION_FN = { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' }
Description
For DNN model types, specifies the activation function of the neural network.
Arguments
The following options are available:
'RELU'
— Rectified linear'RELU6'
— Rectified linear 6'CRELU'
— Concatenated ReLU'ELU'
— Exponential linear'SELU'
— Scaled exponential linear'SIGMOID'
— Sigmoid activation'TANH'
— Tanh activation
The default value is 'RELU'
.
AUTO_CLASS_WEIGHTS
Syntax
AUTO_CLASS_WEIGHTS = { TRUE | FALSE }
Description
Whether to balance class labels using weights for each class in inverse proportion to the frequency of that class.
Use only with the DNN_CLASSIFIER
model.
By default, the training data that is used to create a multiclass DNN Classifier model is unweighted. If the labels in the training data are imbalanced, the model might learn to predict the most popular class of labels more heavily, which might not be desired.
To balance every class, use the following formula:
total_input_rows / (input_rows_for_class_n * number_of_unique_classes)
Arguments
Accepts a BOOL
. The default value is FALSE
.
BATCH_SIZE
Syntax
BATCH_SIZE = int64_value
Description
For DNN model types, specifies the mini batch size of samples that are fed to the neural network.
Arguments
A positive number that is no greater than 8192.
The default value is 32 or the number of samples, whichever is smaller.
CLASS_WEIGHTS
Syntax
CLASS_WEIGHTS = struct_array
Description
The weights to use for each class label. This option cannot be specified if
AUTO_CLASS_WEIGHTS
is TRUE
.
Arguments
struct_array is an ARRAY
of STRUCT
s; each STRUCT
contains a
STRING
that contains the class label and a FLOAT64
containing the weight for
that class label. A weight must be present for every class label. The weights
are not required to add up to one. For example:
CLASS_WEIGHTS = [STRUCT('example_label', .2)]
DROPOUT
Syntax
DROPOUT = float64_value
Description
For DNN model types, specifies the dropout rate of units in the neural network.
Arguments
The value must be non-negative and less than 1.0. The default value is 0.0.
EARLY_STOP
Syntax
EARLY_STOP = { TRUE | FALSE }
Description
Whether training should stop after the first iteration in which the relative
loss improvement is less than the value specified for MIN_REL_PROGRESS
.
Arguments
The value is a BOOL
. The default value is TRUE
.
HIDDEN_UNITS
Syntax
HIDDEN_UNITS = int_array
Description
For DNN model types, specifies the hidden layers of the neural network.
Arguments
An array of integers that represents the architecture of the hidden layers. If not specified, BigQuery ML applies a single hidden layer that contains no more than 128 units. The number of units is calculated based on a variety of aspects, such as feature columns and number of categorical values.
Example
HIDDEN_UNITS = [256, 128, 64]
This example represents an architecture of 3 hidden layers with 256, 128, and 64 nodes respectively.
INPUT_LABEL_COLS
Syntax
INPUT_LABEL_COLS = string_array
Description
The label column name(s) in the training data.
Arguments
string_array is an ARRAY
of STRING
s. The DNN model types only
support string_array values that contain one element. Defaults to
'label'.
L1_REG
Syntax
L1_REG = float64_value
Description
The L1 regularization strength of the OPTIMIZER
. This option is only available when OPTIMIZER
is set to one of the following:
'ADAGRAD'
— Implements the ProximalAdagradOptimizer algorithm'FTRL'
— Implements the FtrlOptimizer algorithm'SGD'
— Implements the ProximalGradientDescentOptimizer algorithm
Arguments
float64_value
is a FLOAT64
. The default value is 0.
L2_REG
Syntax
L2_REG = float64_value
Description
The L2 regularization strength of the OPTIMIZER
. This option is only available when OPTIMIZER
is set to one of the following:
'ADAGRAD'
— Implements the ProximalAdagradOptimizer algorithm'FTRL'
— Implements the FtrlOptimizer algorithm'SGD'
— Implements the ProximalGradientDescentOptimizer algorithm
Arguments
float64_value
is a FLOAT64
. The default value is 0.
LEARN_RATE
Syntax
LEARN_RATE = float64_value
Description
The initial learn rate for training.
Arguments
float64_value is a FLOAT64
. The default value is 0.001.
MAX_ITERATIONS
Syntax
MAX_ITERATIONS = int64_value
Description
The maximum number of training iterations, where one iteration represents a single pass of the entire training data.
Arguments
int64_value is an INT64
. The default value is 20.
MIN_REL_PROGRESS
Syntax
MIN_REL_PROGRESS = float64_value
Description
The minimum relative loss improvement that is necessary to continue training
when EARLY_STOP
is set to true. For example, a value of 0.01 specifies that
each iteration must reduce the loss by 1% for training to continue.
Arguments
float64_value is a FLOAT64
. The default value is 0.01.
OPTIMIZER
Syntax
OPTIMIZER = { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' }
Description
For DNN model types, specifies the optimizer for training the model.
Arguments
The following options are available:
'ADAGRAD'
— Implements the Adagrad algorithm'ADAM'
— Implements the Adam algorithm'FTRL'
— Implements the FTRL algorithm'RMSPROP'
— Implements the RMSProp algorithm'SGD'
— Implements the gradient descent algorithm
The default value is 'ADAM'
.
WARM_START
Syntax
WARM_START = { TRUE | FALSE }
Description
Retrain a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run.
In a warm start run, the iteration numbers are reset to start from zero. The
TRAINING_RUN
number or the TIMESTAMP
columns can be
used to distinguish the warm start run from the original run.
The values of the MODEL_TYPE
, LABELS
, and HIDDEN_UNITS
options, and the
training data schema, must remain the same as they were in previous training job
in a warm start. The warm_start
option is only supported for LINEAR_REG
,
LOGISTIC_REG
, KMEANS
, DNN_REGRESSOR
, DNN_CLASSIFIER
and AUTOENCODER
models retrain.
Arguments
Accepts a BOOL
. The default value is FALSE
.
DATA_SPLIT_METHOD
Syntax
DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' }
Description
The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting due to early stopping.
Arguments
Accepts the following values:
'AUTO_SPLIT'
The automatic split strategy is as follows:
- When there are fewer than 500 rows in the input data, all rows are used as training data.
- When there are between 500 and 50,000 rows in the input data, 20% of the data
is used as evaluation data in a
RANDOM
split. - When there are more than 50,000 rows in the input data, only 10,000 rows
are used as evaluation data in a
RANDOM
split.
'RANDOM'
Split data randomly. A random split is deterministic: different
training runs produce the same split results if the underlying training
data remains the same.
'CUSTOM'
Split data using a customer-provided column of type BOOL
. The rows
with a value of TRUE
are used as evaluation data. The rows with a value of
FALSE
are used as training data.
'SEQ'
Split data sequentially using a customer-provided column. The column can
have any orderable data type: NUMERIC
, BIGNUMERIC
, STRING
, or
TIMESTAMP
. All rows with split values smaller than the threshold are used as
training data. The remaining rows including NULL
values are used as evaluation
data.
'NO_SPLIT'
Use all data as training data.
DATA_SPLIT_EVAL_FRACTION
Syntax
DATA_SPLIT_EVAL_FRACTION = float64_value
Description
This option is used with 'RANDOM'
and 'SEQ'
splits. It specifies the
fraction of the data used for evaluation, accurate to two decimal places.
Arguments
float64_value
is a FLOAT64
. The default value is 0.2
.
DATA_SPLIT_COL
Syntax
DATA_SPLIT_COL = string_value
Description
Identifies the column used to split the data. This column cannot be used as a feature or label, and will be excluded from features automatically.
When the value of
DATA_SPLIT_METHOD
is'CUSTOM'
, the corresponding column should be of typeBOOL
. The rows withTRUE
orNULL
values are used as evaluation data. Rows withFALSE
values are used as training data.When the value of
DATA_SPLIT_METHOD
is'SEQ'
, the last n rows from smallest to largest in the corresponding column are used as evaluation data, where n is the value specified forDATA_SPLIT_EVAL_FRACTION
. The first rows are used as training data.
ENABLE_GLOBAL_EXPLAIN
Syntax
ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE }
Description
Specifies whether to compute global explanations using explainable AI to evaluate global feature importance to the model.
Arguments
The value is a BOOL
. The default value is FALSE
.
INTEGRATED_GRADIENTS_NUM_STEPS
Syntax
INTEGRATED_GRADIENTS_NUM_STEPS = int64_value
Description
Specifies the number of steps to sample between the example being explained and its baseline for approximating the integral in integrated gradients attribution methods.
Arguments
The value is a INT64
. The default value is 15 and the user input should be
less than or equal to 30 and greater than 0.
INTERNAL DEFAULT OPTIONS
BigQuery ML also uses the following default values when building DNN models internally.
loss_reduction = losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE
batch_norm = False
Example
The following example trains a DNN classifier model against 'mytable'
with 'mylabel'
as the label column.
CREATE MODEL project_id:mydataset.mymodel
OPTIONS(MODEL_TYPE='DNN_CLASSIFIER',
ACTIVATION_FN = 'RELU',
BATCH_SIZE = 16,
DROPOUT = 0.1,
EARLY_STOP = FALSE,
HIDDEN_UNITS = [128, 128, 128],
INPUT_LABEL_COLS = ['mylabel'],
LEARN_RATE=0.001,
MAX_ITERATIONS = 50,
OPTIMIZER = 'ADAGRAD')
AS SELECT * FROM project_id:mydataset.mytable;
Supported regions
Training DNN models is not supported in all BigQuery ML regions. For a complete list of supported regions and multi-regions, see the Locations page.