CREATE MODEL
statement
To create a model in BigQuery, use the BigQuery ML CREATE
MODEL
statement. This statement is similar to the
CREATE TABLE
DDL statement. When you run a standard SQL query that contains a CREATE MODEL
statement, a query job is generated for you that
processes the query.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name [TRANSFORM (select_list)] [OPTIONS(model_option_list)] [AS query_statement] model_option_list: MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG' | 'KMEANS' | 'TENSORFLOW' | 'MATRIX_FACTORIZATION' | 'AUTOML_CLASSIFIER' | 'AUTOML_REGRESSOR' | 'BOOSTED_TREE_CLASSIFIER' | 'BOOSTED_TREE_REGRESSOR' | 'DNN_CLASSIFIER' | 'DNN_REGRESSOR' } [, INPUT_LABEL_COLS = string_array ] [, OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' } ] [, L1_REG = float64_value ] [, L2_REG = float64_value ] [, MAX_ITERATIONS = int64_value ] [, LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' } ] [, LEARN_RATE = float64_value ] [, EARLY_STOP = { TRUE | FALSE } ] [, MIN_REL_PROGRESS = float64_value ] [, DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' } ] [, DATA_SPLIT_EVAL_FRACTION = float64_value ] [, DATA_SPLIT_COL = string_value ] [, LS_INIT_LEARN_RATE = float64_value ] [, WARM_START = { TRUE | FALSE } ] [, AUTO_CLASS_WEIGHTS = { TRUE | FALSE } ] [, CLASS_WEIGHTS = struct_array ] [, NUM_CLUSTERS = int64_value ] [, KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' } ] [, KMEANS_INIT_COL = string_value ] [, DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' } ] [, STANDARDIZE_FEATURES = { TRUE | FALSE } ] [, MODEL_PATH = string_value ] [, BUDGET_HOURS = float64_value ] [, FEEDBACK_TYPE = {'EXPLICIT' | 'IMPLICIT'} ] [, NUM_FACTORS = int64_value ] [, USER_COL = string_value ] [, ITEM_COL = string_value ] [, RATING_COL = string_value ] [, WALS_ALPHA = float64_value ] [, BOOSTER_TYPE = { 'gbtree' | 'dart'} ] [, NUM_PARALLEL_TREE = int64_value ] [, DART_NORMALIZE_TYPE = { 'tree' | 'forest'} ] [, TREE_METHOD = { 'auto' | 'exact' | 'approx' | 'hist'} ] [, MIN_TREE_CHILD_WEIGHT = float64_value ] [, COLSAMPLE_BYTREE = float64_value ] [, COLSAMPLE_BYLEVEL = float64_value ] [, COLSAMPLE_BYNODE = float64_value ] [, MIN_SPLIT_LOSS = float64_value ] [, MAX_TREE_DEPTH = int64_value ] [, SUBSAMPLE = float64_value ]) [, ACTIVATION_FN = { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' } ] [, BATCH_SIZE = int64_value ] [, DROPOUT = float64_value ] [, HIDDEN_UNITS = int_array ] [, OPTIMIZER = { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' } ] [, TIME_SERIES_TIMESTAMP_COL = string_value ] [, TIME_SERIES_DATA_COL = string_value ] [, TIME_SERIES_ID_COL = string_value ] [, HORIZON = int64_value ] [, AUTO_ARIMA = { TRUE | FALSE } ] [, AUTO_ARIMA_MAX_ORDER = int64_value ] [, NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value) ] [, DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' } ] [, INCLUDE_DRIFT = { TRUE | FALSE } ] [, HOLIDAY_REGION = { 'GLOBAL' | 'NA' | 'JAPAC' | 'EMEA' | 'LAC' | 'AE' | ... } ]
Not all options are applicable for every model type. For TensorFlow models, see the CREATE MODEL statement for TensorFlow models. For AutoML Tables models, see the CREATE MODEL statement for AutoML Tables models.
CREATE MODEL
Creates and trains a new model in the specified dataset. If the model name
exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates and trains a new model only if the model does not currently exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates and trains a model and replaces an existing model with the same name in the specified dataset.
model_name
model_name
is the name of the model you're creating or replacing. The model
name must be unique per dataset: no other model or table can have the same name.
The model name must follow the same naming rules as a BigQuery table. A
model name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
model_name
is not case-sensitive.
If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks: `[PROJECT_ID].[DATASET].[MODEL]` ; for example, `myproject.mydataset.mymodel`.
TRANSFORM
TRANSFORM lets you specify all preprocessing during model creation and have it automatically applied during prediction and evaluation.
For example, you can create the following model:
CREATE OR REPLACE MODEL m
TRANSFORM(ML.FEATURE_CROSS(STRUCT(f1, f2)) as cross_f,
ML.QUANTILE_BUCKETIZE(f3) OVER() as buckets,
label_col)
OPTIONS(model_type=’linear_reg’, input_label_cols=['label_col'])
AS SELECT * FROM t
During prediction, you don't need to preprocess the input again, and the same transformations are automatically restored:
SELECT * FROM ML.PREDICT(MODEL m, (SELECT f1, f2, f3 FROM table))
When the TRANSFORM
clause is present, only output columns from the
TRANSFORM
clause are used in training. Any results from
query_statement
that don't appear in the TRANSFORM
clause are ignored.
The input columns of the TRANSFORM
clause are the result of query_statement
.
So, the final input used in training is the set of columns generated by the
following query:
SELECT (select_list) FROM (query_statement);
Input columns of the TRANSFORM
clause can be of any SIMPLE type or ARRAY of
SIMPLE type. SIMPLE types are non-STRUCT and non-ARRAY data types.
In prediction (ML.PREDICT
), users only need to pass in the original
columns from the query_statement
that are used inside the TRANSFORM
clause.
The columns dropped in TRANSFORM
don't need to be provided during prediction.
TRANSFORM
is automatically applied to the input data during prediction,
including the statistics used in ML analytic functions (for example,
ML.QUANTILE_BUCKETIZE
).
select_list
You can pass columns from query_statement
through to model training without
transformation by either using *
, * EXCEPT()
, or by listing
the column names directly.
Not all columns from query_statement
are required to appear in the TRANSFORM
clause, so you can drop columns appearing in query_statement
by omitting
them from the TRANSFORM
clause.
You can transform inputs from query_statement
by using expressions in
select_list
. select_list
is similar to a normal SELECT
statement.
select_list
supports the following syntax:
*
* EXCEPT()
* REPLACE()
<var>expression</var>
<var>expression</var>.*
The following cannot appear inside select_list
:
- Aggregation functions.
- Non-ML Analytic functions. You can find all ML analytic functions in preprocessing functions
- UDFs.
- Subqueries.
- Anonymous columns. For example, “a + b as c” is allowed, while “a + b” is not.
The output columns of select_list
can be of any BigQuery ML
supported data type.
If present, the following columns must appear in select_list
without
transformation:
- label
- data_split_col
- kmeans_init_col
If these columns are returned by query_statement
, you must reference them in
select_list
by column name outside of any expression, or by using *
. Using
an alias with these columns is disallowed.
model_option_list
In the model_option_list
, the model_type
option is required. All others
are optional.
CREATE MODEL
supports the following options:
MODEL_TYPE
Syntax
MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG' | 'KMEANS' | 'TENSORFLOW' |
'MATRIX_FACTORIZATION' | 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' |
'BOOSTED_TREE_CLASSIFIER' | 'BOOSTED_TREE_REGRESSOR' | 'DNN_CLASSIFIER' |
'DNN_REGRESSOR' | 'ARIMA' }
Description
Specify the model type. This option is required.
Arguments
'LINEAR_REG'
Linear regression for forecasting; for example, the sales of an
item on a given day. Labels are real-valued: they cannot be +/- infinity or
NaN
.
'LOGISTIC_REG'
Logistic regression for classification; for example, determining whether a
customer will make a purchase. This model can be of two types:
- Binary logistic regression for classification; for example, determining whether a customer will make a purchase. Labels must only have two possible values.
- Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value." Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross entropy loss function.
'KMEANS'
K-means clustering for data segmentation; for example,
identifying customer segments. K-means is an unsupervised learning technique, so
model training does not require labels nor split data for training or
evaluation.
'TENSORFLOW'
Create a model by importing a TensorFlow model into
BigQuery ML. See
the CREATE MODEL statement for TensorFlow models
for more information.
'MATRIX_FACTORIZATION'
Matrix factorization for recommendation systems.
For example, given a set of users, items, and some preferences or ratings for
a subset of the items, creates a model to predict a user's preference for items
they have not rated or interacted with before. For more information, see
the CREATE MODEL statement for matrix factorization models.
'AUTOML_REGRESSOR'
(beta) Create a regression model using AutoML Tables. See
the CREATE MODEL statement for AutoML Tables models
for more information.
'AUTOML_CLASSIFIER'
(beta) Create a classification model using AutoML Tables. See
the CREATE MODEL statement for AutoML Tables models
for more information.
'BOOSTED_TREE_CLASSIFIER'
(beta) Create a Boosted Tree Classifier model using the XGBoost library. See
the CREATE MODEL statement for boosted tree models
for more information.
'BOOSTED_TREE_REGRESSOR'
(beta) Create a Boosted Tree Regressor model using the XGBoost library. See
the CREATE MODEL statement for boosted tree models
for more information.
'DNN_CLASSIFIER'
(beta) Create a Deep Neural Network Classifier model. See
the CREATE MODEL statement for DNN models
for more information.
'DNN_REGRESSOR'
(beta) Create a Deep Neural Network Regressor model. See
the CREATE MODEL statement for DNN models
for more information.
'ARIMA'
Univariate ARIMA-based time series model for time series
forecasting.
For example, given one or multiple time series, creates a model or a set of
time series models at once to forecast future data points. For more information,
see
the CREATE MODEL statement for time series models.
INPUT_LABEL_COLS
Syntax
INPUT_LABEL_COLS = string_array
Description
The label column name(s) in the training data.
Arguments
string_array
is an ARRAY
of STRING
s. The linear regression and
logistic regression model types only support string_array values
that contain one element.
OPTIMIZE_STRATEGY
Syntax
OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' }
Description
The strategy to train linear regression models.
Arguments
Accepts the following values:
'AUTO_STRATEGY'
Determine the training strategy as follows:
- If
l1_reg
orwarm_start
is specified,batch_gradient_descent
strategy is used. - If total cardinalities of training features are more than 10000,
batch_gradient_descent
strategy is used. - If there is over-fitting issue, that is, the number of training examples is
less than 10x, where x is total cardinality,
batch_gradient_descent
strategy is used. NORMAL_EQUATION
strategy is used for all other cases.
'BATCH_GRADIENT_DESCENT'
Train the model using the
batch gradient descent
method, which optimizes the loss function using the gradient function.
'NORMAL_EQUATION'
Directly compute the
least square solution of the linear regression problem with the analytical
formula. Normal equation cannot be used in the following cases:
l1_reg
is specified.warm_start
is specified.- Total cardinality of training features is more than 10,000.
The default value is 'AUTO_STRATEGY'
.
L1_REG
Syntax
L1_REG = float64_value
Description
The amount of L1 regularization applied.
Arguments
float64_value
is a FLOAT64
. The default value is 0.
L2_REG
Syntax
L2_REG = float64_value
Description
The amount of L2 regularization applied.
Arguments
float64_value
is a FLOAT64
. The default value is 0.
MAX_ITERATIONS
Syntax
MAX_ITERATIONS = int64_value
Description
The maximum number of training iterations or steps.
Arguments
int64_value
is an INT64
. The default value is 20.
LEARN_RATE_STRATEGY
Syntax
LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' }
Description
The strategy for specifying the learning rate during training.
Arguments
'LINE_SEARCH'
Use the line
search method to calculate the learning rate. The line search initial learn
rate is the value specified for LS_INIT_LEARN_RATE
.
Line search slows down training and increases the number of bytes processed, but it generally converges even with a larger initial specified learning rate.
'CONSTANT'
Set learning rate to the value specified for
LEARN_RATE.
The default value is 'LINE_SEARCH'
.
LEARN_RATE
Syntax
LEARN_RATE = float64_value
Description
The learn rate for
gradient descent when LEARN_RATE_STRATEGY
is set to CONSTANT
. If
LEARN_RATE_STRATEGY
is set to 'LINE_SEARCH'
, an error is returned.
Arguments
float64_value
is a FLOAT64
. The default value is 0.1.
EARLY_STOP
Syntax
EARLY_STOP = { TRUE | FALSE }
Description
Whether training should stop after the first iteration in which the relative
loss improvement is less than the value specified for
MIN_REL_PROGRESS
.
Arguments
The value is a BOOL
. The default value is TRUE
.
MIN_REL_PROGRESS
Syntax
MIN_REL_PROGRESS = float64_value
Description
The minimum relative loss improvement that is necessary to continue training
when EARLY_STOP
is set to true. For example, a value of 0.01 specifies
that each iteration must reduce the loss by 1% for training to continue.
Arguments
float64_value
is a FLOAT64
. The default value is 0.01.
DATA_SPLIT_METHOD
Syntax
DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' }
Description
The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting via early stopping.
Arguments
Accepts the following values:
'AUTO_SPLIT'
The automatic split strategy is as follows:
- When there are fewer than 500 rows in the input data, all rows are used as training data.
- When there are between 500 and 50,000 rows in the input data, 20% of the data
is used as evaluation data in a
RANDOM
split. - When there are more than 50,000 rows in the input data, only 10,000 of them
are used as evaluation data in a
RANDOM
split.
'RANDOM'
Split data randomly. A random split is deterministic: different
training runs produce the same split results if the underlying training
data remains the same.
'CUSTOM'
Split data using a customer provided column of type BOOL
. The rows
with a value of TRUE
are used as evaluation data. The rows with a value of
FALSE
are used as training data.
'SEQ'
Split data sequentially using a customer-provided column. The column can
have any orderable data type: NUMERIC
, BIGNUMERIC
(Preview), STRING
, or
TIMESTAMP
. All rows with split values smaller than the threshold are used as
training data. The remaining rows including NULLs
are used as evaluation data.
'NO_SPLIT'
Use all data as training data.
The default value is 'AUTO_SPLIT'
for linear regression and logistic
regression models.
The default value is 'NO_SPLIT'
for matrix factorization models. If you decide
to use a different split method, please ensure all users and all items have
sufficient rows in the training set to build a good model.
This is option is not supported by TensorFlow or k-means models.
DATA_SPLIT_EVAL_FRACTION
Syntax
DATA_SPLIT_EVAL_FRACTION = float64_value
Description
This option is used with 'RANDOM'
and 'SEQ'
splits. It specifies the
fraction of the data used for evaluation, accurate to two decimal places.
Arguments
float64_value
is a FLOAT64
. The default value is 0.2.
DATA_SPLIT_COL
Syntax
DATA_SPLIT_COL = string_value
Description
Identifies the column used to split the data. This column cannot be used as a feature or label, and will be excluded from features automatically.
When the value of
DATA_SPLIT_METHOD
is'CUSTOM'
, the corresponding column should be of typeBOOL
. The rows withTRUE
orNULL
values are used as evaluation data. Rows withFALSE
values are used as training data.When the value of
DATA_SPLIT_METHOD
is'SEQ'
, the last n rows from smallest to largest in the corresponding column are used as evaluation data, where n is the value specified forDATA_SPLIT_EVAL_FRACTION
. The first rows are used as training data.
For information on supported input types, see Supported
input types for DATA_SPLIT_COL
.
For more information about using DATA_SPLIT_COL
with AutoML model types, see
the CREATE MODEL statement for AutoML Tables models.
Arguments
string_value
is a STRING
.
LS_INIT_LEARN_RATE
Syntax
LS_INIT_LEARN_RATE = float64_value
Description
Sets the initial learning rate that LEARN_RATE_STRATEGY='LINE_SEARCH'
uses.
This option can only be used if LINE_SEARCH
is
specified.
If the model LEARN_RATE
appears to be doubling every
iteration as indicated by ML.TRAINING_INFO
, try setting
LS_INIT_LEARN_RATE
to the last doubled learn rate. The
optimal initial learn rate is different for every model. A good initial
learn rate for one model might not be a good initial learn rate for another.
Arguments
float64_value
is a FLOAT64
.
WARM_START
Syntax
WARM_START = { TRUE | FALSE }
Description
Retrain a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run.
In a warm start run, the iteration numbers are reset to start from zero. The
TRAINING_RUN
number or the TIMESTAMP
columns can be
used to distinguish the warm start run from the original run.
The value of the MODEL_TYPE
and LABELS
options and the training data schema
must remain constant in a warm start. The warm_start
option is only supported for
LINEAR_REG
, LOGISTIC_REG
, KMEANS
, DNN_REGRESSOR
, and DNN_CLASSIFIER
models retrain.
Arguments
Accepts a BOOL
. The default value is FALSE
.
AUTO_CLASS_WEIGHTS
Syntax
AUTO_CLASS_WEIGHTS = { TRUE | FALSE }
Description
Whether to balance class labels using weights for each class in inverse proportion to the frequency of that class.
Use with boosted_tree_classifier, DNN_CLASSIFIER, logistic and multiclass logistic regressions.
By default, the training data used to create a multiclass logistic regression, boosted tree classifier, and dnn classifier model is unweighted. If the labels in the training data are imbalanced, the model might learn to predict the most popular class of labels more heavily, which might not be desired.
To balance every class, set AUTO_CLASS_WEIGHTS
to TRUE
. Balance is
accomplished using the following formula:
total_input_rows / (input_rows_for_class_n * number_of_unique_classes)
Arguments
Accepts a BOOL
. The default value is FALSE
.
CLASS_WEIGHTS
Syntax
CLASS_WEIGHTS = struct_array
Description
The weights to use for each class label. This option cannot be specified if
AUTO_CLASS_WEIGHTS
is specified.
Arguments
struct_array
is an ARRAY
of STRUCT
s; each STRUCT
contains a
STRING
that contains the class label and a FLOAT64
containing the weight for
that class label. A weight must be present for every class label. The weights
are not required to add up to one. For example:
CLASS_WEIGHTS = [STRUCT('example_label', .2)]
NUM_CLUSTERS
Syntax
NUM_CLUSTERS = int64_value
Description
For a k-means model, the number of clusters to identify in the input data.
Arguments
int64_value
is an INT64
. Allowed values are 2-100. The default
value is log10(n)
, where n
is the number of training
examples.
KMEANS_INIT_METHOD
Syntax
KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' }
Description
For a k-means model, the method of initializing the clusters.
To use the same centroids in repeated CREATE MODEL
queries, specify the option
'CUSTOM'
.
Arguments
'RANDOM'
: Initializes the centroids by randomly selecting
NUM_CLUSTERS
data points from the input data.
'KMEANS++'
: Initializes NUM_CLUSTERS
centroids by using
the KMEANS++
algorithm. 'KMEANS++'
trains a better model than 'RANDOM'
cluster
initialization.
'CUSTOM'
: Initializes the centroids using a provided column of type BOOL
.
This column is specified by the option 'KMEANS_INIT_COL'
.
BigQuery ML uses the rows with a value of TRUE
as the initial centroids.
When this option is present and the values in 'KMEANS_INIT_COL'
are constant,
repeated CREATE MODEL
queries use the same centroids.
The default value is 'RANDOM'
KMEANS_INIT_COL
Syntax
KMEANS_INIT_COL = string_value
Description
For a k-means model, identifies the column used to initialize the centroids.
If this column contains a value of TRUE
for a given row, then BigQuery ML
uses that row as an initial centroid.
This option can only be specified when
'KMEANS_INIT_METHOD'
has the value 'CUSTOM'
. The
corresponding column must be of type BOOL
. Model option
NUM_CLUSTERS
must be present in the query and its value must
equal the total number of TRUE
rows in this column. BigQuery ML cannot
use this column as a feature and excludes it from features automatically.
Arguments
string_value
is a STRING
.
DISTANCE_TYPE
Syntax
DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' }
Description
For a k-means model, the type of metric to compute the distance between two points.
Arguments
Accepts the following values:
'EUCLIDEAN'
Use the following equation to calculate the distance between point
x
and y
:
'COSINE'
Use the following equation calculates the distance:
where \( \lVert x\rVert_{2} \) represents the L2 norm for x
.
The default value is 'EUCLIDEAN'
.
STANDARDIZE_FEATURES
Syntax
STANDARDIZE_FEATURES = { TRUE | FALSE }
Description
For a kmeans
model, whether to
standardize numerical
features.
Arguments
Accepts a BOOL
. The default value is TRUE
.
MODEL_PATH
Syntax
MODEL_PATH = string_value
Description
For TensorFlow model types, specifies the location of the TensorFlow model to import.
string_value
is the location of a Google Cloud Storage bucket that contains
the model to import.
See the CREATE MODEL statement for TensorFlow models for more information.
Example
MODEL_PATH = 'gs:////bucket/path/to/saved_model/*'
FEEDBACK_TYPE
Syntax
FEEDBACK_TYPE = { 'EXPLICIT' | 'IMPLICIT' }
Description
Specifies feedback type for matrix factorization models which changes the algorithm that is used during training.
For more information, see matrix factorization models.
Arguments
The default value is 'EXPLICIT'
.
NUM_FACTORS
Syntax
NUM_FACTORS = int64_value
Description
Specifies the number of latent factors to use for matrix factorization models.
Arguments
int64_value
is an 'INT64'
. Allowed values are 2-200.
The default value is log2(n)
, where n
is the number
of training examples.
USER_COL
Syntax
USER_COL = string_value
Description
The user column name for matrix factorization models.
Arguments
string_value
is a 'STRING'
. The default value is
'user'
.
ITEM_COL
Syntax
ITEM_COL = string_value
Description
The item column name for matrix factorization models.
Arguments
string_value
is a 'STRING'
. The default value is
'item'
.
RATING_COL
Syntax
RATING_COL = string_value
Description
The rating column name for matrix factorization models.
Arguments
string_value
is a 'STRING'
. The default value is
'rating'
.
WALS_ALPHA
Syntax
WALS_ALPHA = float64_value
Description
A hyperparameter for 'IMPLICIT'
matrix factorization model. For more
information, see
matrix factorization models.
Arguments
float64_value
is a 'FLOAT64'
. The default value is 40.
BUDGET_HOURS
Syntax
BUDGET_HOURS = float64_value
Description
Sets the training budget for AutoML Tables training, specified in hours.
Defaults to 1.0 and must be between 1.0 and 72.0. This option can only be used
if MODEL_TYPE
is AUTOML_REGRESSOR
or
AUTOML_CLASSIFIER
.
After training AutoML Tables models, BigQuery ML compresses the model to ensure it is small enough to import, which can take up to 50% of the training time. The time to compress the model is not included in the training budget time.
For more information, see the CREATE MODEL statement for AutoML Tables models.
Arguments
float64_value is a FLOAT64
.
BOOSTER_TYPE
Syntax
BOOSTER_TYPE = { 'GBTREE' | 'DART'}
Description
For boosted tree models, specify the booster type to use, with default value
GBTREE
.
Arguments
The following options are available:
'GBTREE'
: tree booster'DART'
: dart booster
NUM_PARALLEL_TREE
Syntax
NUM_PARALLEL_TREE = int64_value
Description
Number of parallel trees constructed during each iteration. Default value is 1. To train a boosted random forest, set this value larger than 1.
DART_NORMALIZE_TYPE
Syntax
DART_NORMALIZE_TYPE = { 'TREE' | 'FOREST'}
Description
Type of normalization algorithm for DART booster. Default value is TREE
.
TREE_METHOD
Syntax
TREE_METHOD = { 'AUTO' | 'EXACT' | 'APPROX' | 'HIST'}
Description
Type of tree construction algorithm. Default value is AUTO
.
HIST
is recommended for large datasets in order to achieve faster training
speed and lower resource consumption. For more information, see
tree booster.
MIN_TREE_CHILD_WEIGHT
Syntax
MIN_TREE_CHILD_WEIGHT = float64_value
Description
Minimum sum of instance weight needed in a child for further partitioning.
If the tree partition step results in a leaf node with the sum of instance
weight less than min_tree_child_weight
, then the building process will give
up further partitioning. The larger the value of min_tree_child_weight
, the more
conservative the algorithm will be. The value should be greater than or equal to 0,
with default value 1.
COLSAMPLE_BYTREE
Syntax
COLSAMPLE_BYTREE = float64_value
Description
Subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed. The value should be between 0 and 1, with default value 1.
COLSAMPLE_BYLEVEL
Syntax
COLSAMPLE_BYLEVEL = float64_value
Description
Subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree. The value should be between 0 and 1, with default value 1.
COLSAMPLE_BYNODE
Syntax
COLSAMPLE_BYNODE = float64_value
Description
Subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level. The value should be between 0 and 1, with default value 1.
MIN_SPLIT_LOSS
Syntax
MIN_SPLIT_LOSS = float64_value
Description
Minimum loss reduction required to make a further partition on a leaf node of
the tree. The larger the value of min_split_loss
, the more conservative the
algorithm will be. Default value is 0.
MAX_TREE_DEPTH
Syntax
MAX_TREE_DEPTH = int64_value
Description
Maximum depth of a tree. Default value is 6.
SUBSAMPLE
Syntax
SUBSAMPLE = float64_value
Description
Subsample ratio of the training instances. Setting this value to 0.5 means that training randomly samples half of the training data prior to growing trees, which prevents overfitting. Subsampling will occur once in every boosting iteration. This is independent of the training-test data split used in the training options (80/20 random by default). The test data is not used in any iteration irrespective of subsample; subsample is only applied to the training data. The value should be between 0 and 1, with default value 1.0 (use all of the training data in each iteration).
ACTIVATION_FN
Syntax
ACTIVATION_FN = { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' }
Description
For DNN model types, specifies the activation function of the neural network.
Arguments
The following options are available:
'RELU'
— Rectified linear'RELU6'
— Rectified linear 6'CRELU'
— Concatenated ReLU'ELU'
— Exponential linear'SELU'
— Scaled exponential linear'SIGMOID'
— Sigmoid activation'TANH'
— Tanh activation
BATCH_SIZE
Syntax
BATCH_SIZE = int64_value
Description
For DNN model types, specifies the mini batch size of samples that are fed to the neural network.
Arguments
The default value is the smaller value of 1024 and the number of samples.
The max value is 8192.
DROPOUT
Syntax
DROPOUT = float64_value
Description
For DNN model types, specifies the dropout rate of units in the neural network.
Arguments
The default value is 0. A valid input must be between 0.0 and 1.0.
HIDDEN_UNITS
Syntax
HIDDEN_UNITS = int_array
Description
For DNN model types, specifies the hidden layers of the neural network.
Arguments
An array of integers that represents the architecture of the hidden layers. If not specified, BigQuery ML applies a single hidden layer that contains no more than 128 units.
Example
HIDDEN_UNITS = [256, 128, 64]
This example represents an architecture of 3 hidden layers with 256, 128, and 64 nodes respectively.
OPTIMIZER
Syntax
OPTIMIZER = { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' }
Description
For DNN model types, specifies the optimizer for training the model.
Arguments
The following options are available:
'ADAGRAD'
— Implements the Adagrad algorithm'ADAM'
— Implements the Adam algorithm'FTRL'
— Implements the FTRL algorithm'RMSPROP'
— Implements the RMSProp algorithm'SGD'
— Implements the gradient descent algorithm
TIME_SERIES_TIMESTAMP_COL
Syntax
TIME_SERIES_TIMESTAMP_COL = string_value
Description
The timestamp column name for time series models.
Arguments
string_value
is a 'STRING'
.
TIME_SERIES_DATA_COL
Syntax
TIME_SERIES_DATA_COL = string_value
Description
The data column name for time series models.
Arguments
string_value
is a 'STRING'
.
TIME_SERIES_ID_COL
Syntax
TIME_SERIES_ID_COL = string_value
Description
The ID column name for time series models. This column is used when the user wants to fit and forecast multiple time series using a single query. Different IDs indicate different time series.
Arguments
string_value
is a 'STRING'
.
HORIZON
Syntax
HORIZON = int64_value
Description
The number of time points to forecast. When forecasting multiple time series at once, this parameter applies to each time series.
Arguments
The value is a INT64
. The default value is 1000. The maximum value is 10,000.
AUTO_ARIMA
Syntax
AUTO_ARIMA = { TRUE | FALSE }
Description
Whether the training process should use auto.ARIMA or not. If true, training
automatically finds the best non-seasonal order (for example, the p, d, q
tuple) and decides whether or not to include a linear drift term when d
is 1.
If false, the user must specify non_seasonal_order in the query. When
forecasting multiple time series at the same time, the auto.ARIMA algorithm must
be used for each time series, so this option must not be set to false.
Arguments
The value is a BOOL
. The default value is TRUE
.
AUTO_ARIMA_MAX_ORDER
Syntax
AUTO_ARIMA_MAX_ORDER = <var>int64_value</var>
Description
The maximum value for the sum of non-sesonal p and q. It controls the parameter
search space in the auto.ARIMA algorithm. Currently, the allowed values are
(2, 3, 4, 5). As a reference, for each value there are (6, 10, 15, 21) candidate
models to evaluat if non-seasonal d is determined to be 0 or 2. If non-seasonal
d is determined to be 1, the number of candidate models to evaluate doubles as
there is an additional drift term to consider for all the existing candidate
models. This option is disabled when AUTO_ARIMA
is set to false.
Arguments
The value is a INT64
. The default value is 5. The minimum value is 2 and the
maximum value is 5.
NON_SEASONAL_ORDER
Syntax
NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value)
Description
The tuple of non-seasonal p, d, q for the ARIMA model. There are no default values and you must specify all of them. You must explicitly specify auto_arima to false to use this option. Currently, p and q are restricted to [0, 1, 2, 3, 4, 5] and d is restricted to [0, 1, 2]. When forecasting multiple time series at once, one specific non-seasonal order is very unlikely to fit all the time series, and the auto.ARIMA algorithm is used to find the best non-seasonal order for each time series. Therefore, this option is disabled in this case.
Arguments
(int64_value, int64_value, int64_value)
is a tuple of
three 'INT64'
.
DATA_FREQUENCY
Syntax
DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' }
Description
The data frequency of the input time series. The finest supported granularity is
'HOURLY'
. When forecasting multiple time series at once, this argument cannot
be changed from the default value 'AUTO_FREQUENCY'
.
Arguments
Accepts the following values:
'AUTO_FREQUENCY'
: the training process automatically infers the data
frequency, which can be one of the values listed below.
'HOURLY'
: hourly time series
'DAILY'
: daily time series
'WEEKLY'
: weekly time series
'MONTHLY'
: monthly time series
'QUARTERLY'
: querterly time series
'YEARLY'
: yearly time series
The default value is 'AUTO_FREQUENCY'
.
INCLUDE_DRIFT
Syntax
INCLUDE_DRIFT = { TRUE | FALSE }
Description
Should the ARIMA model include a linear drift term or not. The drift term is applicable when non-seasonal d is 1.
When auto-arima is set to false, this argument is default to false. It can be set to true only when non-seasonal d is 1, otherwise it will return an invalid query error.
When auto-arima is set to true, it will automatically decide whether or not to include a linear drift term. Therefore, this option is disabled for auto-ARIMA.
Arguments
The value is a BOOL
. The default value is FALSE
for auto_arima is disabled.
HOLIDAY_REGION
Syntax
HOLIDAY_REGION = { 'GLOBAL' | 'NA' | 'JAPAC' | 'EMEA' | 'LAC' | 'AE' | ... }
Description
The geographical region based on which the holiday effects are applied in modeling. By default, holiday effects modeling is disabled. To turn it on, specify the holiday region using this option.
Arguments
Accepts the following values:
Top level: global
'GLOBAL'
Second level: continental regions
'NA'
: North America'JAPAC'
: Japan and Asia Pacific'EMEA'
: Europe, the Middle East and Africa'LAC'
: Latin America and the Caribbean
Third level: countries/regions
'AE'
: United Arab Emirates'AR'
: Argentina'AT'
: Austria'AU'
: Australia'BE'
: Belgium'BR'
: Brazil'CA'
: Canada'CH'
: Switzerland'CL'
: Chile'CN'
: China'CO'
: Colombia'CZ'
: Czechia'DE'
: Germany'DK'
: Denmark'DZ'
: Algeria'EC'
: Ecuador'EE'
: Estonia'EG'
: Egypt'ES'
: Spain'FI'
: Finland'FR'
: France'GB'
: United Kingdom'GR'
: Greece'HK'
: Hong Kong'HU'
: Hungary'ID'
: Indonesia'IE'
: Ireland'IL'
: Israel'IN'
: India'IR'
: Iran'IT'
: Italy'JP'
: Japan'KR'
: South Korea'LV'
: Latvia'MA'
: Morocco'MX'
: Mexico'MY'
: Malaysia'NG'
: Nigeria'NL'
: Netherlands'NO'
: Norway'NZ'
: New Zealand'PE'
: Peru'PH'
: Philippines'PK'
: Pakistan'PL'
: Poland'PT'
: Portugal'RO'
: Romania'RS'
: Serbia'RU'
: Russia'SA'
: Saudi Arabia'SE'
: Sweden'SG'
: Singapore'SI'
: Slovenia'SK'
: Slovakia'TH'
: Thailand'TR'
: Turkey'TW'
: Taiwan'UA'
: Ukraine'US'
: United States'VE'
: Venezuela'VN'
: Vietnam'ZA'
: South Africa
query_statement
The AS query_statement
clause specifies the standard SQL query that is used to
generate the training data. See the
Standard SQL Query Syntax
page for the supported SQL syntax of the query_statement
clause.
All columns referenced by the query_statement
are used as inputs to the model
except for the columns included in input_label_cols
and data_split_col
.
Supported inputs
The CREATE MODEL
statement supports the following data types for input label
and data split columns.
Supported data types for input label columns
BigQuery ML supports different Standard SQL data types depending on the
model type. Supported data types for input_label_cols
include:
Model type |
Supported label types |
---|---|
linear_reg |
INT64 NUMERIC BIGNUMERIC (Preview) FLOAT64 |
logistic_reg |
Any groupable data type |
kmeans |
Any groupable data type |
Supported data types for data split columns
BigQuery ML supports different Standard SQL data types depending on
the data split method. Supported data types for data_split_col
include:
Data split method |
Supported column types |
---|---|
custom |
BOOL |
seq |
INT64 NUMERIC BIGNUMERIC (Preview) FLOAT64 TIMESTAMP |
Limitations
CREATE MODEL
statements must comply with the following rules:
- Only one
CREATE
statement is allowed. - When you use a
CREATE MODEL
statement, the size of the model must be 90 MB or less or the query fails. Generally, if all categorical variables are short strings, a total feature cardinality (model dimension) of 5-10 million is supported. The dimensionality is dependent on the cardinality and length of the string variables. - The label column cannot contain
NULL
values. If the label column containsNULL
values, the query fails. - Currently, the
CREATE MODEL IF NOT EXISTS
clause always updates the last modified timestamp of a model. - For linear regression models, the
label
column must be real-valued (the column values cannot be +/- infinity or NaN). - For logistic regression models, the label columns can contain up to 50 unique values; that is, the number of classes is less than or equal to 50.
Input variable transformations
BigQuery ML transforms input variables or features as follows:
Input data type | Transformation method | Details |
---|---|---|
INT64 NUMERIC BIGNUMERIC (Preview) FLOAT64 |
Standardization | For all numerical columns, BigQuery ML standardizes and centers the column at zero before passing it into training except Boosted Tree models. When creating a k-means model, the STANDARDIZE_FEATURES option specifies whether to standardize numerical features. |
BOOL STRING BYTES DATE DATETIME TIME |
One-hot encoded | For all non-numerical non-array columns other than
TIMESTAMP , BigQuery ML
performs a one-hot encoding transformation except Boosted Tree models. This transformation generates
a separate feature for each unique value in the column. Label encoding transformation
is applied to train Boosted Tree models to convert each unique value into a numerical value. |
ARRAY |
Multi-hot encoded | For all non-numerical ARRAY columns, BigQuery ML
performs a multi-hot encoding transformation. This transformation generates
a separate feature for each unique element in the ARRAY . |
TIMESTAMP |
Timestamp transformation | When BigQuery ML encounters a TIMESTAMP column, it extracts
a set of components from the TIMESTAMP and performs a mix of
standardization and one-hot encoding on the extracted components. For the
Unix time in seconds component, BigQuery ML uses standardization; for
all other components, it uses one-hot encoding.You can use the ML.WEIGHTS function to see the
transformation
of a TIMESTAMP column into multiple feature columns. |
STRUCT |
Struct expansion | When BigQuery ML encounters a STRUCT column, it
expands the fields inside STRUCT to single columns. It
requires all fields to be named. Nested STRUCT is not
allowed. The column names after expansion are in the format of
{struct_name}_{field_name}. |
Imputation
In statistics, imputation is used to replace missing data with substituted
values. When you train a model in BigQuery ML, NULL
values are treated as
missing data. When you predict outcomes in BigQuery ML, missing values can
occur when BigQuery ML encounters a NULL
value or a previously unseen value.
BigQuery ML handles missing data based on whether the column is numeric,
one-hot encoded, or a timestamp.
Numerical columns
In both training and prediction, NULL
values in numeric columns are replaced
with the mean value as calculated by the feature column in the original input
data.
One-hot/Multi-hot encoded columns
In both training and prediction, NULL
values in the encoded columns are
mapped to an additional category that is added to the data. Previously unseen
data is assigned a weight of 0 during prediction.
Timestamp columns
TIMESTAMP
columns use a mixture of imputation methods from both standardized
and one-hot encoded columns. For the generated unix time column, BigQuery ML
replaces values with the mean unix time across the original columns. For other
generated values, BigQuery ML assigns them to the respective NULL
category
for each extracted feature.
STRUCT columns
In both training and prediction, each field of the STRUCT is imputed according to its type.
CREATE MODEL
examples
The CREATE MODEL
statement creates a model with the specified options. If
the model name exists in the dataset, the following error is returned:
Already Exists: [PROJECT_ID]:[DATASET].[MODEL]
To replace an existing model, use the CREATE OR REPLACE MODEL
statement
instead.
The following examples create models named mymodel
in mydataset
in your
default project.
Training a linear regression model
The following example creates and trains a linear regression model. The learn rate is set to .15, the L1 regularization is set to 1, and the maximum number of training iterations is set to 5.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='LINEAR_REG', LS_INIT_LEARN_RATE=.15, L1_REG=1, MAX_ITERATIONS=5 ) AS SELECT column1, column2, column3, label FROM `mydataset.mytable` WHERE column4 < 10
Training a linear regression model with a sequential data split
In this example, you create a linear regression model with a sequential data
split. The split fraction is .3 and split uses the timestamp
column as the
basis for the split.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='LINEAR_REG', LS_INIT_LEARN_RATE=.15, L1_REG=1, MAX_ITERATIONS=5, DATA_SPLIT_METHOD='SEQ', DATA_SPLIT_EVAL_FRACTION=0.3, DATA_SPLIT_COL='timestamp' ) AS SELECT column1, column2, column3, timestamp, label FROM `mydataset.mytable` WHERE column4 < 10
Training a linear regression model with a custom data split
In this example, you have manually split your data into a training table and an
evaluation table. The training table is named training_table
. The evaluation
table is named evaluation_table
.
This command creates a linear regression model using a custom split method and trains the model by joining the data from the evaluation and training tables.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='LINEAR_REG', DATA_SPLIT_METHOD='CUSTOM', DATA_SPLIT_COL='SPLIT_COL' ) AS SELECT *, false AS split_col FROM `mydataset.training_table` UNION ALL SELECT *, true AS split_col FROM `mydataset.evaluation_table`
In this example, all the columns in the training table and in the evaluation
table are either features or the label. The query uses SELECT *
and UNION ALL
to append all of the data in the split_col
column to the existing data.
Training a multiclass logistic regression model with automatically calculated weights
In this example, you create a multiclass logistic regression model using the
auto_class_weights
option.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='LOGISTIC_REG', AUTO_CLASS_WEIGHTS=TRUE ) AS SELECT * FROM `mydataset.mytable`
Training a multiclass logistic regression model with specified weights
In this example, you create a multiclass logistic regression model using the
class_weights
option. The label columns are label1
, label2
, and label3
.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='LOGISTIC_REG', CLASS_WEIGHTS=[('label1', 0.5), ('label2', 0.3), ('label3', 0.2)]) AS SELECT * FROM `mydataset.mytable`
Training a logistic regression model with specified weights
In this example, you create a logistic regression model using the
class_weights
option.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='LOGISTIC_REG', CLASS_WEIGHTS=[('0', 0.9), ('1', 0.1)]) AS SELECT * FROM `mydataset.mytable`
Training a k-means model
This example creates a k-means model with four clusters using the default
distance_type
value of euclidean_distance
.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='KMEANS', NUM_CLUSTERS=4 ) AS SELECT * FROM `mydataset.mytable`
Training a k-means model with random clusters
This example creates a k-means model with three clusters using the random cluster initialization method.
CREATE MODEL
`mydataset.mymodel`
OPTIONS
( MODEL_TYPE='KMEANS',
NUM_CLUSTERS=3,
KMEANS_INIT_METHOD='RANDOM') AS
SELECT
*
FROM
`mydataset.mytable`
Training a k-means model with custom clusters
This example creates a k-means model with four clusters using the custom cluster
initialization method. init_col
identifies the column of type BOOL
that
contains the values which specify whether a given row is an initial centroid.
This column should only contain three rows with the value TRUE
.
CREATE MODEL
`mydataset.mymodel`
OPTIONS
( MODEL_TYPE='KMEANS',
NUM_CLUSTERS=3,
KMEANS_INIT_METHOD='CUSTOM',
KMEANS_INIT_COL='init_col') AS
SELECT
init_col,
features
FROM
`mydataset.mytable`
Model creation with TRANSFORM, while excluding original columns
The following query trains a model after adding the columns f1
and f2
from the SELECT
statement to form a new column c
; the columns f1
and f2
are omitted from the training data. Model training uses
columns f3
and label_col
as they appear in the data source t
.
CREATE MODEL m
TRANSFORM(f1 + f2 as c, * EXCEPT(f1, f2))
OPTIONS(model_type=’linear_reg’, input_label_cols=['label_col'])
AS SELECT f1, f2, f3, label_col FROM t;