CREATE MODEL
statement
To create a linear regression or logistic regression model in BigQuery, use the
BigQuery ML CREATE MODEL
statement with the LINEAR_REG
or
LOGISTIC_REG
model types.
For information about the supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name [OPTIONS(MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG' }, INPUT_LABEL_COLS = string_array, OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' }, L1_REG = float64_value, L2_REG = float64_value, MAX_ITERATIONS = int64_value, LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' }, LEARN_RATE = float64_value, EARLY_STOP = { TRUE | FALSE }, MIN_REL_PROGRESS = float64_value, DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' }, DATA_SPLIT_EVAL_FRACTION = float64_value, DATA_SPLIT_COL = string_value, LS_INIT_LEARN_RATE = float64_value, WARM_START = { TRUE | FALSE }, AUTO_CLASS_WEIGHTS = { TRUE | FALSE }, CLASS_WEIGHTS = struct_array, ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE }, CALCULATE_P_VALUES = { TRUE | FALSE }, FIT_INTERCEPT = { TRUE | FALSE }, CATEGORY_ENCODING_METHOD = { 'ONE_HOT_ENCODING`, 'DUMMY_ENCODING' } )];
CREATE MODEL
Creates and trains a new model in the specified dataset. If the model name
exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates and trains a new model only if the model does not currently exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates and trains a model and replaces an existing model with the same name in the specified dataset.
model_name
model_name
is the name of the model you're creating or replacing. The model
name must be unique per dataset: no other model or table can have the same name.
The model name must follow the same naming rules as a BigQuery table. A
model name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
model_name
is not case-sensitive.
If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks: `[PROJECT_ID].[DATASET].[MODEL]` ; for example, `myproject.mydataset.mymodel`.
model_option_list
In the model_option_list
, the model_type
option is required. All others
are optional.
CREATE MODEL
supports the following options:
MODEL_TYPE
Syntax
MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG'}
Description
Specify the model type. This option is required.
Arguments
'LINEAR_REG'
Linear regression for forecasting; for example, the sales of an
item on a given day. Labels are real-valued: they cannot be +/- infinity or
NaN
.
'LOGISTIC_REG'
Logistic regression for classification; for example, determining whether a
customer will make a purchase. This model can be of two types:
- Binary logistic regression for classification; for example, determining whether a customer will make a purchase. Labels must only have two possible values, one for the positive class and another for the negative class. BigQuery ML treats the higher label value as the positive class, and lower label value as the negative class. This holds for both numeric and string label values.
- Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value." Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross entropy loss function.
INPUT_LABEL_COLS
Syntax
INPUT_LABEL_COLS = string_array
Description
The label column names in the training data.
Arguments
string_array
is an ARRAY
of STRING
s. The linear
regression and logistic regression model types only support
string_array values that contain one element. Defaults to 'label'.
OPTIMIZE_STRATEGY
Syntax
OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' }
Description
The strategy to train linear regression models.
Arguments
Accepts the following values:
'AUTO_STRATEGY'
Determine the training strategy as follows:
- If
l1_reg
orwarm_start
is specified,batch_gradient_descent
strategy is used. - If total cardinalities of training features are more than 10,000,
batch_gradient_descent
strategy is used. - If there is over-fitting issue, that is, the number of training examples is
less than 10x, where x is total cardinality,
batch_gradient_descent
strategy is used. NORMAL_EQUATION
strategy is used for all other cases.
'BATCH_GRADIENT_DESCENT'
Train the model using the
batch gradient descent
method, which optimizes the loss function using the gradient function.
'NORMAL_EQUATION'
Directly compute the
least square solution of the linear regression problem with the analytical
formula. Normal equation cannot be used in the following cases:
l1_reg
is specified.warm_start
is specified.- Total cardinality of training features is more than 10,000.
The default value is 'AUTO_STRATEGY'
.
L1_REG
Syntax
L1_REG = float64_value
Description
The amount of L1 regularization applied.
Arguments
float64_value
is a FLOAT64
. The default value is 0.
L2_REG
Syntax
L2_REG = float64_value
Description
The amount of L2 regularization applied.
Arguments
float64_value
is a FLOAT64
. The default value is 0.
MAX_ITERATIONS
Syntax
MAX_ITERATIONS = int64_value
Description
The maximum number of training iterations or steps.
Arguments
int64_value
is an INT64
. The default value is 20.
LEARN_RATE_STRATEGY
Syntax
LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' }
Description
The strategy for specifying the learning rate during training.
Arguments
'LINE_SEARCH'
Use the line
search method to calculate the learning rate. The line search initial learn
rate is the value specified for LS_INIT_LEARN_RATE
.
Line search slows down training and increases the number of bytes processed, but it generally converges even with a larger initial specified learning rate.
'CONSTANT'
Set learning rate to the value specified for
LEARN_RATE.
The default value is 'LINE_SEARCH'
.
LEARN_RATE
Syntax
LEARN_RATE = float64_value
Description
The learn rate for
gradient descent when LEARN_RATE_STRATEGY
is set to CONSTANT
. If
LEARN_RATE_STRATEGY
is set to 'LINE_SEARCH'
, an error is returned.
Arguments
float64_value
is a FLOAT64
. The default value is 0.1.
EARLY_STOP
Syntax
EARLY_STOP = { TRUE | FALSE }
Description
Whether training should stop after the first iteration in which the relative
loss improvement is less than the value specified for
MIN_REL_PROGRESS
.
Arguments
The value is a BOOL
. The default value is TRUE
.
MIN_REL_PROGRESS
Syntax
MIN_REL_PROGRESS = float64_value
Description
The minimum relative loss improvement that is necessary to continue training
when EARLY_STOP
is set to true. For example, a value of 0.01 specifies
that each iteration must reduce the loss by 1% for training to continue.
Arguments
float64_value
is a FLOAT64
. The default value is 0.01.
DATA_SPLIT_METHOD
Syntax
DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' }
Description
The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting via early stopping.
When there is a data split, you can find the temporary split results (Training Data, Evaluation Data) on the Model Details page in the BigQuery Console and the model API data_split_result field. These split tables will be saved for 48 hours. If you will need them for longer than 48 hours, copy them out of the anonymous dataset for longer retention.
Arguments
Accepts the following values:
'AUTO_SPLIT'
The automatic split strategy is as follows:
- When there are fewer than 500 rows in the input data, all rows are used as training data.
- When there are between 500 and 50,000 rows in the input data, 20% of the data
is used as evaluation data in a
RANDOM
split. - When there are more than 50,000 rows in the input data, only 10,000 of them
are used as evaluation data in a
RANDOM
split.
'RANDOM'
Split data randomly. A random split is deterministic: different
training runs produce the same split results if the underlying training
data remains the same.
'CUSTOM'
Split data using a customer provided column of type BOOL
. The rows
with a value of TRUE
are used as evaluation data. The rows with a value of
FALSE
are used as training data.
'SEQ'
Split data sequentially using a customer-provided column. The column can
have any orderable data type: NUMERIC
, BIGNUMERIC
, STRING
, or
TIMESTAMP
. All rows with split values smaller than the threshold are used as
training data. The remaining rows including NULLs
are used as evaluation data.
'NO_SPLIT'
Use all data as training data.
The default data split method is 'AUTO_SPLIT'
.
DATA_SPLIT_EVAL_FRACTION
Syntax
DATA_SPLIT_EVAL_FRACTION = float64_value
Description
This option is used with 'RANDOM'
and 'SEQ'
splits. It specifies the
fraction of the data used for evaluation, accurate to two decimal places.
Arguments
float64_value
is a FLOAT64
. The default value is 0.2.
DATA_SPLIT_COL
Syntax
DATA_SPLIT_COL = string_value
Description
Identifies the column used to split the data. This column cannot be used as a feature or label, and will be excluded from features automatically.
When the value of
DATA_SPLIT_METHOD
is'CUSTOM'
, the corresponding column should be of typeBOOL
. The rows withTRUE
orNULL
values are used as evaluation data. Rows withFALSE
values are used as training data.When the value of
DATA_SPLIT_METHOD
is'SEQ'
, the last n rows from smallest to largest in the corresponding column are used as evaluation data, where n is the value specified forDATA_SPLIT_EVAL_FRACTION
. The first rows are used as training data.
Arguments
string_value
is a STRING
.
LS_INIT_LEARN_RATE
Syntax
LS_INIT_LEARN_RATE = float64_value
Description
Sets the initial learning rate that LEARN_RATE_STRATEGY='LINE_SEARCH'
uses.
This option can only be used if LINE_SEARCH
is
specified.
If the model LEARN_RATE
appears to be doubling every
iteration as indicated by ML.TRAINING_INFO
, try setting
LS_INIT_LEARN_RATE
to the last doubled learn rate. The
optimal initial learn rate is different for every model. A good initial
learn rate for one model might not be a good initial learn rate for another.
Arguments
float64_value
is a FLOAT64
. The default value is 0.1.
WARM_START
Syntax
WARM_START = { TRUE | FALSE }
Description
Retrain a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run.
In a warm start run, the iteration numbers are reset to start from zero. The
TRAINING_RUN
number or the TIMESTAMP
columns can be
used to distinguish the warm start run from the original run.
The value of the MODEL_TYPE
and LABELS
options and the training data schema
must remain constant in a warm start.
Arguments
Accepts a BOOL
. The default value is FALSE
.
AUTO_CLASS_WEIGHTS
Syntax
AUTO_CLASS_WEIGHTS = { TRUE | FALSE }
Description
Whether to balance class labels using weights for each class in inverse proportion to the frequency of that class. Use with logistic regression.
By default, the training data used to create a logistic regression model is unweighted. If the labels in the training data are imbalanced, the model might learn to predict the most popular class of labels more heavily, which might not be desired.
To balance every class, set AUTO_CLASS_WEIGHTS
to TRUE
. Balance is
accomplished using the following formula:
total_input_rows / (input_rows_for_class_n * number_of_unique_classes)
Arguments
Accepts a BOOL
. The default value is FALSE
.
CLASS_WEIGHTS
Syntax
CLASS_WEIGHTS = struct_array
Description
The weights to use for each class label. This option cannot be specified if
AUTO_CLASS_WEIGHTS
is TRUE
.
Arguments
struct_array
is an ARRAY
of STRUCT
s; each STRUCT
contains a
STRING
that contains the class label and a FLOAT64
containing the weight for
that class label. A weight must be present for every class label. The weights
are not required to add up to one. For example:
CLASS_WEIGHTS = [STRUCT('example_label', .2)]
ENABLE_GLOBAL_EXPLAIN
Syntax
ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE }
Description
Whether to compute global explanations using explainable AI to evaluate global feature importance to the model.
Arguments
The value is a BOOL
. The default value is FALSE
.
CALCULATE_P_VALUES
Syntax
CALCULATE_P_VALUES = { TRUE | FALSE }
Description
Specifies whether to compute p-values and standard errors during training.
Arguments
The value is a BOOL
. The default value is FALSE
.
FIT_INTERCEPT
Syntax
FIT_INTERCEPT = { TRUE | FALSE }
Description
Specifies whether to fit an intercept to the model during training.
Arguments
The value is a BOOL
. The default value is TRUE
.
CATEGORY_ENCODING_METHOD
Syntax
CATEGORY_ENCODING_METHOD = { 'ONE_HOT_ENCODING', 'DUMMY_ENCODING' }
Description
Specifies which encoding method to use on non-numeric features. For more information about supported encoding methods, read BigQuery ML auto preprocessing.
Arguments
The value is a string
. The default value is 'ONE_HOT_ENCODING'
.
query_statement
The AS query_statement
clause specifies the standard SQL query that is used to
generate the training data. See the
Standard SQL Query Syntax
page for the supported SQL syntax of the query_statement
clause.
All columns referenced by the query_statement
are used as inputs to the model
except for the columns included in input_label_cols
and data_split_col
.
Limitations
CREATE MODEL
statements must comply with the following rules:
- For linear regression models, the
label
column must be real-valued (the column values cannot be +/- infinity or NaN). - For logistic regression models, the label columns can contain up to 50 unique values; that is, the number of classes is less than or equal to 50.
CREATE MODEL
examples
The following examples create models named mymodel
in mydataset
in your
default project.
Training a linear regression model
The following example creates and trains a linear regression model. The learn rate is set to 0.15, the L1 regularization is set to 1, and the maximum number of training iterations is set to 5.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='LINEAR_REG', LS_INIT_LEARN_RATE=0.15, L1_REG=1, MAX_ITERATIONS=5 ) AS SELECT column1, column2, column3, label FROM `mydataset.mytable` WHERE column4 < 10
Training a linear regression model with a sequential data split
In this example, you create a linear regression model with a sequential data
split. The split fraction is 0.3 and split uses the timestamp
column as the
basis for the split.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='LINEAR_REG', LS_INIT_LEARN_RATE=0.15, L1_REG=1, MAX_ITERATIONS=5, DATA_SPLIT_METHOD='SEQ', DATA_SPLIT_EVAL_FRACTION=0.3, DATA_SPLIT_COL='timestamp' ) AS SELECT column1, column2, column3, timestamp, label FROM `mydataset.mytable` WHERE column4 < 10
Training a linear regression model with a custom data split
In this example, you have manually split your data into a training table and an
evaluation table. The training table is named training_table
. The evaluation
table is named evaluation_table
.
This command creates a linear regression model using a custom split method and trains the model by joining the data from the evaluation and training tables.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='LINEAR_REG', DATA_SPLIT_METHOD='CUSTOM', DATA_SPLIT_COL='SPLIT_COL' ) AS SELECT *, false AS split_col FROM `mydataset.training_table` UNION ALL SELECT *, true AS split_col FROM `mydataset.evaluation_table`
In this example, all the columns in the training table and in the evaluation
table are either features or the label. The query uses SELECT *
and UNION ALL
to append all of the data in the split_col
column to the existing data.
Training a multiclass logistic regression model with automatically calculated weights
In this example, you create a multiclass logistic regression model using the
auto_class_weights
option.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='LOGISTIC_REG', AUTO_CLASS_WEIGHTS=TRUE ) AS SELECT * FROM `mydataset.mytable`
Training a multiclass logistic regression model with specified weights
In this example, you create a multiclass logistic regression model using the
class_weights
option. The label columns are label1
, label2
, and label3
.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='LOGISTIC_REG', CLASS_WEIGHTS=[('label1', 0.5), ('label2', 0.3), ('label3', 0.2)]) AS SELECT * FROM `mydataset.mytable`
Training a logistic regression model with specified weights
In this example, you create a logistic regression model using the
class_weights
option.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='LOGISTIC_REG', CLASS_WEIGHTS=[('0', 0.9), ('1', 0.1)]) AS SELECT * FROM `mydataset.mytable`
Model creation with TRANSFORM, while excluding original columns
The following query trains a model after adding the columns f1
and f2
from the SELECT
statement to form a new column c
; the columns f1
and f2
are omitted from the training data. Model training uses
columns f3
and label_col
as they appear in the data source t
.
CREATE MODEL m
TRANSFORM(f1 + f2 as c, * EXCEPT(f1, f2))
OPTIONS(model_type=’linear_reg’, input_label_cols=['label_col'])
AS SELECT f1, f2, f3, label_col FROM t;