The CREATE MODEL statement for generalized linear models

CREATE MODEL statement

To create a linear regression or logistic regression model in BigQuery, use the BigQuery ML CREATE MODEL statement with the LINEAR_REG or LOGISTIC_REG model types.

For information about the supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
[OPTIONS(MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG' },
    INPUT_LABEL_COLS = string_array,
    OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' },
    L1_REG = float64_value,
    L2_REG = float64_value,
    MAX_ITERATIONS = int64_value,
    LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' },
    LEARN_RATE = float64_value,
    EARLY_STOP = { TRUE | FALSE },
    MIN_REL_PROGRESS = float64_value,
    DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' },
    DATA_SPLIT_EVAL_FRACTION = float64_value,
    DATA_SPLIT_COL = string_value,
    LS_INIT_LEARN_RATE = float64_value,
    WARM_START = { TRUE | FALSE },
    AUTO_CLASS_WEIGHTS = { TRUE | FALSE },
    CLASS_WEIGHTS = struct_array,
    ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE },
    CALCULATE_P_VALUES = { TRUE | FALSE },
    CATEGORY_ENCODING_METHOD = { 'ONE_HOT_ENCODING`, 'DUMMY_ENCODING' }
)];

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model does not currently exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

model_name

model_name is the name of the model you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks: `[PROJECT_ID].[DATASET].[MODEL]` ; for example, `myproject.mydataset.mymodel`.

model_option_list

In the model_option_list, the model_type option is required. All others are optional.

CREATE MODEL supports the following options:

MODEL_TYPE

Syntax

MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG'}

Description

Specify the model type. This option is required.

Arguments

'LINEAR_REG' Linear regression for forecasting; for example, the sales of an item on a given day. Labels are real-valued: they cannot be +/- infinity or NaN.

'LOGISTIC_REG' Logistic regression for classification; for example, determining whether a customer will make a purchase. This model can be of two types:

  • Binary logistic regression for classification; for example, determining whether a customer will make a purchase. Labels must only have two possible values, one for the positive class and another for the negative class. BigQuery ML treats the higher label value as the positive class, and lower label value as the negative class. This holds for both numeric and string label values.
  • Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value." Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross entropy loss function.

INPUT_LABEL_COLS

Syntax

INPUT_LABEL_COLS = string_array

Description

The label column names in the training data.

Arguments

string_array is an ARRAY of STRINGs. The linear regression and logistic regression model types only support string_array values that contain one element. Defaults to 'label'.

OPTIMIZE_STRATEGY

Syntax

OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' }

Description

The strategy to train linear regression models.

Arguments

Accepts the following values:

'AUTO_STRATEGY' Determine the training strategy as follows:

  • If l1_reg or warm_start is specified, batch_gradient_descent strategy is used.
  • If total cardinalities of training features are more than 10,000, batch_gradient_descent strategy is used.
  • If there is over-fitting issue, that is, the number of training examples is less than 10x, where x is total cardinality, batch_gradient_descent strategy is used.
  • NORMAL_EQUATION strategy is used for all other cases.

'BATCH_GRADIENT_DESCENT' Train the model using the batch gradient descent method, which optimizes the loss function using the gradient function.

'NORMAL_EQUATION' Directly compute the least square solution of the linear regression problem with the analytical formula. Normal equation cannot be used in the following cases:

  • l1_reg is specified.
  • warm_start is specified.
  • Total cardinality of training features is more than 10,000.

The default value is 'AUTO_STRATEGY'.

L1_REG

Syntax

L1_REG = float64_value

Description

The amount of L1 regularization applied.

Arguments

float64_value is a FLOAT64. The default value is 0.

L2_REG

Syntax

L2_REG = float64_value

Description

The amount of L2 regularization applied.

Arguments

float64_value is a FLOAT64. The default value is 0.

MAX_ITERATIONS

Syntax

MAX_ITERATIONS = int64_value

Description

The maximum number of training iterations or steps.

Arguments

int64_value is an INT64. The default value is 20.

LEARN_RATE_STRATEGY

Syntax

LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' }

Description

The strategy for specifying the learning rate during training.

Arguments

'LINE_SEARCH' Use the line search method to calculate the learning rate. The line search initial learn rate is the value specified for LS_INIT_LEARN_RATE .

Line search slows down training and increases the number of bytes processed, but it generally converges even with a larger initial specified learning rate.

'CONSTANT' Set learning rate to the value specified for LEARN_RATE.

The default value is 'LINE_SEARCH'.

LEARN_RATE

Syntax

LEARN_RATE = float64_value

Description

The learn rate for gradient descent when LEARN_RATE_STRATEGY is set to CONSTANT. If LEARN_RATE_STRATEGY is set to 'LINE_SEARCH', an error is returned.

Arguments

float64_value is a FLOAT64. The default value is 0.1.

EARLY_STOP

Syntax

EARLY_STOP = { TRUE | FALSE }

Description

Whether training should stop after the first iteration in which the relative loss improvement is less than the value specified for MIN_REL_PROGRESS.

Arguments

The value is a BOOL. The default value is TRUE.

MIN_REL_PROGRESS

Syntax

MIN_REL_PROGRESS = float64_value

Description

The minimum relative loss improvement that is necessary to continue training when EARLY_STOP is set to true. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue.

Arguments

float64_value is a FLOAT64. The default value is 0.01.

DATA_SPLIT_METHOD

Syntax

DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' }

Description

The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting via early stopping.

When there is a data split, you can find the temporary split results (Training Data, Evaluation Data) on the Model Details page in the BigQuery Console and the model API data_split_result field. These split tables will be saved for 48  hours. If you will need them for longer than 48 hours, copy them out of the anonymous dataset for longer retention.

Arguments

Accepts the following values:

'AUTO_SPLIT' The automatic split strategy is as follows:

  • When there are fewer than 500 rows in the input data, all rows are used as training data.
  • When there are between 500 and 50,000 rows in the input data, 20% of the data is used as evaluation data in a RANDOM split.
  • When there are more than 50,000 rows in the input data, only 10,000 of them are used as evaluation data in a RANDOM split.

'RANDOM' Split data randomly. A random split is deterministic: different training runs produce the same split results if the underlying training data remains the same.

'CUSTOM' Split data using a customer provided column of type BOOL. The rows with a value of TRUE are used as evaluation data. The rows with a value of FALSE are used as training data.

'SEQ' Split data sequentially using a customer-provided column. The column can have any orderable data type: NUMERIC, BIGNUMERIC, STRING, or TIMESTAMP. All rows with split values smaller than the threshold are used as training data. The remaining rows including NULLs are used as evaluation data.

'NO_SPLIT' Use all data as training data.

The default data split method is 'AUTO_SPLIT'.

DATA_SPLIT_EVAL_FRACTION

Syntax

DATA_SPLIT_EVAL_FRACTION = float64_value

Description

This option is used with 'RANDOM' and 'SEQ' splits. It specifies the fraction of the data used for evaluation, accurate to two decimal places.

Arguments

float64_value is a FLOAT64. The default value is 0.2.

DATA_SPLIT_COL

Syntax

DATA_SPLIT_COL = string_value

Description

Identifies the column used to split the data. This column cannot be used as a feature or label, and will be excluded from features automatically.

  • When the value of DATA_SPLIT_METHOD is 'CUSTOM', the corresponding column should be of type BOOL. The rows with TRUE or NULL values are used as evaluation data. Rows with FALSE values are used as training data.

  • When the value of DATA_SPLIT_METHOD is 'SEQ', the last n rows from smallest to largest in the corresponding column are used as evaluation data, where n is the value specified for DATA_SPLIT_EVAL_FRACTION. The first rows are used as training data.

Arguments

string_value is a STRING.

LS_INIT_LEARN_RATE

Syntax

LS_INIT_LEARN_RATE = float64_value

Description

Sets the initial learning rate that LEARN_RATE_STRATEGY='LINE_SEARCH' uses. This option can only be used if LINE_SEARCH is specified.

If the model LEARN_RATE appears to be doubling every iteration as indicated by ML.TRAINING_INFO, try setting LS_INIT_LEARN_RATE to the last doubled learn rate. The optimal initial learn rate is different for every model. A good initial learn rate for one model might not be a good initial learn rate for another.

Arguments

float64_value is a FLOAT64. The default value is 0.1.

WARM_START

Syntax

WARM_START = { TRUE | FALSE }

Description

Retrain a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run.

In a warm start run, the iteration numbers are reset to start from zero. The TRAINING_RUN number or the TIMESTAMP columns can be used to distinguish the warm start run from the original run.

The value of the MODEL_TYPE and LABELS options and the training data schema must remain constant in a warm start.

Arguments

Accepts a BOOL. The default value is FALSE.

AUTO_CLASS_WEIGHTS

Syntax

AUTO_CLASS_WEIGHTS = { TRUE | FALSE }

Description

Whether to balance class labels using weights for each class in inverse proportion to the frequency of that class. Use with logistic regression.

By default, the training data used to create a logistic regression model is unweighted. If the labels in the training data are imbalanced, the model might learn to predict the most popular class of labels more heavily, which might not be desired.

To balance every class, set AUTO_CLASS_WEIGHTS to TRUE. Balance is accomplished using the following formula:

total_input_rows / (input_rows_for_class_n * number_of_unique_classes)

Arguments

Accepts a BOOL. The default value is FALSE.

CLASS_WEIGHTS

Syntax

CLASS_WEIGHTS = struct_array

Description

The weights to use for each class label. This option cannot be specified if AUTO_CLASS_WEIGHTS is specified.

Arguments

struct_array is an ARRAY of STRUCTs; each STRUCT contains a STRING that contains the class label and a FLOAT64 containing the weight for that class label. A weight must be present for every class label. The weights are not required to add up to one. For example:

CLASS_WEIGHTS = [STRUCT('example_label', .2)]

ENABLE_GLOBAL_EXPLAIN

Syntax

ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE }

Description

Whether to compute global explanations using explainable AI to evaluate global feature importance to the model.

Arguments

The value is a BOOL. The default value is FALSE.

CALCULATE_P_VALUES

Syntax

CALCULATE_P_VALUES = { TRUE | FALSE }

Description

Specifies whether to compute p-values and standard errors during training.

Arguments

The value is a BOOL. The default value is FALSE.

CATEGORY_ENCODING_METHOD

Syntax

CATEGORY_ENCODING_METHOD = { 'ONE_HOT_ENCODING', 'DUMMY_ENCODING' }

Description

Specifies which encoding method to use on non-numeric features. For more information about supported encoding methods, read BigQuery ML auto preprocessing.

Arguments

The value is a string. The default value is 'ONE_HOT_ENCODING'.

query_statement

The AS query_statement clause specifies the standard SQL query that is used to generate the training data. See the Standard SQL Query Syntax page for the supported SQL syntax of the query_statement clause.

All columns referenced by the query_statement are used as inputs to the model except for the columns included in input_label_cols and data_split_col.

Limitations

CREATE MODEL statements must comply with the following rules:

  • For linear regression models, the label column must be real-valued (the column values cannot be +/- infinity or NaN).
  • For logistic regression models, the label columns can contain up to 50 unique values; that is, the number of classes is less than or equal to 50.

CREATE MODEL examples

The following examples create models named mymodel in mydataset in your default project.

Training a linear regression model

The following example creates and trains a linear regression model. The learn rate is set to 0.15, the L1 regularization is set to 1, and the maximum number of training iterations is set to 5.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='LINEAR_REG',
    LS_INIT_LEARN_RATE=0.15,
    L1_REG=1,
    MAX_ITERATIONS=5 ) AS
SELECT
  column1,
  column2,
  column3,
  label
FROM
  `mydataset.mytable`
WHERE
  column4 < 10

Training a linear regression model with a sequential data split

In this example, you create a linear regression model with a sequential data split. The split fraction is 0.3 and split uses the timestamp column as the basis for the split.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='LINEAR_REG',
    LS_INIT_LEARN_RATE=0.15,
    L1_REG=1,
    MAX_ITERATIONS=5,
    DATA_SPLIT_METHOD='SEQ',
    DATA_SPLIT_EVAL_FRACTION=0.3,
    DATA_SPLIT_COL='timestamp' ) AS
SELECT
  column1,
  column2,
  column3,
  timestamp,
  label
FROM
  `mydataset.mytable`
WHERE
  column4 < 10

Training a linear regression model with a custom data split

In this example, you have manually split your data into a training table and an evaluation table. The training table is named training_table. The evaluation table is named evaluation_table.

This command creates a linear regression model using a custom split method and trains the model by joining the data from the evaluation and training tables.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='LINEAR_REG',
    DATA_SPLIT_METHOD='CUSTOM',
    DATA_SPLIT_COL='SPLIT_COL' ) AS
SELECT
  *,
  false AS split_col
FROM
  `mydataset.training_table`
UNION ALL
SELECT
  *,
  true AS split_col
FROM
  `mydataset.evaluation_table`

In this example, all the columns in the training table and in the evaluation table are either features or the label. The query uses SELECT * and UNION ALL to append all of the data in the split_col column to the existing data.

Training a multiclass logistic regression model with automatically calculated weights

In this example, you create a multiclass logistic regression model using the auto_class_weights option.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='LOGISTIC_REG',
    AUTO_CLASS_WEIGHTS=TRUE ) AS
SELECT
  *
FROM
  `mydataset.mytable`

Training a multiclass logistic regression model with specified weights

In this example, you create a multiclass logistic regression model using the class_weights option. The label columns are label1, label2, and label3.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='LOGISTIC_REG',
    CLASS_WEIGHTS=[('label1', 0.5), ('label2', 0.3), ('label3', 0.2)]) AS
SELECT
  *
FROM
  `mydataset.mytable`

Training a logistic regression model with specified weights

In this example, you create a logistic regression model using the class_weights option.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='LOGISTIC_REG',
    CLASS_WEIGHTS=[('0', 0.9), ('1', 0.1)]) AS
SELECT
  *
FROM
  `mydataset.mytable`

Model creation with TRANSFORM, while excluding original columns

The following query trains a model after adding the columns f1 and f2 from the SELECT statement to form a new column c; the columns f1 and f2 are omitted from the training data. Model training uses columns f3 and label_col as they appear in the data source t.

CREATE MODEL m
  TRANSFORM(f1 + f2 as c, * EXCEPT(f1, f2))
  OPTIONS(model_type=’linear_reg’, input_label_cols=['label_col'])
AS SELECT f1, f2, f3, label_col FROM t;