The CREATE MODEL Statement

Models in BigQuery ML

A model in BigQuery ML represents what an ML system has learned from the training data.

The following types of models are supported by BigQuery ML:

  • Linear regression for forecasting — for example, the sales of an item on a given day. Labels are real-valued (they cannot be +/- infinity or NaN).
  • Binary logistic regression for classification — for example, determining whether a customer will make a purchase. Labels must only have two possible values.
  • Multiclass logistic regression for classification — These models can be used to predict multiple possible values such as whether an input is "low-value", "medium-value", or "high-value". Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross entropy loss function.

In BigQuery ML, a model can be used with data from multiple BigQuery datasets for training and for prediction.

When you create a model, categorical variables (of type BOOL, STRING, BYTES, DATE, DATETIME, or TIME) are one-hot encoded by default during training and prediction.

TIMESTAMP is not currently one-hot encoded by default. Use the CAST function to cast TIMESTAMP columns to STRING so that BigQuery ML treats the column as categorical.

For one-hot encoded features, each column value is an array of structs in the following format: [<String_Value, Weight>]. The length of the array is the cardinality (number of values) of that feature. A feature is an input variable used to make a prediction.

Numerical variables (of type NUMERIC, FLOAT64 or INT64) are standardized by default. In addition, standardization is automatically applied during prediction.

CREATE MODEL statement

To create a model in BigQuery, use the BigQuery ML CREATE MODEL statement. This statement is similar to the CREATE TABLE DDL statement. When you run a standard SQL query that contains a CREATE MODEL statement, a query job is generated for you that processes the query.

CREATE MODEL limitations

BigQuery ML CREATE MODEL statements are subject to the following limitations:

  • When you use a CREATE MODEL statement, the size of the model must be 90 MB or less or the query fails. Generally, if all categorical variables are short strings, a total feature cardinality (model dimension) of 5-10 million is supported. The dimensionality is dependent on the cardinality and length of the string variables.
  • When you use a CREATE MODEL statement, the label column cannot contain NULL values. If the label column contains NULL values, the query fails.
  • Currently, the CREATE MODEL IF NOT EXISTS clause always updates the last modified timestamp of a model.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
[OPTIONS(model_option_list)]
[AS query_statement]

Where:

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} is one of the following statements:

  • CREATE MODEL — creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error. Unless you specify split options, the training dataset is automatically split into training and evaluation data.
  • CREATE MODEL IF NOT EXISTS — creates and trains a new model only if the model does not currently exist in the specified dataset.
  • CREATE OR REPLACE MODEL — creates and trains a model and replaces an existing model with the same name in the specified dataset.

CREATE MODEL statements must comply with the following rules:

  • Only one CREATE statement is allowed.
  • For linear regression models, the label column must be real-valued (the column values cannot be +/- infinity or NaN).
  • For logistic regression models, the label column must contain only two distinct values.
  • For multiclass logistic regression models, the label columns can contain up to 10 unique values.
  • In the model_option_list, the model_type option is required. All others are optional.

model_name

model_name is the name of the model you're creating or replacing. The model name must be unique per dataset. The model name must follow the same naming rules as a BigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format: `[PROJECT_ID].[DATASET].[MODEL]` (including the backticks); for example, `myproject.mydataset.mymodel`.

model_option_list

The model_option_list allows you to specify additional model options. You can include multiple options using a comma-separated list.

Specify a model option list in the following format:

NAME=VALUE, ...

NAME and VALUE must be one of the following combinations:

NAME VALUE Details
model_type

linear_reg

logistic_reg

linear_reg creates a linear regression model.

logistic_reg creates a logistic regression model or a multiclass logistic regression model. When you create a multiclass logistic regression model, specify training data that contains more than two unique labels.

The model_type option is required. All others are optional.

input_label_cols STRING

The label column name(s) in the training data. input_label_cols accepts an array of strings, but only one array element is supported for linear_reg and logistic_reg models.

If input_label_cols is unspecified, the column named "label" in the training data is used. If neither exists, the query fails.

For linear regression models, the label column must be real-valued (the column values must be real numbers). For logistic regression models, the label column must contain only two distinct values.

l1_reg FLOAT64

The amount of L1 regularization applied. The default value is 0.

l2_reg FLOAT64

The amount of L2 regularization applied. The default value is 0.

max_iterations INT64

The maximum number of training iterations (steps). The default value is 20.

learn_rate_strategy line_search
constant

Choose the strategy for specifying the learning rate during training. The default value is line_search.

Line search slows down training and increases the number of bytes processed, but the trade off is that it generally converges even with a larger initial learning rate specified.

If the model learn_rate appears to be doubling every iteration (as indicatd by ml.training_info), try setting ls_init_learn_rate to the last doubled learn rate. The optimal initial learn rate is different for every model. A good initial learn rate for one model might not be a good initial learn rate for another.

learn_rate FLOAT64

The learn rate for gradient descent when learn_rate_strategy is set to constant. If learn_rate_strategy is set to line_search, an error is returned. The default value is 0.1.

early_stop BOOL

Indicates training should stop after the first iteration in which the relative loss improvement is less than min_rel_progress. The default value is true.

min_rel_progress FLOAT64

The minimum relative loss improvement necessary to continue training when early_stop is set to true. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue. The default value is 0.01.

data_split_method

auto_split

random

custom

seq

no_split

The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting via early stopping. The default value is auto_split.

random splits data randomly. A random split is deterministic — different training runs produce the same split results so long as the underlying training data hasn't changed.

custom splits data using a customer provided BOOL (boolean) column. The rows with a boolean value of "true" are used as evaluation data. The rows with a value of "false" are used as training data.

seq splits data sequentially using a customer-provided column. The column can have any orderable data type: NUMERIC, STRING or TIMESTAMP. All rows with split values smaller than the threshold are used as training data. The remaining rows (including NULLs) are used as evaluation data.

no_split uses all data as training data.

If data_split_method is unspecified or explicitly set to auto_split, the automatic split strategy is as follows:

  1. When there are less than 500 rows in the input data, all rows are used as training data.
  2. When there are between 500 and 50,000 rows in the input data, 20% of the data is used as evaluation data in a random split.
  3. When there are more than 50,000 rows in the input data, only 10,000 of them are used as evaluation data in a random split.
data_split_eval_fraction FLOAT64

This option is used with random and seq splits. data_split_eval_fraction is the fraction of the data used for evaluation, accurate to two decimal places. The default value is 0.2.

data_split_col STRING

This option identifies the column used to split the data. This column cannot be used as a feature or label, and will be excluded from features automatically.

When data_split_method is custom, the corresponding column should be of type BOOL. The rows with true or NULL values are used as evaluation data. Rows with false values are used as training data.

When data_split_method is seq, the last data_split_fraction rows (from smallest to largest) in the corresponding column are used as evaluation data. The first rows are used as training data.

ls_init_learn_rate DOUBLE

Sets the initial learning rate that learn_rate_strategy='line_search' uses. This option can only be used if line_search is specified.

warm_start BOOL

This option is used to retrain a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run. The default value is false.

In a warm start run, the iteration numbers are reset to start from zero. The training_run number or the TIMESTAMP columns can be used to distinguish the warm start run from the original run.

The model_type and labels options and the training data schema cannot change in a warm start.

auto_class_weights BOOL

By default, the training data used to create a multiclass logistic regression model is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily, which may not be desired. Class weights can be used to balance the class labels and can be used for logistic and multiclass logistic regressions.

If set to true, the weights for each class are calculated in inverse proportion to the frequency of that class. To balance every class, use the following formula: TOTAL_INPUT_ROWS / (INPUT_ROWS_FOR_CLASS_N * NUMBER_OF_UNIQUE_CLASSES).

class_weights Array[STRUCT(STRING, DOUBLE)]

This option cannot be specified if auto_class_weights is specified. The value must be a list of weights to use for each class label. A weight must be provided for every class label. The weights are not required to add up to one.

query_statement

The AS query_statement clause specifies the standard SQL query that is used to generate the training data. See the Standard SQL Query Syntax page for the supported SQL syntax of the query_statement clause.

All columns referenced by the query_statement are used as inputs to the model except for the columns included in input_label_cols and data_split_col.

CREATE MODEL examples

The CREATE MODEL statement creates a model with the specified options. If the model name exists in the dataset, the following error is returned:

Already Exists: [PROJECT_ID]:[DATASET].[MODEL]

To replace an existing model, use the CREATE OR REPLACE MODEL statement instead.

The following examples create models named mymodel in mydataset in your default project.

Training a linear regression model

The following example creates and trains a linear regression model. The learn rate is set to .15, the L1 regularization is set to 1, and the maximum number of training iterations is set to 5.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='linear_reg',
    ls_init_learn_rate=.15,
    l1_reg=1,
    max_iterations=5 ) AS
SELECT
  column1,
  column2,
  column3,
  label
FROM
  `mydataset.mytable`
WHERE
  column4 < 10

Training a linear regression model with a sequential data split

In this example, you create a linear regression model with a sequential data split. The split fraction is .3 and split uses the timestamp column as the basis for the split.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='linear_reg',
    ls_init_learn_rate=.15,
    l1_reg=1,
    max_iterations=5,
    data_split_method='seq',
    data_split_eval_fraction=0.3,
    data_split_col='timestamp' ) AS
SELECT
  column1,
  column2,
  column3,
  timestamp,
  label
FROM
  `mydataset.mytable`
WHERE
  column4 < 10

Training a linear regression model with a custom data split

In this example, you have manually split your data into a training table and an evaluation table. The training table is named training_table. The evaluation table is named evaluation_table.

This command creates a linear regression model using a custom split method and trains the model by joining the data from the evaluation and training tables.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='linear_reg',
    data_split_method='custom',
    data_split_col='split_col' ) AS
SELECT
  *,
  false AS split_col
FROM
  `mydataset.training_table`
UNION ALL
SELECT
  *,
  true AS split_col
FROM
  `mydataset.evaluation_table`

In this example, all the columns in the training table and in the evaluation table are either features or the label. The query uses SELECT * and UNION ALL to append all of the data in the split_col column to the existing data.

Training a multiclass logistic regression model with automatically calculated weights

In this example, you create a multiclass logistic regression model using the auto_class_weights option.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='logistic_reg',
    auto_class_weights=true ) AS
SELECT
  *
FROM
  `mydataset.mytable`

Training a multiclass logistic regression model with specified weights

In this example, you create a multiclass logistic regression model using the class_weights option. The label columns are label1, label2, and label3.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='logistic_reg',
    class_weights=[('label1', 0.5), ('label2', 0.3), ('label3', 0.2)]) AS
SELECT
  *
FROM
  `mydataset.mytable`

Training a logistic regression model with specified weights

In this example, you create a logistic regression model using the class_weights option.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='logistic_reg',
    class_weights=[('0', 0.9), ('1', 0.1)]) AS
SELECT
  *
FROM
  `mydataset.mytable`
Was this page helpful? Let us know how we did:

Send feedback about...

Need help? Visit our support page.