The CREATE MODEL statement

CREATE MODEL statement

To create a model in BigQuery, use the BigQuery ML CREATE MODEL statement. This statement is similar to the CREATE TABLE DDL statement. When you run a standard SQL query that contains a CREATE MODEL statement, a query job is generated for you that processes the query.

For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
[TRANSFORM (select_list)]
[OPTIONS(model_option_list)]
[AS query_statement]

model_option_list:
    MODEL_TYPE = { 'LINEAR_REG' |
                   'LOGISTIC_REG' |
                   'KMEANS' |
                   'MATRIX_FACTORIZATION' |
                   'PCA' |
                   'AUTOENCODER' |
                   'AUTOML_CLASSIFIER' |
                   'AUTOML_REGRESSOR' |
                   'BOOSTED_TREE_CLASSIFIER' |
                   'BOOSTED_TREE_REGRESSOR' |
                   'DNN_CLASSIFIER' |
                   'DNN_REGRESSOR' |
                   'DNN_LINEAR_COMBINED_CLASSIFIER' |
                   'DNN_LINEAR_COMBINED_REGRESSOR' |
                   'ARIMA_PLUS' |
                   'TENSORFLOW'}
    [, INPUT_LABEL_COLS = string_array ]
    [, MAX_ITERATIONS = int64_value ]
    [, EARLY_STOP = { TRUE | FALSE } ]
    [, MIN_REL_PROGRESS = float64_value ]
    [, DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' } ]
    [, DATA_SPLIT_EVAL_FRACTION = float64_value ]
    [, DATA_SPLIT_COL = string_value ]
    [, OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' } ]
    [, L1_REG = float64_value ]
    [, L2_REG = float64_value ]
    [, LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' } ]
    [, LEARN_RATE = float64_value ]
    [, LS_INIT_LEARN_RATE = float64_value ]
    [, WARM_START = { TRUE | FALSE } ]
    [, AUTO_CLASS_WEIGHTS = { TRUE | FALSE } ]
    [, CLASS_WEIGHTS = struct_array ]
    [, NUM_CLUSTERS = int64_value ]
    [, KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' } ]
    [, KMEANS_INIT_COL = string_value ]
    [, DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' } ]
    [, STANDARDIZE_FEATURES = { TRUE | FALSE } ]
    [, MODEL_PATH = string_value ]
    [, BUDGET_HOURS = float64_value ]
    [, FEEDBACK_TYPE = {'EXPLICIT' | 'IMPLICIT'} ]
    [, NUM_FACTORS = int64_value ]
    [, USER_COL = string_value ]
    [, ITEM_COL = string_value ]
    [, RATING_COL = string_value ]
    [, WALS_ALPHA = float64_value ]
    [, BOOSTER_TYPE = { 'gbtree' | 'dart'} ]
    [, NUM_PARALLEL_TREE = int64_value ]
    [, DART_NORMALIZE_TYPE = { 'tree' | 'forest'} ]
    [, TREE_METHOD = { 'auto' | 'exact' | 'approx' | 'hist'} ]
    [, MIN_TREE_CHILD_WEIGHT = float64_value ]
    [, COLSAMPLE_BYTREE = float64_value ]
    [, COLSAMPLE_BYLEVEL = float64_value ]
    [, COLSAMPLE_BYNODE = float64_value ]
    [, MIN_SPLIT_LOSS = float64_value ]
    [, MAX_TREE_DEPTH = int64_value ]
    [, SUBSAMPLE = float64_value ])
    [, ACTIVATION_FN = { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' } ]
    [, BATCH_SIZE = int64_value ]
    [, DROPOUT = float64_value ]
    [, HIDDEN_UNITS = int_array ]
    [, OPTIMIZER = { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' } ]
    [, TIME_SERIES_TIMESTAMP_COL = string_value ]
    [, TIME_SERIES_DATA_COL = string_value ]
    [, TIME_SERIES_ID_COL = { string_value | string_array } ]
    [, HORIZON = int64_value ]
    [, AUTO_ARIMA = { TRUE | FALSE } ]
    [, AUTO_ARIMA_MAX_ORDER = int64_value ]
    [, NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value) ]
    [, DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | ... } ]
    [, INCLUDE_DRIFT = { TRUE | FALSE } ]
    [, HOLIDAY_REGION = { 'GLOBAL' | 'NA' | 'JAPAC' | 'EMEA' | 'LAC' | 'AE' | ... } ]
    [, CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE } ]
    [, ADJUST_STEP_CHANGES = { TRUE | FALSE } ]
    [, DECOMPOSE_TIME_SERIES = { TRUE | FALSE } ]
    [, ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE } ]
    [, INTEGRATED_GRADIENTS_NUM_STEPS = int64_value ]

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model does not currently exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

model_name

model_name is the name of the model you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks: `[PROJECT_ID].[DATASET].[MODEL]` ; for example, `myproject.mydataset.mymodel`.

TRANSFORM

TRANSFORM lets you specify all preprocessing during model creation and have it automatically applied during prediction and evaluation.

For example, you can create the following model:

CREATE OR REPLACE MODEL m
  TRANSFORM(ML.FEATURE_CROSS(STRUCT(f1, f2)) as cross_f,
            ML.QUANTILE_BUCKETIZE(f3) OVER() as buckets,
            label_col)
  OPTIONS(model_type=’linear_reg’, input_label_cols=['label_col'])
AS SELECT * FROM t

During prediction, you don't need to preprocess the input again, and the same transformations are automatically restored:

SELECT * FROM ML.PREDICT(MODEL m, (SELECT f1, f2, f3 FROM table))

When the TRANSFORM clause is present, only output columns from the TRANSFORM clause are used in training. Any results from query_statement that don't appear in the TRANSFORM clause are ignored.

The input columns of the TRANSFORM clause are the result of query_statement. So, the final input used in training is the set of columns generated by the following query:

SELECT (select_list) FROM (query_statement);

Input columns of the TRANSFORM clause can be of any SIMPLE type or ARRAY of SIMPLE type. SIMPLE types are non-STRUCT and non-ARRAY data types.

In prediction (ML.PREDICT), users only need to pass in the original columns from the query_statement that are used inside the TRANSFORM clause. The columns dropped in TRANSFORM don't need to be provided during prediction. TRANSFORM is automatically applied to the input data during prediction, including the statistics used in ML analytic functions (for example, ML.QUANTILE_BUCKETIZE).

select_list

You can pass columns from query_statement through to model training without transformation by either using ** EXCEPT(), or by listing the column names directly.

Not all columns from query_statement are required to appear in the TRANSFORM clause, so you can drop columns appearing in query_statement by omitting them from the TRANSFORM clause.

You can transform inputs from query_statement by using expressions in select_list. select_list is similar to a normal SELECT statement. select_list supports the following syntax:

  • *
  • * EXCEPT()
  • * REPLACE()
  • expression
  • expression.*

The following cannot appear inside select_list:

  • Aggregation functions.
  • Non-ML Analytic functions. You can find all ML analytic functions in preprocessing functions
  • UDFs.
  • Subqueries.
  • Anonymous columns. For example, “a + b as c” is allowed, while “a + b” is not.

The output columns of select_list can be of any BigQuery ML supported data type.

If present, the following columns must appear in select_list without transformation:

  • label
  • data_split_col
  • kmeans_init_col

If these columns are returned by query_statement, you must reference them in select_list by column name outside of any expression, or by using *. Using an alias with these columns is disallowed.

model_option_list

CREATE MODEL supports the following options:

MODEL_TYPE

Syntax

MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG' | 'KMEANS' | 'PCA' |
'MATRIX_FACTORIZATION' | 'AUTOENCODER' | 'AUTOML_REGRESSOR' |
'AUTOML_CLASSIFIER' | 'BOOSTED_TREE_CLASSIFIER' | 'BOOSTED_TREE_REGRESSOR' |
'DNN_CLASSIFIER' | 'DNN_REGRESSOR' | 'DNN_LINEAR_COMBINED_CLASSIFIER' |
'DNN_LINEAR_COMBINED_REGRESSOR' | 'ARIMA_PLUS' | 'TENSORFLOW'}

Description

Specify the model type. This argument is required.

Arguments

The argument is in the model type column.

Model category Model type Description Model specific CREATE MODEL statement
Regression 'LINEAR_REG' Linear regression for real-valued label prediction; for example, the sales of an item on a given day. CREATE MODEL statement for generalized linear models
'BOOSTED_TREE_REGRESSOR' Create a Boosted Tree Regressor model using the XGBoost library. CREATE MODEL statement for boosted tree models
'DNN_REGRESSOR' Create a Deep Neural Network Regressor model. CREATE MODEL statement for DNN models
'DNN_LINEAR_COMBINED_REGRESSOR' Create a Wide-and-Deep Regressor model. CREATE MODEL statement for Wide-and-Deep models
'AUTOML_REGRESSOR' Create a regression model using AutoML Tables. CREATE MODEL statement for AutoML Tables models
Classification 'LOGISTIC_REG' Logistic regression for binary-class or multi-class classification; for example, determining whether a customer will make a purchase. CREATE MODEL statement for generalized linear models
'BOOSTED_TREE_CLASSIFIER' Create a Boosted Tree Classifier model using the XGBoost library. CREATE MODEL statement for boosted tree models
'DNN_CLASSIFIER' Create a Deep Neural Network Classifier model. CREATE MODEL statement for DNN models
'DNN_LINEAR_COMBINED_CLASSIFIER' Create a Wide-and-Deep Classifier model. CREATE MODEL statement for Wide-and-Deep models
'AUTOML_CLASSIFIER' Create a classification model using AutoML Tables. CREATE MODEL statement for AutoML Tables models
Clustering 'KMEANS' K-means clustering for data segmentation; for example, identifying customer segments. CREATE MODEL statement for K-means models
Collaborative Filtering 'MATRIX_FACTORIZATION' Matrix factorization for recommendation systems. For example, given a set of users, items, and some ratings for a subset of the items, creates a model to predict a user's rating for items they have not rated. CREATE MODEL statement for matrix factorization models
Dimensionality Reduction 'PCA' Principal component analysis for dimensionality reduction. CREATE MODEL statement for PCA models
'AUTOENCODER' Create an Autoencoder model for anomaly detection, dimensionality reduction, and embedding purposes. CREATE MODEL statement for Autoencoder model
Time series forecasting 'ARIMA_PLUS' (previously 'ARIMA') Univariate time-series forecasting with many modeling components under the hood such as ARIMA model for the trend, STL and ETS for seasonality, holiday effects, and so on. CREATE MODEL statement for time series models
Importing models 'TENSORFLOW' Create a model by importing a TensorFlow model into BigQuery ML. CREATE MODEL statement for TensorFlow models

Other model options

The table below provides a comprehensive list of model options, with a brief description and their applicable model types. You can find detailed description in the model specific CREATE MODEL statement by clicking the model type in the "Applied model types" column.

When the applied model types are supervised learning models, unless "regressor" or "classifier" is explicitly listed, it means that model options apply to both the regressor and the classifier. For example, the "Boosted tree" means that model option applies to both Boosted tree regressor and Boosted tree classifier, while the "Boosted tree classifier" only applies to the classifier.

Name Description Applied model types
INPUT_LABEL_COLS The label column names in the training data. Linear & logistic regression,
Boosted trees,
DNN,
Wide & deep,
AutoML Tables
MAX_ITERATIONS The maximum number of training iterations or steps. Linear & logistic regression,
Boosted trees,
DNN,
Wide & deep,
Kmeans,
Matrix factorization,
Autoencoder
EARLY_STOP Whether training should stop after the first iteration in which the relative loss improvement is less than the value specified for `MIN_REL_PROGRESS`. Linear & logistic regression,
Boosted trees,
DNN,
Wide & deep,
Kmeans,
Matrix factorization,
Autoencoder
MIN_REL_PROGRESS The minimum relative loss improvement that is necessary to continue training when `EARLY_STOP` is set to true. Linear & logistic regression,
Boosted trees,
DNN,
Wide & deep,
Kmeans,
Matrix factorization,
Autoencoder
DATA_SPLIT_METHOD The method to split input data into training and evaluation sets. Linear & logistic regression,
Boosted trees,
DNN,
Wide & deep
Matrix factorization
DATA_SPLIT_EVAL_FRACTION Specifies the fraction of the data used for evaluation, accurate to two decimal places. Linear & logistic regression,
Boosted trees,
DNN,
Wide & deep
Matrix factorization
DATA_SPLIT_COL Identifies the column used to split the data. Linear & logistic regression,
Boosted trees,
DNN,
Wide & deep
Matrix factorization
OPTIMIZE_STRATEGY The strategy to train linear regression models. Linear regression
L1_REG The amount of L1 regularization applied. Linear & logistic regression,
Boosted trees
L2_REG The amount of L2 regularization applied. Linear & logistic regression,
Boosted trees,
Matrix factorization
LEARN_RATE_STRATEGY The strategy for specifying the learning rate during training. Linear & logistic regression
LEARN_RATE The learn rate for gradient descent when LEARN_RATE_STRATEGY is set to CONSTANT. Linear & logistic regression
LS_INIT_LEARN_RATE Sets the initial learning rate that LEARN_RATE_STRATEGY=LINE_SEARCH uses. Linear & logistic regression
WARM_START Retrain a model with new training data, new model options, or both. Linear & logistic regression,
DNN,
Wide & deep,
Kmeans,
Autoencoder
AUTO_CLASS_WEIGHTS Whether to balance class labels using weights for each class in inverse proportion to the frequency of that class. Logistic regression,
Boosted tree classifier,
DNN classifier,
Wide & deep classifier
CLASS_WEIGHTS The weights to use for each class label. This option cannot be specified if AUTO_CLASS_WEIGHTS is specified. Logistic regression,
Boosted tree classifier,
DNN classifier,
Wide & deep classifier
NUM_CLUSTERS The number of clusters to identify in the input data. Kmeans
KMEANS_INIT_METHOD The method of initializing the clusters. Kmeans
KMEANS_INIT_COL Identifies the column used to initialize the centroids. Kmeans
DISTANCE_TYPE The type of metric to compute the distance between two points. Kmeans
STANDARDIZE_FEATURES Whether to standardize numerical features. Kmeans
BUDGET_HOURS Sets the training budget hours. AutoML Tables
MODEL_PATH Specifies the location of the TensorFlow model to import. Imported tensorflow model
FEEDBACK_TYPE Specifies feedback type for matrix factorization models which changes the algorithm that is used during training. Matrix factorization
NUM_FACTORS Specifies the number of latent factors. Matrix factorization
USER_COL The user column name. Matrix factorization
RATING_COL The rating column name. Matrix factorization
WALS_ALPHA A hyperparameter for matrix factorization models with IMPLICIT feedback. Matrix factorization
BOOSTER_TYPE For boosted tree models, specify the booster type to use, with default value GBTREE. Boosted trees
NUM_PARALLEL_TREE Number of parallel trees constructed during each iteration. Boosted trees
DART_NORMALIZE_TYPE Type of normalization algorithm for DART booster. Boosted trees
TREE_METHOD Type of tree construction algorithm. Boosted trees
MIN_TREE_CHILD_WEIGHT Minimum sum of instance weight needed in a child for further partitioning. Boosted trees
COLSAMPLE_BYTREE Subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed. Boosted trees
COLSAMPLE_BYLEVEL Subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Boosted trees
COLSAMPLE_BYNODE Subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Boosted trees
MIN_SPLIT_LOSS Minimum loss reduction required to make a further partition on a leaf node of the tree. Boosted trees
MAX_TREE_DEPTH Maximum depth of a tree. Boosted trees
SUBSAMPLE Subsample ratio of the training instances. Boosted trees
ACTIVATION_FN Specifies the activation function of the neural network. DNN,
Wide & deep,
Autoencoder
BATCH_SIZE Specifies the mini batch size of samples that are fed to the neural network. DNN,
Wide & deep,
Autoencoder
DROPOUT Specifies the dropout rate of units in the neural network. DNN,
Wide & deep,
Autoencoder
HIDDEN_UNITS Specifies the hidden layers of the neural network. DNN,
Wide & deep,
Autoencoder
OPTIMIZER Specifies the optimizer for training the model. DNN,
Wide & deep,
Autoencoder
TIME_SERIES_TIMESTAMP_COL The timestamp column name for time series models. ARIMA_PLUS
TIME_SERIES_DATA_COL The data column name for time series models. ARIMA_PLUS
TIME_SERIES_ID_COL The ID column names for time-series models. ARIMA_PLUS
HORIZON The number of time points to forecast. When forecasting multiple time series at once, this parameter applies to each time series. ARIMA_PLUS
AUTO_ARIMA Whether the training process should use auto.ARIMA or not. ARIMA_PLUS
AUTO_ARIMA_MAX_ORDER The maximum value for the sum of non-sesonal p and q. It controls the parameter search space in the auto.ARIMA algorithm. ARIMA_PLUS
NON_SEASONAL_ORDER The tuple of non-seasonal p, d, and q for the ARIMA_PLUS model. ARIMA_PLUS
DATA_FREQUENCY The data frequency of the input time series. ARIMA_PLUS
INCLUDE_DRIFT Should the ARIMA_PLUS model include a linear drift term or not. ARIMA_PLUS
HOLIDAY_REGION The geographical region based on which the holiday effect is applied in modeling. ARIMA_PLUS
CLEAN_SPIKES_AND_DIPS Whether the spikes and dips should be cleaned. ARIMA_PLUS
ADJUST_STEP_CHANGES Whether the step changes should be adjusted. ARIMA_PLUS
DECOMPOSE_TIME_SERIES Whether the separate components of both the history and the forecast parts of the time series (such as seasonal components) should be saved. ARIMA_PLUS
ENABLE_GLOBAL_EXPLAIN Specifies whether to compute global explanations using explainable AI to evaluate global feature importance to the model. Linear & logistic regression,
Boosted trees,
DNN,
Wide & deep
INTEGRATED_GRADIENTS_NUM_STEPS Specifies the number of steps to sample between the example being explained and its baseline for approximating the integral in integrated gradients attribution methods. DNN,
Wide & deep

query_statement

The AS query_statement clause specifies the standard SQL query that is used to generate the training data. See the Standard SQL Query Syntax page for the supported SQL syntax of the query_statement clause.

All columns referenced by the query_statement are used as inputs to the model except for the columns included in input_label_cols and data_split_col.

Supported inputs

The CREATE MODEL statement supports the following data types for input label and data split columns.

Supported data types for input label columns

BigQuery ML supports different Standard SQL data types depending on the model type. Supported data types for input_label_cols include:

Model type Supported label types
regression models INT64
NUMERIC
BIGNUMERIC
FLOAT64
classification models Any groupable data type

Supported data types for data split columns

BigQuery ML supports different Standard SQL data types depending on the data split method. Supported data types for data_split_col include:

Data split method Supported column types
custom BOOL
seq INT64
NUMERIC
BIGNUMERIC
FLOAT64
TIMESTAMP

Limitations

CREATE MODEL statements must comply with the following rules:

  • Only one CREATE statement is allowed.
  • When you use a CREATE MODEL statement, the size of the model must be 90 MB or less or the query fails. Generally, if all categorical variables are short strings, a total feature cardinality (model dimension) of 5-10 million is supported. The dimensionality is dependent on the cardinality and length of the string variables.
  • The label column cannot contain NULL values. If the label column contains NULL values, the query fails.
  • Currently, the CREATE MODEL IF NOT EXISTS clause always updates the last modified timestamp of a model.