The CREATE MODEL statement

CREATE MODEL statement

To create a model in BigQuery, use the BigQuery ML CREATE MODEL statement. This statement is similar to the CREATE TABLE DDL statement. When you run a standard SQL query that contains a CREATE MODEL statement, a query job is generated for you that processes the query.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
[TRANSFORM (select_list)]
[OPTIONS(model_option_list)]
[AS query_statement]

model_option_list:
  MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG' | 'KMEANS' | 'TENSORFLOW' | 'MATRIX_FACTORIZATION' | 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' | 'BOOSTED_TREE_CLASSIFIER' | 'BOOSTED_TREE_REGRESSOR' | 'DNN_CLASSIFIER' | 'DNN_REGRESSOR' }
  [, INPUT_LABEL_COLS = string_array ]
  [, OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' } ]
  [, L1_REG = float64_value ]
  [, L2_REG = float64_value ]
  [, MAX_ITERATIONS = int64_value ]
  [, LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' } ]
  [, LEARN_RATE = float64_value ]
  [, EARLY_STOP = { TRUE | FALSE } ]
  [, MIN_REL_PROGRESS = float64_value ]
  [, DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' } ]
  [, DATA_SPLIT_EVAL_FRACTION = float64_value ]
  [, DATA_SPLIT_COL = string_value ]
  [, LS_INIT_LEARN_RATE = float64_value ]
  [, WARM_START = { TRUE | FALSE } ]
  [, AUTO_CLASS_WEIGHTS = { TRUE | FALSE } ]
  [, CLASS_WEIGHTS = struct_array ]
  [, NUM_CLUSTERS = int64_value ]
  [, KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' } ]
  [, KMEANS_INIT_COL = string_value ]
  [, DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' } ]
  [, STANDARDIZE_FEATURES = { TRUE | FALSE } ]
  [, MODEL_PATH = string_value ]
  [, BUDGET_HOURS = float64_value ]
  [, FEEDBACK_TYPE = {'EXPLICIT' | 'IMPLICIT'} ]
  [, NUM_FACTORS = int64_value ]
  [, USER_COL = string_value ]
  [, ITEM_COL = string_value ]
  [, RATING_COL = string_value ]
  [, WALS_ALPHA = float64_value ]
  [, BOOSTER_TYPE = { 'gbtree' | 'dart'} ]
  [, NUM_PARALLEL_TREE = int64_value ]
  [, DART_NORMALIZE_TYPE = { 'tree' | 'forest'} ]
  [, TREE_METHOD = { 'auto' | 'exact' | 'approx' | 'hist'} ]
  [, MIN_TREE_CHILD_WEIGHT = float64_value ]
  [, COLSAMPLE_BYTREE = float64_value ]
  [, COLSAMPLE_BYLEVEL = float64_value ]
  [, COLSAMPLE_BYNODE = float64_value ]
  [, MIN_SPLIT_LOSS = float64_value ]
  [, MAX_TREE_DEPTH = int64_value ]
  [, SUBSAMPLE = float64_value ])
  [, ACTIVATION_FN = { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' } ]
  [, BATCH_SIZE = int64_value ]
  [, DROPOUT = float64_value ]
  [, HIDDEN_UNITS = int_array ]
  [, OPTIMIZER = { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' } ]
  [, TIME_SERIES_TIMESTAMP_COL = string_value ]
  [, TIME_SERIES_DATA_COL = string_value ]
  [, TIME_SERIES_ID_COL = string_value ]
  [, HORIZON = int64_value ]
  [, AUTO_ARIMA = { TRUE | FALSE } ]
  [, AUTO_ARIMA_MAX_ORDER = int64_value ]
  [, NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value) ]
  [, DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' } ]
  [, INCLUDE_DRIFT = { TRUE | FALSE } ]
  [, HOLIDAY_REGION = { 'GLOBAL' | 'NA' | 'JAPAC' | 'EMEA' | 'LAC' | 'AE' | ... } ]

Not all options are applicable for every model type. For TensorFlow models, see the CREATE MODEL statement for TensorFlow models. For AutoML Tables models, see the CREATE MODEL statement for AutoML Tables models.

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model does not currently exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

model_name

model_name is the name of the model you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks: `[PROJECT_ID].[DATASET].[MODEL]` ; for example, `myproject.mydataset.mymodel`.

TRANSFORM

TRANSFORM lets you specify all preprocessing during model creation and have it automatically applied during prediction and evaluation.

For example, you can create the following model:

CREATE OR REPLACE MODEL m
  TRANSFORM(ML.FEATURE_CROSS(STRUCT(f1, f2)) as cross_f,
            ML.QUANTILE_BUCKETIZE(f3) OVER() as buckets,
            label_col)
  OPTIONS(model_type=’linear_reg’, input_label_cols=['label_col'])
AS SELECT * FROM t

During prediction, you don't need to preprocess the input again, and the same transformations are automatically restored:

SELECT * FROM ML.PREDICT(MODEL m, (SELECT f1, f2, f3 FROM table))

When the TRANSFORM clause is present, only output columns from the TRANSFORM clause are used in training. Any results from query_statement that don't appear in the TRANSFORM clause are ignored.

The input columns of the TRANSFORM clause are the result of query_statement. So, the final input used in training is the set of columns generated by the following query:

SELECT (select_list) FROM (query_statement);

Input columns of the TRANSFORM clause can be of any SIMPLE type or ARRAY of SIMPLE type. SIMPLE types are non-STRUCT and non-ARRAY data types.

In prediction (ML.PREDICT), users only need to pass in the original columns from the query_statement that are used inside the TRANSFORM clause. The columns dropped in TRANSFORM don't need to be provided during prediction. TRANSFORM is automatically applied to the input data during prediction, including the statistics used in ML analytic functions (for example, ML.QUANTILE_BUCKETIZE).

select_list

You can pass columns from query_statement through to model training without transformation by either using ** EXCEPT(), or by listing the column names directly.

Not all columns from query_statement are required to appear in the TRANSFORM clause, so you can drop columns appearing in query_statement by omitting them from the TRANSFORM clause.

You can transform inputs from query_statement by using expressions in select_list. select_list is similar to a normal SELECT statement. select_list supports the following syntax:

  • *
  • * EXCEPT()
  • * REPLACE()
  • <var>expression</var>
  • <var>expression</var>.*

The following cannot appear inside select_list:

  • Aggregation functions.
  • Non-ML Analytic functions. You can find all ML analytic functions in preprocessing functions
  • UDFs.
  • Subqueries.
  • Anonymous columns. For example, “a + b as c” is allowed, while “a + b” is not.

The output columns of select_list can be of any BigQuery ML supported data type.

If present, the following columns must appear in select_list without transformation:

  • label
  • data_split_col
  • kmeans_init_col

If these columns are returned by query_statement, you must reference them in select_list by column name outside of any expression, or by using *. Using an alias with these columns is disallowed.

model_option_list

In the model_option_list, the model_type option is required. All others are optional.

CREATE MODEL supports the following options:

MODEL_TYPE

Syntax

MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG' | 'KMEANS' |
'BOOSTED_TREE_REGRESSOR' | 'BOOSTED_TREE_CLASSIFIER' | 'DNN_CLASSIFIER' | 'DNN_REGRESSOR' }

Description

Specify the model type. This option is required.

Arguments

'LINEAR_REG' Linear regression for forecasting; for example, the sales of an item on a given day. Labels are real-valued: they cannot be +/- infinity or NaN.

'LOGISTIC_REG' Logistic regression for classification; for example, determining whether a customer will make a purchase. This model can be of two types:

  • Binary logistic regression for classification; for example, determining whether a customer will make a purchase. Labels must only have two possible values.
  • Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value." Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross entropy loss function.

'KMEANS' K-means clustering for data segmentation; for example, identifying customer segments. K-means is an unsupervised learning technique, so model training does not require labels nor split data for training or evaluation.

'TENSORFLOW' (beta) Create a model by importing a TensorFlow model into BigQuery ML. See the CREATE MODEL statement for TensorFlow models for more information.

'MATRIX_FACTORIZATION' (beta) Matrix factorization for recommendation systems. For example, given a set of users, items, and some preferences or ratings for a subset of the items, creates a model to predict a user's preference for items they have not rated or interacted with before. For more information, see the CREATE MODEL statement for matrix factorization models.

'AUTOML_REGRESSOR' (beta) Create a regression model using AutoML Tables. See the CREATE MODEL statement for AutoML Tables models for more information.

'AUTOML_CLASSIFIER' (beta) Create a classification model using AutoML Tables. See the CREATE MODEL statement for AutoML Tables models for more information.

'BOOSTED_TREE_CLASSIFIER' (beta) Create a Boosted Tree Classifier model using the XGBoost library. See the CREATE MODEL statement for boosted tree models for more information.

'BOOSTED_TREE_REGRESSOR' (beta) Create a Boosted Tree Regressor model using the XGBoost library. See the CREATE MODEL statement for boosted tree models for more information.

'DNN_CLASSIFIER' (beta) Create a Deep Neural Network Classifier model. See the CREATE MODEL statement for DNN models for more information.

'DNN_REGRESSOR' (beta) Create a Deep Neural Network Regressor model. See the CREATE MODEL statement for DNN models for more information.

'ARIMA' (beta) Univariate ARIMA-based time series model for time series forecasting. For example, given one or multiple time series, creates a model or a set of time series models at once to forecast future data points. For more information, see the CREATE MODEL statement for time series models.

INPUT_LABEL_COLS

Syntax

INPUT_LABEL_COLS = string_array

Description

The label column name(s) in the training data.

Arguments

string_array is an ARRAY of STRINGs. The linear regression and logistic regression model types only support string_array values that contain one element.

OPTIMIZE_STRATEGY

Syntax

OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' }

Description

The strategy to train linear regression models.

Arguments

Accepts the following values:

'AUTO_STRATEGY' Determine the training strategy as follows:

  • If l1_reg or warm_start is specified, batch_gradient_descent strategy is used.
  • If total cardinalities of training features are more than 10000, batch_gradient_descent strategy is used.
  • If there is over-fitting issue, that is, the number of training examples is less than 10x, where x is total cardinality, batch_gradient_descent strategy is used.
  • NORMAL_EQUATION strategy is used for all other cases.

'BATCH_GRADIENT_DESCENT' Train the model using the batch gradient descent method, which optimizes the loss function using the gradient function.

'NORMAL_EQUATION' Directly compute the least square solution of the linear regression problem with the analytical formula. Normal equation cannot be used in the following cases:

  • l1_reg is specified.
  • warm_start is specified.
  • Total cardinality of training features is more than 10,000.

The default value is 'AUTO_STRATEGY'.

L1_REG

Syntax

L1_REG = float64_value

Description

The amount of L1 regularization applied.

Arguments

float64_value is a FLOAT64. The default value is 0.

L2_REG

Syntax

L2_REG = float64_value

Description

The amount of L2 regularization applied.

Arguments

float64_value is a FLOAT64. The default value is 0.

MAX_ITERATIONS

Syntax

MAX_ITERATIONS = int64_value

Description

The maximum number of training iterations or steps.

Arguments

int64_value is an INT64. The default value is 20.

LEARN_RATE_STRATEGY

Syntax

LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' }

Description

The strategy for specifying the learning rate during training.

Arguments

'LINE_SEARCH' Use the line search method to calculate the learning rate. The line search initial learn rate is the value specified for LS_INIT_LEARN_RATE .

Line search slows down training and increases the number of bytes processed, but it generally converges even with a larger initial specified learning rate.

'CONSTANT' Set learning rate to the value specified for LEARN_RATE.

The default value is 'LINE_SEARCH'.

LEARN_RATE

Syntax

LEARN_RATE = float64_value

Description

The learn rate for gradient descent when LEARN_RATE_STRATEGY is set to CONSTANT. If LEARN_RATE_STRATEGY is set to 'LINE_SEARCH', an error is returned.

Arguments

float64_value is a FLOAT64. The default value is 0.1.

EARLY_STOP

Syntax

EARLY_STOP = { TRUE | FALSE }

Description

Whether training should stop after the first iteration in which the relative loss improvement is less than the value specified for MIN_REL_PROGRESS.

Arguments

The value is a BOOL. The default value is TRUE.

MIN_REL_PROGRESS

Syntax

MIN_REL_PROGRESS = float64_value

Description

The minimum relative loss improvement that is necessary to continue training when EARLY_STOP is set to true. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue.

Arguments

float64_value is a FLOAT64. The default value is 0.01.

DATA_SPLIT_METHOD

Syntax

DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' }

Description

The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting via early stopping.

Arguments

Accepts the following values:

'AUTO_SPLIT' The automatic split strategy is as follows:

  • When there are fewer than 500 rows in the input data, all rows are used as training data.
  • When there are between 500 and 50,000 rows in the input data, 20% of the data is used as evaluation data in a RANDOM split.
  • When there are more than 50,000 rows in the input data, only 10,000 of them are used as evaluation data in a RANDOM split.

'RANDOM' Split data randomly. A random split is deterministic: different training runs produce the same split results if the underlying training data remains the same.

'CUSTOM' Split data using a customer provided column of type BOOL. The rows with a value of TRUE are used as evaluation data. The rows with a value of FALSE are used as training data.

'SEQ' Split data sequentially using a customer-provided column. The column can have any orderable data type: NUMERIC, STRING, or TIMESTAMP. All rows with split values smaller than the threshold are used as training data. The remaining rows including NULLs are used as evaluation data.

'NO_SPLIT' Use all data as training data.

The default value is 'AUTO_SPLIT' for linear regression and logistic regression models.

The default value is 'NO_SPLIT' for matrix factorization models. If you decide to use a different split method, please ensure all users and all items have sufficient rows in the training set to build a good model.

This is option is not supported by TensorFlow or k-means models.

DATA_SPLIT_EVAL_FRACTION

Syntax

DATA_SPLIT_EVAL_FRACTION = float64_value

Description

This option is used with 'RANDOM' and 'SEQ' splits. It specifies the fraction of the data used for evaluation, accurate to two decimal places.

Arguments

float64_value is a FLOAT64. The default value is 0.2.

DATA_SPLIT_COL

Syntax

DATA_SPLIT_COL = string_value

Description

Identifies the column used to split the data. This column cannot be used as a feature or label, and will be excluded from features automatically.

  • When the value of DATA_SPLIT_METHOD is 'CUSTOM', the corresponding column should be of type BOOL. The rows with TRUE or NULL values are used as evaluation data. Rows with FALSE values are used as training data.

  • When the value of DATA_SPLIT_METHOD is 'SEQ', the last n rows from smallest to largest in the corresponding column are used as evaluation data, where n is the value specified for DATA_SPLIT_EVAL_FRACTION. The first rows are used as training data.

For information on supported input types, see Supported input types for DATA_SPLIT_COL.

For more information about using DATA_SPLIT_COL with AutoML model types, see the CREATE MODEL statement for AutoML Tables models.

Arguments

string_value is a STRING.

LS_INIT_LEARN_RATE

Syntax

LS_INIT_LEARN_RATE = float64_value

Description

Sets the initial learning rate that LEARN_RATE_STRATEGY='LINE_SEARCH' uses. This option can only be used if LINE_SEARCH is specified.

If the model LEARN_RATE appears to be doubling every iteration as indicated by ML.TRAINING_INFO, try setting LS_INIT_LEARN_RATE to the last doubled learn rate. The optimal initial learn rate is different for every model. A good initial learn rate for one model might not be a good initial learn rate for another.

Arguments

float64_value is a FLOAT64.

WARM_START

Syntax

WARM_START = { TRUE | FALSE }

Description

Retrain a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run.

In a warm start run, the iteration numbers are reset to start from zero. The TRAINING_RUN number or the TIMESTAMP columns can be used to distinguish the warm start run from the original run.

The value of the MODEL_TYPE and LABELS options and the training data schema must remain constant in a warm start. The warm_start option is only supported for LINEAR_REG, LOGISTIC_REG, and KMEANS models retrain.

Arguments

Accepts a BOOL. The default value is FALSE.

AUTO_CLASS_WEIGHTS

Syntax

AUTO_CLASS_WEIGHTS = { TRUE | FALSE }

Description

Whether to balance class labels using weights for each class in inverse proportion to the frequency of that class.

Use with boosted_tree_classifier, DNN_CLASSIFIER, logistic and multiclass logistic regressions.

By default, the training data used to create a multiclass logistic regression, boosted tree classifier and dnn classifier model is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily, which may not be desired.

To balance every class, use the following formula:

total_input_rows / (input_rows_for_class_n * number_of_unique_classes)

Arguments

Accepts a BOOL. The default value is FALSE.

CLASS_WEIGHTS

Syntax

CLASS_WEIGHTS = struct_array

Description

The weights to use for each class label. This option cannot be specified if AUTO_CLASS_WEIGHTS is specified.

Arguments

struct_array is an ARRAY of STRUCTs; each STRUCT contains a STRING that contains the class label and a FLOAT64 containing the weight for that class label. A weight must be present for every class label. The weights are not required to add up to one. For example:

CLASS_WEIGHTS = [STRUCT('example_label', .2)]

NUM_CLUSTERS

Syntax

NUM_CLUSTERS = int64_value

Description

For a k-means model, the number of clusters to identify in the input data.

Arguments

int64_value is an INT64. Allowed values are 2-100. The default value is log10(n), where n is the number of training examples.

KMEANS_INIT_METHOD

Syntax

KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' }

Description

For a k-means model, the method of initializing the clusters.

To use the same centroids in repeated CREATE MODEL queries, specify the option 'CUSTOM'.

Arguments

'RANDOM': Initializes the centroids by randomly selecting NUM_CLUSTERS data points from the input data.

'KMEANS++': Initializes NUM_CLUSTERS centroids by using the KMEANS++ algorithm. 'KMEANS++' trains a better model than 'RANDOM' cluster initialization.

'CUSTOM': Initializes the centroids using a provided column of type BOOL. This column is specified by the option 'KMEANS_INIT_COL'. BigQuery ML uses the rows with a value of TRUE as the initial centroids. When this option is present and the values in 'KMEANS_INIT_COL' are constant, repeated CREATE MODEL queries use the same centroids.

The default value is 'RANDOM'

KMEANS_INIT_COL

Syntax

KMEANS_INIT_COL = string_value

Description

For a k-means model, identifies the column used to initialize the centroids. If this column contains a value of TRUE for a given row, then BigQuery ML uses that row as an initial centroid.

This option can only be specified when 'KMEANS_INIT_METHOD' has the value 'CUSTOM'. The corresponding column must be of type BOOL. Model option NUM_CLUSTERS must be present in the query and its value must equal the total number of TRUE rows in this column. BigQuery ML cannot use this column as a feature and excludes it from features automatically.

Arguments

string_value is a STRING.

DISTANCE_TYPE

Syntax

DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' }

Description

For a k-means model, the type of metric to compute the distance between two points.

Arguments

Accepts the following values:

'EUCLIDEAN' Use the following equation to calculate the distance between point x and y:

$$ \lVert x-y\rVert_{2} $$

'COSINE' Use the following equation calculates the distance:

$$ \sqrt{1-\frac{x \cdot y}{\lVert x\rVert_{2}\lVert y\rVert_{2}}} $$

where \( \lVert x\rVert_{2} \) represents the L2 norm for x.

The default value is 'EUCLIDEAN'.

STANDARDIZE_FEATURES

Syntax

STANDARDIZE_FEATURES = { TRUE | FALSE }

Description

For a kmeans model, whether to standardize numerical features.

Arguments

Accepts a BOOL. The default value is TRUE.

MODEL_PATH

Syntax

MODEL_PATH = string_value

Description

For TensorFlow model types, specifies the location of the TensorFlow model to import.

string_value is the location of a Google Cloud Storage bucket that contains the model to import.

See the CREATE MODEL statement for TensorFlow models for more information.

Example

MODEL_PATH = 'gs:////bucket/path/to/saved_model/*'

FEEDBACK_TYPE

Syntax

FEEDBACK_TYPE = { 'EXPLICIT' | 'IMPLICIT' }

Description

Specifies feedback type for matrix factorization models which changes the algorithm that is used during training.

For more information, see matrix factorization models.

Arguments

The default value is 'EXPLICIT'.

NUM_FACTORS

Syntax

NUM_FACTORS = int64_value

Description

Specifies the number of latent factors to use for matrix factorization models.

Arguments

int64_value is an 'INT64'. Allowed values are 2-200. The default value is log2(n), where n is the number of training examples.

USER_COL

Syntax

USER_COL = string_value

Description

The user column name for matrix factorization models.

Arguments

string_value is a 'STRING'. The default value is 'user'.

ITEM_COL

Syntax

ITEM_COL = string_value

Description

The item column name for matrix factorization models.

Arguments

string_value is a 'STRING'. The default value is 'item'.

RATING_COL

Syntax

RATING_COL = string_value

Description

The rating column name for matrix factorization models.

Arguments

string_value is a 'STRING'. The default value is 'rating'.

WALS_ALPHA

Syntax

WALS_ALPHA = float64_value

Description

A hyperparameter for 'IMPLICIT' matrix factorization model. For more information, see matrix factorization models.

Arguments

float64_value is a 'FLOAT64'. The default value is 40.

BUDGET_HOURS

Syntax

BUDGET_HOURS = float64_value

Description

Sets the training budget for AutoML Tables training, specified in hours. Defaults to 1.0 and must be between 1.0 and 72.0. This option can only be used if MODEL_TYPE is AUTOML_REGRESSOR or AUTOML_CLASSIFIER.

After training AutoML Tables models, BigQuery ML compresses the model to ensure it is small enough to import, which can take up to 50% of the training time. The time to compress the model is not included in the training budget time.

For more information, see the CREATE MODEL statement for AutoML Tables models.

Arguments

float64_value is a FLOAT64.

BOOSTER_TYPE

Syntax

BOOSTER_TYPE =  { 'GBTREE' | 'DART'}

Description

For boosted tree models, specify the booster type to use, with default value GBTREE.

Arguments

The following options are available:

NUM_PARALLEL_TREE

Syntax

NUM_PARALLEL_TREE = int64_value

Description

Number of parallel trees constructed during each iteration. Default value is 1. To train a boosted random forest, set this value larger than 1.

DART_NORMALIZE_TYPE

Syntax

DART_NORMALIZE_TYPE = { 'TREE' | 'FOREST'}

Description

Type of normalization algorithm for DART booster. Default value is TREE.

TREE_METHOD

Syntax

TREE_METHOD = { 'AUTO' | 'EXACT' | 'APPROX' | 'HIST'}

Description

Type of tree construction algorithm. Default value is AUTO.

MIN_TREE_CHILD_WEIGHT

Syntax

MIN_TREE_CHILD_WEIGHT = float64_value

Description

Minimum sum of instance weight needed in a child for further partitioning. If the tree partition step results in a leaf node with the sum of instance weight less than min_tree_child_weight, then the building process will give up further partitioning. The larger the value of min_tree_child_weight, the more conservative the algorithm will be. The value should be greater than or equal to 0, with default value 1.

COLSAMPLE_BYTREE

Syntax

COLSAMPLE_BYTREE = float64_value

Description

Subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed. The value should be between 0 and 1, with default value 1.

COLSAMPLE_BYLEVEL

Syntax

COLSAMPLE_BYLEVEL = float64_value

Description

Subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree. The value should be between 0 and 1, with default value 1.

COLSAMPLE_BYNODE

Syntax

COLSAMPLE_BYNODE = float64_value

Description

Subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level. The value should be between 0 and 1, with default value 1.

MIN_SPLIT_LOSS

Syntax

MIN_SPLIT_LOSS = float64_value

Description

Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger the value of min_split_loss, the more conservative the algorithm will be. Default value is 0.

MAX_TREE_DEPTH

Syntax

MAX_TREE_DEPTH = int64_value

Description

Maximum depth of a tree. Default value is 6.

SUBSAMPLE

Syntax

SUBSAMPLE = float64_value

Description

Subsample ratio of the training instances. Setting this value to 0.5 means that training randomly samples half of the training data prior to growing trees, which prevents overfitting. Subsampling will occur once in every boosting iteration. This is independent of the training-test data split used in the training options (80/20 random by default). The test data is not used in any iteration irrespective of subsample; subsample is only applied to the training data. The value should be between 0 and 1, with default value 1.0 (use all of the training data in each iteration).

ACTIVATION_FN

Syntax

ACTIVATION_FN =  { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' }

Description

For DNN model types, specifies the activation function of the neural network.

Arguments

The following options are available:

BATCH_SIZE

Syntax

BATCH_SIZE = int64_value

Description

For DNN model types, specifies the mini batch size of samples that are fed to the neural network.

Arguments

The default value is the smaller value of 1024 and the number of samples.

The max value is 8192.

DROPOUT

Syntax

DROPOUT = float64_value

Description

For DNN model types, specifies the dropout rate of units in the neural network.

Arguments

The default value is 0. A valid input must be between 0.0 and 1.0.

HIDDEN_UNITS

Syntax

HIDDEN_UNITS = int_array

Description

For DNN model types, specifies the hidden layers of the neural network.

Arguments

An array of integers that represents the architecture of the hidden layers. If not specified, BigQuery ML applies a single hidden layer that contains no more than 128 units.

Example

HIDDEN_UNITS = [256, 128, 64]

This example represents an architecture of 3 hidden layers with 256, 128, and 64 nodes respectively.

OPTIMIZER

Syntax

OPTIMIZER =  { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' }

Description

For DNN model types, specifies the optimizer for training the model.

Arguments

The following options are available:

TIME_SERIES_TIMESTAMP_COL

Syntax

TIME_SERIES_TIMESTAMP_COL = string_value

Description

The timestamp column name for time series models.

Arguments

string_value is a 'STRING'.

TIME_SERIES_DATA_COL

Syntax

TIME_SERIES_DATA_COL = string_value

Description

The data column name for time series models.

Arguments

string_value is a 'STRING'.

TIME_SERIES_ID_COL

Syntax

TIME_SERIES_ID_COL = string_value

Description

The ID column name for time series models. This column is used when the user wants to fit and forecast multiple time series using a single query. Different IDs indicate different time series.

Arguments

string_value is a 'STRING'.

HORIZON

Syntax

HORIZON = int64_value

Description

The number of time points to forecast. When forecasting multiple time series at once, this parameter applies to each time series.

Arguments

The value is a INT64. The default value is 1000. The maximum value is 10,000.

AUTO_ARIMA

Syntax

AUTO_ARIMA = { TRUE | FALSE }

Description

Whether the training process should use auto.ARIMA or not. If true, training automatically finds the best non-seasonal order (for example, the p, d, q tuple) and decides whether or not to include a linear drift term when d is 1. If false, the user must specify non_seasonal_order in the query. When forecasting multiple time series at the same time, the auto.ARIMA algorithm must be used for each time series, so this option must not be set to false.

Arguments

The value is a BOOL. The default value is TRUE.

AUTO_ARIMA_MAX_ORDER

Syntax

AUTO_ARIMA_MAX_ORDER = <var>int64_value</var>

Description

The maximum value for the sum of non-sesonal p and q. It controls the parameter search space in the auto.ARIMA algorithm. Currently, the allowed values are (2, 3, 4, 5). As a reference, for each value there are (6, 10, 15, 21) candidate models to evaluat if non-seasonal d is determined to be 0 or 2. If non-seasonal d is determined to be 1, the number of candidate models to evaluate doubles as there is an additional drift term to consider for all the existing candidate models. This option is disabled when AUTO_ARIMA is set to false.

Arguments

The value is a INT64. The default value is 5. The minimum value is 2 and the maximum value is 5.

NON_SEASONAL_ORDER

Syntax

NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value)

Description

The tuple of non-seasonal p, d, q for the ARIMA model. There are no default values and you must specify all of them. You must explicitly specify auto_arima to false to use this option. Currently, p and q are restricted to [0, 1, 2, 3, 4, 5] and d is restricted to [0, 1, 2]. When forecasting multiple time series at once, one specific non-seasonal order is very unlikely to fit all the time series, and the auto.ARIMA algorithm is used to find the best non-seasonal order for each time series. Therefore, this option is disabled in this case.

Arguments

(int64_value, int64_value, int64_value) is a tuple of three 'INT64'.

DATA_FREQUENCY

Syntax

DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' }

Description

The data frequency of the input time series. The finest supported granularity is 'HOURLY'. When forecasting multiple time series at once, this argument cannot be changed from the default value 'AUTO_FREQUENCY'.

Arguments

Accepts the following values:

'AUTO_FREQUENCY': the training process automatically infers the data frequency, which can be one of the values listed below.

'HOURLY': hourly time series

'DAILY': daily time series

'WEEKLY': weekly time series

'MONTHLY': monthly time series

'QUARTERLY': querterly time series

'YEARLY': yearly time series

The default value is 'AUTO_FREQUENCY'.

INCLUDE_DRIFT

Syntax

INCLUDE_DRIFT = { TRUE | FALSE }

Description

Should the ARIMA model include a linear drift term or not. The drift term is applicable when non-seasonal d is 1.

  • When auto-arima is set to false, this argument is default to false. It can be set to true only when non-seasonal d is 1, otherwise it will return an invalid query error.

  • When auto-arima is set to true, it will automatically decide whether or not to include a linear drift term. Therefore, this option is disabled for auto-ARIMA.

Arguments

The value is a BOOL. The default value is FALSE for auto_arima is disabled.

HOLIDAY_REGION

Syntax

HOLIDAY_REGION = { 'GLOBAL' | 'NA' | 'JAPAC' | 'EMEA' | 'LAC' | 'AE' | ... }

Description

The geographical region based on which the holiday effects are applied in modeling. By default, holiday effects modeling is disabled. To turn it on, specify the holiday region using this option.

Arguments

Accepts the following values:

Top level: global

  • 'GLOBAL'

Second level: continental regions

  • 'NA': North America
  • 'JAPAC': Japan and Asia Pacific
  • 'EMEA': Europe, the Middle East and Africa
  • 'LAC': Latin America and the Caribbean

Third level: countries/regions

  • 'AE': United Arab Emirates
  • 'AR': Argentina
  • 'AT': Austria
  • 'AU': Australia
  • 'BE': Belgium
  • 'BR': Brazil
  • 'CA': Canada
  • 'CH': Switzerland
  • 'CL': Chile
  • 'CN': China
  • 'CO': Colombia
  • 'CZ': Czechia
  • 'DE': Germany
  • 'DK': Denmark
  • 'DZ': Algeria
  • 'EC': Ecuador
  • 'EE': Estonia
  • 'EG': Egypt
  • 'ES': Spain
  • 'FI': Finland
  • 'FR': France
  • 'GB': United Kingdom
  • 'GR': Greece
  • 'HK': Hong Kong
  • 'HU': Hungary
  • 'ID': Indonesia
  • 'IE': Ireland
  • 'IL': Israel
  • 'IN': India
  • 'IR': Iran
  • 'IT': Italy
  • 'JP': Japan
  • 'KR': South Korea
  • 'LV': Latvia
  • 'MA': Morocco
  • 'MX': Mexico
  • 'MY': Malaysia
  • 'NG': Nigeria
  • 'NL': Netherlands
  • 'NO': Norway
  • 'NZ': New Zealand
  • 'PE': Peru
  • 'PH': Philippines
  • 'PK': Pakistan
  • 'PL': Poland
  • 'PT': Portugal
  • 'RO': Romania
  • 'RS': Serbia
  • 'RU': Russia
  • 'SA': Saudi Arabia
  • 'SE': Sweden
  • 'SG': Singapore
  • 'SI': Slovenia
  • 'SK': Slovakia
  • 'TH': Thailand
  • 'TR': Turkey
  • 'TW': Taiwan
  • 'UA': Ukraine
  • 'US': United States
  • 'VE': Venezuela
  • 'VN': Vietnam
  • 'ZA': South Africa

query_statement

The AS query_statement clause specifies the standard SQL query that is used to generate the training data. See the Standard SQL Query Syntax page for the supported SQL syntax of the query_statement clause.

All columns referenced by the query_statement are used as inputs to the model except for the columns included in input_label_cols and data_split_col.

Supported inputs

The CREATE MODEL statement supports the following data types for input label and data split columns.

Supported data types for input label columns

BigQuery ML supports different Standard SQL data types depending on the model type. Supported data types for input_label_cols include:

Model type Supported label types
linear_reg INT64
NUMERIC
FLOAT64
logistic_reg Any groupable data type
kmeans Any groupable data type

Supported data types for data split columns

BigQuery ML supports different Standard SQL data types depending on the data split method. Supported data types for data_split_col include:

Data split method Supported column types
custom BOOL
seq INT64
NUMERIC
FLOAT64
TIMESTAMP

Limitations

CREATE MODEL statements must comply with the following rules:

  • Only one CREATE statement is allowed.
  • When you use a CREATE MODEL statement, the size of the model must be 90 MB or less or the query fails. Generally, if all categorical variables are short strings, a total feature cardinality (model dimension) of 5-10 million is supported. The dimensionality is dependent on the cardinality and length of the string variables.
  • The label column cannot contain NULL values. If the label column contains NULL values, the query fails.
  • Currently, the CREATE MODEL IF NOT EXISTS clause always updates the last modified timestamp of a model.
  • For linear regression models, the label column must be real-valued (the column values cannot be +/- infinity or NaN).
  • For logistic regression models, the label columns can contain up to 50 unique values; that is, the number of classes is less than or equal to 50.

Input variable transformations

BigQuery ML transforms input variables or features as follows:

Input data type Transformation method Details
INT64
NUMERIC
FLOAT64
Standardization For all numerical columns, BigQuery ML standardizes and centers the column at zero before passing it into training except Boosted Tree models. When creating a k-means model, the STANDARDIZE_FEATURES option specifies whether to standardize numerical features.
BOOL
STRING
BYTES
DATE
DATETIME
TIME
One-hot encoded For all non-numerical non-array columns other than TIMESTAMP, BigQuery ML performs a one-hot encoding transformation except Boosted Tree models. This transformation generates a separate feature for each unique value in the column. Label encoding transformation is applied to train Boosted Tree models to convert each unique value into a numerical value.
ARRAY Multi-hot encoded For all non-numerical ARRAY columns, BigQuery ML performs a multi-hot encoding transformation. This transformation generates a separate feature for each unique element in the ARRAY.
TIMESTAMP Timestamp transformation When BigQuery ML encounters a TIMESTAMP column, it extracts a set of components from the TIMESTAMP and performs a mix of standardization and one-hot encoding on the extracted components. For the Unix time in seconds component, BigQuery ML uses standardization; for all other components, it uses one-hot encoding.

You can use the ML.WEIGHTS function to see the transformation of a TIMESTAMP column into multiple feature columns.
STRUCT Struct expansion When BigQuery ML encounters a STRUCT column, it expands the fields inside STRUCT to single columns. It requires all fields to be named. Nested STRUCT is not allowed. The column names after expansion are in the format of {struct_name}_{field_name}.

Imputation

In statistics, imputation is used to replace missing data with substituted values. When you train a model in BigQuery ML, NULL values are treated as missing data. When you predict outcomes in BigQuery ML, missing values can occur when BigQuery ML encounters a NULL value or a previously unseen value. BigQuery ML handles missing data based on whether the column is numeric, one-hot encoded, or a timestamp.

Numerical columns

In both training and prediction, NULL values in numeric columns are replaced with the mean value as calculated by the feature column in the original input data.

One-hot/Multi-hot encoded columns

In both training and prediction, NULL values in the encoded columns are mapped to an additional category that is added to the data. Previously unseen data is assigned a weight of 0 during prediction.

Timestamp columns

TIMESTAMP columns use a mixture of imputation methods from both standardized and one-hot encoded columns. For the generated unix time column, BigQuery ML replaces values with the mean unix time across the original columns. For other generated values, BigQuery ML assigns them to the respective NULL category for each extracted feature.

STRUCT columns

In both training and prediction, each field of the STRUCT is imputed according to its type.

CREATE MODEL examples

The CREATE MODEL statement creates a model with the specified options. If the model name exists in the dataset, the following error is returned:

Already Exists: [PROJECT_ID]:[DATASET].[MODEL]

To replace an existing model, use the CREATE OR REPLACE MODEL statement instead.

The following examples create models named mymodel in mydataset in your default project.

Training a linear regression model

The following example creates and trains a linear regression model. The learn rate is set to .15, the L1 regularization is set to 1, and the maximum number of training iterations is set to 5.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='LINEAR_REG',
    LS_INIT_LEARN_RATE=.15,
    L1_REG=1,
    MAX_ITERATIONS=5 ) AS
SELECT
  column1,
  column2,
  column3,
  label
FROM
  `mydataset.mytable`
WHERE
  column4 < 10

Training a linear regression model with a sequential data split

In this example, you create a linear regression model with a sequential data split. The split fraction is .3 and split uses the timestamp column as the basis for the split.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='LINEAR_REG',
    LS_INIT_LEARN_RATE=.15,
    L1_REG=1,
    MAX_ITERATIONS=5,
    DATA_SPLIT_METHOD='SEQ',
    DATA_SPLIT_EVAL_FRACTION=0.3,
    DATA_SPLIT_COL='timestamp' ) AS
SELECT
  column1,
  column2,
  column3,
  timestamp,
  label
FROM
  `mydataset.mytable`
WHERE
  column4 < 10

Training a linear regression model with a custom data split

In this example, you have manually split your data into a training table and an evaluation table. The training table is named training_table. The evaluation table is named evaluation_table.

This command creates a linear regression model using a custom split method and trains the model by joining the data from the evaluation and training tables.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='LINEAR_REG',
    DATA_SPLIT_METHOD='CUSTOM',
    DATA_SPLIT_COL='SPLIT_COL' ) AS
SELECT
  *,
  false AS split_col
FROM
  `mydataset.training_table`
UNION ALL
SELECT
  *,
  true AS split_col
FROM
  `mydataset.evaluation_table`

In this example, all the columns in the training table and in the evaluation table are either features or the label. The query uses SELECT * and UNION ALL to append all of the data in the split_col column to the existing data.

Training a multiclass logistic regression model with automatically calculated weights

In this example, you create a multiclass logistic regression model using the auto_class_weights option.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='LOGISTIC_REG',
    AUTO_CLASS_WEIGHTS=TRUE ) AS
SELECT
  *
FROM
  `mydataset.mytable`

Training a multiclass logistic regression model with specified weights

In this example, you create a multiclass logistic regression model using the class_weights option. The label columns are label1, label2, and label3.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='LOGISTIC_REG',
    CLASS_WEIGHTS=[('label1', 0.5), ('label2', 0.3), ('label3', 0.2)]) AS
SELECT
  *
FROM
  `mydataset.mytable`

Training a logistic regression model with specified weights

In this example, you create a logistic regression model using the class_weights option.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='LOGISTIC_REG',
    CLASS_WEIGHTS=[('0', 0.9), ('1', 0.1)]) AS
SELECT
  *
FROM
  `mydataset.mytable`

Training a k-means model

This example creates a k-means model with four clusters using the default distance_type value of euclidean_distance.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='KMEANS',
    NUM_CLUSTERS=4 ) AS
SELECT
  *
FROM `mydataset.mytable`

Training a k-means model with random clusters

This example creates a k-means model with three clusters using the random cluster initialization method.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='KMEANS',
    NUM_CLUSTERS=3,
    KMEANS_INIT_METHOD='RANDOM') AS
SELECT
  *
FROM
  `mydataset.mytable`

Training a k-means model with custom clusters

This example creates a k-means model with four clusters using the custom cluster initialization method. init_col identifies the column of type BOOL that contains the values which specify whether a given row is an initial centroid. This column should only contain three rows with the value TRUE.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='KMEANS',
    NUM_CLUSTERS=3,
    KMEANS_INIT_METHOD='CUSTOM',
    KMEANS_INIT_COL='init_col') AS
SELECT
  init_col,
  features
FROM
  `mydataset.mytable`

Model creation with TRANSFORM, while excluding original columns

The following query trains a model after adding the columns f1 and f2 from the SELECT statement to form a new column c; the columns f1 and f2 are omitted from the training data. Model training uses columns f3 and label_col as they appear in the data source t.

CREATE MODEL m
  TRANSFORM(f1 + f2 as c, * EXCEPT(f1, f2))
  OPTIONS(model_type=’linear_reg’, input_label_cols=['label_col'])
AS SELECT f1, f2, f3, label_col FROM t;