The CREATE MODEL Statement

Models in BigQuery ML

A model in BigQuery ML represents what an ML system has learned from the training data.

The following types of models are supported by BigQuery ML:

  • Linear regression for forecasting; for example, the sales of an item on a given day. Labels are real-valued (they cannot be +/- infinity or NaN).
  • Binary logistic regression for classification; for example, determining whether a customer will make a purchase. Labels must only have two possible values.
  • Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value." Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross entropy loss function.
  • K-means clustering for data segmentation (beta); for example, identifying customer segments. K-means is an unsupervised learning technique, so model training does not require labels nor split data for training or evaluation.

In BigQuery ML, a model can be used with data from multiple BigQuery datasets for training and for prediction.

CREATE MODEL statement

To create a model in BigQuery, use the BigQuery ML CREATE MODEL statement. This statement is similar to the CREATE TABLE DDL statement. When you run a standard SQL query that contains a CREATE MODEL statement, a query job is generated for you that processes the query.

CREATE MODEL limitations

BigQuery ML CREATE MODEL statements are subject to the following limitations:

  • When you use a CREATE MODEL statement, the size of the model must be 90 MB or less or the query fails. Generally, if all categorical variables are short strings, a total feature cardinality (model dimension) of 5-10 million is supported. The dimensionality is dependent on the cardinality and length of the string variables.
  • When you use a CREATE MODEL statement, the label column cannot contain NULL values. If the label column contains NULL values, the query fails.
  • Currently, the CREATE MODEL IF NOT EXISTS clause always updates the last modified timestamp of a model.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
[OPTIONS(model_option_list)]
[AS query_statement]

Where:

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} is one of the following statements:

  • CREATE MODEL — creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error. Unless you specify split options, the training dataset is automatically split into training and evaluation data.
  • CREATE MODEL IF NOT EXISTS — creates and trains a new model only if the model does not currently exist in the specified dataset.
  • CREATE OR REPLACE MODEL — creates and trains a model and replaces an existing model with the same name in the specified dataset.

CREATE MODEL statements must comply with the following rules:

  • Only one CREATE statement is allowed.
  • For linear regression models, the label column must be real-valued (the column values cannot be +/- infinity or NaN).
  • For logistic regression models, the label column must contain only two distinct values.
  • For multiclass logistic regression models, the label columns can contain up to 50 unique values.
  • In the model_option_list, the model_type option is required. All others are optional.

model_name

model_name is the name of the model you're creating or replacing. The model name must be unique per dataset. The model name must follow the same naming rules as a BigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format: `[PROJECT_ID].[DATASET].[MODEL]` (including the backticks); for example, `myproject.mydataset.mymodel`.

model_option_list

The model_option_list allows you to specify additional model options. You can include multiple options using a comma-separated list. Options are case-insensitive.

Specify a model option list in the following format:

NAME=VALUE, ...

NAME and VALUE must be one of the following combinations:

NAME VALUE Details
model_type

linear_reg

logistic_reg

kmeans

linear_reg creates a linear regression model.

logistic_reg creates a logistic regression model or a multiclass logistic regression model. When you create a multiclass logistic regression model, specify training data that contains more than two unique labels.

kmeans creates a k-means clustering model. This unsupervised model does not require labels.

The model_type option is required. All others are optional.

input_label_cols STRING

The label column name(s) in the training data. input_label_cols accepts an array of strings, but only one array element is supported for linear_reg and logistic_reg models.

If input_label_cols is unspecified, the column named "label" in the training data is used. If neither exists, the query fails.

For linear regression models, the label column must be real-valued (the column values must be real numbers). For logistic regression models, the label column must contain only two distinct values.

For more information on supported input label types, see Supported inputs for input_label_cols.

optimize_strategy

auto_strategy

batch_gradient_descent

normal_equation

The strategy to train linear regression models. The default value is auto_strategy.

If batch_gradient_descent is used, training is done using batch gradient descent method, which optimizes the loss function using the gradient function. This is a widely used first-order iterative optimization method.

If normal_equation is used, training algorithm directly computes the least square solution of the linear regression problem with the analytical formula. Normal equation is not applied if:

  1. l1_reg is specified.
  2. warm_start is specified.
  3. Total cardinality of training features is more than 10000.

If optimize_strategy is unspecified or explicitly set to auto_strategy, the optimize strategy is determined as follows:

  1. If l1_reg or warm_start is specified, batch_gradient_descent strategy is used.
  2. If total cardinalities of training features are more than 10000, batch_gradient_descent strategy is used
  3. If there is over-fitting issue, i,e., num of training examples is less than 10x of total cardinality, batch_gradient_descent strategy is used
  4. normal_equation strategy is used for all other cases.
l1_reg FLOAT64

The amount of L1 regularization applied. The default value is 0.

l2_reg FLOAT64

The amount of L2 regularization applied. The default value is 0.

max_iterations INT64

The maximum number of training iterations (steps). The default value is 20.

learn_rate_strategy line_search
constant

Choose the strategy for specifying the learning rate during training. The default value is line_search.

Line search slows down training and increases the number of bytes processed, but the trade off is that it generally converges even with a larger initial learning rate specified.

If the model learn_rate appears to be doubling every iteration (as indicatd by ml.training_info), try setting ls_init_learn_rate to the last doubled learn rate. The optimal initial learn rate is different for every model. A good initial learn rate for one model might not be a good initial learn rate for another.

learn_rate FLOAT64

The learn rate for gradient descent when learn_rate_strategy is set to constant. If learn_rate_strategy is set to line_search, an error is returned. The default value is 0.1.

early_stop BOOL

Indicates training should stop after the first iteration in which the relative loss improvement is less than min_rel_progress. The default value is true.

min_rel_progress FLOAT64

The minimum relative loss improvement necessary to continue training when early_stop is set to true. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue. The default value is 0.01.

data_split_method

auto_split

random

custom

seq

no_split

The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting via early stopping. The default value is auto_split.

random splits data randomly. A random split is deterministic — different training runs produce the same split results so long as the underlying training data hasn't changed.

custom splits data using a customer provided BOOL (boolean) column. The rows with a boolean value of "true" are used as evaluation data. The rows with a value of "false" are used as training data.

seq splits data sequentially using a customer-provided column. The column can have any orderable data type: NUMERIC, STRING or TIMESTAMP. All rows with split values smaller than the threshold are used as training data. The remaining rows (including NULLs) are used as evaluation data.

no_split uses all data as training data.

If data_split_method is unspecified or explicitly set to auto_split, the automatic split strategy is as follows:

  1. When there are less than 500 rows in the input data, all rows are used as training data.
  2. When there are between 500 and 50,000 rows in the input data, 20% of the data is used as evaluation data in a random split.
  3. When there are more than 50,000 rows in the input data, only 10,000 of them are used as evaluation data in a random split.
data_split_eval_fraction FLOAT64

This option is used with random and seq splits. data_split_eval_fraction is the fraction of the data used for evaluation, accurate to two decimal places. The default value is 0.2.

data_split_col STRING

This option identifies the column used to split the data. This column cannot be used as a feature or label, and will be excluded from features automatically.

When data_split_method is custom, the corresponding column should be of type BOOL. The rows with true or NULL values are used as evaluation data. Rows with false values are used as training data.

When data_split_method is seq, the last data_split_fraction rows (from smallest to largest) in the corresponding column are used as evaluation data. The first rows are used as training data.

For information on supported input types, see Supported input types for data_split_col.

ls_init_learn_rate DOUBLE

Sets the initial learning rate that learn_rate_strategy='line_search' uses. This option can only be used if line_search is specified.

warm_start BOOL

This option is used to retrain a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run. The default value is false.

In a warm start run, the iteration numbers are reset to start from zero. The training_run number or the TIMESTAMP columns can be used to distinguish the warm start run from the original run.

The model_type and labels options and the training data schema cannot change in a warm start.

auto_class_weights BOOL

By default, the training data used to create a multiclass logistic regression model is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily, which may not be desired. Class weights can be used to balance the class labels and can be used for logistic and multiclass logistic regressions.

If set to true, the weights for each class are calculated in inverse proportion to the frequency of that class. To balance every class, use the following formula: TOTAL_INPUT_ROWS / (INPUT_ROWS_FOR_CLASS_N * NUMBER_OF_UNIQUE_CLASSES).

class_weights Array[STRUCT(STRING, DOUBLE)]

This option cannot be specified if auto_class_weights is specified. The value must be a list of weights to use for each class label. A weight must be provided for every class label. The weights are not required to add up to one.

num_clusters INT64

For a kmeans model, the number of clusters to identify in the input data. Default value is log10(n), where n is the number of training examples. Allowed values are 2-100.

distance_type STRING

For a kmeans model, the type of metric to compute the distance between two points. Default value is EUCLIDEAN. Allowed values are EUCLIDEAN and COSINE.

For the EUCLIDEAN distance type, the following equation calculates the distance between point x and y:

$$ \lVert x-y\rVert_{2} $$

For the COSINE distance type, the following equation calculates the distance:

$$ \sqrt{1-\frac{x \cdot y}{\lVert x\rVert_{2}\lVert y\rVert_{2}}} $$
where$$ \lVert x\rVert_{2} $$ represents the L2 norm for x.

standardize_features BOOL

For a kmeans model, whether to standardize numerical features. Default value is TRUE.

query_statement

The AS query_statement clause specifies the standard SQL query that is used to generate the training data. See the Standard SQL Query Syntax page for the supported SQL syntax of the query_statement clause.

All columns referenced by the query_statement are used as inputs to the model except for the columns included in input_label_cols and data_split_col.

Supported inputs

The CREATE MODEL statement supports the following input variables or features for the input_label_cols and data_split_cols options.

Supported inputs for input_label_cols

BigQuery ML supports different Standard SQL data types depending on the model type. Supported data types for input_label_cols include:

Model type Supported label types
linear_reg INT64
NUMERIC
FLOAT64
logistic_reg Any groupable data type
kmeans Any groupable data type

Supported inputs for data_split_col

BigQuery ML supports different Standard SQL data types depending on the data split method. Supported data types for data_split_col include:

Data split method Supported column types
custom BOOL
seq INT64
NUMERIC
FLOAT64
TIMESTAMP

Input variable transformations

BigQuery ML transforms input variables or features as follows:

Input data type Transformation method Details
INT64
NUMERIC
FLOAT64
Standardization For all numeric columns, BigQuery ML standardizes and centers the column at zero before passing it into training.
BOOL
STRING
BYTES
DATE
DATETIME
TIME
One-hot encoded For all non-numeric columns other than TIMESTAMP, BigQuery ML performs a one-hot encoding transformation. This transformation generates a separate feature for each unique value in the column.
ARRAY
STRUCT
Unsupported
TIMESTAMP Timestamp transformation When BigQuery ML encounters a TIMESTAMP column, it extracts a set of components from the TIMESTAMP and performs a mix of standardization and one-hot encoding on the extracted components. For the Unix time in seconds component, BigQuery ML uses standardization; for all other components, it uses one-hot encoding.

You can use the ML.WEIGHTS function to see the transformation of a TIMESTAMP column into multiple feature columns.

For one-hot encoded features, each column value is an ARRAY of STRUCTs in the following format: [<String_Value, Weight>]. The length of the ARRAY is the cardinality, or number of unique values, of that feature.

Imputation

In statistics, imputation is used to replace missing data with substituted values. When you train a model in BigQuery ML, NULL values are treated as missing data. When you predict outcomes in BigQuery ML, missing values can occur when BigQuery ML encounters a NULL value or a previously unseen value. BigQuery ML handles missing data based on whether the column is numeric, one-hot encoded, or a timestamp.

Numeric columns

In both training and prediction, NULL values in numeric columns are replaced with the mean value as calculated by the feature column in the original input data.

One-hot encoded columns

In both training and prediction, NULL values in one-hot encoded columns are mapped to an additional category that is added to the data. Previously unseen data is assigned a weight of 0 during prediction.

Timestamp columns

TIMESTAMP columns use a mixture of imputation methods from both standardized and one-hot encoded columns. For the generated unix time column, BigQuery ML replaces values with the mean unix time across the original columns. For other generated values, BigQuery ML assigns them to the respective NULL category for each extracted feature.

CREATE MODEL examples

The CREATE MODEL statement creates a model with the specified options. If the model name exists in the dataset, the following error is returned:

Already Exists: [PROJECT_ID]:[DATASET].[MODEL]

To replace an existing model, use the CREATE OR REPLACE MODEL statement instead.

The following examples create models named mymodel in mydataset in your default project.

Training a linear regression model

The following example creates and trains a linear regression model. The learn rate is set to .15, the L1 regularization is set to 1, and the maximum number of training iterations is set to 5.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='linear_reg',
    ls_init_learn_rate=.15,
    l1_reg=1,
    max_iterations=5 ) AS
SELECT
  column1,
  column2,
  column3,
  label
FROM
  `mydataset.mytable`
WHERE
  column4 < 10

Training a linear regression model with a sequential data split

In this example, you create a linear regression model with a sequential data split. The split fraction is .3 and split uses the timestamp column as the basis for the split.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='linear_reg',
    ls_init_learn_rate=.15,
    l1_reg=1,
    max_iterations=5,
    data_split_method='seq',
    data_split_eval_fraction=0.3,
    data_split_col='timestamp' ) AS
SELECT
  column1,
  column2,
  column3,
  timestamp,
  label
FROM
  `mydataset.mytable`
WHERE
  column4 < 10

Training a linear regression model with a custom data split

In this example, you have manually split your data into a training table and an evaluation table. The training table is named training_table. The evaluation table is named evaluation_table.

This command creates a linear regression model using a custom split method and trains the model by joining the data from the evaluation and training tables.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='linear_reg',
    data_split_method='custom',
    data_split_col='split_col' ) AS
SELECT
  *,
  false AS split_col
FROM
  `mydataset.training_table`
UNION ALL
SELECT
  *,
  true AS split_col
FROM
  `mydataset.evaluation_table`

In this example, all the columns in the training table and in the evaluation table are either features or the label. The query uses SELECT * and UNION ALL to append all of the data in the split_col column to the existing data.

Training a multiclass logistic regression model with automatically calculated weights

In this example, you create a multiclass logistic regression model using the auto_class_weights option.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='logistic_reg',
    auto_class_weights=true ) AS
SELECT
  *
FROM
  `mydataset.mytable`

Training a multiclass logistic regression model with specified weights

In this example, you create a multiclass logistic regression model using the class_weights option. The label columns are label1, label2, and label3.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='logistic_reg',
    class_weights=[('label1', 0.5), ('label2', 0.3), ('label3', 0.2)]) AS
SELECT
  *
FROM
  `mydataset.mytable`

Training a logistic regression model with specified weights

In this example, you create a logistic regression model using the class_weights option.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='logistic_reg',
    class_weights=[('0', 0.9), ('1', 0.1)]) AS
SELECT
  *
FROM
  `mydataset.mytable`

Training a k-means model

This example creates a k-means model with four clusters using the default distance_type value of euclidean_distance.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( model_type='kmeans',
    num_clusters=4 ) AS
SELECT
  *
FROM `mydataset.mytable`
¿Te sirvió esta página? Envíanos tu opinión:

Enviar comentarios sobre…

¿Necesitas ayuda? Visita nuestra página de asistencia.