# The CREATE MODEL statement

## CREATE MODEL statement

To create a model in BigQuery, use the BigQuery ML CREATE MODEL statement. This statement is similar to the CREATE TABLE DDL statement. When you run a standard SQL query that contains a CREATE MODEL statement, a query job is generated for you that processes the query.

## CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
[OPTIONS(model_option_list)]
[AS query_statement]

model_option_list:
MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG' | 'KMEANS' | 'TENSORFLOW' }
[, INPUT_LABEL_COLS = string_array ]
[, OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' } ]
[, L1_REG = float64_value ]
[, L2_REG = float64_value ]
[, MAX_ITERATIONS = int64_value ]
[, LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' } ]
[, LEARN_RATE = float64_value ]
[, EARLY_STOP = { TRUE | FALSE } ]
[, MIN_REL_PROGRESS = float64_value ]
[, DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' } ]
[, DATA_SPLIT_EVAL_FRACTION = float64_value ]
[, DATA_SPLIT_COL = string_value ]
[, LS_INIT_LEARN_RATE = float64_value ]
[, WARM_START = { TRUE | FALSE } ]
[, AUTO_CLASS_WEIGHTS = { TRUE | FALSE } ]
[, CLASS_WEIGHTS = struct_array ]
[, NUM_CLUSTERS = int64_value ]
[, DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' } ]
[, STANDARDIZE_FEATURES = { TRUE | FALSE } ]
[, MODEL_PATH = string_value ]


### CREATE MODEL

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

### CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model does not currently exist in the specified dataset.

### CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

### model_name

model_name is the name of the model you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

• Contain up to 1,024 characters
• Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks: [PROJECT_ID].[DATASET].[MODEL] ; for example, myproject.mydataset.mymodel.

### model_option_list

In the model_option_list, the model_type option is required. All others are optional.

CREATE MODEL supports the following options:

#### MODEL_TYPE

Syntax

MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG' | 'KMEANS' }


Description

Specify the model type. This option is required.

Arguments

'LINEAR_REG' Linear regression for forecasting; for example, the sales of an item on a given day. Labels are real-valued: they cannot be +/- infinity or NaN.

'LOGISTIC_REG' Logistic regression for classification; for example, determining whether a customer will make a purchase. This model can be of two types:

• Binary logistic regression for classification; for example, determining whether a customer will make a purchase. Labels must only have two possible values.
• Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value." Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross entropy loss function.

'KMEANS' K-means clustering for data segmentation (beta); for example, identifying customer segments. K-means is an unsupervised learning technique, so model training does not require labels nor split data for training or evaluation.

'TENSORFLOW' (beta) Create a model by importing a TensorFlow model into BigQuery ML. See the CREATE MODEL statement for TensorFlow models for more information.

#### INPUT_LABEL_COLS

Syntax

 INPUT_LABEL_COLS = string_array 

Description

The label column name(s) in the training data.

Arguments

string_array is an ARRAY of STRINGs. The linear regression and logistic regression model types only support string_array values containing one element.

#### OPTIMIZE_STRATEGY

Syntax

OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' }


Description

The strategy to train linear regression models.

Arguments

Accepts the following values:

'AUTO_STRATEGY' Determine the training strategy as follows:

• If l1_reg or warm_start is specified, batch_gradient_descent strategy is used.
• If total cardinalities of training features are more than 10000, batch_gradient_descent strategy is used.
• If there is over-fitting issue, that is, the number of training examples is less than 10x, where x is total cardinality, batch_gradient_descent strategy is used.
• NORMAL_EQUATION strategy is used for all other cases.

'BATCH_GRADIENT_DESCENT' Train the model using the batch gradient descent method, which optimizes the loss function using the gradient function.

'NORMAL_EQUATION' Directly compute the least square solution of the linear regression problem with the analytical formula. Normal equation cannot be used in the following cases:

• l1_reg is specified.
• warm_start is specified.
• Total cardinality of training features is more than 10,000.

The default value is 'AUTO_STRATEGY'.

#### L1_REG

Syntax

 L1_REG = float64_value 

Description

The amount of L1 regularization applied.

Arguments

float64_value is a FLOAT64. The default value is 0.

#### L2_REG

Syntax

 L2_REG = float64_value 

Description

The amount of L2 regularization applied.

Arguments

float64_value is a FLOAT64. The default value is 0.

#### MAX_ITERATIONS

Syntax

 MAX_ITERATIONS = int64_value 

Description

The maximum number of training iterations or steps.

Arguments

int64_value is an INT64. The default value is 20.

#### LEARN_RATE_STRATEGY

Syntax

LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' }


Description

The strategy for specifying the learning rate during training.

Arguments

'LINE_SEARCH' Use the line search method to calculate the learning rate. The line search initial learn rate is the value specified for LS_INIT_LEARN_RATE .

Line search slows down training and increases the number of bytes processed, but it generally converges even with a larger initial specified learning rate.

'CONSTANT' Set learning rate to the value specified for LEARN_RATE.

The default value is 'LINE_SEARCH'.

#### LEARN_RATE

Syntax

 LEARN_RATE = float64_value 

Description

The learn rate for gradient descent when LEARN_RATE_STRATEGY is set to CONSTANT. If LEARN_RATE_STRATEGY is set to 'LINE_SEARCH', an error is returned.

Arguments

float64_value is a FLOAT64. The default value is 0.1.

#### EARLY_STOP

Syntax

EARLY_STOP = { TRUE | FALSE }


Description

Whether training should stop after the first iteration in which the relative loss improvement is less than the value specified for MIN_REL_PROGRESS.

Arguments

The value is a BOOL. The default value is TRUE.

#### MIN_REL_PROGRESS

Syntax

 MIN_REL_PROGRESS = float64_value 

Description

The minimum relative loss improvement necessary to continue training when EARLY_STOP is set to true. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue.

Arguments

float64_value is a FLOAT64. The default value is 0.01.

#### DATA_SPLIT_METHOD

Syntax

DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' }


Description

The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting via early stopping.

Arguments

Accepts the following values:

'AUTO_SPLIT' The automatic split strategy is as follows:

• When there are fewer than 500 rows in the input data, all rows are used as training data.
• When there are between 500 and 50,000 rows in the input data, 20% of the data is used as evaluation data in a RANDOM split.
• When there are more than 50,000 rows in the input data, only 10,000 of them are used as evaluation data in a RANDOM split.

'RANDOM' Split data randomly. A random split is deterministic: different training runs produce the same split results if the underlying training data remains the same.

'CUSTOM' Split data using a customer provided column of type BOOL. The rows with a value of TRUE are used as evaluation data. The rows with a value of FALSE are used as training data.

'SEQ' Split data sequentially using a customer-provided column. The column can have any orderable data type: NUMERIC, STRING, or TIMESTAMP. All rows with split values smaller than the threshold are used as training data. The remaining rows including NULLs are used as evaluation data.

'NO_SPLIT' Use all data as training data.

The default value is 'AUTO_SPLIT'.

#### DATA_SPLIT_EVAL_FRACTION

Syntax

 DATA_SPLIT_EVAL_FRACTION = float64_value 

Description

This option is used with 'RANDOM' and 'SEQ' splits. It specifies the fraction of the data used for evaluation, accurate to two decimal places.

Arguments

float64_value is a FLOAT64. The default value is 0.2.

#### DATA_SPLIT_COL

Syntax

DATA_SPLIT_COL = string_value

Description

Identifies the column used to split the data. This column cannot be used as a feature or label, and will be excluded from features automatically.

• When the value of DATA_SPLIT_METHOD is 'CUSTOM', the corresponding column should be of type BOOL. The rows with TRUE or NULL values are used as evaluation data. Rows with FALSE values are used as training data.

• When the value of DATA_SPLIT_METHOD is 'SEQ', the last n rows from smallest to largest in the corresponding column are used as evaluation data, where n is the value specified for DATA_SPLIT_EVAL_FRACTION. The first rows are used as training data.

For information on supported input types, see Supported input types for DATA_SPLIT_COL.

Arguments

string_value is a STRING.

#### LS_INIT_LEARN_RATE

Syntax

LS_INIT_LEARN_RATE = float64_value

Description

Sets the initial learning rate that LEARN_RATE_STRATEGY='LINE_SEARCH' uses. This option can only be used if LINE_SEARCH is specified.

If the model LEARN_RATE appears to be doubling every iteration as indicated by ML.TRAINING_INFO, try setting LS_INIT_LEARN_RATE to the last doubled learn rate. The optimal initial learn rate is different for every model. A good initial learn rate for one model might not be a good initial learn rate for another.

Arguments

float64_value is a FLOAT64.

#### WARM_START

Syntax

WARM_START = { TRUE | FALSE }


Description

Retrain a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run.

In a warm start run, the iteration numbers are reset to start from zero. The TRAINING_RUN number or the TIMESTAMP columns can be used to distinguish the warm start run from the original run.

The value of the MODEL_TYPE and LABELS options and the training data schema must remain constant in a warm start.

Arguments

Accepts a BOOL. The default value is FALSE.

#### AUTO_CLASS_WEIGHTS

Syntax

AUTO_CLASS_WEIGHTS = { TRUE | FALSE }


Description

Whether to balance class labels using weights for each class in inverse proportion to the frequency of that class.

Use with logistic and multiclass logistic regressions.

By default, the training data used to create a multiclass logistic regression model is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily, which may not be desired.

To balance every class, use the following formula:

total_input_rows / (input_rows_for_class_n * number_of_unique_classes)


Arguments

Accepts a BOOL. The default value is FALSE.

#### CLASS_WEIGHTS

Syntax

CLASS_WEIGHTS = struct_array

Description

The weights to use for each class label. This option cannot be specified if AUTO_CLASS_WEIGHTS is specified.

Arguments

struct_array is an ARRAY of STRUCTs; each STRUCT contains a STRING containing the class label and a FLOAT64 containing the weight for that class label. A weight must be present for every class label. The weights are not required to add up to one. For example:

CLASS_WEIGHTS = [STRUCT('example_label', .2)]


#### NUM_CLUSTERS

Syntax

NUM_CLUSTERS = int64_value

Description

For a k-means model, the number of clusters to identify in the input data.

Arguments

int64_value is an INT64. Allowed values are 2-100. The default value is log10(n), where n is the number of training examples.

#### DISTANCE_TYPE

Syntax

DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' }


Description

For a k-means model, the type of metric to compute the distance between two points.

Arguments

Accepts the following values:

'EUCLIDEAN' Use the following equation to calculate the distance between point x and y:

$$\lVert x-y\rVert_{2}$$

'COSINE' Use the following equation calculates the distance:

$$\sqrt{1-\frac{x \cdot y}{\lVert x\rVert_{2}\lVert y\rVert_{2}}}$$

where $$\lVert x\rVert_{2}$$ represents the L2 norm for x.

The default value is 'EUCLIDEAN'.

#### STANDARDIZE_FEATURES

Syntax

STANDARDIZE_FEATURES = { TRUE | FALSE }


Description

For a kmeans model, whether to standardize numerical features.

Arguments

Accepts a BOOL. The default value is TRUE.

#### MODEL_PATH

Syntax

MODEL_PATH = string_value


Description

For TensorFlow model types, specifies the location of the TensorFlow model to import.

string_value is the location of a Google Cloud Storage bucket that contains the model to import.

Example

MODEL_PATH = 'gs:////bucket/path/to/saved_model/*'


### query_statement

The AS query_statement clause specifies the standard SQL query that is used to generate the training data. See the Standard SQL Query Syntax page for the supported SQL syntax of the query_statement clause.

All columns referenced by the query_statement are used as inputs to the model except for the columns included in input_label_cols and data_split_col.

## Supported inputs

The CREATE MODEL statement supports the following data types for input label and data split columns.

### Supported data types for input label columns

BigQuery ML supports different Standard SQL data types depending on the model type. Supported data types for input_label_cols include:

Model type Supported label types
linear_reg INT64
NUMERIC
FLOAT64
logistic_reg Any groupable data type
kmeans Any groupable data type

### Supported data types for data split columns

BigQuery ML supports different Standard SQL data types depending on the data split method. Supported data types for data_split_col include:

Data split method Supported column types
custom BOOL
seq INT64
NUMERIC
FLOAT64
TIMESTAMP

### Limitations

CREATE MODEL statements must comply with the following rules:

• Only one CREATE statement is allowed.
• When you use a CREATE MODEL statement, the size of the model must be 90 MB or less or the query fails. Generally, if all categorical variables are short strings, a total feature cardinality (model dimension) of 5-10 million is supported. The dimensionality is dependent on the cardinality and length of the string variables.
• The label column cannot contain NULL values. If the label column contains NULL values, the query fails.
• CREATE MODEL does not currently support the use of KMS encryption keys. If your project uses a default KMS key, CREATE MODEL will return an error.
• Currently, the CREATE MODEL IF NOT EXISTS clause always updates the last modified timestamp of a model.
• For linear regression models, the label column must be real-valued (the column values cannot be +/- infinity or NaN).
• For logistic regression models, the label columns can contain up to 50 unique values; that is, the number of classes is less than or equal to 50.

### Input variable transformations

BigQuery ML transforms input variables or features as follows:

Input data type Transformation method Details
INT64
NUMERIC
FLOAT64
Standardization For all numeric columns, BigQuery ML standardizes and centers the column at zero before passing it into training. When creating a k-means model, the STANDARDIZE_FEATURES option specifies whether to standardize numerical features.
BOOL
STRING
BYTES
DATE
DATETIME
TIME
One-hot encoded For all non-numeric columns other than TIMESTAMP, BigQuery ML performs a one-hot encoding transformation. This transformation generates a separate feature for each unique value in the column.
ARRAY
STRUCT
Unsupported
TIMESTAMP Timestamp transformation When BigQuery ML encounters a TIMESTAMP column, it extracts a set of components from the TIMESTAMP and performs a mix of standardization and one-hot encoding on the extracted components. For the Unix time in seconds component, BigQuery ML uses standardization; for all other components, it uses one-hot encoding.

You can use the ML.WEIGHTS function to see the transformation of a TIMESTAMP column into multiple feature columns.

## Imputation

In statistics, imputation is used to replace missing data with substituted values. When you train a model in BigQuery ML, NULL values are treated as missing data. When you predict outcomes in BigQuery ML, missing values can occur when BigQuery ML encounters a NULL value or a previously unseen value. BigQuery ML handles missing data based on whether the column is numeric, one-hot encoded, or a timestamp.

### Numeric columns

In both training and prediction, NULL values in numeric columns are replaced with the mean value as calculated by the feature column in the original input data.

### One-hot encoded columns

In both training and prediction, NULL values in one-hot encoded columns are mapped to an additional category that is added to the data. Previously unseen data is assigned a weight of 0 during prediction.

### Timestamp columns

TIMESTAMP columns use a mixture of imputation methods from both standardized and one-hot encoded columns. For the generated unix time column, BigQuery ML replaces values with the mean unix time across the original columns. For other generated values, BigQuery ML assigns them to the respective NULL category for each extracted feature.

## CREATE MODEL examples

The CREATE MODEL statement creates a model with the specified options. If the model name exists in the dataset, the following error is returned:

Already Exists: [PROJECT_ID]:[DATASET].[MODEL]

To replace an existing model, use the CREATE OR REPLACE MODEL statement instead.

The following examples create models named mymodel in mydataset in your default project.

### Training a linear regression model

The following example creates and trains a linear regression model. The learn rate is set to .15, the L1 regularization is set to 1, and the maximum number of training iterations is set to 5.

CREATE MODEL
mydataset.mymodel
OPTIONS
( model_type='linear_reg',
ls_init_learn_rate=.15,
l1_reg=1,
max_iterations=5 ) AS
SELECT
column1,
column2,
column3,
label
FROM
mydataset.mytable
WHERE
column4 < 10


### Training a linear regression model with a sequential data split

In this example, you create a linear regression model with a sequential data split. The split fraction is .3 and split uses the timestamp column as the basis for the split.

CREATE MODEL
mydataset.mymodel
OPTIONS
( model_type='linear_reg',
ls_init_learn_rate=.15,
l1_reg=1,
max_iterations=5,
data_split_method='seq',
data_split_eval_fraction=0.3,
data_split_col='timestamp' ) AS
SELECT
column1,
column2,
column3,
timestamp,
label
FROM
mydataset.mytable
WHERE
column4 < 10


### Training a linear regression model with a custom data split

In this example, you have manually split your data into a training table and an evaluation table. The training table is named training_table. The evaluation table is named evaluation_table.

This command creates a linear regression model using a custom split method and trains the model by joining the data from the evaluation and training tables.

CREATE MODEL
mydataset.mymodel
OPTIONS
( model_type='linear_reg',
data_split_method='custom',
data_split_col='split_col' ) AS
SELECT
*,
false AS split_col
FROM
mydataset.training_table
UNION ALL
SELECT
*,
true AS split_col
FROM
mydataset.evaluation_table


In this example, all the columns in the training table and in the evaluation table are either features or the label. The query uses SELECT * and UNION ALL to append all of the data in the split_col column to the existing data.

### Training a multiclass logistic regression model with automatically calculated weights

In this example, you create a multiclass logistic regression model using the auto_class_weights option.

CREATE MODEL
mydataset.mymodel
OPTIONS
( model_type='logistic_reg',
auto_class_weights=true ) AS
SELECT
*
FROM
mydataset.mytable


### Training a multiclass logistic regression model with specified weights

In this example, you create a multiclass logistic regression model using the class_weights option. The label columns are label1, label2, and label3.

CREATE MODEL
mydataset.mymodel
OPTIONS
( model_type='logistic_reg',
class_weights=[('label1', 0.5), ('label2', 0.3), ('label3', 0.2)]) AS
SELECT
*
FROM
mydataset.mytable


### Training a logistic regression model with specified weights

In this example, you create a logistic regression model using the class_weights option.

CREATE MODEL
mydataset.mymodel
OPTIONS
( model_type='logistic_reg',
class_weights=[('0', 0.9), ('1', 0.1)]) AS
SELECT
*
FROM
mydataset.mytable


### Training a k-means model

This example creates a k-means model with four clusters using the default distance_type value of euclidean_distance.

CREATE MODEL
mydataset.mymodel
OPTIONS
( model_type='kmeans',
num_clusters=4 ) AS
SELECT
*
FROM mydataset.mytable

Оцените, насколько информация на этой странице была вам полезна:

### Оставить отзыв о...

Текущей странице
BigQuery ML Documentation