The CREATE MODEL statement to train AutoML Tables models

AutoML Tables enables you to automatically build state-of-the-art machine learning models on structured data at massively increased speed and scale. See AutoML Tables documentation for more details about the model type.

CREATE MODEL statement for AutoML Tables models

To train an AutoML Tables model inside of BigQuery, use the BigQuery ML CREATE MODEL statement with one of the AutoML Tables model types.

For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
OPTIONS(MODEL_TYPE = { 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' },
        BUDGET_HOURS = float64_value,
        OPTIMIZATION_OBJECTIVE = string_value,
        INPUT_LABEL_COLS = string_array,
        DATA_SPLIT_COL = string_value)
[AS query_statement];

CREATE MODEL

Creates a new BigQuery ML model in the specified dataset. For AutoML Tables models, BigQuery ML trains and compresses the AutoML Tables model and converts it to a BigQuery ML model. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates a new model only if the model does not currently exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates a new model and replaces any existing model with the same name in the specified dataset.

Model options

BigQuery ML currently uses the default values for AutoML Tables training options, including automatic data splitting and the default optimization function.

model_name

model_name is the name of the BigQuery ML model you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can contain the following:

  • Up to 1,024 characters
  • Letters of either case, numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example:

`myproject.mydataset.mymodel`

model_option_list

In the model_option_list, the model_type option is required. All others are optional.

CREATE MODEL supports the following options:

MODEL_TYPE

Syntax

MODEL_TYPE = { 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' }

Description

Specifies the model type. This option is required. For a regression model, the type should be 'AUTOML_REGRESSOR' and the type of the label column should be numeric. For a classification model, the type should be 'AUTOML_CLASSIFIER' and the label column can be either a string or numeric.

BUDGET_HOURS

Syntax

BUDGET_HOURS = float64_value

Description

Sets the training budget for AutoML Tables training, specified in hours. Defaults to 1.0 and must be between 1.0 and 72.0. This option can only be used if MODEL_TYPE is AUTOML_REGRESSOR or AUTOML_CLASSIFIER.

After training AutoML Tables models, BigQuery ML compresses the model to ensure it is small enough to import, which can take up to 50% of the training time. The time to compress the model is not included in the training budget time.

Arguments

float64_value is a FLOAT64.

OPTIMIZATION_OBJECTIVE

Syntax

OPTIMIZATION_OBJECTIVE = string_value | struct_value

Description

Sets the optimization objective function to use for AutoML Tables training.

  • For regression, the allowed values are: MINIMIZE_RMSE(default), MINIMIZE_MAE, and MINIMIZE_RMSLE.
  • For binary classification, the allowed values are: MAXIMIZE_AU_ROC(default), MINIMIZE_LOG_LOSS, MAXIMIZE_AU_PRC, MAXIMIZE_PRECISION_AT_RECALL and MAXIMIZE_RECALL_AT_PRECISION.
  • For multiclass classification, the only allowed value is: MINIMIZE_LOG_LOSS.

For more details on the optimization objective functions, see the AutoML Tables documentation.

Arguments

OPTIMIZATION_OBJECTIVE is a polymorphic option that can be specified as a string value or a struct value.

  • string_value is one of: MAXIMIZE_AU_ROC, MINIMIZE_LOG_LOSS, MAXIMIZE_AU_PRC, MINIMIZE_RMSE, MINIMIZE_MAE, or MINIMIZE_RMSLE.

    For example:

    OPTIMIZATION_OBJECTIVE = 'MAXIMIZE_AU_ROC'
    
  • struct_value is a STRUCT. The STRUCT contains a STRING and a FLOAT64, where the STRING represents the optimization objective function and the FLOAT64 represents the user input recall or precision value. The STRING is either MAXIMIZE_PRECISION_AT_RECALL or MAXIMIZE_RECALL_AT_PRECISION. The FLOAT64 value is the fixed recall value when the STRING is MAXIMIZE_PRECISION_AT_RECALL, or the fixed precision value when the STRING is MAXIMIZE_RECALL_AT_PRECISION. For either STRING, the value of the FLOAT64 must be in the range of [0,1].

    For example:

    OPTIMIZATION_OBJECTIVE = STRUCT('MAXIMIZE_PRECISION_AT_RECALL', 0.3)
    

INPUT_LABEL_COLS

Syntax

INPUT_LABEL_COLS = string_array

Description

The label column name in the training data. Defaults to 'label'.

Arguments

string_array is an ARRAY of STRINGs. The AutoML Tables model types only support string_array values that contain one element.

DATA_SPLIT_COL

Syntax

DATA_SPLIT_COL = string_value

Description

The split column name in the training data. Defaults to automatic splitting.

Arguments

string_value is one of the columns in the training data and should be either a timestamp or string column. This column will be passed directly to AutoML Tables. In a string column, you specify rows to be used for training, validation, and testing using the TRAIN, VALIDATE, TEST, and UNASSIGNED values. For more information about how to use these values, see the AutoML Tables data split column. Timestamps will be treated as an AutoML Tables Time Column.

Supported inputs

The CREATE MODEL statement supports the following data types for input label and data split columns.

Supported data types for input label columns

BigQuery ML supports different GoogleSQL data types depending on the model type. Supported data types for input_label_cols include:

Model type Supported label types
automl_regressor INT64
NUMERIC
BIGNUMERIC
FLOAT64
automl_classifier Any groupable data type

Supported data types for other columns

For columns other than the label column, any groupable data type is supported. The BigQuery column type will be used to determine the feature column type in AutoML Tables.

BigQuery type AutoML type
INT64
NUMERIC
BIGNUMERIC
FLOAT64
NUMERIC or TIMESTAMP if AutoML Tables determines that it is a unix timestamp
BOOL CATEGORICAL
STRING
BYTES
Either CATEGORICAL or TEXT, auto-selected by AutoML Tables.
TIMESTAMP
DATETIME
TIME
DATE
Either TIMESTAMP, CATEGORICAL, or TEXT, auto-selected by AutoML Tables.

To force a numeric column to be treated as categorical, CAST it to a BigQuery string. Arrays of supported types are allowed and will remain arrays during AutoML Tables training.

Limitations

CREATE MODEL statements for AutoML Tables models must comply with the following rules:

  • The input data to AutoML Tables must be between 1000 and 200,000,000 rows, and less than 100 GB.
  • Global region CMEK keys and multi-region CMEK keys, for example eu or us, are not supported.
  • The models are not visible in the AutoML Tables UI, and not available for batch or online predictions in AutoML Tables.
  • The default maximum number of concurrent training jobs is 5. Raising the Vertex AI quota does not modify this quota. If you receive an error Too many AutoML training queries have been issued within a short period of time, you can submit a request to raise the maximum number of concurrent training jobs. To request an increase, contact bqml-feedback@google.com with your project ID and the details of your request.

CREAT