The CREATE MODEL statement to train AutoML Tables models
AutoML Tables enables you to automatically build state-of-the-art machine learning models on structured data at massively increased speed and scale. See AutoML Tables documentation for more details about the model type.
CREATE MODEL
statement for AutoML Tables models
To train an AutoML Tables model inside of BigQuery, use the
BigQuery ML CREATE MODEL
statement with one of the
AutoML Tables model types.
For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name OPTIONS(MODEL_TYPE = { 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' }, BUDGET_HOURS = float64_value, OPTIMIZATION_OBJECTIVE = string_value, INPUT_LABEL_COLS = string_array, DATA_SPLIT_COL = string_value) [AS query_statement];
CREATE MODEL
Creates a new BigQuery ML model in the specified dataset. For AutoML
Tables models, BigQuery ML trains and compresses the
AutoML Tables model and converts it to a BigQuery ML
model. If the model name exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates a new model only if the model does not currently exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates a new model and replaces any existing model with the same name in the specified dataset.
Model options
BigQuery ML currently uses the default values for AutoML Tables training options, including automatic data splitting and the default optimization function.
model_name
model_name
is the name of the BigQuery ML model you're creating or
replacing. The model name must be unique per dataset: no other model or table
can have the same name. The model name must follow the same naming rules as a
BigQuery table. A model name can contain the following:
- Up to 1,024 characters
- Letters of either case, numbers, and underscores
model_name
is not case-sensitive.
If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example:
`myproject.mydataset.mymodel`
model_option_list
In the model_option_list
, the model_type
option is required. All others
are optional.
CREATE MODEL
supports the following options:
MODEL_TYPE
Syntax
MODEL_TYPE = { 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' }
Description
Specifies the model type. This option is required. For a regression model, the type should be 'AUTOML_REGRESSOR' and the type of the label column should be numeric. For a classification model, the type should be 'AUTOML_CLASSIFIER' and the label column can be either a string or numeric.
BUDGET_HOURS
Syntax
BUDGET_HOURS = float64_value
Description
Sets the training budget for AutoML Tables training, specified in hours.
Defaults to 1.0 and must be between 1.0 and 72.0. This option can only be used
if MODEL_TYPE
is AUTOML_REGRESSOR
or
AUTOML_CLASSIFIER
.
After training AutoML Tables models, BigQuery ML compresses the model to ensure it is small enough to import, which can take up to 50% of the training time. The time to compress the model is not included in the training budget time.
Arguments
float64_value is a FLOAT64
.
OPTIMIZATION_OBJECTIVE
Syntax
OPTIMIZATION_OBJECTIVE = string_value | struct_value
Description
Sets the optimization objective function to use for AutoML Tables training.
- For regression, the allowed values are:
MINIMIZE_RMSE
(default),MINIMIZE_MAE
, andMINIMIZE_RMSLE
. - For binary classification, the allowed values are:
MAXIMIZE_AU_ROC
(default),MINIMIZE_LOG_LOSS
,MAXIMIZE_AU_PRC
,MAXIMIZE_PRECISION_AT_RECALL
andMAXIMIZE_RECALL_AT_PRECISION
. - For multiclass classification, the only allowed value is:
MINIMIZE_LOG_LOSS
.
For more details on the optimization objective functions, see the AutoML Tables documentation.
Arguments
OPTIMIZATION_OBJECTIVE
is a polymorphic option that can be specified as a
string value or a struct value.
string_value is one of:
MAXIMIZE_AU_ROC
,MINIMIZE_LOG_LOSS
,MAXIMIZE_AU_PRC
,MINIMIZE_RMSE
,MINIMIZE_MAE
, orMINIMIZE_RMSLE
.For example:
OPTIMIZATION_OBJECTIVE = 'MAXIMIZE_AU_ROC'
struct_value is a
STRUCT
. TheSTRUCT
contains aSTRING
and aFLOAT64
, where theSTRING
represents the optimization objective function and theFLOAT64
represents the user input recall or precision value. TheSTRING
is eitherMAXIMIZE_PRECISION_AT_RECALL
orMAXIMIZE_RECALL_AT_PRECISION
. TheFLOAT64
value is the fixed recall value when theSTRING
isMAXIMIZE_PRECISION_AT_RECALL
, or the fixed precision value when theSTRING
isMAXIMIZE_RECALL_AT_PRECISION
. For eitherSTRING
, the value of theFLOAT64
must be in the range of [0,1].For example:
OPTIMIZATION_OBJECTIVE = STRUCT('MAXIMIZE_PRECISION_AT_RECALL', 0.3)
INPUT_LABEL_COLS
Syntax
INPUT_LABEL_COLS = string_array
Description
The label column name in the training data. Defaults to 'label'.
Arguments
string_array is an ARRAY
of STRING
s. The
AutoML Tables model types only support string_array
values that contain one element.
DATA_SPLIT_COL
Syntax
DATA_SPLIT_COL = string_value
Description
The split column name in the training data. Defaults to automatic splitting.
Arguments
string_value is one of the columns in the training data and should be
either a timestamp or string column. This column will be passed directly to
AutoML Tables. In a string column, you specify rows to be used
for training, validation, and testing using the TRAIN
, VALIDATE
, TEST
, and
UNASSIGNED
values. For more information about how to use these values, see
the AutoML Tables data split
column. Timestamps will be treated as an
AutoML Tables Time Column.
Supported inputs
The CREATE MODEL
statement supports the following data types for input label
and data split columns.
Supported data types for input label columns
BigQuery ML supports different GoogleSQL data types depending on the
model type. Supported data types for input_label_cols
include:
Model type |
Supported label types |
---|---|
automl_regressor |
INT64 NUMERIC BIGNUMERIC FLOAT64 |
automl_classifier |
Any groupable data type |
Supported data types for other columns
For columns other than the label column, any groupable data type is supported. The BigQuery column type will be used to determine the feature column type in AutoML Tables.
BigQuery type |
AutoML type |
---|---|
INT64 NUMERIC BIGNUMERIC FLOAT64 |
NUMERIC
or TIMESTAMP
if AutoML Tables determines that it is a unix timestamp |
BOOL |
CATEGORICAL |
STRING BYTES |
Either CATEGORICAL
or TEXT ,
auto-selected by AutoML Tables. |
TIMESTAMP DATETIME TIME DATE |
Either TIMESTAMP ,
CATEGORICAL ,
or TEXT ,
auto-selected by AutoML Tables. |
To force a numeric column to be treated as categorical, CAST it to a BigQuery string. Arrays of supported types are allowed and will remain arrays during AutoML Tables training.
Limitations
CREATE MODEL
statements for AutoML Tables models must comply with the
following rules:
- The input data to AutoML Tables must be between 1000 and 200,000,000 rows, and less than 100 GB.
Global
region CMEK keys and multi-region CMEK keys, for exampleeu
orus
, are not supported.- The models are not visible in the AutoML Tables UI, and not available for batch or online predictions in AutoML Tables.
- The default maximum number of concurrent training jobs
is 5. Raising the Vertex AI quota does not modify this quota. If you receive an error
Too many AutoML training queries have been issued within a short period of time
, you can submit a request to raise the maximum number of concurrent training jobs. To request an increase, contact bqml-feedback@google.com with your project ID and the details of your request.