The CREATE MODEL statement for training AutoML Tables models

AutoML Tables enables you to automatically build state-of-the-art machine learning models on structured data at massively increased speed and scale. See AutoML Tables documentation for more details about the model type.

CREATE MODEL statement for AutoML Tables models

To train an AutoML Tables model inside of BigQuery, use the BigQuery ML CREATE MODEL statement with one of the AutoML Tables model types.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
OPTIONS(MODEL_TYPE = { 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' }
         , BUDGET_HOURS = float64_value)
        [, INPUT_LABEL_COLS = string_array ]
        [, DATA_SPLIT_COL = string ]
[AS query_statement];

CREATE MODEL

Creates a new BigQuery ML model in the specified dataset. For AutoML Tables models, BigQuery ML trains and compresses the AutoML Tables model and converts it to a BigQuery ML model. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates a new model only if the model does not currently exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates a new model and replaces any existing model with the same name in the specified dataset.

Model Options

BigQuery ML currently uses the default values for AutoML Tables training options, including automatic data splitting and the default optimization function.

model_name

model_name is the name of the BigQuery ML model you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can contain the following:

  • Up to 1,024 characters
  • Letters of either case, numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example:

`myproject.mydataset.mymodel`

CREATE MODEL supports the following options:

MODEL_TYPE

Syntax

MODEL_TYPE = { 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' }

Description

Specifies the model type. This option is required. For a regression model, the type should be 'AUTOML_REGRESSOR' and the type of the label column should be numeric. For a classification model, the type should be 'AUTOML_CLASSIFIER' and the label column can be either a string or numeric.

BUDGET_HOURS

Syntax

BUDGET_HOURS = float64_value

Description

Sets the training budget for AutoML Tables training, specified in hours. Defaults to 1.0 and must be between 1.0 and 72.0. This option can only be used if MODEL_TYPE is AUTOML_REGRESSOR or AUTOML_CLASSIFIER.

After training AutoML Tables models, BigQuery ML compresses the model to ensure it is small enough to import, which can take up to 50% of the training time. The time to compress the model is not included in the training budget time.

Arguments

float64_value is a FLOAT64.

OPTIMIZATION_OBJECTIVE

Syntax

OPTIMIZATION_OBJECTIVE = string_value

Description

Sets the optimization objective function to use for AutoML Tables training. For regression the allowed values are: MINIMIZE_RMSE (default), MINIMIZE_MAE, and MINIMIZE_RMSLE. For binary classification the allowed values are MAXIMIZE_AU_ROC (default), MINIMIZE_LOG_LOSS, MAXIMIZE_AU_PRC. For Multiclass classification, the only allowed value is: MINIMIZE_LOG_LOSS.

For more details on the optimization objective functions, see the AutoML Tables documentation.

Arguments

string_value is one of: MAXIMIZE_AU_ROC, MINIMIZE_LOG_LOSS, MAXIMIZE_AU_PRC, MINIMIZE_RMSE, MINIMIZE_MAE, or MINIMIZE_RMSLE.

INPUT_LABEL_COLS

Syntax

INPUT_LABEL_COLS = string_array

Description

The label column name in the training data. Defaults to 'label'.

Arguments

string_array is an ARRAY of STRINGs. The AutoML Tables model types only support string_array values that contain one element.

DATA_SPLIT_COL

Syntax

DATA_SPLIT_COL = string_value

Description

The split column name in the training data. Defaults to automatic splitting.

Arguments

string_value is one of the columns in the training data and should be either a timestamp or string column. This column will be passed directly to AutoML Tables. In a string column, you specify rows to be used for training, validation, and testing using the TRAIN, VALIDATE, TEST, and UNASSIGNED values. For more information about how to use these values, see the AutoML Tables data split column. Timestamps will be treated as an AutoML Tables Time Column.

Supported inputs

The CREATE MODEL statement supports the following data types for input label and data split columns.

Supported data types for input label columns

BigQuery ML supports different Standard SQL data types depending on the model type. Supported data types for input_label_cols include:

Model type Supported label types
automl_regressor INT64
NUMERIC
FLOAT64
automl_classifier Any groupable data type

Supported data types for other columns

For columns other than the label column, any groupable data type is supported. The BigQuery column type will be used to determine the feature column type in AutoML Tables.

BigQuery type AutoML type
INT64
NUMERIC
FLOAT64
NUMERIC or TIMESTAMP if AutoML Tables determines that it is a unix timestamp
BOOL CATEGORICAL
STRING
BYTES
Either CATEGORICAL or TEXT, auto-selected by AutoML Tables.
TIMESTAMP
DATETIME
TIME
DATE
Either TIMESTAMP, CATEGORICAL, or TEXT, auto-selected by AutoML Tables.

To force a numeric column to be treated as categorical, CAST it to a BigQuery string. Arrays of supported types are allowed and will remain arrays during AutoML Tables training.

Limitations

CREATE MODEL statements for AutoML Tables models must comply with the following rules:

  • The input data to AutoML Tables must be between 1000 and 100 million rows, and less than 100 GB.
  • AXT and CMEK are currently not supported.
  • The models are not visible in the AutoML Tables UI, and not available for batch or online predictions in AutoML Tables.

CREATE MODEL examples

The following example creates models named mymodel in mydataset in your default project. It uses the public 'nyc-tlc.yellow.trips' taxi trip data available in BigQuery. The job takes approximately 3 hours to complete, including training, model compression, temporary data movement (to AutoML), and setup tasks.

Create the model

CREATE OR REPLACE MODEL project_id.mydataset.mymodel
       OPTIONS(model_type='AUTOML_REGRESSOR',
               input_label_cols=['fare_amount'],
               budget_hours=1.0)
AS SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  pickup_longitude,
  pickup_latitude,
  dropoff_longitude,
  dropoff_latitude,
  passenger_count
FROM `nyc-tlc.yellow.trips`
WHERE ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 100000)) = 1
AND
  trip_distance > 0
  AND fare_amount >= 2.5 AND fare_amount <= 100.0
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0

Run predictions

SELECT * FROM ML.PREDICT(MODEL project_id.mydataset.mymodel, (
    SELECT * FROM `nyc-tlc.yellow.trips` LIMIT 100))

Supported regions

Training AutoML Tables models is not supported in all BigQuery ML regions. For a complete list of supported regions and multi-regions, see the Locations page.