AutoML Tables enables you to automatically build state-of-the-art machine learning models on structured data at massively increased speed and scale. See AutoML Tables documentation for more details about the model type.
CREATE MODEL
statement for AutoML Tables models
To train an AutoML Tables model inside of BigQuery, use the
BigQuery ML CREATE MODEL
statement with one of the AutoML Tables
model types.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name OPTIONS(MODEL_TYPE = { 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' } , BUDGET_HOURS = float64_value) [, INPUT_LABEL_COLS = string_array ] [, DATA_SPLIT_COL = string ] [AS query_statement];
CREATE MODEL
Creates a new BigQuery ML model in the specified dataset. For AutoML
Tables models, BigQuery ML trains and compresses the AutoML Tables
model and converts it to a BigQuery ML model. If the model name
exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates a new model only if the model does not currently exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates a new model and replaces any existing model with the same name in the specified dataset.
Model Options
BigQuery ML currently uses the default values for AutoML Tables training options, including automatic data splitting and the default optimization function.
model_name
model_name
is the name of the BigQuery ML model you're creating or
replacing. The model name must be unique per dataset: no other model or table
can have the same name. The model name must follow the same naming rules as a
BigQuery table. A model name can contain the following:
- Up to 1,024 characters
- Letters of either case, numbers, and underscores
model_name
is not case-sensitive.
If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example:
`myproject.mydataset.mymodel`
CREATE MODEL
supports the following options:
MODEL_TYPE
Syntax
MODEL_TYPE = { 'AUTOML_REGRESSOR' | 'AUTOML_CLASSIFIER' }
Description
Specifies the model type. This option is required. For a regression model, the type should be 'AUTOML_REGRESSOR' and the type of the label column should be numeric. For a classification model, the type should be 'AUTOML_CLASSIFIER' and the label column can be either a string or numeric.
BUDGET_HOURS
Syntax
BUDGET_HOURS = float64_value
Description
Sets the training budget for AutoML Tables training, specified in hours.
Defaults to 1.0 and must be between 1.0 and 72.0. This option can only be used
if MODEL_TYPE
is AUTOML_REGRESSOR
or
AUTOML_CLASSIFIER
.
After training AutoML Tables models, BigQuery ML compresses the model to ensure it is small enough to import, which can take up to 50% of the training time. The time to compress the model is not included in the training budget time.
Arguments
float64_value is a FLOAT64
.
OPTIMIZATION_OBJECTIVE
Syntax
OPTIMIZATION_OBJECTIVE = string_value
Description
Sets the optimization objective function to use for AutoML Tables training. For regression the allowed values are: MINIMIZE_RMSE (default), MINIMIZE_MAE, and MINIMIZE_RMSLE. For binary classification the allowed values are MAXIMIZE_AU_ROC (default), MINIMIZE_LOG_LOSS, MAXIMIZE_AU_PRC. For Multiclass classification, the only allowed value is: MINIMIZE_LOG_LOSS.
For more details on the optimization objective functions, see the AutoML Tables documentation.
Arguments
string_value is one of: MAXIMIZE_AU_ROC, MINIMIZE_LOG_LOSS, MAXIMIZE_AU_PRC, MINIMIZE_RMSE, MINIMIZE_MAE, or MINIMIZE_RMSLE.
INPUT_LABEL_COLS
Syntax
INPUT_LABEL_COLS = string_array
Description
The label column name in the training data. Defaults to 'label'.
Arguments
string_array is an ARRAY
of STRING
s. The AutoML Tables model
types only support string_array values that contain one element.
DATA_SPLIT_COL
Syntax
DATA_SPLIT_COL = string_value
Description
The split column name in the training data. Defaults to automatic splitting.
Arguments
string_value is one of the columns in the training data and should be
either a timestamp or string column. This column will be passed directly to
AutoML Tables. In a string column, you specify rows to be used for training,
validation, and testing using the TRAIN
, VALIDATE
, TEST
, and UNASSIGNED
values. For more information about how to use these values, see
the AutoML Tables data split column.
Timestamps will be treated as an AutoML Tables Time Column.
Supported inputs
The CREATE MODEL
statement supports the following data types for input label
and data split columns.
Supported data types for input label columns
BigQuery ML supports different Standard SQL data types depending on the
model type. Supported data types for input_label_cols
include:
Model type |
Supported label types |
---|---|
automl_regressor |
INT64 NUMERIC BIGNUMERIC (Preview) FLOAT64 |
automl_classifier |
Any groupable data type |
Supported data types for other columns
For columns other than the label column, any groupable data type is supported. The BigQuery column type will be used to determine the feature column type in AutoML Tables.
BigQuery type |
AutoML type |
---|---|
INT64 NUMERIC BIGNUMERIC (Preview) FLOAT64 |
NUMERIC
or TIMESTAMP
if AutoML Tables determines that it is a unix timestamp |
BOOL |
CATEGORICAL |
STRING BYTES |
Either CATEGORICAL
or TEXT ,
auto-selected by AutoML Tables. |
TIMESTAMP DATETIME TIME DATE |
Either TIMESTAMP ,
CATEGORICAL ,
or TEXT ,
auto-selected by AutoML Tables. |
To force a numeric column to be treated as categorical, CAST it to a BigQuery string. Arrays of supported types are allowed and will remain arrays during AutoML Tables training.
Limitations
CREATE MODEL
statements for AutoML Tables models must comply with the
following rules:
- The input data to AutoML Tables must be between 1000 and 100 million rows, and less than 100 GB.
- AXT and CMEK are currently not supported.
- The models are not visible in the AutoML Tables UI, and not available for batch or online predictions in AutoML Tables.
CREATE MODEL
examples
The following example creates models named mymodel
in mydataset
in your
default project. It uses the public 'nyc-tlc.yellow.trips' taxi trip data
available in BigQuery. The job takes approximately 3 hours to complete,
including training, model compression, temporary data movement (to AutoML), and
setup tasks.
Create the model
CREATE OR REPLACE MODEL project_id.mydataset.mymodel
OPTIONS(model_type='AUTOML_REGRESSOR',
input_label_cols=['fare_amount'],
budget_hours=1.0)
AS SELECT
(tolls_amount + fare_amount) AS fare_amount,
pickup_longitude,
pickup_latitude,
dropoff_longitude,
dropoff_latitude,
passenger_count
FROM `nyc-tlc.yellow.trips`
WHERE ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 100000)) = 1
AND
trip_distance > 0
AND fare_amount >= 2.5 AND fare_amount <= 100.0
AND pickup_longitude > -78
AND pickup_longitude < -70
AND dropoff_longitude > -78
AND dropoff_longitude < -70
AND pickup_latitude > 37
AND pickup_latitude < 45
AND dropoff_latitude > 37
AND dropoff_latitude < 45
AND passenger_count > 0
Run predictions
SELECT * FROM ML.PREDICT(MODEL project_id.mydataset.mymodel, (
SELECT * FROM `nyc-tlc.yellow.trips` LIMIT 100))
Supported regions
Training AutoML Tables models is not supported in all BigQuery ML regions. For a complete list of supported regions and multi-regions, see the Locations page.