The CREATE MODEL statement for transform-only models
This document describes the CREATE MODEL
statement for creating
transform-only models in BigQuery. Transform-only models use the
TRANSFORM
clause
to apply preprocessing functions
to input data and return the preprocessed data. Transform-only models decouple
data preprocessing from model training, making it easier for you to capture and
reuse a set of data preprocessing rules.
You can use a transform-only model in conjunction with the
ML.TRANSFORM
function
to provide preprocessed data to other models:
- You can use it in the query statement when creating another model, in order to use the transformed data as the training data for that model.
- You can use it in the
query statement
argument of theML.PREDICT
function, in order to provide data for prediction that is processed in the way the target model expects.
For batch feature transformations, it is better to use transform-only models because it lets you process large amounts of data in a short time. For online feature transformations, it is better to use Vertex AI Feature Store because it provides responses with low latency.
You can also use a transform-only model with the
ML.FEATURE_INFO
function
in order to return information about feature transformations in the model.
For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name TRANSFORM (select_list) OPTIONS(MODEL_TYPE = 'TRANSFORM_ONLY') AS query_statement
CREATE MODEL
Creates and trains a new model in the specified dataset. If the model name
exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates and trains a new model only if the model doesn't exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates and trains a model and replaces an existing model with the same name in the specified dataset.
model_name
The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
model_name
is not case-sensitive.
If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example, `myproject.mydataset.mymodel`.
select_list
You can pass columns from query_statement
through to model training without
transformation by either using *
, * EXCEPT()
, or by listing
the column names directly.
Not all columns from query_statement
are required to appear in the TRANSFORM
clause, so you can drop columns appearing in query_statement
by omitting
them from the TRANSFORM
clause.
You can transform inputs from query_statement
by using expressions in
select_list
. select_list
is similar to a normal SELECT
statement.
select_list
supports the following syntax:
*
* EXCEPT()
* REPLACE()
expression
expression.*
The following cannot appear inside select_list
:
- Aggregation functions.
- Non-BigQuery ML analytic functions. For more information about supported functions, see Manual feature preprocessing.
- UDFs.
- Subqueries.
- Anonymous columns. For example,
a + b as c
is allowed, whilea + b
isn't.
The output columns of select_list
can be of any BigQuery
supported data type.
If present, the following columns must appear in select_list
without
transformation:
label
data_split_col
kmeans_init_col
instance_weight_col
If these columns are returned by query_statement
, you must reference them in
select_list
by column name outside of any expression, or by using *
. You
can't use aliases with these columns.
MODEL_TYPE
Syntax
MODEL_TYPE = 'TRANSFORM_ONLY'
Description
Specify the model type. This option is required.
query_statement
The GoogleSQL query that contains the data to preprocess. The statistics that are calculated when transforming this data are applied to the input data of any functions that you use the model with.
Examples
The following examples show how to create and use transform-only models.
Example 1
The following example creates a model named transform_model
in mydataset
in your
default project. The model transforms several columns from the
BigQuery public table bigquery-public-data.ml_datasets.penguins
:
CREATE MODEL `mydataset.transform_model` TRANSFORM( species, island, ML.MAX_ABS_SCALER(culmen_length_mm) OVER () AS culmen_length_mm, ML.MAX_ABS_SCALER(culmen_depth_mm) OVER () AS culmen_depth_mm, ML.MAX_ABS_SCALER(flipper_length_mm) OVER () AS flipper_length_mm, sex, body_mass_g) OPTIONS ( model_type = 'transform_only') AS ( SELECT * FROM `bigquery-public-data.ml_datasets.penguins` );
Example 2
The following example creates a model named mymodel
in mydataset
in your
default project. The model is trained on data that is preprocessed by using
a transform-only model:
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE = 'LINEAR_REG', MAX_ITERATIONS = 5, INPUT_LABEL_COLS = ['body_mass_g'] ) AS SELECT * FROM ML.TRANSFORM( MODEL `mydataset.transform_model`, TABLE `bigquery-public-data.ml_datasets.penguins`) WHERE body_mass_g IS NOT NULL;