Manual feature preprocessing

You can use the TRANSFORM clause of the CREATE MODEL statement in combination with manual preprocessing functions to define custom data preprocessing. You can also use these manual preprocessing functions outside of the TRANSFORM clause.

If you want to decouple data preprocessing from model training, you can create a transform-only model that only performs data transformations by using the TRANSFORM clause.

You can use the ML.TRANSFORM function to increase the transparency of feature preprocessing. This function lets you return the preprocessed data from a model's TRANSFORM clause, so that you can see the actual training data that goes into the model training, as well as the actual prediction data that goes into model serving.

For information about feature preprocessing support in BigQuery ML, see Feature preprocessing overview.

For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.

Types of preprocessing functions

There are several types of manual preprocessing functions:

  • Scalar functions operate on a single row. For example, ML.BUCKETIZE.
  • Table-valued functions operate on all rows and output a table. For example, ML.FEATURES_AT_TIME.
  • Analytic functions operate on all rows, and output the result for each row based on the statistics collected across all rows. For example, ML.QUANTILE_BUCKETIZE.

    You must always use an empty OVER() clause with ML analytic functions.

    When you use ML analytic functions inside theTRANSFORM clause during training, the same statistics are automatically applied to the input in prediction.

The following sections describe the available preprocessing functions.

General functions

Use the following function on string or numerical expressions to do data cleanup:

Numerical functions

Use the following functions on numerical expressions to regularize data:

Categorical functions

Use the following functions on categorize data:

Text functions

Use the following functions on text string expressions:

Image functions

Use the following functions on image data:

Known limitations