Automatic feature preprocessing

BigQuery ML performs automatic preprocessing during training by using the CREATE MODEL statement. This consists of missing value imputation and feature transformations.

For information about feature preprocessing support in BigQuery ML, see Feature preprocessing overview.

For information about the supported model types of each SQL statement and function, and all of the supported SQL statements and functions for each model type, read End-to-end user journey for each model.

Missing data imputation

In statistics, imputation is used to replace missing data with substituted values. When you train a model in BigQuery ML, NULL values are treated as missing data. When you predict outcomes in BigQuery ML, missing values can occur when BigQuery ML encounters a NULL value or a previously unseen value. BigQuery ML handles missing data based on whether the column is numeric, one-hot encoded, or a timestamp.

Column type Imputation method
Numerical In both training and prediction, `NULL` values in numeric columns are replaced with the mean value as calculated by the feature column in the original input data.
One-hot/Multi-hot encoded In both training and prediction, `NULL` values in the encoded columns are mapped to an additional category that is added to the data. Previously unseen data is assigned a weight of 0 during prediction.
Timestamp `TIMESTAMP` columns use a mixture of imputation methods from both standardized and one-hot encoded columns. For the generated unix time column, BigQuery ML replaces values with the mean unix time across the original columns. For other generated values, BigQuery ML assigns them to the respective `NULL` category for each extracted feature.
STRUCT In both training and prediction, each field of the STRUCT is imputed according to its type.

Feature Transformations

By default, BigQuery ML transforms input features as follows:

Input data type Transformation method Details
INT64
NUMERIC
BIGNUMERIC
FLOAT64
Standardization For all numerical columns, BigQuery ML standardizes and centers the column at zero before passing it into training for all models with the exception of Boosted Tree models. When creating a k-means model, the STANDARDIZE_FEATURES option specifies whether to standardize numerical features.
BOOL
STRING
BYTES
DATE
DATETIME
TIME
One-hot encoded For all non-numerical non-array columns other than TIMESTAMP, BigQuery ML performs a one-hot encoding transformation for all models with the exception of Boosted Tree models. This transformation generates a separate feature for each unique value in the column. Label encoding transformation is applied to train Boosted Tree models to convert each unique value into a numerical value.
ARRAY Multi-hot encoded For all non-numerical ARRAY columns, BigQuery ML performs a multi-hot encoding transformation. This transformation generates a separate feature for each unique element in the ARRAY.
TIMESTAMP Timestamp transformation When a linear or logistic regression model encounters a TIMESTAMP column, it extracts a set of components from the TIMESTAMP and performs a mix of standardization and one-hot encoding on the extracted components. For the Unix time in seconds component, BigQuery ML uses standardization. For all other components, it uses one-hot encoding.

For more information, see the timestamp feature transformation table below.
STRUCT Struct expansion When BigQuery ML encounters a STRUCT column, it expands the fields inside the STRUCT to create a single column. It requires all fields of STRUCT to be named. Nested STRUCTs are not allowed. The column names after expansion are in the format of {struct_name}_{field_name}.

TIMESTAMP feature transformation

The following table shows the components extracted from TIMESTAMPs and the corresponding transformation method.

TIMESTAMP component processed_input result Transformation method
Unix time in seconds [COLUMN_NAME] Standardization
Day of month _TS_DOM_[COLUMN_NAME] One-hot encoding
Day of week _TS_DOW_[COLUMN_NAME] One-hot encoding
Month of year _TS_MOY_[COLUMN_NAME] One-hot encoding
Hour of day _TS_HOD_[COLUMN_NAME] One-hot encoding
Minute of hour _TS_MOH_[COLUMN_NAME] One-hot encoding
Week of year (weeks begin on Sunday) _TS_WOY_[COLUMN_NAME] One-hot encoding
Year _TS_YEAR_[COLUMN_NAME] One-hot encoding

Category feature encoding

For features that are one-hot encoded, a different default encoding method can be specified using the model option, CATEGORY_ENCODING_METHOD. For generalized linear models (GLM) models, CATEGORY_ENCODING_METHOD can be set to the following:

One-hot encoding

One-hot encoding maps each category that a feature has to its own binary feature where 0 represents the absence of the feature and 1 represents the presence (known as a dummy variable) creating N new feature columns where N is the number of unique categories for the feature across the training table.

For example, if our training table has a feature column called fruit with the categories Apple, Banana, and Cranberry, with a table like the following:

Row fruit
1 Apple
2 Banana
3 Cranberry

then CATEGORY_ENCODING_METHOD='ONE_HOT_ENCODING' transforms the table to the following internal representation:

Row fruit_Apple fruit_Banana fruit_Cranberry
1 1 0 0
2 0 1 0
3 0 0 1

Dummy encoding

Dummy encoding is similar to one-hot encoding where a categorical feature is transformed into a set of dummy variables. Dummy encoding uses N-1 dummy variables insetad of N dummy variables to represent N categories for a feature.

For example, if we set CATEGORY_ENCODING_METHOD to 'DUMMY_ENCODING' for the same fruit feature column in example above, then the table is transformed to the following internal representation:

Row fruit_Apple fruit_Banana
1 1 0
2 0 1
3 0 0

In BigQuery ML, the category with the most occurrences in the training dataset is dropped. When multiple categories have the most occurrences, a random category within that set is dropped.

The final set of weights from ML.WEIGHTS still includes the dropped category, but its weight is always 0.0. For ML.ADVANCED_WEIGHTS, the standard error and p-value for the dropped variable is NaN.

If warm_start is used on a model that was initially trained with 'DUMMY_ENCODING', the same dummy variable is dropped from the first training run. Models cannot change encoding methods between training runs.