BigQuery ML performs automatic preprocessing during training by using the
CREATE MODEL
statement. This consists of missing value imputation and feature transformations.
For information about feature preprocessing support in BigQuery ML, see Feature preprocessing overview.
For information about the supported model types of each SQL statement and function, and all of the supported SQL statements and functions for each model type, read End-to-end user journey for each model.
Missing data imputation
In statistics, imputation is used to replace missing data with substituted
values. When you train a model in BigQuery ML, NULL
values are treated as
missing data. When you predict outcomes in BigQuery ML, missing values can
occur when BigQuery ML encounters a NULL
value or a previously unseen value.
BigQuery ML handles missing data based on whether the column is numeric,
one-hot encoded, or a timestamp.
Column type | Imputation method |
---|---|
Numerical | In both training and prediction, `NULL` values in numeric columns are replaced with the mean value as calculated by the feature column in the original input data. |
One-hot/Multi-hot encoded | In both training and prediction, `NULL` values in the encoded columns are mapped to an additional category that is added to the data. Previously unseen data is assigned a weight of 0 during prediction. |
Timestamp | `TIMESTAMP` columns use a mixture of imputation methods from both standardized and one-hot encoded columns. For the generated unix time column, BigQuery ML replaces values with the mean unix time across the original columns. For other generated values, BigQuery ML assigns them to the respective `NULL` category for each extracted feature. |
STRUCT | In both training and prediction, each field of the STRUCT is imputed according to its type. |
Feature Transformations
By default, BigQuery ML transforms input features as follows:
Input data type | Transformation method | Details |
---|---|---|
INT64 NUMERIC BIGNUMERIC FLOAT64 |
Standardization | For all numerical columns, BigQuery ML standardizes and centers the column at zero before passing it into training for all models with the exception of Boosted Tree and Random Forest models. When creating a k-means model, the STANDARDIZE_FEATURES option specifies whether to standardize numerical features. |
BOOL STRING BYTES DATE DATETIME TIME |
One-hot encoded | For all non-numerical non-array columns other than
TIMESTAMP , BigQuery ML
performs a one-hot encoding transformation for all models with the exception of Boosted Tree and Random Forest models. This transformation generates
a separate feature for each unique value in the column. Label encoding transformation
is applied to train Boosted Tree and Random Forest models to convert each unique value into a numerical value. |
ARRAY |
Multi-hot encoded | For all non-numerical ARRAY columns, BigQuery ML
performs a multi-hot encoding transformation. This transformation generates
a separate feature for each unique element in the ARRAY . |
TIMESTAMP |
Timestamp transformation | When a linear or logistic regression model encounters a TIMESTAMP column, it extracts
a set of components from the TIMESTAMP and performs a mix of
standardization and one-hot encoding on the extracted components. For the
Unix time in seconds component, BigQuery ML uses standardization. For
all other components, it uses one-hot encoding.For more information, see the timestamp feature transformation table below. |
STRUCT |
Struct expansion | When BigQuery ML encounters a STRUCT column, it
expands the fields inside the STRUCT to create a single column. It
requires all fields of STRUCT to be named. Nested STRUCT s are not
allowed. The column names after expansion are in the format of
{struct_name}_{field_name}. |
ARRAY of STRUCT s |
No transformation |
TIMESTAMP
feature transformation
The following table shows the components extracted from TIMESTAMP
s and the
corresponding transformation method.
TIMESTAMP component |
processed_input result |
Transformation method |
---|---|---|
Unix time in seconds | [COLUMN_NAME] |
Standardization |
Day of month | _TS_DOM_[COLUMN_NAME] |
One-hot encoding |
Day of week | _TS_DOW_[COLUMN_NAME] |
One-hot encoding |
Month of year | _TS_MOY_[COLUMN_NAME] |
One-hot encoding |
Hour of day | _TS_HOD_[COLUMN_NAME] |
One-hot encoding |
Minute of hour | _TS_MOH_[COLUMN_NAME] |
One-hot encoding |
Week of year (weeks begin on Sunday) | _TS_WOY_[COLUMN_NAME] |
One-hot encoding |
Year | _TS_YEAR_[COLUMN_NAME] |
One-hot encoding |
Category feature encoding
For features that are one-hot encoded, a different default encoding method can be specified
using the model option, CATEGORY_ENCODING_METHOD
. For generalized linear models (GLM) models,
CATEGORY_ENCODING_METHOD
can be set to the following:
One-hot encoding
One-hot encoding maps each category that a feature has to its own binary feature
where 0 represents the absence of the feature and 1 represents the presence
(known as a dummy variable) creating N
new feature columns where N
is the
number of unique categories for the feature across the training table.
For example, if our training table has a feature column called fruit
with the categories Apple
, Banana
, and Cranberry
, with a table like the following:
Row | fruit |
---|---|
1 | Apple |
2 | Banana |
3 | Cranberry |
then CATEGORY_ENCODING_METHOD='ONE_HOT_ENCODING'
transforms the table to the following internal representation:
Row | fruit_Apple | fruit_Banana | fruit_Cranberry |
---|---|---|---|
1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 |
3 | 0 | 0 | 1 |
Dummy encoding
Dummy encoding
is similar to one-hot encoding where a categorical feature is transformed into a set of dummy variables. Dummy encoding uses N-1
dummy variables insetad of N
dummy variables to represent N
categories for a feature.
For example, if we set CATEGORY_ENCODING_METHOD
to 'DUMMY_ENCODING'
for the same fruit
feature column in example above, then the table is transformed to the following internal representation:
Row | fruit_Apple | fruit_Banana |
---|---|---|
1 | 1 | 0 |
2 | 0 | 1 |
3 | 0 | 0 |
In BigQuery ML, the category with the most occurrences in the training dataset is dropped. When multiple categories have the most occurrences, a random category within that set is dropped.
The final set of weights from ML.WEIGHTS
still includes the dropped
category, but its weight is always 0.0. For ML.ADVANCED_WEIGHTS
, the standard error
and p-value for the dropped variable is NaN
.
If warm_start
is used on a model that was initially trained with 'DUMMY_ENCODING'
,
the same dummy variable is dropped from the first training run. Models cannot
change encoding methods between training runs.