Automatic feature preprocessing
BigQuery ML performs automatic preprocessing during training by using the
CREATE MODEL statement.
Automatic preprocessing consists of
missing value imputation
and feature transformations.
For information about feature preprocessing support in BigQuery ML, see Feature preprocessing overview.
For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.
Missing data imputation
In statistics, imputation is used to replace missing data with substituted
values. When you train a model in BigQuery ML,
NULL values are
treated as missing data. When you predict outcomes in BigQuery ML,
missing values can occur when BigQuery ML encounters a
value or a previously unseen value. BigQuery ML handles missing
data differently, based on the type of data in the column.
|Column type||Imputation method|
|Numeric||In both training and prediction,
|One-hot/Multi-hot encoded||In both training and prediction,
||In both training and prediction, each field of the
By default, BigQuery ML transforms input features as follows:
|Input data type||Transformation method||Details|
||Standardization||For most models, BigQuery ML standardizes and centers
numerical columns at zero before passing it into training. The exceptions
are boosted tree and random forest models, for which no standardization
occurs, and k-means models, where the
||One-hot encoded||For all non-numerical, non-array columns other than
||Multi-hot encoded||For all non-numerical
||Timestamp transformation||When a linear or logistic regression
model encounters a
For more information, see the timestamp feature transformation table below.
||Struct expansion||When BigQuery ML encounters a
TIMESTAMP feature transformation
The following table shows the components extracted from
TIMESTAMP columns and
the corresponding transformation method.
|Unix time in seconds||
|Day of month||
|Day of week||
|Month of year||
|Hour of day||
|Minute of hour||
|Week of year (weeks begin on Sunday)||
Category feature encoding
For features that are one-hot encoded, you can specify a different default
encoding method by using the model option
generalized linear models (GLM) models, you can set
to one of the following values:
One-hot encoding maps each category that a feature has to its own binary
0 represents the absence of the feature and
1 represents the
presence (known as a dummy variable). This mapping creates
N new feature
N is the number of unique categories for the feature across
the training table.
For example, suppose your training table has a feature column that's called
fruit with the categories
Cranberry, such as the
In this case, the
transforms the table to the following internal representation:
Dummy encoding is
similar to one-hot encoding, where a categorical feature is transformed into a
set of placeholder variables. Dummy encoding uses
N-1 placeholder variables
N placeholder variables to represent
N categories for a feature.
For example, if you set
fruit feature column shown in the preceding one-hot encoding example,
then the table is transformed to the following internal representation:
The category with the most occurrences in the training dataset is dropped. When multiple categories have the most occurrences, a random category within that set is dropped.
The final set of weights from
still includes the dropped category, but its weight is always
the standard error and p-value for the dropped variable is
warm_start is used on a model that was initially trained with
'DUMMY_ENCODING', the same placeholder variable is dropped from the first
training run. Models cannot change encoding methods between training runs.