Automatic feature preprocessing
BigQuery ML performs automatic preprocessing during training by using the
CREATE MODEL statement.
Automatic preprocessing consists of
missing value imputation
and feature transformations.
For information about feature preprocessing support in BigQuery ML, see Feature preprocessing overview.
Missing data imputation
In statistics, imputation is used to replace missing data with substituted
values. When you train a model in BigQuery ML, NULL values are
treated as missing data. When you predict outcomes in BigQuery ML,
missing values can occur when BigQuery ML encounters a NULL
value or a previously unseen value. BigQuery ML handles missing
data differently, based on the type of data in the column.
| Column type | Imputation method | 
|---|---|
| Numeric | In both training and prediction, NULLvalues in numeric
    columns are replaced with the mean value of the given column, as calculated
    by the feature column in the original input data. | 
| One-hot/Multi-hot encoded | In both training and prediction, NULLvalues in the
     encoded columns are mapped to an additional category that is added to
     the data. Previously unseen data is assigned a weight of 0 during
     prediction. | 
| TIMESTAMP | TIMESTAMPcolumns use a mixture of imputation methods
    from both standardized and one-hot encoded columns. For the generated Unix
    time column, BigQuery ML replaces values with the mean Unix
    time across the original columns. For other generated values,
    BigQuery ML assigns them to the respectiveNULLcategory for each extracted feature. | 
| STRUCT | In both training and prediction, each field of the STRUCTis imputed according to its type. | 
Feature transformations
By default, BigQuery ML transforms input features as follows:
| Input data type | Transformation method | Details | 
|---|---|---|
| INT64NUMERICBIGNUMERICFLOAT64 | Standardization | For most models, BigQuery ML standardizes and centers
    numerical columns at zero before passing it into training. The exceptions
    are boosted tree and random forest models, for which no standardization
    occurs, and k-means models, where the STANDARDIZE_FEATURESoption controls whether numerical features are standardized. | 
| BOOLSTRINGBYTESDATEDATETIMETIME | One-hot encoded | For all non-numerical, non-array columns other than TIMESTAMP, BigQuery ML
      performs a one-hot encoding transformation for all models other than
      boosted tree and random forest models. This transformation generates
      a separate feature for each unique value in the column. Label encoding
      transformation is applied to train boosted tree and random forest models
      to convert each unique value into a numerical value. | 
| ARRAY | Multi-hot encoded | For all non-numerical ARRAYcolumns, BigQuery ML
      performs a multi-hot encoding transformation. This transformation generates
      a separate feature for each unique element in theARRAY. | 
| TIMESTAMP | Timestamp transformation | When a linear or logistic regression
    model encounters a TIMESTAMPcolumn, it extracts
      a set of components from theTIMESTAMPand performs a mix of
      standardization and one-hot encoding on the extracted components. For the
      Unix epoch time in seconds component, BigQuery ML uses
      standardization. For all other components, it uses one-hot encoding.For more information, see the following timestamp feature transformation table. | 
| STRUCT | Struct expansion | When BigQuery ML encounters a STRUCTcolumn,
    it expands the fields inside theSTRUCTto create a single
    column. It requires all fields ofSTRUCTto be named. NestedSTRUCTs are not allowed. The column names after expansion are
    in the format of{struct_name}_{field_name}. | 
| ARRAYofSTRUCT | No transformation | |
| ARRAYofNUMERIC | No transformation | 
TIMESTAMP feature transformation
The following table shows the components extracted from TIMESTAMP columns and
the corresponding transformation method.
| TIMESTAMPcomponent | processed_inputresult | Transformation method | 
|---|---|---|
| Unix epoch time in seconds | [COLUMN_NAME] | Standardization | 
| Day of month | _TS_DOM_[COLUMN_NAME] | One-hot encoding | 
| Day of week | _TS_DOW_[COLUMN_NAME] | One-hot encoding | 
| Month of year | _TS_MOY_[COLUMN_NAME] | One-hot encoding | 
| Hour of day | _TS_HOD_[COLUMN_NAME] | One-hot encoding | 
| Minute of hour | _TS_MOH_[COLUMN_NAME] | One-hot encoding | 
| Week of year (weeks begin on Sunday) | _TS_WOY_[COLUMN_NAME] | One-hot encoding | 
| Year | _TS_YEAR_[COLUMN_NAME] | One-hot encoding | 
Category feature encoding
For features that are one-hot encoded, you can specify a different default
encoding method by using the model option CATEGORY_ENCODING_METHOD. For
generalized linear models (GLM) models, you can set CATEGORY_ENCODING_METHOD
to one of the following values:
One-hot encoding
One-hot encoding maps each category that a feature has to its own binary
feature, where 0 represents the absence of the feature and 1 represents the
presence (known as a dummy variable). This mapping creates N new feature
columns, where N is the number of unique categories for the feature across
the training table.
For example, suppose your training table has a feature column that's called
fruit with the categories Apple, Banana, and Cranberry, such as the
following:
| Row | fruit | 
|---|---|
| 1 | Apple | 
| 2 | Banana | 
| 3 | Cranberry | 
In this case, the CATEGORY_ENCODING_METHOD='ONE_HOT_ENCODING' option
transforms the table to the following internal representation:
| Row | fruit_Apple | fruit_Banana | fruit_Cranberry | 
|---|---|---|---|
| 1 | 1 | 0 | 0 | 
| 2 | 0 | 1 | 0 | 
| 3 | 0 | 0 | 1 | 
One-hot encoding is supported by linear and logistic regression and boosted tree models.
Dummy encoding
Dummy encoding is
similar to one-hot encoding, where a categorical feature is transformed into a
set of placeholder variables. Dummy encoding uses N-1 placeholder variables
instead of N placeholder variables to represent N categories for a feature.
For example, if you set CATEGORY_ENCODING_METHOD to 'DUMMY_ENCODING' for
the same fruit feature column shown in the preceding one-hot encoding example,
then the table is transformed to the following internal representation:
| Row | fruit_Apple | fruit_Banana | 
|---|---|---|
| 1 | 1 | 0 | 
| 2 | 0 | 1 | 
| 3 | 0 | 0 | 
The category with the most occurrences in the training dataset is dropped. When multiple categories have the most occurrences, a random category within that set is dropped.
The final set of weights from
ML.WEIGHTS
still includes the dropped category, but its weight is always 0.0. For
ML.ADVANCED_WEIGHTS,
the standard error and p-value for the dropped variable is NaN.
If warm_start is used on a model that was initially trained with
'DUMMY_ENCODING', the same placeholder variable is dropped from the first
training run. Models cannot change encoding methods between training runs.
Dummy encoding is supported by linear and logistic regression models.
Label encoding
Label encoding transforms the value of a categorical feature to an INT64 value
in [0, <number of categories>].
For example, if you had a book dataset like the following:
| Title | Genre | 
|---|---|
| Book 1 | Fantasy | 
| Book 2 | Cooking | 
| Book 3 | History | 
| Book 4 | Cooking | 
The label encoded values might look similar to the following:
| Title | Genre (text) | Genre (numeric) | 
|---|---|---|
| Book 1 | Fantasy | 1 | 
| Book 2 | Cooking | 2 | 
| Book 3 | History | 3 | 
| Book 4 | Cooking | 2 | 
The encoding vocabulary is sorted alphabetically. NULL values and categories
that aren't in the vocabulary are encoded to 0.
Label encoding is supported by boosted tree models.
Target encoding
Target encoding replaces the categorical feature value with the probability of the target for classification models, or with the expected value of the target for regression models.
Features that have been target encoded might look similar to the following example:
# Classification model +------------------------+----------------------+ | original value | target encoded value | +------------------------+----------------------+ | (category_1, target_1) | 0.5 | | (category_1, target_2) | 0.5 | | (category_2, target_1) | 0.0 | +------------------------+----------------------+ # Regression model +------------------------+----------------------+ | original value | target encoded value | +------------------------+----------------------+ | (category_1, 2) | 2.5 | | (category_1, 3) | 2.5 | | (category_2, 1) | 1.5 | | (category_2, 2) | 1.5 | +------------------------+----------------------+
Target encoding is supported by boosted tree models.
What's next
For more information about supported SQL statements and functions for models that support automatic feature preprocessing, see the following documents: