For information about feature preprocessing support in BigQuery ML, see Feature preprocessing overview.
For information about the supported model types of each SQL statement and function, and all of the supported SQL statements and functions for each model type, read End-to-end user journey for each model.
Missing data imputation
In statistics, imputation is used to replace missing data with substituted
values. When you train a model in BigQuery ML,
NULL values are treated as
missing data. When you predict outcomes in BigQuery ML, missing values can
occur when BigQuery ML encounters a
NULL value or a previously unseen value.
BigQuery ML handles missing data based on whether the column is numeric,
one-hot encoded, or a timestamp.
|Column type||Imputation method|
|Numerical||In both training and prediction, `NULL` values in numeric columns are replaced with the mean value as calculated by the feature column in the original input data.|
|One-hot/Multi-hot encoded||In both training and prediction, `NULL` values in the encoded columns are mapped to an additional category that is added to the data. Previously unseen data is assigned a weight of 0 during prediction.|
|Timestamp||`TIMESTAMP` columns use a mixture of imputation methods from both standardized and one-hot encoded columns. For the generated unix time column, BigQuery ML replaces values with the mean unix time across the original columns. For other generated values, BigQuery ML assigns them to the respective `NULL` category for each extracted feature.|
|STRUCT||In both training and prediction, each field of the STRUCT is imputed according to its type.|
By default, BigQuery ML transforms input features as follows:
|Input data type||Transformation method||Details|
||Standardization||For all numerical columns, BigQuery ML standardizes and centers the column at zero before passing it into training for all models with the exception of Boosted Tree models. When creating a k-means model, the STANDARDIZE_FEATURES option specifies whether to standardize numerical features.|
||One-hot encoded||For all non-numerical non-array columns other than
||Multi-hot encoded||For all non-numerical
||Timestamp transformation||When a linear or logistic regression model encounters a |
For more information, see the timestamp feature transformation table below.
||Struct expansion||When BigQuery ML encounters a
TIMESTAMP feature transformation
The following table shows the components extracted from
TIMESTAMPs and the
corresponding transformation method.
|Unix time in seconds||
|Day of month||
|Day of week||
|Month of year||
|Hour of day||
|Minute of hour||
|Week of year (weeks begin on Sunday)||