When training on BigQuery ML, there are two kinds of data preprocessing:
- Automatic preprocessing. BigQuery ML performs automatic
preprocessing during training. For more information, see the
CREATE MODEL
statement. - Manual preprocessing. BigQuery ML provides the
TRANSFORM
clause for you to define custom preprocessing using the preprocessing functions on this page. You can also use these functions outside theTRANSFORM
clause.
BigQuery ML includes automatic preprocessing in the model export but does not include manual preprocessing.
There are two types of preprocessing functions, scalar and analytic:
- Scalar functions operate on a single row (for example,
ML.BUCKETIZE
). - Analytic functions operate on all rows (for example,
ML.QUANTILE_BUCKETIZE
) and output the result for each row based on the statistics collected across all rows.
ML analytic functions are different from normal analytic functions in that
you always use them with an empty OVER()
clause.
When ML analytic functions are used inside the
TRANSFORM
clause during training, the same statistics are automatically applied to
the input in prediction.
ML.BUCKETIZE
ML.BUCKETIZE(numerical_expression, array_split_points[, exclude_boundaries])
Description
Returns a STRING as the bucket name after numerical_expression
is split
into buckets by array_split_points
.
The returned bucket names are in the format of bin_<bucket_index>
, and
bucket_index
starts from 1. The range of each bucket is [lower_bound,
upper_bound).
Input
numerical_expression
: Numerical expression to bucketize.array_split_points
: Sorted numerical array with points to splitnumerical_expression
with.exclude_boundaries
: Optional BOOL. IfTRUE
, the two boundaries are removed fromarray_split_points
. The default value isFALSE
.
Return data type
STRING
Example
SELECT
ML.BUCKETIZE(2.5, [1, 2, 3]) AS bucket,
ML.BUCKETIZE(2.5, [1, 2, 3], TRUE) AS bucket_without_boundaries;
+---------+----------------------------+
| bucket | bucket_without_boundaries |
+---------+----------------------------+
| bin_3 | bin_2 |
+---------+----------------------------+
ML.POLYNOMIAL_EXPAND
ML.POLYNOMIAL_EXPAND(struct_numerical_features[, degree])
Description
Returns a STRUCT with all polynomial combinations of the numerical input features with degree no larger than the passed-in degree including the original features. The field names of the output STRUCT are concatenations of the original feature names.
Input
struct_numerical_features
: Numerical features wrapped in a STRUCT to be expanded. The maximum number of input features in the STRUCT is 10. Unnamed features are not allowed instruct_numerical_features
. Duplicates are not allowed.degree
: Optional INT64. The highest degree of all combinations in the range of [1, 4]. The default value is 2.
Return Data Type
STRUCT of DOUBLE
Example
SELECT
ML.POLYNOMIAL_EXPAND(STRUCT(2 AS f1, 3 AS f2)) AS output;
+-------------------------------------------------------------------+
| output |
+-------------------------------------------------------------------+
| {"f1":"2.0","f1_f1":"4.0","f1_f2":"6.0","f2":"3.0","f2_f2":"9.0"} |
+-------------------------------------------------------------------+
ML.FEATURE_CROSS
ML.FEATURE_CROSS(struct_categorical_features[, degree])
Description
Returns a STRUCT with all combinations of crossed categorical features with degree no larger than the passed in degree, except for 1-degree items (the original features) and self-crossing items. The field names of the output STRUCT are concatenations of the original feature names.
Input
struct_categorical_features
: Categorical features wrapped in a STRUCT to be crossed. The maximum number of input features in the STRUCT is 10. Unnamed features aren't allowed instruct_numerical_features
. Duplicates aren't allowed.degree
: Optional INT64. The highest degree of all combinations in the range of [2, 4]. The default value is 2.
Return data type
STRUCT of STRING
Example
SELECT
ML.FEATURE_CROSS(STRUCT('a' AS f1, 'b' AS f2, 'c' AS f3)) AS output;
+---------------------------------------------+
| output |
+---------------------------------------------+
| {"f1_f2":"a_b","f1_f3":"a_c","f2_f3":"b_c"} |
+---------------------------------------------+
ML.NGRAMS
ML.NGRAMS(array_input, range[, separator])
Description
Returns an ARRAY of STRING as n-grams by merging adjacent tokens in the
array_input
.
Input
array_input
: ARRAY of STRING. The strings are the tokens to be merged.range
: ARRAY of 2 INT64 elements or a single Int64. These two sorted INT64 elements in the ARRAY input are the range of n-gram sizes to return. A single Int64 x is equivalent to the range of [x, x].separator
: Optional STRING. The separator to connect two adjacent tokens in the output. The default value is whitespace.
Return data type
ARRAY of STRING
Example
SELECT
ML.NGRAMS(['a', 'b', 'c'], [2,3], '#') AS output;
+-----------------------+
| output |
+-----------------------+
| ["a#b","a#b#c","b#c"] |
+-----------------------+
ML.QUANTILE_BUCKETIZE
ML.QUANTILE_BUCKETIZE(numerical_expression, num_buckets) OVER()
Description
Bucketizes a continuous numerical feature into a STRING with the bucket name as the value based on quantiles.
The returned bucket names are in the format of bin_<bucket_index>
, and
bucket_index
starts from 1.
When used in a
TRANSFORM
clause, the same quantiles are automatically used in prediction.
Input
numerical_expression
: Numerical expression to bucketize.num_buckets
: INT64. The number of buckets to splitnumerical_expression
into.
Return data type
STRING
Example
SELECT
f, ML.QUANTILE_BUCKETIZE(f, 3) OVER() AS bucket
FROM
UNNEST([1,2,3,4,5]) AS f;
+---+--------+
| f | bucket |
+---+--------+
| 3 | bin_2 |
| 5 | bin_3 |
| 2 | bin_2 |
| 1 | bin_1 |
| 4 | bin_3 |
+---+--------+
ML.MIN_MAX_SCALER
ML.MIN_MAX_SCALER(numerical_expression) OVER()
Description
Scales the numerical_expression
to [0, 1] capped with MIN and MAX
across all rows.
When used in
TRANSFORM
clause, the same MIN and MAX are automatically used in prediction. If the
prediction data is outside the [MIN, MAX] range, it is capped to either
0 or 1.
Input
numerical_expression
: Numerical expression to scale.
Return data type
DOUBLE
Example
SELECT
f, ML.MIN_MAX_SCALER(f) OVER() AS output
FROM
UNNEST([1,2,3,4,5]) AS f;
+---+--------+
| f | output |
+---+--------+
| 4 | 0.75 |
| 2 | 0.25 |
| 1 | 0.0 |
| 3 | 0.5 |
| 5 | 1.0 |
+---+--------+
ML.STANDARD_SCALER
ML.STANDARD_SCALER(numerical_expression) OVER()
Description
Standardizes numerical_expression
.
When this is used in a TRANSFORM clause,
the STDDEV
and MEAN
calculated to standardize the expression are
automatically used in prediction.
Input
numerical_expression
: Numerical expression to scale.
Return data type
DOUBLE
Example
SELECT
f, ML.STANDARD_SCALER(f) OVER() AS output
FROM
UNNEST([1,2,3,4,5]) AS f;
+---+---------------------+
| f | output |
+---+---------------------+
| 1 | -1.2649110640673518 |
| 5 | 1.2649110640673518 |
| 2 | -0.6324555320336759 |
| 4 | 0.6324555320336759 |
| 3 | 0.0 |
+---+---------------------+