Preprocessing functions

When training on BigQuery ML, there are two kinds of data preprocessing:

  • Automatic preprocessing. BigQuery ML performs automatic preprocessing during training. For more information, see the CREATE MODEL statement.
  • Manual preprocessing. BigQuery ML provides the TRANSFORM clause for you to define custom preprocessing using the preprocessing functions on this page. You can also use these functions outside the TRANSFORM clause.

BigQuery ML includes automatic preprocessing in the model export but does not include manual preprocessing.

There are two types of preprocessing functions, scalar and analytic:

  • Scalar functions operate on a single row (for example, ML.BUCKETIZE).
  • Analytic functions operate on all rows (for example, ML.QUANTILE_BUCKETIZE) and output the result for each row based on the statistics collected across all rows.

ML analytic functions are different from normal analytic functions in that you always use them with an empty OVER() clause.

When ML analytic functions are used inside the TRANSFORM clause during training, the same statistics are automatically applied to the input in prediction.

ML.BUCKETIZE

ML.BUCKETIZE(numerical_expression, array_split_points[, exclude_boundaries])

Description

Returns a STRING as the bucket name after numerical_expression is split into buckets by array_split_points.

The returned bucket names are in the format of bin_<bucket_index>, and bucket_index starts from 1. The range of each bucket is [lower_bound, upper_bound).

Input

  • numerical_expression: Numerical expression to bucketize.
  • array_split_points: Sorted numerical array with points to split numerical_expression with.
  • exclude_boundaries: Optional BOOL. If TRUE, the two boundaries are removed from array_split_points. The default value is FALSE.

Return data type

STRING

Example

SELECT
  ML.BUCKETIZE(2.5, [1, 2, 3]) AS bucket,
  ML.BUCKETIZE(2.5, [1, 2, 3], TRUE) AS bucket_without_boundaries;

+---------+----------------------------+
| bucket  | bucket_without_boundaries  |
+---------+----------------------------+
| bin_3   | bin_2                      |
+---------+----------------------------+

ML.POLYNOMIAL_EXPAND

ML.POLYNOMIAL_EXPAND(struct_numerical_features[, degree])

Description

Returns a STRUCT with all polynomial combinations of the numerical input features with degree no larger than the passed-in degree including the original features. The field names of the output STRUCT are concatenations of the original feature names.

Input

  • struct_numerical_features: Numerical features wrapped in a STRUCT to be expanded. The maximum number of input features in the STRUCT is 10. Unnamed features are not allowed in struct_numerical_features. Duplicates are not allowed.
  • degree: Optional INT64. The highest degree of all combinations in the range of [1, 4]. The default value is 2.

Return Data Type

STRUCT of DOUBLE

Example

SELECT
  ML.POLYNOMIAL_EXPAND(STRUCT(2 AS f1, 3 AS f2)) AS output;

+-------------------------------------------------------------------+
|                              output                               |
+-------------------------------------------------------------------+
| {"f1":"2.0","f1_f1":"4.0","f1_f2":"6.0","f2":"3.0","f2_f2":"9.0"} |
+-------------------------------------------------------------------+

ML.FEATURE_CROSS

ML.FEATURE_CROSS(struct_categorical_features[, degree])

Description

Returns a STRUCT with all combinations of crossed categorical features with degree no larger than the passed in degree, except for 1-degree items (the original features) and self-crossing items. The field names of the output STRUCT are concatenations of the original feature names.

Input

  • struct_categorical_features: Categorical features wrapped in a STRUCT to be crossed. The maximum number of input features in the STRUCT is 10. Unnamed features aren't allowed in struct_numerical_features. Duplicates aren't allowed.
  • degree: Optional INT64. The highest degree of all combinations in the range of [2, 4]. The default value is 2.

Return data type

STRUCT of STRING

Example

SELECT
  ML.FEATURE_CROSS(STRUCT('a' AS f1, 'b' AS f2, 'c' AS f3)) AS output;

+---------------------------------------------+
|                   output                    |
+---------------------------------------------+
| {"f1_f2":"a_b","f1_f3":"a_c","f2_f3":"b_c"} |
+---------------------------------------------+

ML.NGRAMS

ML.NGRAMS(array_input, range[, separator])

Description

Returns an ARRAY of STRING as n-grams by merging adjacent tokens in the array_input.

Input

  • array_input: ARRAY of STRING. The strings are the tokens to be merged.
  • range: ARRAY of 2 INT64 elements or a single Int64. These two sorted INT64 elements in the ARRAY input are the range of n-gram sizes to return. A single Int64 x is equivalent to the range of [x, x].
  • separator: Optional STRING. The separator to connect two adjacent tokens in the output. The default value is whitespace .

Return data type

ARRAY of STRING

Example

SELECT
  ML.NGRAMS(['a', 'b', 'c'], [2,3], '#') AS output;

+-----------------------+
|        output         |
+-----------------------+
| ["a#b","a#b#c","b#c"] |
+-----------------------+

ML.QUANTILE_BUCKETIZE

ML.QUANTILE_BUCKETIZE(numerical_expression, num_buckets) OVER()

Description

Bucketizes a continuous numerical feature into a STRING with the bucket name as the value based on quantiles.

The returned bucket names are in the format of bin_<bucket_index>, and bucket_index starts from 1.

When used in a TRANSFORM clause, the same quantiles are automatically used in prediction.

Input

  • numerical_expression: Numerical expression to bucketize.
  • num_buckets: INT64. The number of buckets to split numerical_expression into.

Return data type

STRING

Example

SELECT
  f, ML.QUANTILE_BUCKETIZE(f, 3) OVER() AS bucket
FROM
  UNNEST([1,2,3,4,5]) AS f;

+---+--------+
| f | bucket |
+---+--------+
| 3 | bin_2  |
| 5 | bin_3  |
| 2 | bin_2  |
| 1 | bin_1  |
| 4 | bin_3  |
+---+--------+

ML.MIN_MAX_SCALER

ML.MIN_MAX_SCALER(numerical_expression) OVER()

Description

Scales the numerical_expression to [0, 1] capped with MIN and MAX across all rows.

When used in TRANSFORM clause, the same MIN and MAX are automatically used in prediction. If the prediction data is outside the [MIN, MAX] range, it is capped to either 0 or 1.

Input

  • numerical_expression: Numerical expression to scale.

Return data type

DOUBLE

Example

SELECT
  f, ML.MIN_MAX_SCALER(f) OVER() AS output
FROM
  UNNEST([1,2,3,4,5]) AS f;

+---+--------+
| f | output |
+---+--------+
| 4 |   0.75 |
| 2 |   0.25 |
| 1 |    0.0 |
| 3 |    0.5 |
| 5 |    1.0 |
+---+--------+

ML.STANDARD_SCALER

ML.STANDARD_SCALER(numerical_expression) OVER()

Description

Standardizes numerical_expression.

When this is used in a TRANSFORM clause, the STDDEV and MEAN calculated to standardize the expression are automatically used in prediction.

Input

  • numerical_expression: Numerical expression to scale.

Return data type

DOUBLE

Example

SELECT
  f, ML.STANDARD_SCALER(f) OVER() AS output
FROM
  UNNEST([1,2,3,4,5]) AS f;

+---+---------------------+
| f |       output        |
+---+---------------------+
| 1 | -1.2649110640673518 |
| 5 |  1.2649110640673518 |
| 2 | -0.6324555320336759 |
| 4 |  0.6324555320336759 |
| 3 |                 0.0 |
+---+---------------------+