The ML.QUANTILE_BUCKETIZE function

This document describes the ML.QUANTILE_BUCKETIZE function, which lets you break a continuous numerical feature into buckets based on quantiles.

When used in the TRANSFORM clause, the same quantiles are automatically used in prediction.

Syntax

ML.QUANTILE_BUCKETIZE(numerical_expression, num_buckets [, output_format]) OVER()

Arguments

ML.QUANTILE_BUCKETIZE takes the following arguments:

  • numerical_expression: the numerical expression to bucketize.
  • num_buckets: an INT64 value that specifies the number of buckets to split numerical_expression into.
  • output_format: a STRING value that specifies the output format of the bucket. Valid output formats are as follows:
    • bucket_names: returns a STRING value in the format bin_<bucket_index>. For example, bin_3. The bucket_index value starts at 1. This is the default bucket format.
    • bucket_ranges: returns a STRING value in the format [lower_bound, upper_bound) in interval notation. For example, (-inf, 2.5), [2.5, 4.6), [4.6, +inf).
    • bucket_ranges_json: returns a JSON-formatted STRING value in the format {"start": "lower_bound", "end": "upper_bound"}. For example, {"start": "-Infinity", "end": "2.5"}, {"start": "2.5", "end": "4.6"}, {"start": "4.6", "end": "Infinity"}. The inclusivity and exclusivity of the lower and upper bound follow the same pattern as the bucket_ranges option.

Output

ML.QUANTILE_BUCKETIZE returns a STRING value that contains the name of the bucket, in the format specified by the output_format argument.

Example

The following example breaks a numerical expression of five elements into three buckets:

SELECT
  f,
  ML.QUANTILE_BUCKETIZE(f, 3) OVER() AS bucket,
  ML.QUANTILE_BUCKETIZE(f, 3, "bucket_ranges") OVER() AS bucket_ranges,
  ML.QUANTILE_BUCKETIZE(f, 3, "bucket_ranges_json") OVER() AS bucket_ranges_json
FROM
  UNNEST([1,2,3,4,5]) AS f
ORDER BY f;

The output looks similar to the following:

+---+--------+---------------+------------------------------------+
| f | bucket | bucket_ranges | bucket_ranges_json                 |
|---|--------|---------------|------------------------------------|
| 1 | bin_1  | (-inf, 2)     | {"start": "-Infinity", "end": "2"} |
| 2 | bin_2  | [2, 4)        | {"start": "2", "end": "4"}         |
| 3 | bin_2  | [2, 4)        | {"start": "2", "end": "4"}         |
| 4 | bin_3  | [4, +inf)     | {"start": "4", "end": "Infinity"}  |
| 5 | bin_3  | [4, +inf)     | {"start": "4", "end": "Infinity"}  |
+---+--------+---------------+------------------------------------+

What's next