The ML.ONE_HOT_ENCODER function
This document describes the ML.ONE_HOT_ENCODER
function, which lets you
encode a string expression using a
one-hot
or dummy
encoding scheme.
The encoding vocabulary is sorted alphabetically. NULL
values and categories
that aren't in the vocabulary are encoded with an index
value of 0
. If you
use dummy encoding, the dropped category is encoded with a value
of 0
.
When used in the
TRANSFORM
clause,
the vocabulary and dropped category values calculated during training, along
with the top k and frequency threshold values that you specified, are
automatically used in prediction.
Syntax
ML.ONE_HOT_ENCODER(string_expression [, drop] [, top_k] [, frequency_threshold]) OVER()
Arguments
ML.ONE_HOT_ENCODER
takes the following arguments:
string_expression
: theSTRING
expression to encode.drop
: aSTRING
value that specifies whether the function drops a category. Valid values are as follows:none
: Retain all categories. This is the default value.most_frequent
: Drop the most frequent category found in the string expression. Selecting this value causes the function to use dummy encoding.
top_k
: anINT64
value that specifies the number of categories included in the encoding vocabulary. The function selects thetop_k
most frequent categories in the data and uses those; categories below this threshold are encoded to0
. This value must be less than1,000,000
to avoid problems due to high dimensionality. The default value is32,000
.frequency_threshold
: anINT64
value that limits the categories included in the encoding vocabulary based on category frequency. The function uses categories whose frequency is greater than or equal tofrequency_threshold
; categories below this threshold are encoded to0
. The default value is5
.
Output
ML.ONE_HOT_ENCODER
returns an array of struct values, in the form
ARRAY<STRUCT<INT64, FLOAT64>>
. The first element in the struct provides the
index of the encoded string expression, and the second element provides the
value of the encoded string expression.
Example
The following example performs dummy encoding on a set of string expressions. It limits the encoding vocabulary to the ten categories that occur the most frequently in the data and that also occur zero or more times.
SELECT f, ML.ONE_HOT_ENCODER(f, 'most_frequent', 10, 0) OVER () AS output FROM UNNEST([NULL, 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd']) AS f ORDER BY f;
The output looks similar to the following:
+------+-----------------------------+ | f | output.index | output.value | +------+--------------+--------------+ | NULL | 0 | 1.0 | | a | 1 | 1.0 | | b | 2 | 1.0 | | b | 2 | 1.0 | | c | 3 | 0.0 | | c | 3 | 0.0 | | c | 3 | 0.0 | | d | 4 | 1.0 | | d | 4 | 1.0 | +------+-----------------------------+
What's next
- For information about feature preprocessing, see Feature preprocessing overview.
- For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.