ML.BAG_OF_WORDS 函数
使用 ML.BAG_OF_WORDS
函数计算经过词元化处理的文档的表示法(作为其词袋(多集)),而不考虑字词排序和语法。您可以在 TRANSFORM 子句中使用 ML.BAG_OF_WORDS
。
语法
ML.BAG_OF_WORDS( tokenized_document [, top_k] [, frequency_threshold] ) OVER()
参数
ML.BAG_OF_WORDS
接受以下参数:
tokenized_document
:表示经过词元化处理的文档的ARRAY<STRING>
值。经过词元化处理的文档是一组用于文本分析的字词(词元)。如需详细了解 BigQuery 中的词元化,请参阅TEXT_ANALYZE
。top_k
:可选参数。接受一个INT64
值,该值表示字典的大小,不包括未知字词。最多数量的文档中显示的top_k
字词都会被添加到字典中,直到达到此阈值。例如,如果此值为20
,则最多数量的文档中出现的前 20 个独特字词会被添加到字典中,然后不添加任何其他字词。frequency_threshold
:可选参数。接受一个INT64
值,该值表示字词必须至少出现在几个文档中,才能被添加到字典中的文档数量下限。例如,如果此值为3
,则字词必须至少在经过词元化处理的文档中出现三次,才能被添加到字典中。
如果字词满足 top_k
和 frequency_threshold
的条件,则会被添加到字词字典中,否则这些字词会被视为未知字词。未知字词始终是字典中的第一个字词,以 0
表示。字典的其余部分将按字母顺序排序。
输出
ML.BAG_OF_WORDS
会为输入中的每一行返回一个值。每个值都具有以下类型:
ARRAY<STRUCT<index INT64, value FLOAT64>>
定义:
index
:添加到字典中的字词的索引。未知字词的索引为0
。value
:文档中的相应计数。
配额
请参阅 Cloud AI 服务函数配额和限制。
示例
以下示例会对输入列 f
(没有未知字词)调用 ML.BAG_OF_WORDS
函数:
WITH ExampleTable AS ( SELECT 1 AS id, ['a', 'b', 'b', 'c'] AS f UNION ALL SELECT 2 AS id, ['a', 'c'] AS f ) SELECT ML.BAG_OF_WORDS(f, 32, 1) OVER() AS results FROM ExampleTable ORDER BY id;
输出类似于以下内容:
+----+---------------------------------------------------------------------------------------+ | id | results | +----+---------------------------------------------------------------------------------------+ | 1 | [{"index":"1","value":"1.0"},{"index":"2","value":"2.0"},{"index":"3","value":"1.0"}] | | 2 | [{"index":"1","value":"1.0"},{"index":"3","value":"1.0"}] | +----+---------------------------------------------------------------------------------------+
请注意,结果中没有索引 0
,因为没有未知字词。
以下示例会对输入列 f
调用 ML.BAG_OF_WORDS
函数:
WITH ExampleTable AS ( SELECT 1 AS id, ['a', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd'] AS f UNION ALL SELECT 2 AS id, ['a', 'c', NULL] AS f ) SELECT ML.BAG_OF_WORDS(f, 4, 2) OVER() AS results FROM ExampleTable ORDER BY id;
输出类似于以下内容:
+----+---------------------------------------------------------------------------------------+ | id | results | +----+---------------------------------------------------------------------------------------+ | 1 | [{"index":"0","value":"5.0"},{"index":"1","value":"1.0"},{"index":"2","value":"4.0"}] | | 2 | [{"index":"0","value":"1.0"},{"index":"1","value":"1.0"},{"index":"2","value":"1.0"}] | +----+---------------------------------------------------------------------------------------+
请注意,当 frequency_threshold
的值设置为 2
时,系统不会返回 b
和 d
的值,因为它们仅出现在一个文档中。
以下示例会调用 ML.BAG_OF_WORDS
函数,该函数具有更低的值 top_k
:
WITH ExampleTable AS ( SELECT 1 AS id, ['a', 'b', 'b', 'c'] AS f UNION ALL SELECT 2 AS id, ['a', 'c', 'c'] AS f ) SELECT ML.BAG_OF_WORDS(f, 2, 1) OVER() AS results FROM ExampleTable ORDER BY id;
输出类似于以下内容:
+----+---------------------------------------------------------------------------------------+ | id | results | +----+---------------------------------------------------------------------------------------+ | 1 | [{"index":"0","value":"2.0"},{"index":"1","value":"1.0"},{"index":"2","value":"1.0"}] | | 2 | [{"index":"1","value":"1.0"},{"index":"2","value":"2.0"}] | +----+---------------------------------------------------------------------------------------+
请注意,b
的值不会返回,因为我们指定的是前两个字词,而 b
只会出现在一个文档中。
以下示例包含了两个具有相同频率的字词。由于会按字母顺序排列,这些字词中的一个字词会从结果中排除。
WITH ExampleData AS ( SELECT 1 AS id, ['a', 'b', 'b', 'c', 'd', 'd', 'd'] as f UNION ALL SELECT 2 AS id, ['a', 'c', 'c', 'd', 'd', 'd'] as f ) SELECT id, ML.BAG_OF_WORDS(f, 2 ,2) OVER() as result FROM ExampleData ORDER BY id;
结果如下所示:
+----+---------------------------------------------------------------------------------------+ | id | result | +----+---------------------------------------------------------------------------------------+ | 1 | [{"index":"0","value":"5.0"},{"index":"1","value":"1.0"},{"index":"2","value":"1.0"}] | | 2 | [{"index":"0","value":"3.0"},{"index":"1","value":"1.0"},{"index":"2","value":"2.0"}] | +----+---------------------------------------------------------------------------------------+
后续步骤
- 了解机器学习之外的
BAG_OF_WORDS
函数。