ML.BAG_OF_WORDS 函数

使用 ML.BAG_OF_WORDS 函数计算经过词元化处理的文档的表示法(作为其词袋(多集)),而不考虑字词排序和语法。您可以在 TRANSFORM 子句中使用 ML.BAG_OF_WORDS

语法

ML.BAG_OF_WORDS(
  tokenized_document
  [, top_k]
  [, frequency_threshold]
)
OVER()

参数

ML.BAG_OF_WORDS 接受以下参数:

  • tokenized_document:表示经过词元化处理的文档的 ARRAY<STRING> 值。经过词元化处理的文档是一组用于文本分析的字词(词元)。如需详细了解 BigQuery 中的词元化,请参阅 TEXT_ANALYZE
  • top_k:可选参数。接受一个 INT64 值,该值表示字典的大小,不包括未知字词。最多数量的文档中显示的 top_k 字词都会被添加到字典中,直到达到此阈值。例如,如果此值为 20,则最多数量的文档中出现的前 20 个独特字词会被添加到字典中,然后不添加任何其他字词。
  • frequency_threshold:可选参数。接受一个 INT64 值,该值表示字词必须至少出现在几个文档中,才能被添加到字典中的文档数量下限。例如,如果此值为 3,则字词必须至少在经过词元化处理的文档中出现三次,才能被添加到字典中。

如果字词满足 top_kfrequency_threshold 的条件,则会被添加到字词字典中,否则这些字词会被视为未知字词。未知字词始终是字典中的第一个字词,以 0 表示。字典的其余部分将按字母顺序排序。

输出

ML.BAG_OF_WORDS 会为输入中的每一行返回一个值。每个值都具有以下类型:

ARRAY<STRUCT<index INT64, value FLOAT64>>

定义:

  • index:添加到字典中的字词的索引。未知字词的索引为 0
  • value:文档中的相应计数。

配额

请参阅 Cloud AI 服务函数配额和限制

示例

以下示例会对输入列 f(没有未知字词)调用 ML.BAG_OF_WORDS 函数:

WITH ExampleTable AS (
  SELECT 1 AS id, ['a', 'b', 'b', 'c'] AS f
  UNION ALL
  SELECT 2 AS id, ['a', 'c'] AS f
)

SELECT ML.BAG_OF_WORDS(f, 32, 1) OVER() AS results
FROM ExampleTable
ORDER BY id;

输出类似于以下内容:

+----+---------------------------------------------------------------------------------------+
| id |                                        results                                        |
+----+---------------------------------------------------------------------------------------+
|  1 | [{"index":"1","value":"1.0"},{"index":"2","value":"2.0"},{"index":"3","value":"1.0"}] |
|  2 |                             [{"index":"1","value":"1.0"},{"index":"3","value":"1.0"}] |
+----+---------------------------------------------------------------------------------------+

请注意,结果中没有索引 0,因为没有未知字词。

以下示例会对输入列 f 调用 ML.BAG_OF_WORDS 函数:

WITH ExampleTable AS (
  SELECT 1 AS id, ['a', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd'] AS f
  UNION ALL
  SELECT 2 AS id, ['a', 'c', NULL] AS f
)

SELECT ML.BAG_OF_WORDS(f, 4, 2) OVER() AS results
FROM ExampleTable
ORDER BY id;

输出类似于以下内容:

+----+---------------------------------------------------------------------------------------+
| id |                                        results                                        |
+----+---------------------------------------------------------------------------------------+
|  1 | [{"index":"0","value":"5.0"},{"index":"1","value":"1.0"},{"index":"2","value":"4.0"}] |
|  2 | [{"index":"0","value":"1.0"},{"index":"1","value":"1.0"},{"index":"2","value":"1.0"}] |
+----+---------------------------------------------------------------------------------------+
 

请注意,当 frequency_threshold 的值设置为 2 时,系统不会返回 bd 的值,因为它们仅出现在一个文档中。

以下示例会调用 ML.BAG_OF_WORDS 函数,该函数具有更低的值 top_k

WITH ExampleTable AS (
  SELECT 1 AS id, ['a', 'b', 'b', 'c'] AS f
  UNION ALL
  SELECT 2 AS id, ['a', 'c', 'c'] AS f
)

SELECT ML.BAG_OF_WORDS(f, 2, 1) OVER() AS results
FROM ExampleTable
ORDER BY id;

输出类似于以下内容:

+----+---------------------------------------------------------------------------------------+
| id |                                        results                                        |
+----+---------------------------------------------------------------------------------------+
|  1 | [{"index":"0","value":"2.0"},{"index":"1","value":"1.0"},{"index":"2","value":"1.0"}] |
|  2 |                             [{"index":"1","value":"1.0"},{"index":"2","value":"2.0"}] |
+----+---------------------------------------------------------------------------------------+
 

请注意,b 的值不会返回,因为我们指定的是前两个字词,而 b 只会出现在一个文档中。

以下示例包含了两个具有相同频率的字词。由于会按字母顺序排列,这些字词中的一个字词会从结果中排除。

WITH ExampleData AS (
  SELECT 1 AS id, ['a', 'b', 'b', 'c', 'd', 'd', 'd'] as f
  UNION ALL
  SELECT 2 AS id, ['a', 'c', 'c', 'd', 'd', 'd'] as f
)

SELECT id, ML.BAG_OF_WORDS(f, 2 ,2) OVER() as result
FROM ExampleData
ORDER BY id;

结果如下所示:

+----+---------------------------------------------------------------------------------------+
| id |                                         result                                        |
+----+---------------------------------------------------------------------------------------+
|  1 | [{"index":"0","value":"5.0"},{"index":"1","value":"1.0"},{"index":"2","value":"1.0"}] |
|  2 | [{"index":"0","value":"3.0"},{"index":"1","value":"1.0"},{"index":"2","value":"2.0"}] |
+----+---------------------------------------------------------------------------------------+

后续步骤