标准 SQL 中的 HyperLogLog++ 函数

BigQuery 支持使用 HyperLogLog++ 算法的以下近似聚合函数。如需了解近似聚合函数的工作方式,请参阅近似聚合

HLL_COUNT.INIT

HLL_COUNT.INIT(input [, precision])

说明

这个标量函数会接受一个或多个 input 值,并将这些值聚合到一个 HyperLogLog++ 草图。每个草图均使用 BYTES 数据类型表示。随后您可以使用 HLL_COUNT.MERGEHLL_COUNT.MERGE_PARTIAL 合并草图。如果不需要合并,您可以使用 HLL_COUNT.EXTRACT 从草图中提取不同值的最终计数。

input 可以是以下各项之一:

  • INT64
  • STRING
  • BYTES

此函数支持可选参数 precision。此参数用于定义估计值的准确性,定义的准确性越高处理草图或将草图存储在磁盘上所需的内存就越多。下表展示了允许的精度值、每组最大草图大小,以及典型精度的置信区间 (CI):

精度 最大草图大小 (KiB) 65% CI 95% CI 99% CI
10 1 ±1.63% ±3.25% ±6.50%
11 2 ±1.15% ±2.30% ±4.60%
12 4 ±0.81% ±1.63% ±3.25%
13 8 ±0.57% ±1.15% ±1.72%
14 16 ±0.41% ±0.81% ±1.22%
15(默认) 32 ±0.29% ±0.57% ±0.86%
16 64 ±0.20% ±0.41% ±0.61%
17 128 ±0.14% ±0.29% ±0.43%
18 256 ±0.10% ±0.20% ±0.41%
19 512 ±0.07% ±0.14% ±0.29%
20 1024 ±0.05% ±0.10% ±0.20%
21 2048 ±0.04% ±0.07% ±0.14%
22 4096 ±0.03% ±0.05% ±0.10%
23 8192 ±0.02% ±0.04% ±0.07%
24 16384 ±0.01% ±0.03% ±0.05%

如果输入是 NULL,此函数会返回 NULL。

如需了解详情,请参阅 HyperLogLog 实践:一种先进基数估计算法的算法工程

支持的输入类型

BYTES

返回类型

BYTES

示例

SELECT
  HLL_COUNT.INIT(respondent) AS respondents_hll,
  flavor,
  country
FROM UNNEST([
  STRUCT(1 AS respondent, "Vanilla" AS flavor, "CH" AS country),
  (1, "Chocolate", "CH"),
  (2, "Chocolate", "US"),
  (2, "Strawberry", "US")])
GROUP BY flavor, country;

HLL_COUNT.MERGE

HLL_COUNT.MERGE(sketch)

说明

这个聚合函数会计算多组 HyperLogLog++ 草图的并集,并返回其基数。

每个 sketch 都必须具有相同的精度,并按照相同的类型初始化。尝试合并具有不同精度或用于不同类型的草图会引发错误。例如,您不能将使用 INT64 数据初始化的草图与使用 STRING 数据初始化的草图合并。

在合并草图时,此函数会忽略 NULL 值。如果合并了零行,或者只有 NULL 值发生合并,此函数会返回 0

支持的输入类型

BYTES

返回类型

INT64

示例

SELECT HLL_COUNT.MERGE(respondents_hll) AS num_respondents, flavor
FROM (
  SELECT
    HLL_COUNT.INIT(respondent) AS respondents_hll,
    flavor,
    country
  FROM UNNEST([
    STRUCT(1 AS respondent, "Vanilla" AS flavor, "CH" AS country),
    (1, "Chocolate", "CH"),
    (2, "Chocolate", "US"),
    (2, "Strawberry", "US")])
  GROUP BY flavor, country)
GROUP BY flavor;

HLL_COUNT.MERGE_PARTIAL

HLL_COUNT.MERGE_PARTIAL(sketch)

说明

这个聚合函数会接受一个或多个 HyperLogLog++ sketch 输入,并将它们合并到一个新草图。

如果无输入或所有输入均为 NULL,此函数返回 NULL。

支持的输入类型

BYTES

返回类型

BYTES

示例

SELECT HLL_COUNT.MERGE_PARTIAL(respondents_hll) AS num_respondents, flavor
FROM (
  SELECT
    HLL_COUNT.INIT(respondent) AS respondents_hll,
    flavor,
    country
  FROM UNNEST([
    STRUCT(1 AS respondent, "Vanilla" AS flavor, "CH" AS country),
    (1, "Chocolate", "CH"),
    (2, "Chocolate", "US"),
    (2, "Strawberry", "US")])
  GROUP BY flavor, country)
GROUP BY flavor;

HLL_COUNT.EXTRACT

HLL_COUNT.EXTRACT(sketch)

说明

这个标量函数会提取一个 HyperLogLog++ 草图的基数估计值。

如果 sketch 是 NULL,此函数返回的基数估计值为 0

支持的输入类型

BYTES

返回类型

INT64

示例

SELECT
  flavor,
  country,
  HLL_COUNT.EXTRACT(respondents_hll) AS num_respondents
FROM (
  SELECT
    HLL_COUNT.INIT(respondent) AS respondents_hll,
    flavor,
    country
  FROM UNNEST([
    STRUCT(1 AS respondent, "Vanilla" AS flavor, "CH" AS country),
    (1, "Chocolate", "CH"),
    (2, "Chocolate", "US"),
    (2, "Strawberry", "US")])
  GROUP BY flavor, country);

+------------+---------+-----------------+
| flavor     | country | num_respondents |
+------------+---------+-----------------+
| Vanilla    | CH      | 1               |
| Chocolate  | CH      | 1               |
| Chocolate  | US      | 1               |
| Strawberry | US      | 1               |
+------------+---------+-----------------+
此页内容是否有用?请给出您的反馈和评价:

发送以下问题的反馈:

此网页
需要帮助?请访问我们的支持页面