使用旧版 SQL 查询嵌套和重复字段

本文档详细介绍了如何使用旧版 SQL 查询语法来查询嵌套数据和重复数据。BigQuery 的首选查询语法是 GoogleSQL。如需了解如何在 GoogleSQL 中处理嵌套数据和重复数据,请参阅 GoogleSQL 迁移指南

BigQuery 支持加载导出 JSON 和 Avro 文件形式的嵌套数据和重复数据。对于许多旧版 SQL 查询,BigQuery 可自动展平数据。例如,许多 SELECT 语句可以在保持数据结构的同时检索嵌套字段或重复字段,WHERE 子句可以在保持数据结构的同时过滤数据。相反,ORDER BYGROUP BY 子句则隐式展平查询数据。对于未隐式展平数据的情况,例如查询旧版 SQL 中的多个重复字段,可以使用 FLATTENWITHIN SQL 函数查询数据。

FLATTEN

在查询嵌套数据时,BigQuery 会自动为您展平表数据。 例如,我们来看看个人数据的示例架构:

   Last modified                 Schema                 Total Rows   Total Bytes   Expiration
 ----------------- ----------------------------------- ------------ ------------- ------------
  27 Sep 10:01:06   |- kind: string                     4            794
                    |- fullName: string (required)
                    |- age: integer
                    |- gender: string
                    +- phoneNumber: record
                    |  |- areaCode: integer
                    |  |- number: integer
                    +- children: record (repeated)
                    |  |- name: string
                    |  |- gender: string
                    |  |- age: integer
                    +- citiesLived: record (repeated)
                    |  |- place: string
                    |  +- yearsLived: integer (repeated)

请注意,其中包含多个重复字段和嵌套字段。如果针对个人表运行如下所示的旧版 SQL 查询:

SELECT
  fullName AS name,
  age,
  gender,
  citiesLived.place,
  citiesLived.yearsLived
FROM [dataset.tableId]

BigQuery 会以展平的输出形式返回数据:

+---------------+-----+--------+-------------------+------------------------+
|     name      | age | gender | citiesLived_place | citiesLived_yearsLived |
+---------------+-----+--------+-------------------+------------------------+
| John Doe      |  22 | Male   | Seattle           |                   1995 |
| John Doe      |  22 | Male   | Stockholm         |                   2005 |
| Mike Jones    |  35 | Male   | Los Angeles       |                   1989 |
| Mike Jones    |  35 | Male   | Los Angeles       |                   1993 |
| Mike Jones    |  35 | Male   | Los Angeles       |                   1998 |
| Mike Jones    |  35 | Male   | Los Angeles       |                   2002 |
| Mike Jones    |  35 | Male   | Washington DC     |                   1990 |
| Mike Jones    |  35 | Male   | Washington DC     |                   1993 |
| Mike Jones    |  35 | Male   | Washington DC     |                   1998 |
| Mike Jones    |  35 | Male   | Washington DC     |                   2008 |
| Mike Jones    |  35 | Male   | Portland          |                   1993 |
| Mike Jones    |  35 | Male   | Portland          |                   1998 |
| Mike Jones    |  35 | Male   | Portland          |                   2003 |
| Mike Jones    |  35 | Male   | Portland          |                   2005 |
| Mike Jones    |  35 | Male   | Austin            |                   1973 |
| Mike Jones    |  35 | Male   | Austin            |                   1998 |
| Mike Jones    |  35 | Male   | Austin            |                   2001 |
| Mike Jones    |  35 | Male   | Austin            |                   2005 |
| Anna Karenina |  45 | Female | Stockholm         |                   1992 |
| Anna Karenina |  45 | Female | Stockholm         |                   1998 |
| Anna Karenina |  45 | Female | Stockholm         |                   2000 |
| Anna Karenina |  45 | Female | Stockholm         |                   2010 |
| Anna Karenina |  45 | Female | Moscow            |                   1998 |
| Anna Karenina |  45 | Female | Moscow            |                   2001 |
| Anna Karenina |  45 | Female | Moscow            |                   2005 |
| Anna Karenina |  45 | Female | Austin            |                   1995 |
| Anna Karenina |  45 | Female | Austin            |                   1999 |
+---------------+-----+--------+-------------------+------------------------+

在此示例中,citiesLived.place 现在是 citiesLived_placecitiesLived.yearsLived 现在是 citiesLived_yearsLived

虽然 BigQuery 可以自动展平嵌套字段,但在处理多个重复字段时您仍可能需要明确调用 FLATTEN。例如,如果尝试运行如下所示的旧版 SQL 查询:

SELECT fullName, age
FROM [dataset.tableId]
WHERE
  (citiesLived.yearsLived > 1995 ) AND
  (children.age > 3)

BigQuery 会返回类似于以下内容的错误:

Cannot query the cross product of repeated fields children.age and citiesLived.yearsLived

如需在多个重复字段中查询,则需要展平以下字段之一:

SELECT
  fullName,
  age,
  gender,
  citiesLived.place
FROM (FLATTEN([dataset.tableId], children))
WHERE
  (citiesLived.yearsLived > 1995) AND
  (children.age > 3)
GROUP BY fullName, age, gender, citiesLived.place

返回结果:

+------------+-----+--------+-------------------+
|  fullName  | age | gender | citiesLived_place |
+------------+-----+--------+-------------------+
| John Doe   |  22 | Male   | Stockholm         |
| Mike Jones |  35 | Male   | Los Angeles       |
| Mike Jones |  35 | Male   | Washington DC     |
| Mike Jones |  35 | Male   | Portland          |
| Mike Jones |  35 | Male   | Austin            |
+------------+-----+--------+-------------------+

WITHIN 子句

WITHIN 关键字专用于聚合函数,以便在记录和嵌套字段内的子字段和重复字段中进行聚合。在指定 WITHIN 关键字时,需要指定要聚合的范围:

  • WITHIN RECORD:对记录所含重复值中的数据进行聚合。
  • WITHIN node_name:对指定节点所含重复值中的数据进行聚合,其中的节点是聚合函数中字段的父节点。

假设您要查找上个示例中每个人的子女数。为此,您可以计算每个记录包含的 children.name 数:

SELECT
  fullName,
  COUNT(children.name) WITHIN RECORD AS numberOfChildren
FROM [dataset.tableId];

您将得到以下结果:

+---------------+------------------+
|   fullName    | numberOfChildren |
+---------------+------------------+
| John Doe      |                2 |
| Jane Austen   |                2 |
| Mike Jones    |                3 |
| Anna Karenina |                0 |
+---------------+------------------+

如需进行比较,请尝试列出所有子女的姓名:

SELECT fullName, children.name
FROM [dataset.tableId]
+---------------+---------------+
|   fullName    | children_name |
+---------------+---------------+
| John Doe      | Jane          |
| John Doe      | John          |
| Jane Austen   | Josh          |
| Jane Austen   | Jim           |
| Mike Jones    | Earl          |
| Mike Jones    | Sam           |
| Mike Jones    | Kit           |
| Anna Karenina | None          |
+---------------+---------------+

这与我们的 WITHIN RECORD 查询结果匹配;John Doe 确实有两个名叫 Jane 和 John 的子女,Jane Austen 有两个名叫 Josh 和 Jim 的子女,Mike Jones 有三个名叫 Earl、Sam 和 Kit 的子女,而 Anna Karenina 没有子女。

现在,假设您想要查找某人在不同地方居住过的次数。您可以使用 WITHIN 子句在某一特定节点中聚合:

SELECT
  fullName,
  COUNT(citiesLived.place) WITHIN RECORD AS numberOfPlacesLived,
  citiesLived.place,
  COUNT(citiesLived.yearsLived) WITHIN citiesLived AS numberOfTimesInEachCity,
FROM [dataset.tableId];
+---------------+---------------------+-------------------+-------------------------+
|   fullName    | numberOfPlacesLived | citiesLived_place | numberOfTimesInEachCity |
+---------------+---------------------+-------------------+-------------------------+
| John Doe      |                   2 | Seattle           |                       1 |
| John Doe      |                   2 | Stockholm         |                       1 |
| Mike Jones    |                   4 | Los Angeles       |                       4 |
| Mike Jones    |                   4 | Washington DC     |                       4 |
| Mike Jones    |                   4 | Portland          |                       4 |
| Mike Jones    |                   4 | Austin            |                       4 |
| Anna Karenina |                   3 | Stockholm         |                       4 |
| Anna Karenina |                   3 | Moscow            |                       3 |
| Anna Karenina |                   3 | Austin            |                       2 |
+---------------+---------------------+-------------------+-------------------------+

此查询执行以下操作:

  • citiesLived.place 执行 WITHIN RECORD,计算每人居住过的地点数
  • citiesLived.yearsLived 执行 WITHIN,计算每个人在每个城市居住过的次数(仅在 citiesLived 中统计)。

对嵌套和重复字段使用限定范围的聚合是 BigQuery 最强大的功能之一,通常可以避免在查询中出现成本高昂的联接。