使用旧版 SQL 查询嵌套和重复字段
本文档详细介绍了如何使用旧版 SQL 查询语法来查询嵌套数据和重复数据。BigQuery 的首选查询语法是 GoogleSQL。如需了解如何在 GoogleSQL 中处理嵌套数据和重复数据,请参阅 GoogleSQL 迁移指南。
BigQuery 支持加载和导出 JSON 和 Avro 文件形式的嵌套数据和重复数据。对于许多旧版 SQL 查询,BigQuery 可自动展平数据。例如,许多 SELECT
语句可以在保持数据结构的同时检索嵌套字段或重复字段,WHERE
子句可以在保持数据结构的同时过滤数据。相反,ORDER BY
和 GROUP BY
子句则隐式展平查询数据。对于未隐式展平数据的情况,例如查询旧版 SQL 中的多个重复字段,可以使用 FLATTEN
和 WITHIN
SQL 函数查询数据。
FLATTEN
在查询嵌套数据时,BigQuery 会自动为您展平表数据。 例如,我们来看看个人数据的示例架构:
Last modified Schema Total Rows Total Bytes Expiration ----------------- ----------------------------------- ------------ ------------- ------------ 27 Sep 10:01:06 |- kind: string 4 794 |- fullName: string (required) |- age: integer |- gender: string +- phoneNumber: record | |- areaCode: integer | |- number: integer +- children: record (repeated) | |- name: string | |- gender: string | |- age: integer +- citiesLived: record (repeated) | |- place: string | +- yearsLived: integer (repeated)
请注意,其中包含多个重复字段和嵌套字段。如果针对个人表运行如下所示的旧版 SQL 查询:
SELECT fullName AS name, age, gender, citiesLived.place, citiesLived.yearsLived FROM [dataset.tableId]
BigQuery 会以展平的输出形式返回数据:
+---------------+-----+--------+-------------------+------------------------+ | name | age | gender | citiesLived_place | citiesLived_yearsLived | +---------------+-----+--------+-------------------+------------------------+ | John Doe | 22 | Male | Seattle | 1995 | | John Doe | 22 | Male | Stockholm | 2005 | | Mike Jones | 35 | Male | Los Angeles | 1989 | | Mike Jones | 35 | Male | Los Angeles | 1993 | | Mike Jones | 35 | Male | Los Angeles | 1998 | | Mike Jones | 35 | Male | Los Angeles | 2002 | | Mike Jones | 35 | Male | Washington DC | 1990 | | Mike Jones | 35 | Male | Washington DC | 1993 | | Mike Jones | 35 | Male | Washington DC | 1998 | | Mike Jones | 35 | Male | Washington DC | 2008 | | Mike Jones | 35 | Male | Portland | 1993 | | Mike Jones | 35 | Male | Portland | 1998 | | Mike Jones | 35 | Male | Portland | 2003 | | Mike Jones | 35 | Male | Portland | 2005 | | Mike Jones | 35 | Male | Austin | 1973 | | Mike Jones | 35 | Male | Austin | 1998 | | Mike Jones | 35 | Male | Austin | 2001 | | Mike Jones | 35 | Male | Austin | 2005 | | Anna Karenina | 45 | Female | Stockholm | 1992 | | Anna Karenina | 45 | Female | Stockholm | 1998 | | Anna Karenina | 45 | Female | Stockholm | 2000 | | Anna Karenina | 45 | Female | Stockholm | 2010 | | Anna Karenina | 45 | Female | Moscow | 1998 | | Anna Karenina | 45 | Female | Moscow | 2001 | | Anna Karenina | 45 | Female | Moscow | 2005 | | Anna Karenina | 45 | Female | Austin | 1995 | | Anna Karenina | 45 | Female | Austin | 1999 | +---------------+-----+--------+-------------------+------------------------+
在此示例中,citiesLived.place
现在是 citiesLived_place
,citiesLived.yearsLived
现在是 citiesLived_yearsLived
。
虽然 BigQuery 可以自动展平嵌套字段,但在处理多个重复字段时您仍可能需要明确调用 FLATTEN
。例如,如果尝试运行如下所示的旧版 SQL 查询:
SELECT fullName, age FROM [dataset.tableId] WHERE (citiesLived.yearsLived > 1995 ) AND (children.age > 3)
BigQuery 会返回类似于以下内容的错误:
Cannot query the cross product of repeated fields children.age and citiesLived.yearsLived
如需在多个重复字段中查询,则需要展平以下字段之一:
SELECT fullName, age, gender, citiesLived.place FROM (FLATTEN([dataset.tableId], children)) WHERE (citiesLived.yearsLived > 1995) AND (children.age > 3) GROUP BY fullName, age, gender, citiesLived.place
返回结果:
+------------+-----+--------+-------------------+ | fullName | age | gender | citiesLived_place | +------------+-----+--------+-------------------+ | John Doe | 22 | Male | Stockholm | | Mike Jones | 35 | Male | Los Angeles | | Mike Jones | 35 | Male | Washington DC | | Mike Jones | 35 | Male | Portland | | Mike Jones | 35 | Male | Austin | +------------+-----+--------+-------------------+
WITHIN 子句
WITHIN
关键字专用于聚合函数,以便在记录和嵌套字段内的子字段和重复字段中进行聚合。在指定 WITHIN
关键字时,需要指定要聚合的范围:
WITHIN RECORD
:对记录所含重复值中的数据进行聚合。WITHIN node_name
:对指定节点所含重复值中的数据进行聚合,其中的节点是聚合函数中字段的父节点。
假设您要查找上个示例中每个人的子女数。为此,您可以计算每个记录包含的 children.name 数:
SELECT fullName, COUNT(children.name) WITHIN RECORD AS numberOfChildren FROM [dataset.tableId];
您将得到以下结果:
+---------------+------------------+ | fullName | numberOfChildren | +---------------+------------------+ | John Doe | 2 | | Jane Austen | 2 | | Mike Jones | 3 | | Anna Karenina | 0 | +---------------+------------------+
如需进行比较,请尝试列出所有子女的姓名:
SELECT fullName, children.name FROM [dataset.tableId]
+---------------+---------------+ | fullName | children_name | +---------------+---------------+ | John Doe | Jane | | John Doe | John | | Jane Austen | Josh | | Jane Austen | Jim | | Mike Jones | Earl | | Mike Jones | Sam | | Mike Jones | Kit | | Anna Karenina | None | +---------------+---------------+
这与我们的 WITHIN RECORD
查询结果匹配;John Doe 确实有两个名叫 Jane 和 John 的子女,Jane Austen 有两个名叫 Josh 和 Jim 的子女,Mike Jones 有三个名叫 Earl、Sam 和 Kit 的子女,而 Anna Karenina 没有子女。
现在,假设您想要查找某人在不同地方居住过的次数。您可以使用 WITHIN
子句在某一特定节点中聚合:
SELECT fullName, COUNT(citiesLived.place) WITHIN RECORD AS numberOfPlacesLived, citiesLived.place, COUNT(citiesLived.yearsLived) WITHIN citiesLived AS numberOfTimesInEachCity, FROM [dataset.tableId];
+---------------+---------------------+-------------------+-------------------------+ | fullName | numberOfPlacesLived | citiesLived_place | numberOfTimesInEachCity | +---------------+---------------------+-------------------+-------------------------+ | John Doe | 2 | Seattle | 1 | | John Doe | 2 | Stockholm | 1 | | Mike Jones | 4 | Los Angeles | 4 | | Mike Jones | 4 | Washington DC | 4 | | Mike Jones | 4 | Portland | 4 | | Mike Jones | 4 | Austin | 4 | | Anna Karenina | 3 | Stockholm | 4 | | Anna Karenina | 3 | Moscow | 3 | | Anna Karenina | 3 | Austin | 2 | +---------------+---------------------+-------------------+-------------------------+
此查询执行以下操作:
- 对
citiesLived.place
执行WITHIN RECORD
,计算每人居住过的地点数 - 对
citiesLived.yearsLived
执行WITHIN
,计算每个人在每个城市居住过的次数(仅在citiesLived
中统计)。
对嵌套和重复字段使用限定范围的聚合是 BigQuery 最强大的功能之一,通常可以避免在查询中出现成本高昂的联接。