使用管道语法分析数据
本教程介绍了如何使用管道语法编写查询来分析数据。
管道语法是 GoogleSQL 的扩展,支持线性查询结构,旨在让查询更易于读取、编写和维护。
管道语法由管道符号 |>
、管道运算符名称和任何参数组成。如需了解详情,请参阅以下资源:
在本教程中,您将使用公开提供的 bigquery-public-data.austin_bikeshare.bikeshare_trips
表(其中包含有关自行车行程的数据)使用管道语法构建复杂查询。
目标
- 通过开始具有
FROM
子句的查询来查看表数据。 - 使用
EXTEND
管道运算符添加列。 - 使用
AGGREGATE
管道运算符每日和每周汇总数据。 - 使用
CROSS JOIN
管道运算符汇总滑动窗口中的数据。 - 使用
WHERE
管道运算符过滤数据。 - 在执行多级汇总时,将管道语法的线性查询结构与标准语法的嵌套查询结构进行比较。
准备工作
要开始使用 BigQuery 公共数据集,您必须先创建或选择一个项目。我们提供每月免费处理 1 TB 数据,因此您无需启用结算功能即可开始查询公共数据集。如果您打算处理的数据量超出免费层级范围,则还必须启用结算功能。
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- 新项目会自动启用 BigQuery。如需在现有项目中启用 BigQuery,请
Enable the BigQuery API.
- 将
AGGREGATE
管道运算符与COUNT
函数结合使用,可查找行程总数和使用的自行车数。 使用
GROUP BY
子句按日期对结果进行分组。如需向前复制数据,请使用
GENERATE_ARRAY
函数和交叉联接:管道语法
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` |> EXTEND CAST(start_time AS DATE) AS date |> CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days;
标准语法
SELECT *, CAST(start_time AS DATE) AS date FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days;
GENERATE_ARRAY
函数会创建一个包含0
到6
共七个元素的数组。CROSS JOIN UNNEST
运算会为每行创建七个副本,其中包含一个新的diff_days
列,该列包含每个行从0
到6
的某个数组元素值。您可以使用diff_days
值来调整原始日期,以将时间范围向前滑动相应天数(最多为原始日期后七天)。如需查看行程的计算活跃日期,请将
EXTEND
管道运算符与DATE_ADD
函数结合使用,以创建一个名为active_date
的列,其中包含开始日期以及diff_days
列中的值:管道语法
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` |> EXTEND CAST(start_time AS DATE) AS date |> CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days |> EXTEND DATE_ADD(date, INTERVAL diff_days DAY) AS active_date;
标准语法
SELECT *, DATE_ADD(date, INTERVAL diff_days DAY) AS active_date FROM ( SELECT *, CAST(start_time AS DATE) AS date FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days)
例如,如果行程在
2024-05-20
开始,则在2024-05-26
之前的每一天也都被视为活跃。最后,汇总行程 ID 和自行车 ID,并按
active_date
分组:管道语法
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` |> EXTEND CAST(start_time AS DATE) AS date |> CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days |> EXTEND DATE_ADD(date, INTERVAL diff_days DAY) AS active_date |> AGGREGATE COUNT(DISTINCT bike_id) AS active_7d_bikes, COUNT(trip_id) AS active_7d_trips GROUP BY active_date DESC;
标准语法
SELECT DATE_ADD(date, INTERVAL diff_days DAY) AS active_date, COUNT(DISTINCT bike_id) AS active_7d_bikes, COUNT(trip_id) AS active_7d_trips FROM ( SELECT *, CAST(start_time AS DATE) AS date FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days) GROUP BY active_date ORDER BY active_date DESC;
结果类似于以下内容:
+-------------+-----------------+-----------------+ | active_date | active_7d_bikes | active_7d_trips | +-------------+-----------------+-----------------+ | 2024-07-06 | 90 | 331 | | 2024-07-05 | 142 | 726 | | 2024-07-04 | 186 | 1163 | | ... | ... | ... | +-------------+-----------------+-----------------+
- 再添加一个
EXTEND
管道运算符,该运算符使用带有OVER
子句的窗口函数来计算表中的最大日期。 - 使用
WHERE
管道运算符过滤掉已超过最大日期的生成行。
如需详细了解运行查询的不同方式,请参阅运行查询。
查看表数据
如需从 bikeshare_trips
表中检索所有数据,请运行以下查询:
管道语法
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`;
标准语法
SELECT *
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`;
在管道语法中,查询可以以 FROM
子句开头,而无需 SELECT
子句来返回表结果。
结果类似于以下内容:
+----------+-----------------+---------+-----------+-------------------------+-----+ | trip_id | subscriber_type | bike_id | bike_type | start_time | ... | +----------+-----------------+---------+-----------+-------------------------+-----+ | 28875008 | Pay-as-you-ride | 18181 | electric | 2023-02-12 12:46:32 UTC | ... | | 28735401 | Explorer | 214 | classic | 2023-01-13 12:01:45 UTC | ... | | 29381980 | Local365 | 21803 | electric | 2023-04-20 08:43:46 UTC | ... | | ... | ... | ... | ... | ... | ... | +----------+-----------------+---------+-----------+-------------------------+-----+
添加列
在 bikeshare_trips
表中,start_time
列是一个时间戳,但您可能需要添加一个仅显示行程日期的列。如需在管道语法中添加列,请使用 EXTEND
管道运算符:
管道语法
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
|> EXTEND CAST(start_time AS DATE) AS date;
标准语法
SELECT *, CAST(start_time AS DATE) AS date
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`;
结果类似于以下内容:
+----------+-----------------+---------+-----------+-------------------------+------------+-----+ | trip_id | subscriber_type | bike_id | bike_type | start_time | date | ... | +----------+-----------------+---------+-----------+-------------------------+------------+-----+ | 28875008 | Pay-as-you-ride | 18181 | electric | 2023-02-12 12:46:32 UTC | 2023-02-12 | ... | | 28735401 | Explorer | 214 | classic | 2023-01-13 12:01:45 UTC | 2023-01-13 | ... | | 29381980 | Local365 | 21803 | electric | 2023-04-20 08:43:46 UTC | 2023-04-20 | ... | | ... | ... | ... | ... | ... | ... | ... | +----------+-----------------+---------+-----------+-------------------------+------------+-----+
汇总每日数据
您可以按日期分组,以查找每天的行程总数和使用的自行车数量。
管道语法
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
|> EXTEND CAST(start_time AS DATE) AS date
|> AGGREGATE
COUNT(*) AS trips,
COUNT(DISTINCT bike_id) AS distinct_bikes
GROUP BY date;
标准语法
SELECT
CAST(start_time AS DATE) AS date,
COUNT(*) AS trips,
COUNT(DISTINCT bike_id) AS distinct_bikes
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY date;
结果类似于以下内容:
+------------+-------+----------------+
| date | trips | distinct_bikes |
+------------+-------+----------------+
| 2023-04-20 | 841 | 197 |
| 2023-01-27 | 763 | 148 |
| 2023-06-12 | 562 | 202 |
| ... | ... | ... |
+------------+-------+----------------+
对结果排序
如需按 date
列以降序对结果进行排序,请向 GROUP BY
子句添加 DESC
后缀:
管道语法
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
|> EXTEND CAST(start_time AS DATE) AS date
|> AGGREGATE
COUNT(*) AS trips,
COUNT(DISTINCT bike_id) AS distinct_bikes
GROUP BY date DESC;
标准语法
SELECT
CAST(start_time AS DATE) AS date,
COUNT(*) AS trips,
COUNT(DISTINCT bike_id) AS distinct_bikes
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY date
ORDER BY date DESC;
结果类似于以下内容:
+------------+-------+----------------+
| date | trips | distinct_bikes |
+------------+-------+----------------+
| 2024-06-30 | 331 | 90 |
| 2024-06-29 | 395 | 123 |
| 2024-06-28 | 437 | 137 |
| ... | ... | ... |
+------------+-------+----------------+
在管道语法中,您可以向 GROUP BY
子句直接添加排序后缀,而无需使用 ORDER BY
管道运算符。
向 GROUP BY
子句添加后缀是管道语法支持的多个可选 AGGREGATE
的简写排序功能之一。在标准语法中,这是不可能的,您必须使用 ORDER BY
子句进行排序。
汇总每周数据
现在,您已经拥有每天使用的自行车数量的数据,可以根据查询来查找每七天使用的不同自行车的数量。
如需更新表中的行以显示周而不是天,请在 GROUP BY
子句中使用 DATE_TRUNC
函数,并将粒度设置为 WEEK
:
管道语法
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
|> EXTEND CAST(start_time AS DATE) AS date
|> AGGREGATE
COUNT(*) AS trips,
COUNT(DISTINCT bike_id) AS distinct_bikes,
GROUP BY DATE_TRUNC(date, WEEK) AS date DESC;
标准语法
SELECT
DATE_TRUNC(CAST(start_time AS DATE), WEEK) AS date,
COUNT(*) AS trips,
COUNT(DISTINCT bike_id) AS distinct_bikes,
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY date
ORDER BY date DESC;
结果类似于以下内容:
+------------+-------+----------------+
| date | trips | distinct_bikes |
+------------+-------+----------------+
| 2024-06-30 | 331 | 90 |
| 2024-06-23 | 3206 | 213 |
| 2024-06-16 | 3441 | 212 |
| ... | ... | ... |
+------------+-------+----------------+
在滑动窗口中汇总
上一部分中的结果显示了开始日期和结束日期之间的固定窗口内的行程,例如 2024-06-23
到 2024-06-29
。您可能希望在滑动窗口中查看行程,在七天的时间范围内,随着每一天过去,时间范围会向前移动。换言之,对于任何给定日期,您可能都想了解接下来一周的行程数和使用的自行车数。
如需对数据应用滑动窗口,请先将每趟行程从其开始日期向前复制额外的六天活跃天数。然后,使用 DATE_ADD
函数计算活跃日期。最后,汇总每个活跃日期的行程和自行车 ID。
过滤未来日期
在上面的查询中,日期会延伸到未来,最长可延伸到数据中的最后日期之后六天。如需过滤超出数据结束日期的日期,请在查询中设置最大日期:
管道语法
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
|> EXTEND CAST(start_time AS DATE) AS date
|> EXTEND MAX(date) OVER () AS max_date
|> CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days
|> EXTEND DATE_ADD(date, INTERVAL diff_days DAY) AS active_date
|> WHERE active_date <= max_date
|> AGGREGATE COUNT(DISTINCT bike_id) AS active_7d_bikes,
COUNT(trip_id) AS active_7d_trips
GROUP BY active_date DESC;
标准语法
SELECT
DATE_ADD(date, INTERVAL diff_days DAY) AS active_date,
COUNT(DISTINCT bike_id) AS active_7d_bikes,
COUNT(trip_id) AS active_7d_trips
FROM(
SELECT *
FROM (
SELECT *,
DATE_ADD(date, INTERVAL diff_days DAY) AS active_date,
MAX(date) OVER () AS max_date
FROM(
SELECT *, CAST(start_time AS DATE) AS date,
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days))
WHERE active_date <= max_date)
GROUP BY active_date
ORDER BY active_date DESC;
结果类似于以下内容:
+-------------+-----------------+-----------------+
| active_date | active_7d_bikes | active_7d_trips |
+-------------+-----------------+-----------------+
| 2024-06-30 | 212 | 3031 |
| 2024-06-29 | 213 | 3206 |
| 2024-06-28 | 219 | 3476 |
| ... | ... | ... |
+-------------+-----------------+-----------------+