使用管道语法分析数据

本教程介绍了如何使用管道语法编写查询来分析数据。

管道语法是 GoogleSQL 的扩展,支持线性查询结构,旨在让查询更易于读取、编写和维护。 管道语法由管道符号 |>管道运算符名称和任何参数组成。如需了解详情,请参阅以下资源:

在本教程中,您将使用公开提供的 bigquery-public-data.austin_bikeshare.bikeshare_trips(其中包含有关自行车行程的数据)使用管道语法构建复杂查询。

目标

准备工作

要开始使用 BigQuery 公共数据集,您必须先创建或选择一个项目。我们提供每月免费处理 1 TB 数据,因此您无需启用结算功能即可开始查询公共数据集。如果您打算处理的数据量超出免费层级范围,则还必须启用结算功能。

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. 新项目会自动启用 BigQuery。如需在现有项目中启用 BigQuery,请

    Enable the BigQuery API.

    Enable the API

  7. 如需详细了解运行查询的不同方式,请参阅运行查询

    查看表数据

    如需从 bikeshare_trips 表中检索所有数据,请运行以下查询:

    管道语法

    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`;
    

    标准语法

    SELECT *
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`;
    

    在管道语法中,查询可以以 FROM 子句开头,而无需 SELECT 子句来返回表结果。

    结果类似于以下内容:

    +----------+-----------------+---------+-----------+-------------------------+-----+
    | trip_id  | subscriber_type | bike_id | bike_type | start_time              | ... |
    +----------+-----------------+---------+-----------+-------------------------+-----+
    | 28875008 | Pay-as-you-ride | 18181   | electric  | 2023-02-12 12:46:32 UTC | ... |
    | 28735401 | Explorer        | 214     | classic   | 2023-01-13 12:01:45 UTC | ... |
    | 29381980 | Local365        | 21803   | electric  | 2023-04-20 08:43:46 UTC | ... |
    | ...      | ...             | ...     | ...       | ...                     | ... |
    +----------+-----------------+---------+-----------+-------------------------+-----+
    

    添加列

    bikeshare_trips 表中,start_time 列是一个时间戳,但您可能需要添加一个仅显示行程日期的列。如需在管道语法中添加列,请使用 EXTEND 管道运算符

    管道语法

    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    |> EXTEND CAST(start_time AS DATE) AS date;
    

    标准语法

    SELECT *, CAST(start_time AS DATE) AS date
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`;
    

    结果类似于以下内容:

    +----------+-----------------+---------+-----------+-------------------------+------------+-----+
    | trip_id  | subscriber_type | bike_id | bike_type | start_time              | date       | ... |
    +----------+-----------------+---------+-----------+-------------------------+------------+-----+
    | 28875008 | Pay-as-you-ride | 18181   | electric  | 2023-02-12 12:46:32 UTC | 2023-02-12 | ... |
    | 28735401 | Explorer        | 214     | classic   | 2023-01-13 12:01:45 UTC | 2023-01-13 | ... |
    | 29381980 | Local365        | 21803   | electric  | 2023-04-20 08:43:46 UTC | 2023-04-20 | ... |
    | ...      | ...             | ...     | ...       | ...                     | ...        | ... |
    +----------+-----------------+---------+-----------+-------------------------+------------+-----+
    

    汇总每日数据

    您可以按日期分组,以查找每天的行程总数和使用的自行车数量。

    • AGGREGATE 管道运算符COUNT 函数结合使用,可查找行程总数和使用的自行车数。
    • 使用 GROUP BY 子句按日期对结果进行分组。

    管道语法

    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    |> EXTEND CAST(start_time AS DATE) AS date
    |> AGGREGATE
         COUNT(*) AS trips,
         COUNT(DISTINCT bike_id) AS distinct_bikes
       GROUP BY date;
    

    标准语法

    SELECT
      CAST(start_time AS DATE) AS date,
      COUNT(*) AS trips,
      COUNT(DISTINCT bike_id) AS distinct_bikes
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    GROUP BY date;
    

    结果类似于以下内容:

    +------------+-------+----------------+
    | date       | trips | distinct_bikes |
    +------------+-------+----------------+
    | 2023-04-20 | 841   | 197            |
    | 2023-01-27 | 763   | 148            |
    | 2023-06-12 | 562   | 202            |
    | ...        | ...   | ...            |
    +------------+-------+----------------+
    

    对结果排序

    如需按 date 列以降序对结果进行排序,请向 GROUP BY 子句添加 DESC 后缀:

    管道语法

    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    |> EXTEND CAST(start_time AS DATE) AS date
    |> AGGREGATE
         COUNT(*) AS trips,
         COUNT(DISTINCT bike_id) AS distinct_bikes
       GROUP BY date DESC;
    

    标准语法

    SELECT
      CAST(start_time AS DATE) AS date,
      COUNT(*) AS trips,
      COUNT(DISTINCT bike_id) AS distinct_bikes
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    GROUP BY date
    ORDER BY date DESC;
    

    结果类似于以下内容:

    +------------+-------+----------------+
    | date       | trips | distinct_bikes |
    +------------+-------+----------------+
    | 2024-06-30 | 331   | 90             |
    | 2024-06-29 | 395   | 123            |
    | 2024-06-28 | 437   | 137            |
    | ...        | ...   | ...            |
    +------------+-------+----------------+
    

    在管道语法中,您可以向 GROUP BY 子句直接添加排序后缀,而无需使用 ORDER BY 管道运算符。 向 GROUP BY 子句添加后缀是管道语法支持的多个可选 AGGREGATE 的简写排序功能之一。在标准语法中,这是不可能的,您必须使用 ORDER BY 子句进行排序。

    汇总每周数据

    现在,您已经拥有每天使用的自行车数量的数据,可以根据查询来查找每七天使用的不同自行车的数量。

    如需更新表中的行以显示周而不是天,请在 GROUP BY 子句中使用 DATE_TRUNC 函数,并将粒度设置为 WEEK

    管道语法

    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    |> EXTEND CAST(start_time AS DATE) AS date
    |> AGGREGATE
        COUNT(*) AS trips,
        COUNT(DISTINCT bike_id) AS distinct_bikes,
    GROUP BY DATE_TRUNC(date, WEEK) AS date DESC;
    

    标准语法

    SELECT
      DATE_TRUNC(CAST(start_time AS DATE), WEEK) AS date,
      COUNT(*) AS trips,
      COUNT(DISTINCT bike_id) AS distinct_bikes,
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    GROUP BY date
    ORDER BY date DESC;
    

    结果类似于以下内容:

    +------------+-------+----------------+
    | date       | trips | distinct_bikes |
    +------------+-------+----------------+
    | 2024-06-30 | 331   | 90             |
    | 2024-06-23 | 3206  | 213            |
    | 2024-06-16 | 3441  | 212            |
    | ...        | ...   | ...            |
    +------------+-------+----------------+
    

    在滑动窗口中汇总

    上一部分中的结果显示了开始日期和结束日期之间的固定窗口内的行程,例如 2024-06-232024-06-29。您可能希望在滑动窗口中查看行程,在七天的时间范围内,随着每一天过去,时间范围会向前移动。换言之,对于任何给定日期,您可能都想了解接下来一周的行程数和使用的自行车数。

    如需对数据应用滑动窗口,请先将每趟行程从其开始日期向前复制额外的六天活跃天数。然后,使用 DATE_ADD 函数计算活跃日期。最后,汇总每个活跃日期的行程和自行车 ID。

    1. 如需向前复制数据,请使用 GENERATE_ARRAY 函数和交叉联接:

      管道语法

      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
      |> EXTEND CAST(start_time AS DATE) AS date
      |> CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days;
      

      标准语法

      SELECT *, CAST(start_time AS DATE) AS date
      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
      CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days;
      

      GENERATE_ARRAY 函数会创建一个包含 06 共七个元素的数组。CROSS JOIN UNNEST 运算会为每行创建七个副本,其中包含一个新的 diff_days 列,该列包含每个行从 06 的某个数组元素值。您可以使用 diff_days 值来调整原始日期,以将时间范围向前滑动相应天数(最多为原始日期后七天)。

    2. 如需查看行程的计算活跃日期,请将 EXTEND 管道运算符与 DATE_ADD 函数结合使用,以创建一个名为 active_date 的列,其中包含开始日期以及 diff_days 列中的值:

      管道语法

      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
      |> EXTEND CAST(start_time AS DATE) AS date
      |> CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days
      |> EXTEND DATE_ADD(date, INTERVAL diff_days DAY) AS active_date;
      

      标准语法

      SELECT *, DATE_ADD(date, INTERVAL diff_days DAY) AS active_date
      FROM (
        SELECT *, CAST(start_time AS DATE) AS date
        FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
        CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days)
      

      例如,如果行程在 2024-05-20 开始,则在 2024-05-26 之前的每一天也都被视为活跃。

    3. 最后,汇总行程 ID 和自行车 ID,并按 active_date 分组:

      管道语法

      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
      |> EXTEND CAST(start_time AS DATE) AS date
      |> CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days
      |> EXTEND DATE_ADD(date, INTERVAL diff_days DAY) AS active_date
      |> AGGREGATE COUNT(DISTINCT bike_id) AS active_7d_bikes,
                  COUNT(trip_id) AS active_7d_trips
      GROUP BY active_date DESC;
      

      标准语法

      SELECT
        DATE_ADD(date, INTERVAL diff_days DAY) AS active_date,
        COUNT(DISTINCT bike_id) AS active_7d_bikes,
        COUNT(trip_id) AS active_7d_trips
      FROM (
        SELECT *, CAST(start_time AS DATE) AS date
        FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
        CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days)
      GROUP BY active_date
      ORDER BY active_date DESC;
      

      结果类似于以下内容:

      +-------------+-----------------+-----------------+
      | active_date | active_7d_bikes | active_7d_trips |
      +-------------+-----------------+-----------------+
      | 2024-07-06  | 90              | 331             |
      | 2024-07-05  | 142             | 726             |
      | 2024-07-04  | 186             | 1163            |
      | ...         | ...             | ...             |
      +-------------+-----------------+-----------------+
      

    过滤未来日期

    在上面的查询中,日期会延伸到未来,最长可延伸到数据中的最后日期之后六天。如需过滤超出数据结束日期的日期,请在查询中设置最大日期:

    1. 再添加一个 EXTEND 管道运算符,该运算符使用带有 OVER 子句的窗口函数来计算表中的最大日期。
    2. 使用 WHERE 管道运算符过滤掉已超过最大日期的生成行。

    管道语法

    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    |> EXTEND CAST(start_time AS DATE) AS date
    |> EXTEND MAX(date) OVER () AS max_date
    |> CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days
    |> EXTEND DATE_ADD(date, INTERVAL diff_days DAY) AS active_date
    |> WHERE active_date <= max_date
    |> AGGREGATE COUNT(DISTINCT bike_id) AS active_7d_bikes,
                 COUNT(trip_id) AS active_7d_trips
       GROUP BY active_date DESC;
    

    标准语法

    SELECT
      DATE_ADD(date, INTERVAL diff_days DAY) AS active_date,
      COUNT(DISTINCT bike_id) AS active_7d_bikes,
      COUNT(trip_id) AS active_7d_trips
    FROM(
      SELECT *
      FROM (
        SELECT *,
          DATE_ADD(date, INTERVAL diff_days DAY) AS active_date,
          MAX(date) OVER () AS max_date
        FROM(
          SELECT *, CAST(start_time AS DATE) AS date,
          FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
          CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days))
      WHERE active_date <= max_date)
    GROUP BY active_date
    ORDER BY active_date DESC;
    

    结果类似于以下内容:

    +-------------+-----------------+-----------------+
    | active_date | active_7d_bikes | active_7d_trips |
    +-------------+-----------------+-----------------+
    | 2024-06-30  | 212             | 3031            |
    | 2024-06-29  | 213             | 3206            |
    | 2024-06-28  | 219             | 3476            |
    | ...         | ...             | ...             |
    +-------------+-----------------+-----------------+
    

    后续步骤