使用 pipe 語法分析資料

本教學課程說明如何使用管道語法編寫查詢,以分析資料。

管道語法是 GoogleSQL 的擴充功能,支援線性查詢結構,可讓您更輕鬆地讀取、撰寫及維護查詢。管道語法包含管道符號 |>管道運算子名稱和任何引數。詳情請參閱下列資源:

在本教學課程中,您將使用公開的bigquery-public-data.austin_bikeshare.bikeshare_trips資料表,以管道語法建構複雜查詢。這個資料表包含自行車行程的相關資料。

目標

事前準備

您必須先建立或選取專案,才能開始使用 BigQuery 公開資料集。我們免費為您提供每月 1 TB 的資料處理量,讓您無需啟用計費功能就能開始查詢公開資料集。如果您想要進行的運用會超出免費方案的範圍,則必須啟用計費功能。

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. 新專案會自動啟用 BigQuery。如要在現有的專案中啟用 BigQuery,請

    Enable the BigQuery API.

    Enable the API

  7. 如要進一步瞭解執行查詢的不同方式,請參閱執行查詢

    查看資料表資料

    如要從 bikeshare_trips 資料表擷取所有資料,請執行下列查詢:

    管道語法

    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`;
    

    標準語法

    SELECT *
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`;
    

    在管道語法中,查詢可以從 FROM 子句開始,不必使用 SELECT 子句即可傳回表格結果。

    結果大致如下:

    +----------+-----------------+---------+-----------+-------------------------+-----+
    | trip_id  | subscriber_type | bike_id | bike_type | start_time              | ... |
    +----------+-----------------+---------+-----------+-------------------------+-----+
    | 28875008 | Pay-as-you-ride | 18181   | electric  | 2023-02-12 12:46:32 UTC | ... |
    | 28735401 | Explorer        | 214     | classic   | 2023-01-13 12:01:45 UTC | ... |
    | 29381980 | Local365        | 21803   | electric  | 2023-04-20 08:43:46 UTC | ... |
    | ...      | ...             | ...     | ...       | ...                     | ... |
    +----------+-----------------+---------+-----------+-------------------------+-----+
    

    新增欄

    bikeshare_trips 資料表中,start_time 資料欄是時間戳記,但您可能只想新增顯示行程日期的資料欄。如要以管道語法新增資料欄,請使用 EXTEND 管道運算子

    管道語法

    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    |> EXTEND CAST(start_time AS DATE) AS date;
    

    標準語法

    SELECT *, CAST(start_time AS DATE) AS date
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`;
    

    結果大致如下:

    +----------+-----------------+---------+-----------+-------------------------+------------+-----+
    | trip_id  | subscriber_type | bike_id | bike_type | start_time              | date       | ... |
    +----------+-----------------+---------+-----------+-------------------------+------------+-----+
    | 28875008 | Pay-as-you-ride | 18181   | electric  | 2023-02-12 12:46:32 UTC | 2023-02-12 | ... |
    | 28735401 | Explorer        | 214     | classic   | 2023-01-13 12:01:45 UTC | 2023-01-13 | ... |
    | 29381980 | Local365        | 21803   | electric  | 2023-04-20 08:43:46 UTC | 2023-04-20 | ... |
    | ...      | ...             | ...     | ...       | ...                     | ...        | ... |
    +----------+-----------------+---------+-----------+-------------------------+------------+-----+
    

    匯總每日資料

    您可以依日期分組,找出每天的行程總數和使用的單車數。

    • 使用 COUNT 函式搭配AGGREGATE 管道運算子,找出總共的行程次數和使用的單車數量。
    • 使用 GROUP BY 子句,依日期將結果分組。

    管道語法

    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    |> EXTEND CAST(start_time AS DATE) AS date
    |> AGGREGATE
         COUNT(*) AS trips,
         COUNT(DISTINCT bike_id) AS distinct_bikes
       GROUP BY date;
    

    標準語法

    SELECT
      CAST(start_time AS DATE) AS date,
      COUNT(*) AS trips,
      COUNT(DISTINCT bike_id) AS distinct_bikes
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    GROUP BY date;
    

    結果大致如下:

    +------------+-------+----------------+
    | date       | trips | distinct_bikes |
    +------------+-------+----------------+
    | 2023-04-20 | 841   | 197            |
    | 2023-01-27 | 763   | 148            |
    | 2023-06-12 | 562   | 202            |
    | ...        | ...   | ...            |
    +------------+-------+----------------+
    

    訂單結果

    如要依 date 欄遞減排序結果,請在 GROUP BY 子句中加入 DESC 後置字串:

    管道語法

    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    |> EXTEND CAST(start_time AS DATE) AS date
    |> AGGREGATE
         COUNT(*) AS trips,
         COUNT(DISTINCT bike_id) AS distinct_bikes
       GROUP BY date DESC;
    

    標準語法

    SELECT
      CAST(start_time AS DATE) AS date,
      COUNT(*) AS trips,
      COUNT(DISTINCT bike_id) AS distinct_bikes
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    GROUP BY date
    ORDER BY date DESC;
    

    結果大致如下:

    +------------+-------+----------------+
    | date       | trips | distinct_bikes |
    +------------+-------+----------------+
    | 2024-06-30 | 331   | 90             |
    | 2024-06-29 | 395   | 123            |
    | 2024-06-28 | 437   | 137            |
    | ...        | ...   | ...            |
    +------------+-------+----------------+
    

    在管道語法中,您可以直接將排序後置字元新增至 GROUP BY 子句,而不必使用 ORDER BY 管道運算子。在 GROUP BY 子句中加入後置字元,是 AGGREGATE 管道語法支援的其中一項選用簡短排序功能。在標準語法中,這是不可能的,您必須使用 ORDER BY 子句進行排序。

    匯總每週資料

    現在您已取得每天使用的單車數量資料,可以根據查詢結果,找出每個七天時間範圍內使用的不重複單車數量。

    如要更新資料表中的資料列,改為顯示週而非天數,請在 GROUP BY 子句中使用 DATE_TRUNC 函式,並將精細度設為 WEEK

    管道語法

    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    |> EXTEND CAST(start_time AS DATE) AS date
    |> AGGREGATE
        COUNT(*) AS trips,
        COUNT(DISTINCT bike_id) AS distinct_bikes,
    GROUP BY DATE_TRUNC(date, WEEK) AS date DESC;
    

    標準語法

    SELECT
      DATE_TRUNC(CAST(start_time AS DATE), WEEK) AS date,
      COUNT(*) AS trips,
      COUNT(DISTINCT bike_id) AS distinct_bikes,
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    GROUP BY date
    ORDER BY date DESC;
    

    結果大致如下:

    +------------+-------+----------------+
    | date       | trips | distinct_bikes |
    +------------+-------+----------------+
    | 2024-06-30 | 331   | 90             |
    | 2024-06-23 | 3206  | 213            |
    | 2024-06-16 | 3441  | 212            |
    | ...        | ...   | ...            |
    +------------+-------+----------------+
    

    匯總滑動視窗

    前一節的結果顯示開始和結束日期之間的固定時間範圍內的行程,例如 2024-06-232024-06-29。您可能想查看滑動時間範圍內的行程,也就是七天內每天都會更新的行程。換句話說,您可能想知道特定日期後一週的行程次數和自行車使用次數。

    如要將滑動視窗套用至資料,請先將每趟行程從開始日期起,往前複製六個額外的有效天。接著,使用 DATE_ADD 函式計算有效天數的日期。最後,匯總每個有效日期的行程和單車 ID。

    1. 如要將資料向前複製,請使用 GENERATE_ARRAY 函式和交叉聯結:

      管道語法

      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
      |> EXTEND CAST(start_time AS DATE) AS date
      |> CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days;
      

      標準語法

      SELECT *, CAST(start_time AS DATE) AS date
      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
      CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days;
      

      GENERATE_ARRAY 函式會建立含有七個元素的陣列,分別是 06CROSS JOIN UNNEST 運算會建立每個資料列的七個副本,並新增 diff_days 資料欄,其中包含 06 的陣列元素值。您可以將 diff_days 值做為原始日期的調整值,將時間範圍往前推移該天數,最多可推移至原始日期後七天。

    2. 如要查看行程的計算活動日期,請使用 EXTEND 管道運算子和 DATE_ADD 函式,建立名為 active_date 的資料欄,其中包含開始日期加上 diff_days 資料欄中的值:

      管道語法

      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
      |> EXTEND CAST(start_time AS DATE) AS date
      |> CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days
      |> EXTEND DATE_ADD(date, INTERVAL diff_days DAY) AS active_date;
      

      標準語法

      SELECT *, DATE_ADD(date, INTERVAL diff_days DAY) AS active_date
      FROM (
        SELECT *, CAST(start_time AS DATE) AS date
        FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
        CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days)
      

      舉例來說,如果旅程從 2024-05-20 開始,則在 2024-05-26 之前每天都算有效。

    3. 最後,匯總行程 ID 和單車 ID,並依 active_date 分組:

      管道語法

      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
      |> EXTEND CAST(start_time AS DATE) AS date
      |> CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days
      |> EXTEND DATE_ADD(date, INTERVAL diff_days DAY) AS active_date
      |> AGGREGATE COUNT(DISTINCT bike_id) AS active_7d_bikes,
                  COUNT(trip_id) AS active_7d_trips
      GROUP BY active_date DESC;
      

      標準語法

      SELECT
        DATE_ADD(date, INTERVAL diff_days DAY) AS active_date,
        COUNT(DISTINCT bike_id) AS active_7d_bikes,
        COUNT(trip_id) AS active_7d_trips
      FROM (
        SELECT *, CAST(start_time AS DATE) AS date
        FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
        CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days)
      GROUP BY active_date
      ORDER BY active_date DESC;
      

      結果大致如下:

      +-------------+-----------------+-----------------+
      | active_date | active_7d_bikes | active_7d_trips |
      +-------------+-----------------+-----------------+
      | 2024-07-06  | 90              | 331             |
      | 2024-07-05  | 142             | 726             |
      | 2024-07-04  | 186             | 1163            |
      | ...         | ...             | ...             |
      +-------------+-----------------+-----------------+
      

    篩選未來日期

    在上述查詢中,日期會延伸至未來,最多可超出資料中的最後一個日期六天。如要篩除超出資料結束日期的日期,請在查詢中設定最晚日期:

    1. 新增另一個 EXTEND 管道運算子,使用含 OVER 子句的視窗函式,計算資料表中的最大日期。
    2. 使用 WHERE 管道運算子,篩除超過最晚日期的產生資料列。

    管道語法

    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    |> EXTEND CAST(start_time AS DATE) AS date
    |> EXTEND MAX(date) OVER () AS max_date
    |> CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days
    |> EXTEND DATE_ADD(date, INTERVAL diff_days DAY) AS active_date
    |> WHERE active_date <= max_date
    |> AGGREGATE COUNT(DISTINCT bike_id) AS active_7d_bikes,
                 COUNT(trip_id) AS active_7d_trips
       GROUP BY active_date DESC;
    

    標準語法

    SELECT
      DATE_ADD(date, INTERVAL diff_days DAY) AS active_date,
      COUNT(DISTINCT bike_id) AS active_7d_bikes,
      COUNT(trip_id) AS active_7d_trips
    FROM(
      SELECT *
      FROM (
        SELECT *,
          DATE_ADD(date, INTERVAL diff_days DAY) AS active_date,
          MAX(date) OVER () AS max_date
        FROM(
          SELECT *, CAST(start_time AS DATE) AS date,
          FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
          CROSS JOIN UNNEST(GENERATE_ARRAY(0, 6)) AS diff_days))
      WHERE active_date <= max_date)
    GROUP BY active_date
    ORDER BY active_date DESC;
    

    結果大致如下:

    +-------------+-----------------+-----------------+
    | active_date | active_7d_bikes | active_7d_trips |
    +-------------+-----------------+-----------------+
    | 2024-06-30  | 212             | 3031            |
    | 2024-06-29  | 213             | 3206            |
    | 2024-06-28  | 219             | 3476            |
    | ...         | ...             | ...             |
    +-------------+-----------------+-----------------+
    

    後續步驟