同時執行多項 BigQuery 工作


BigQuery 託管多個公開資料集,供一般大眾查詢。在本教學課程中,您將建立工作流程,平行執行多個 BigQuery 查詢工作,並與依序執行工作相比,瞭解效能提升情形。

目標

教學課程內容:

  1. Wikipedia 公開資料集執行查詢,找出特定月份瀏覽次數最多的標題。
  2. 部署及執行工作流程,依序執行多個 BigQuery 查詢工作。
  3. 部署及執行工作流程,使用平行疊代執行 BigQuery 工作,並平行執行一般 for 迴圈。

您可以在 Google Cloud 控制台中執行下列指令,也可以在終端機或 Cloud Shell 中使用 Google Cloud CLI 執行。

費用

在本文件中,您會使用 Google Cloud的下列計費元件:

如要根據預測用量估算費用,請使用 Pricing Calculator

初次使用 Google Cloud 的使用者可能符合免費試用資格。

事前準備

貴機構定義的安全性限制,可能會導致您無法完成下列步驟。如需疑難排解資訊,請參閱「在受限的 Google Cloud 環境中開發應用程式」。

控制台

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Workflows API.

    Enable the API

  5. Create a service account:

    1. In the Google Cloud console, go to the Create service account page.

      Go to Create service account
    2. Select your project.
    3. In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.

      In the Service account description field, enter a description. For example, Service account for quickstart.

    4. Click Create and continue.
    5. Grant the following roles to the service account: BigQuery > BigQuery Job User, Logging > Logs Writer.

      To grant a role, find the Select a role list, then select the role.

      To grant additional roles, click Add another role and add each additional role.

    6. Click Continue.
    7. Click Done to finish creating the service account.

  6. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  7. Make sure that billing is enabled for your Google Cloud project.

  8. Enable the Workflows API.

    Enable the API

  9. Create a service account:

    1. In the Google Cloud console, go to the Create service account page.

      Go to Create service account
    2. Select your project.
    3. In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.

      In the Service account description field, enter a description. For example, Service account for quickstart.

    4. Click Create and continue.
    5. Grant the following roles to the service account: BigQuery > BigQuery Job User, Logging > Logs Writer.

      To grant a role, find the Select a role list, then select the role.

      To grant additional roles, click Add another role and add each additional role.

    6. Click Continue.
    7. Click Done to finish creating the service account.

gcloud

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Workflows API.

    Enable the API

  5. Create a service account:

    1. In the Google Cloud console, go to the Create service account page.

      Go to Create service account
    2. Select your project.
    3. In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.

      In the Service account description field, enter a description. For example, Service account for quickstart.

    4. Click Create and continue.
    5. Grant the following roles to the service account: roles/bigquery.jobUser, roles/logging.logWriter.

      To grant a role, find the Select a role list, then select the role.

      To grant additional roles, click Add another role and add each additional role.

    6. Click Continue.
    7. Click Done to finish creating the service account.

  6. Install the Google Cloud CLI.

  7. 如果您使用外部識別資訊提供者 (IdP),請先 使用聯合身分登入 gcloud CLI

  8. 如要初始化 gcloud CLI,請執行下列指令:

    gcloud init
  9. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  10. Make sure that billing is enabled for your Google Cloud project.

  11. Enable the Workflows API.

    Enable the API

  12. Create a service account:

    1. In the Google Cloud console, go to the Create service account page.

      Go to Create service account
    2. Select your project.
    3. In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.

      In the Service account description field, enter a description. For example, Service account for quickstart.

    4. Click Create and continue.
    5. Grant the following roles to the service account: roles/bigquery.jobUser, roles/logging.logWriter.

      To grant a role, find the Select a role list, then select the role.

      To grant additional roles, click Add another role and add each additional role.

    6. Click Continue.
    7. Click Done to finish creating the service account.

  13. Install the Google Cloud CLI.

  14. 如果您使用外部識別資訊提供者 (IdP),請先 使用聯合身分登入 gcloud CLI

  15. 如要初始化 gcloud CLI,請執行下列指令:

    gcloud init

執行 BigQuery 查詢工作

在 BigQuery 中,您可以執行互動式 (隨選) 查詢工作。詳情請參閱「執行互動式和批次查詢工作」。

控制台

  1. 前往 Google Cloud 控制台的「BigQuery」頁面。

    前往 BigQuery

  2. 在「Query editor」(查詢編輯器) 文字區域中輸入下列 BigQuery SQL 查詢:

    SELECT TITLE, SUM(views)
    FROM `bigquery-samples.wikipedia_pageviews.201207h`
    GROUP BY TITLE
    ORDER BY SUM(views) DESC
    LIMIT 100
    
  3. 按一下「執行」

bq

在終端機中輸入下列 bq query 指令,使用標準 SQL 語法執行互動式查詢:

bq query \
--use_legacy_sql=false \
'SELECT
  TITLE, SUM(views)
FROM
  `bigquery-samples.wikipedia_pageviews.201207h`
GROUP BY
  TITLE
ORDER BY
  SUM(views) DESC
LIMIT 100'

這會執行查詢,傳回特定月份中觀看次數最多的前 100 個 Wikipedia 標題,並將輸出內容寫入臨時資料表。

請注意查詢的執行時間長度。

部署會依序執行多項查詢的工作流程

工作流程定義是由一系列步驟組成,這些步驟會使用 Workflows 語法描述。建立工作流程後,請部署工作流程,以便執行。部署步驟也會驗證來源檔案是否可執行。

下列工作流程定義了五個資料表清單,可使用 Workflows BigQuery 連接器對這些資料表執行查詢。系統會依序執行查詢,並將每個資料表中最常觀看的影視內容儲存至結果對應表。

控制台

  1. 前往 Google Cloud 控制台的「Workflows」頁面:

    前往「Workflows」頁面

  2. 點選「建立」

  3. 輸入新工作流程的名稱,例如 workflow-serial-bqjobs

  4. 選擇適當的區域,例如 us-central1

  5. 選取先前建立的服務帳戶

    您應該已將「BigQuery」>「BigQuery 工作使用者」和「記錄」>「記錄寫入者」IAM 角色授予服務帳戶。

  6. 點選「下一步」

  7. 在工作流程編輯器中,輸入下列工作流程定義:

    main:
        steps:
        - init:
            assign:
                - results : {} # result from each iteration keyed by table name
                - tables:
                    - 201201h
                    - 201202h
                    - 201203h
                    - 201204h
                    - 201205h
        - runQueries:
            for:
                value: table
                in: ${tables}
                steps:
                - logTable:
                    call: sys.log
                    args:
                        text: ${"Running query for table " + table}
                - runQuery:
                    call: googleapis.bigquery.v2.jobs.query
                    args:
                        projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                        body:
                            useLegacySql: false
                            useQueryCache: false
                            timeoutMs: 30000
                            # Find top 100 titles with most views on Wikipedia
                            query: ${
                                "SELECT TITLE, SUM(views)
                                FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                                WHERE LENGTH(TITLE) > 10
                                GROUP BY TITLE
                                ORDER BY SUM(VIEWS) DESC
                                LIMIT 100"
                                }
                    result: queryResult
                - returnResult:
                    assign:
                        # Return the top title from each table
                        - results[table]: {}
                        - results[table].title: ${queryResult.rows[0].f[0].v}
                        - results[table].views: ${queryResult.rows[0].f[1].v}
        - returnResults:
            return: ${results}
  8. 按一下 [Deploy] (部署)

gcloud

  1. 開啟終端機,然後為工作流程建立原始碼檔案:

    touch workflow-serial-bqjobs.yaml
  2. 將下列工作流程複製到原始碼檔案:

    main:
        steps:
        - init:
            assign:
                - results : {} # result from each iteration keyed by table name
                - tables:
                    - 201201h
                    - 201202h
                    - 201203h
                    - 201204h
                    - 201205h
        - runQueries:
            for:
                value: table
                in: ${tables}
                steps:
                - logTable:
                    call: sys.log
                    args:
                        text: ${"Running query for table " + table}
                - runQuery:
                    call: googleapis.bigquery.v2.jobs.query
                    args:
                        projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                        body:
                            useLegacySql: false
                            useQueryCache: false
                            timeoutMs: 30000
                            # Find top 100 titles with most views on Wikipedia
                            query: ${
                                "SELECT TITLE, SUM(views)
                                FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                                WHERE LENGTH(TITLE) > 10
                                GROUP BY TITLE
                                ORDER BY SUM(VIEWS) DESC
                                LIMIT 100"
                                }
                    result: queryResult
                - returnResult:
                    assign:
                        # Return the top title from each table
                        - results[table]: {}
                        - results[table].title: ${queryResult.rows[0].f[0].v}
                        - results[table].views: ${queryResult.rows[0].f[1].v}
        - returnResults:
            return: ${results}
  3. 輸入下列指令來部署工作流程:

    gcloud workflows deploy workflow-serial-bqjobs \
       --source=workflow-serial-bqjobs.yaml \
       --service-account=MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com

    MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com 替換為您先前建立的服務帳戶電子郵件地址。

    您應該已將 roles/bigquery.jobUserroles/logging.logWriter IAM 角色授予服務帳戶。

執行工作流程,並依序執行多個查詢

執行工作流程時,系統會執行與該工作流程相關聯的目前工作流程定義。

控制台

  1. 前往 Google Cloud 控制台的「Workflows」頁面:

    前往「Workflows」頁面

  2. 在「Workflows」(工作流程) 頁面中,選取 workflow-serial-bqjobs 工作流程,前往詳細資料頁面。

  3. 在「Workflow Details」(工作流程詳細資料) 頁面中,按一下 「Execute」(執行)

  4. 再次點按「執行」

  5. 在「輸出」窗格中查看工作流程結果。

gcloud

  1. 開啟終端機。

  2. 執行工作流程:

     gcloud workflows run workflow-serial-bqjob

工作流程執行時間應約為一分鐘,或前次執行時間的五倍。結果會包含每個資料表,如下所示:

{
  "201201h": {
    "title": "Special:Search",
    "views": "14591339"
  },
  "201202h": {
    "title": "Special:Search",
    "views": "132765420"
  },
  "201203h": {
    "title": "Special:Search",
    "views": "123316818"
  },
  "201204h": {
    "title": "Special:Search",
    "views": "116830614"
  },
  "201205h": {
    "title": "Special:Search",
    "views": "131357063"
  }
}

部署及執行可平行執行多項查詢的工作流程

您可以進行幾項變更,平行執行查詢,不必依序執行五項查詢:

 - runQueries:
    parallel:
        shared: [results]
        for:
            value: table
            in: ${tables}
  • parallel 步驟可讓 for 迴圈的每次疊代平行執行。
  • results 變數會宣告為 shared,因此分支可寫入該變數,且每個分支的結果都可以附加至該變數。

控制台

  1. 前往 Google Cloud 控制台的「Workflows」頁面:

    前往「Workflows」頁面

  2. 點選「建立」

  3. 輸入新工作流程的名稱,例如 workflow-parallel-bqjobs

  4. 選擇適當的區域,例如 us-central1

  5. 選取先前建立的服務帳戶

  6. 點選「下一步」

  7. 在工作流程編輯器中,輸入下列工作流程定義:

    main:
        steps:
        - init:
            assign:
                - results : {} # result from each iteration keyed by table name
                - tables:
                    - 201201h
                    - 201202h
                    - 201203h
                    - 201204h
                    - 201205h
        - runQueries:
            parallel:
                shared: [results]
                for:
                    value: table
                    in: ${tables}
                    steps:
                    - logTable:
                        call: sys.log
                        args:
                            text: ${"Running query for table " + table}
                    - runQuery:
                        call: googleapis.bigquery.v2.jobs.query
                        args:
                            projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                            body:
                                useLegacySql: false
                                useQueryCache: false
                                timeoutMs: 30000
                                # Find top 100 titles with most views on Wikipedia
                                query: ${
                                    "SELECT TITLE, SUM(views)
                                    FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                                    WHERE LENGTH(TITLE) > 10
                                    GROUP BY TITLE
                                    ORDER BY SUM(VIEWS) DESC
                                    LIMIT 100"
                                    }
                        result: queryResult
                    - returnResult:
                        assign:
                            # Return the top title from each table
                            - results[table]: {}
                            - results[table].title: ${queryResult.rows[0].f[0].v}
                            - results[table].views: ${queryResult.rows[0].f[1].v}
        - returnResults:
            return: ${results}
  8. 按一下 [Deploy] (部署)

  9. 在「Workflow Details」(工作流程詳細資料) 頁面中,按一下 「Execute」(執行)

  10. 再次點按「執行」

  11. 在「輸出」窗格中查看工作流程結果。

gcloud

  1. 開啟終端機,然後為工作流程建立原始碼檔案:

    touch workflow-parallel-bqjobs.yaml
  2. 將下列工作流程複製到原始碼檔案:

    main:
        steps:
        - init:
            assign:
                - results : {} # result from each iteration keyed by table name
                - tables:
                    - 201201h
                    - 201202h
                    - 201203h
                    - 201204h
                    - 201205h
        - runQueries:
            parallel:
                shared: [results]
                for:
                    value: table
                    in: ${tables}
                    steps:
                    - logTable:
                        call: sys.log
                        args:
                            text: ${"Running query for table " + table}
                    - runQuery:
                        call: googleapis.bigquery.v2.jobs.query
                        args:
                            projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                            body:
                                useLegacySql: false
                                useQueryCache: false
                                timeoutMs: 30000
                                # Find top 100 titles with most views on Wikipedia
                                query: ${
                                    "SELECT TITLE, SUM(views)
                                    FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                                    WHERE LENGTH(TITLE) > 10
                                    GROUP BY TITLE
                                    ORDER BY SUM(VIEWS) DESC
                                    LIMIT 100"
                                    }
                        result: queryResult
                    - returnResult:
                        assign:
                            # Return the top title from each table
                            - results[table]: {}
                            - results[table].title: ${queryResult.rows[0].f[0].v}
                            - results[table].views: ${queryResult.rows[0].f[1].v}
        - returnResults:
            return: ${results}
  3. 輸入下列指令來部署工作流程:

    gcloud workflows deploy workflow-parallell-bqjobs \
       --source=workflow-parallel-bqjobs.yaml \
       --service-account=MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com

    MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com 替換為您先前建立的服務帳戶電子郵件地址。

  4. 執行工作流程:

     gcloud workflows run workflow-parallel-bqjobs

結果會與先前的輸出內容類似,但工作流程執行時間應約為 20 秒或更短!

清除所用資源

如果您是為了這個教學課程建立新專案,請刪除專案。如果您使用現有專案,並想保留專案,但不要在本教學課程中新增的變更,請刪除為本教學課程建立的資源

刪除專案

如要避免付費,最簡單的方法就是刪除您為了本教學課程所建立的專案。

如要刪除專案:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

刪除教學課程資源

刪除在本教學課程中建立的工作流程:

gcloud workflows delete WORKFLOW_NAME

後續步驟