複数の BigQuery ジョブを並列実行する

BigQuery は、クエリ対象として一般公開できるいくつかの一般公開データセットをホストしています。このチュートリアルでは、複数の BigQuery クエリジョブを同時に実行するワークフローを作成して、ジョブを順次実行する場合に比べてパフォーマンスが向上していることを示します。

目標

このチュートリアルの内容は次のとおりです。

Wikipedia の一般公開データセットに対してクエリを実行し、特定の月で閲覧数が最も多いタイトルを特定します。
複数の BigQuery クエリジョブを順次実行するワークフローをデプロイして実行します。
並列イテレーションを使用して BigQuery ジョブを実行するワークフローをデプロイして実行し、通常の for ループが並列実行されます。

Google Cloud コンソールで次のコマンドを実行するか、ターミナルまたは Cloud Shell で Google Cloud CLI を使用できます。

費用

このドキュメントでは、Google Cloud の次の課金対象のコンポーネントを使用します。

料金計算ツールを使うと、予想使用量に基づいて費用の見積もりを生成できます。新しい Google Cloud ユーザーは無料トライアルをご利用いただける場合があります。

始める前に

組織で定義されているセキュリティの制約により、次の手順を完了できない場合があります。トラブルシューティング情報については、制約のある Google Cloud 環境でアプリケーションを開発するをご覧ください。

Console

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Workflows API.

Enable the API

Create a service account:

In the Google Cloud console, go to the Create service account page.
Go to Create service account
Select your project.
In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.

In the Service account description field, enter a description. For example, Service account for quickstart.
Click Create and continue.
Grant the following roles to the service account: BigQuery > BigQuery Job User, Logging > Logs Writer.

To grant a role, find the Select a role list, then select the role.

To grant additional roles, click Add another role and add each additional role.

Note: The Role field affects which resources the service account can access in your project. You can revoke these roles or grant additional roles later.
Click Continue.
Click Done to finish creating the service account.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Workflows API.

Enable the API

Create a service account:

In the Google Cloud console, go to the Create service account page.
Go to Create service account
Select your project.
In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.

In the Service account description field, enter a description. For example, Service account for quickstart.
Click Create and continue.
Grant the following roles to the service account: BigQuery > BigQuery Job User, Logging > Logs Writer.

To grant a role, find the Select a role list, then select the role.

To grant additional roles, click Add another role and add each additional role.

Note: The Role field affects which resources the service account can access in your project. You can revoke these roles or grant additional roles later.
Click Continue.
Click Done to finish creating the service account.

gcloud

If you don't already have one, sign up for a new account.

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Make sure that billing is enabled for your Google Cloud project.

Enable the Workflows API:

gcloud services enable workflows.googleapis.com

Set up authentication:

Create the service account:
```
gcloud iam service-accounts create SERVICE_ACCOUNT_NAME
```
Replace SERVICE_ACCOUNT_NAME with a name for the service account.
Grant roles to the service account. Run the following command once for each of the following IAM roles: roles/bigquery.jobUser, roles/logging.logWriter:
```
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com" --role=ROLE
```
Replace the following:
- SERVICE_ACCOUNT_NAME: the name of the service account
- PROJECT_ID: the project ID where you created the service account
- ROLE: the role to grant
Note: The --role flag affects which resources the service account can access in your project. You can revoke these roles or grant additional roles later.

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Make sure that billing is enabled for your Google Cloud project.

Enable the Workflows API:

gcloud services enable workflows.googleapis.com

Set up authentication:

Create the service account:
```
gcloud iam service-accounts create SERVICE_ACCOUNT_NAME
```
Replace SERVICE_ACCOUNT_NAME with a name for the service account.
Grant roles to the service account. Run the following command once for each of the following IAM roles: roles/bigquery.jobUser, roles/logging.logWriter:
```
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com" --role=ROLE
```
Replace the following:
- SERVICE_ACCOUNT_NAME: the name of the service account
- PROJECT_ID: the project ID where you created the service account
- ROLE: the role to grant
Note: The --role flag affects which resources the service account can access in your project. You can revoke these roles or grant additional roles later.

BigQuery クエリジョブを実行する

BigQuery では、インタラクティブ（オンデマンド）クエリジョブを実行できます。詳細については、インタラクティブクエリとバッチクエリのジョブの実行をご覧ください。

Console

Google Cloud コンソールで BigQuery ページに移動します。

BigQuery に移動

[クエリエディタ] のテキスト領域に、次の BigQuery SQL クエリを入力します。

SELECT TITLE, SUM(views)
FROM `bigquery-samples.wikipedia_pageviews.201207h`
GROUP BY TITLE
ORDER BY SUM(views) DESC
LIMIT 100

[実行] をクリックします。

bq

ターミナルで、次の bq query コマンドを入力し、標準 SQL 構文を使用してインタラクティブクエリを実行します。

bq query \
--use_legacy_sql=false \
'SELECT
  TITLE, SUM(views)
FROM
  `bigquery-samples.wikipedia_pageviews.201207h`
GROUP BY
  TITLE
ORDER BY
  SUM(views) DESC
LIMIT 100'

このクエリは、特定の月で閲覧数が最も多い上位 100 件の Wikipedia タイトルを返すクエリを実行し、その出力を一時テーブルに書き込みます。

クエリの実行に要した時間をメモします。

複数のクエリを順次実行するワークフローをデプロイする

ワークフロー定義は、ワークフロー構文を使用して説明した一連のステップで構成されています。ワークフローを作成したら、デプロイして実行できるようにします。デプロイの手順では、ソースファイルを実行できることも検証されます。

次のワークフローでは、Workflows の BigQuery コネクタを使用して、クエリを実行する 5 つのテーブルのリストを定義しています。クエリは順次実行され、各テーブルで最も閲覧されたタイトルが結果マップに保存されます。

Console

Google Cloud コンソールで、[ワークフロー] ページに移動します。

[ワークフロー] に移動
[作成] をクリックします。
新しいワークフローの名前を入力します（例: workflow-serial-bqjobs）。
適切なリージョンを選択します（例: us-central1）。
先ほど作成したサービスアカウントを選択します。

BigQuery > BigQuery ジョブユーザーと Logging > ログ書き込み IAM ロールの両方がすでにサービスアカウントに付与されている必要があります。
[次へ] をクリックします。

ワークフローエディタで、次のワークフローの定義を入力します。

main:
    steps:
    - init:
        assign:
            - results : {} # result from each iteration keyed by table name
            - tables:
                - 201201h
                - 201202h
                - 201203h
                - 201204h
                - 201205h
    - runQueries:
        for:
            value: table
            in: ${tables}
            steps:
            - logTable:
                call: sys.log
                args:
                    text: ${"Running query for table " + table}
            - runQuery:
                call: googleapis.bigquery.v2.jobs.query
                args:
                    projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                    body:
                        useLegacySql: false
                        useQueryCache: false
                        timeoutMs: 30000
                        # Find top 100 titles with most views on Wikipedia
                        query: ${
                            "SELECT TITLE, SUM(views)
                            FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                            WHERE LENGTH(TITLE) > 10
                            GROUP BY TITLE
                            ORDER BY SUM(VIEWS) DESC
                            LIMIT 100"
                            }
                result: queryResult
            - returnResult:
                assign:
                    # Return the top title from each table
                    - results[table]: {}
                    - results[table].title: ${queryResult.rows[0].f[0].v}
                    - results[table].views: ${queryResult.rows[0].f[1].v}
    - returnResults:
        return: ${results}

[デプロイ] をクリックします。

gcloud

ターミナルを開き、ワークフローのソースコードファイルを作成します。
```
touch workflow-serial-bqjobs.yaml
```

次のワークフローをソースコードファイルにコピーします。

main:
    steps:
    - init:
        assign:
            - results : {} # result from each iteration keyed by table name
            - tables:
                - 201201h
                - 201202h
                - 201203h
                - 201204h
                - 201205h
    - runQueries:
        for:
            value: table
            in: ${tables}
            steps:
            - logTable:
                call: sys.log
                args:
                    text: ${"Running query for table " + table}
            - runQuery:
                call: googleapis.bigquery.v2.jobs.query
                args:
                    projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                    body:
                        useLegacySql: false
                        useQueryCache: false
                        timeoutMs: 30000
                        # Find top 100 titles with most views on Wikipedia
                        query: ${
                            "SELECT TITLE, SUM(views)
                            FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                            WHERE LENGTH(TITLE) > 10
                            GROUP BY TITLE
                            ORDER BY SUM(VIEWS) DESC
                            LIMIT 100"
                            }
                result: queryResult
            - returnResult:
                assign:
                    # Return the top title from each table
                    - results[table]: {}
                    - results[table].title: ${queryResult.rows[0].f[0].v}
                    - results[table].views: ${queryResult.rows[0].f[1].v}
    - returnResults:
        return: ${results}

次のコマンドを入力してワークフローをデプロイします。
```
gcloud workflows deploy workflow-serial-bqjobs \
   --source=workflow-serial-bqjobs.yaml \
   --service-account=MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com
```
MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com は、先ほど作成したサービスアカウントのメールアドレスに置き換えます。

すでにサービスアカウントに roles/bigquery.jobUser と roles/logging.logWriter の両方の IAM ロールを付与している必要があります。

ワークフローを実行し、複数のクエリを順次実行する

ワークフローを実行すると、そのワークフローに関連付けられた現在のワークフロー定義が実行されます。

Console

Google Cloud コンソールで、[ワークフロー] ページに移動します。

[ワークフロー] に移動
[Workflows] ページで、[workflow-serial-bqjobs] ワークフローを選択して、詳細ページに移動します。
[ワークフローの詳細] ページで [ 実行] を選択します。
もう一度 [Execute] をクリックします。
ワークフローの結果が [出力] ペインに表示されます。

gcloud

ターミナルを開きます。

ワークフローを実行します。

 gcloud workflows run workflow-serial-bqjob

ワークフローの実行には、約 1 分すなわち前回の実行時間の 5 倍ほどの時間を要します。結果には各テーブルが含まれており、次のようになります。

{
  "201201h": {
    "title": "Special:Search",
    "views": "14591339"
  },
  "201202h": {
    "title": "Special:Search",
    "views": "132765420"
  },
  "201203h": {
    "title": "Special:Search",
    "views": "123316818"
  },
  "201204h": {
    "title": "Special:Search",
    "views": "116830614"
  },
  "201205h": {
    "title": "Special:Search",
    "views": "131357063"
  }
}

複数のクエリを並列に実行するワークフローをデプロイして実行する

いくつかの変更を加えることで、5 つのクエリを順次実行する代わりに、クエリを並列に実行することができます。

 - runQueries:
    parallel:
        shared: [results]
        for:
            value: table
            in: ${tables}

parallel ステップを使用すると、for ループの各イテレーションを並列に実行できます。
results 変数は shared として宣言されます。これにより、ブランチによる書き込みが可能になり、各ブランチの結果を追加できます。

Console

Google Cloud コンソールで、[ワークフロー] ページに移動します。

[ワークフロー] に移動
[作成] をクリックします。
新しいワークフローの名前を入力します（例: workflow-parallel-bqjobs）。
適切なリージョンを選択します（例: us-central1）。
先ほど作成したサービスアカウントを選択します。
[次へ] をクリックします。

ワークフローエディタで、次のワークフローの定義を入力します。

main:
    steps:
    - init:
        assign:
            - results : {} # result from each iteration keyed by table name
            - tables:
                - 201201h
                - 201202h
                - 201203h
                - 201204h
                - 201205h
    - runQueries:
        parallel:
            shared: [results]
            for:
                value: table
                in: ${tables}
                steps:
                - logTable:
                    call: sys.log
                    args:
                        text: ${"Running query for table " + table}
                - runQuery:
                    call: googleapis.bigquery.v2.jobs.query
                    args:
                        projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                        body:
                            useLegacySql: false
                            useQueryCache: false
                            timeoutMs: 30000
                            # Find top 100 titles with most views on Wikipedia
                            query: ${
                                "SELECT TITLE, SUM(views)
                                FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                                WHERE LENGTH(TITLE) > 10
                                GROUP BY TITLE
                                ORDER BY SUM(VIEWS) DESC
                                LIMIT 100"
                                }
                    result: queryResult
                - returnResult:
                    assign:
                        # Return the top title from each table
                        - results[table]: {}
                        - results[table].title: ${queryResult.rows[0].f[0].v}
                        - results[table].views: ${queryResult.rows[0].f[1].v}
    - returnResults:
        return: ${results}

[デプロイ] をクリックします。
[ワークフローの詳細] ページで [ 実行] を選択します。
もう一度 [Execute] をクリックします。
ワークフローの結果が [出力] ペインに表示されます。

gcloud

ターミナルを開き、ワークフローのソースコードファイルを作成します。
```
touch workflow-parallel-bqjobs.yaml
```

次のワークフローをソースコードファイルにコピーします。

main:
    steps:
    - init:
        assign:
            - results : {} # result from each iteration keyed by table name
            - tables:
                - 201201h
                - 201202h
                - 201203h
                - 201204h
                - 201205h
    - runQueries:
        parallel:
            shared: [results]
            for:
                value: table
                in: ${tables}
                steps:
                - logTable:
                    call: sys.log
                    args:
                        text: ${"Running query for table " + table}
                - runQuery:
                    call: googleapis.bigquery.v2.jobs.query
                    args:
                        projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                        body:
                            useLegacySql: false
                            useQueryCache: false
                            timeoutMs: 30000
                            # Find top 100 titles with most views on Wikipedia
                            query: ${
                                "SELECT TITLE, SUM(views)
                                FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                                WHERE LENGTH(TITLE) > 10
                                GROUP BY TITLE
                                ORDER BY SUM(VIEWS) DESC
                                LIMIT 100"
                                }
                    result: queryResult
                - returnResult:
                    assign:
                        # Return the top title from each table
                        - results[table]: {}
                        - results[table].title: ${queryResult.rows[0].f[0].v}
                        - results[table].views: ${queryResult.rows[0].f[1].v}
    - returnResults:
        return: ${results}

次のコマンドを入力してワークフローをデプロイします。
```
gcloud workflows deploy workflow-parallell-bqjobs \
   --source=workflow-parallel-bqjobs.yaml \
   --service-account=MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com
```
MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com は、先ほど作成したサービスアカウントのメールアドレスに置き換えます。

ワークフローを実行します。

 gcloud workflows run workflow-parallel-bqjobs

結果は先ほどの出力と類似していますが、ワークフローの実行は約 20 秒以内に完了します。

クリーンアップ

このチュートリアル用に新規プロジェクトを作成した場合は、そのプロジェクトを削除します。既存のプロジェクトを使用し、このチュートリアルで変更を加えずに残す場合は、チュートリアル用に作成したリソースを削除します。

プロジェクトを削除する

課金をなくす最も簡単な方法は、チュートリアル用に作成したプロジェクトを削除することです。

プロジェクトを削除するには:

注意: プロジェクトを削除すると、次のような影響があります。

プロジェクト内のすべてのものが削除されます。このドキュメントのタスクで既存のプロジェクトを使用した場合、それを削除すると、そのプロジェクトで行った他の作業もすべて削除されます。
カスタムプロジェクト ID が失われます。このプロジェクトを作成したときに、将来使用するカスタムプロジェクト ID を作成した可能性があります。そのプロジェクト ID を使用した URL（たとえば、appspot.com）を保持するには、プロジェクト全体ではなくプロジェクト内の選択したリソースだけを削除します。

複数のアーキテクチャ、チュートリアル、クイックスタートを実施する予定がある場合は、プロジェクトを再利用すると、プロジェクトの割り当て上限を超えないようにすることができます。

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

チュートリアルリソースの削除

このチュートリアルで作成したワークフローを削除します。

gcloud workflows delete WORKFLOW_NAME

次のステップ

並列ステップの詳細については、並列ステップの実行をご覧ください。
Workflows コネクタの詳細については、コネクタについてをご覧ください。
Workflows について詳しくは、Workflows の概要をご覧ください。

複数の BigQuery ジョブを並列実行する

目標

費用

始める前に

Console

gcloud

BigQuery クエリジョブを実行する

Console

bq

複数のクエリを順次実行するワークフローをデプロイする

Console

gcloud

ワークフローを実行し、複数のクエリを順次実行する

Console

gcloud

複数のクエリを並列に実行するワークフローをデプロイして実行する

Console

gcloud

クリーンアップ

プロジェクトを削除する

チュートリアル リソースの削除

次のステップ

チュートリアルリソースの削除