Dataflow テンプレートを使用してストリーミングパイプラインを作成する

このドキュメントでは、Google 提供の Dataflow テンプレートを使用してストリーミングパイプラインを作成する方法を説明します。具体的には、クイックスタートで例として Pub/Sub to BigQuery テンプレートを使用します。

Pub/Sub to BigQuery テンプレートは、Pub/Sub トピックから JSON 形式のメッセージを読み取り、BigQuery テーブルに書き込むことができるストリーミングパイプラインです。

このタスクを Google Cloud コンソールで直接行う際の順を追ったガイダンスについては、[ガイドを表示] をクリックしてください。

ガイドを表示

始める前に

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, BigQuery, Pub/Sub, and Resource Manager APIs.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, BigQuery, Pub/Sub, and Resource Manager APIs.

Enable the APIs

Cloud Storage バケットを作成します。

In the Google Cloud console, go to the Cloud Storage Buckets page.
Go to Buckets
Click Create.
On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
1. For Name your bucket, enter a unique bucket name. Don't include sensitive information in the bucket name, because the bucket namespace is global and publicly visible.
2. In the Choose where to store your data section, do the following:
  1. Select a Location type.
  2. Choose a location where your bucket's data is permanently stored from the Location type drop-down menu.
    - If you select the dual-region location type, you can also choose to enable turbo replication by using the relevant checkbox.
  3. To set up cross-bucket replication, select Add cross-bucket replication via Storage Transfer Service and follow these steps:
    Set up cross-bucket replication
    
    In the Bucket menu, select a bucket.
    
    In the Replication settings section, click Configure to configure settings for the replication job.
    
    The Configure cross-bucket replication pane appears.
    
    To filter objects to replicate by object name prefix, enter a prefix that you want to include or exclude objects from, then click Add a prefix.
    
    To set a storage class for the replicated objects, select a storage class from the Storage class menu. If you skip this step, the replicated objects will use the destination bucket's storage class by default.
    
    Click Done.
3. In the Choose how to store your data section, do the following:
  1. In the Set a default class section, select the following: Standard.
  2. To enable hierarchical namespace, in the Optimize storage for data-intensive workloads section, select Enable hierarchical namespace on this bucket.
    Note: You cannot enable hierarchical namespace in existing buckets.
4. In the Choose how to control access to objects section, select whether or not your bucket enforces public access prevention, and select an access control method for your bucket's objects.
  Note: You cannot change the Prevent public access setting if this setting is enforced at an organization policy.
5. In the Choose how to protect object data section, do the following:
  - Select any of the options under Data protection that you want to set for your bucket.
    - To enable soft delete, click the Soft delete policy (For data recovery) checkbox, and specify the number of days you want to retain objects after deletion.
    - To set Object Versioning, click the Object versioning (For version control) checkbox, and specify the maximum number of versions per object and the number of days after which the noncurrent versions expire.
    - To enable the retention policy on objects and buckets, click the Retention (For compliance) checkbox, and then do the following:
      - To enable Object Retention Lock, click the Enable object retention checkbox.
      - To enable Bucket Lock, click the Set bucket retention policy checkbox, and choose a unit of time and a length of time for your retention period.
  - To choose how your object data will be encrypted, expand the Data encryption section (), and select a Data encryption method.
Click Create.

次のものをコピーします。これらは以後のセクションで使用されます。
- Cloud Storage バケット名。
- 実際の Google Cloud のプロジェクト ID。
  
  ID を調べる方法については、プロジェクトの識別をご覧ください。

このクイックスタートの手順を最後まで行うには、ユーザーアカウントに Dataflow 管理者ロールとサービスアカウントユーザーロールが必要です。Compute Engine のデフォルトのサービスアカウントには、Dataflow ワーカーロール、ストレージオブジェクト管理者ロール、Pub/Sub 編集者ロール、BigQuery データ編集者ロール、閲覧者ロールが必要です。 Google Cloud コンソールで、必要なロールを追加するには:
1. [IAM] ページに移動して、プロジェクトを選択します。
  [IAM] に移動
2. ユーザーアカウントを含む行で、（プリンシパルを編集します）アイコンをクリックします。[別のロールを追加] をクリックし、[Dataflow 管理者] と [サービスアカウントユーザー] のロールを追加します。
3. [保存] をクリックします。
4. Compute Engine のデフォルトのサービスアカウント（PROJECT_NUMBER-compute@developer.gserviceaccount.com）を含む行で、（プリンシパルを編集します）アイコンをクリックします。
5. [別のロールを追加] をクリックし、Dataflow ワーカー、ストレージオブジェクト管理者、Pub/Sub 編集者、BigQuery データ編集者、閲覧者のロールを追加します。
6. [保存] をクリックします。
  
  ロール付与の詳細については、コンソールを使用して IAM ロールを付与するをご覧ください。
デフォルトでは、新しいプロジェクトはデフォルトネットワークで開始されます。プロジェクトのデフォルトネットワークが無効になっているか削除されている場合、Compute ネットワークユーザーのロール（roles/compute.networkUser）が付与されているユーザーアカウントのプロジェクト内にネットワークが必要です。

BigQuery データセットとテーブルを作成する

Google Cloud コンソールを使用して、Pub/Sub トピックに適したスキーマで、BigQuery のデータセットとテーブルを作成します。

この例で、データセットの名前は taxirides、テーブルの名前は realtime です。このデータセットとテーブルを作成するには、以下の操作を行います。

[BigQuery] ページに移動します。
[BigQuery] に移動
[エクスプローラ] パネルで、データセットを作成するプロジェクトの横にある [アクションを表示] をクリックしてから、[データセットを作成] をクリックします。
注: デフォルトエクスペリエンスは、プレビュー Google Cloud コンソールです。[プレビュー機能を非表示] をクリックして Google Cloud コンソールに移動した場合は、代わりにナビゲーションパネルの [リソース] セクションでプロジェクトを選択します。
[データセットを作成] パネルで、次の操作を行います。

[データセット ID] に「taxirides」と入力します。データセット ID は Google Cloud プロジェクトごとに一意です。
[ロケーションタイプ] で [マルチリージョン] を選択してから、[US（米国の複数のリージョン）] を選択します。一般公開データセットは US マルチリージョンロケーションに保存されています。わかりやすくするため、データセットを同じロケーションに配置します。
その他のデフォルト設定はそのままにして、[データセットを作成] をクリックします。

[エクスプローラ] パネルで、プロジェクトを開きます。
taxirides データセットの隣にある「アクションを表示」をクリックし、[テーブルを作成] をクリックします。
注: デフォルトエクスペリエンスは、プレビュー Google Cloud コンソールです。[プレビュー機能を非表示] をクリックして Google Cloud コンソールに移動した場合、代わりにナビゲーションパネルの [リソース] セクションで、作成した taxirides データセットを選択します。
[テーブルを作成] パネルで、次の操作を行います。

[ソース] セクションの [テーブルの作成元] で [空のテーブル] を選択します。
[送信先] セクションの [テーブル] に「realtime」と入力します。

[スキーマ] セクションで [テキストとして編集] をクリックし、次のスキーマ定義をボックスに貼り付けます。

ride_id:string,point_idx:integer,latitude:float,longitude:float,timestamp:timestamp,
meter_reading:float,meter_increment:float,ride_status:string,passenger_count:integer

[パーティションとクラスタの設定] セクションの [パーティショニング] で、[タイムスタンプ] フィールドを選択します。

その他のデフォルト設定はそのままにして、[テーブルを作成] をクリックします。

パイプラインを実行する

Google が提供する Pub/Sub to BigQuery テンプレートを使用して、ストリーミングパイプラインを実行します。パイプラインは入力トピックから受信データを取得します。

Dataflow の [ジョブ] ページに移動します。
[ジョブ] に移動
[テンプレートからジョブを作成] をクリックします。
Dataflow ジョブの [ジョブ名] として「taxi-data」と入力します。
[Dataflow テンプレート] で、[Pub/Sub to BigQuery] テンプレートを選択します。
[BigQuery output table] に、次のテキストを入力します。
```
PROJECT_ID:taxirides.realtime
```
PROJECT_ID は、BigQuery データセットを作成したプロジェクトのプロジェクト ID に置き換えます。
[オプションのソースパラメータ] セクションの [Pub/Sub トピックを入力] で、[トピックを手動で入力] をクリックします。

ダイアログで、[トピック名] に次のように入力し、[保存] をクリックします。

projects/pubsub-public-data/topics/taxirides-realtime

一般公開されている Pub/Sub トピックは、NYC Taxi & Limousine Commission のオープンデータセットに基づいています。このトピックの JSON 形式のサンプルメッセージを次に示します。

{
  "ride_id": "19c41fc4-e362-4be5-9d06-435a7dc9ba8e",
  "point_idx": 217,
  "latitude": 40.75399,
  "longitude": -73.96302,
  "timestamp": "2021-03-08T02:29:09.66644-05:00",
  "meter_reading": 6.293821,
  "meter_increment": 0.029003782,
  "ride_status": "enroute",
  "passenger_count": 1
}

[一時的な保存場所] に次のように入力します。
```
gs://BUCKET_NAME/temp/
```
BUCKET_NAME を Cloud Storage バケットの名前に置き換えます。temp フォルダには、ステージング済みのパイプラインジョブなどの一時ファイルが保存されます。
プロジェクトにデフォルトネットワークがない場合は、[ネットワーク] と [サブネットワーク] を入力します。詳細については、ネットワークとサブネットワークの指定をご覧ください。
注: network オプションで指定しない限り、Dataflow ランナーは default Virtual Private Cloud ネットワークでジョブを実行します。プロジェクトにデフォルトネットワークがない場合、ネットワークを指定しないと、エラーが発生します。デフォルトネットワークが削除された場合やデフォルトネットワークの作成が組織のポリシーの制約で妨げられた場合は、デフォルトネットワークが表示されないことがあります。
[ジョブを実行] をクリックします。

結果を表示する

realtime テーブルに書き込まれたデータを表示する方法は次のとおりです。

[BigQuery] ページに移動します。

[BigQuery] に移動
[クエリを新規作成] をクリックします。新しいエディタタブが開きます。
```
SELECT * FROM `PROJECT_ID.taxirides.realtime`
WHERE `timestamp` > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
LIMIT 1000
```
PROJECT_ID は、BigQuery データセットを作成したプロジェクトのプロジェクト ID に置き換えます。テーブルにデータが表示されるまで、最大 5 分かかることがあります。
[実行] をクリックします。

クエリは、過去 24 時間以内にテーブルに追加された行を返します。標準 SQL を使用してクエリを実行することもできます。

クリーンアップ

このページで使用したリソースについて、 Google Cloud アカウントに課金されないようにするには、次の手順を実施します。

プロジェクトを削除する

課金されないようにする最も簡単な方法は、クイックスタート用に作成した Google Cloud プロジェクトを削除することです。

注意: プロジェクトを削除すると、次のような影響があります。

プロジェクト内のすべてのものが削除されます。このドキュメントのタスクで既存のプロジェクトを使用した場合、それを削除すると、そのプロジェクトで行った他の作業もすべて削除されます。
カスタムプロジェクト ID が失われます。このプロジェクトを作成したときに、将来使用するカスタムプロジェクト ID を作成した可能性があります。そのプロジェクト ID を使用した URL（たとえば、appspot.com）を保持するには、プロジェクト全体ではなくプロジェクト内の選択したリソースだけを削除します。

複数のアーキテクチャ、チュートリアル、クイックスタートを実施する予定がある場合は、プロジェクトを再利用すると、プロジェクトの割り当て上限を超えないようにすることができます。

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

個々のリソースを削除する

このクイックスタートで使用した Google Cloud プロジェクトを残しておく場合は、個々のリソースを削除します。

Dataflow の [ジョブ] ページに移動します。
[ジョブ] に移動
ジョブリストからストリーミングジョブを選択します。
ナビゲーションで、[停止] をクリックします。
[ジョブの停止] ダイアログで、パイプラインを [キャンセル] または [ドレイン] し、[ジョブの停止] をクリックします。
[BigQuery] ページに移動します。
[BigQuery] に移動
[エクスプローラ] パネルで、プロジェクトを展開します。
削除するデータセットの横にある [アクションを表示] をクリックし、[開く] をクリックします。
詳細パネルで [データセットを削除] をクリックし、指示に沿って操作します。
In the Google Cloud console, go to the Cloud Storage Buckets page.
Go to Buckets
Click the checkbox for the bucket that you want to delete.
To delete the bucket, click Delete, and then follow the instructions.

Dataflow テンプレートを使用してストリーミング パイプラインを作成する

始める前に

Set up cross-bucket replication

BigQuery データセットとテーブルを作成する

パイプラインを実行する

結果を表示する

クリーンアップ

プロジェクトを削除する

個々のリソースを削除する

次のステップ

Dataflow テンプレートを使用してストリーミングパイプラインを作成する