Google 提供の Dataflow バッチテンプレート

Google はオープンソースの Dataflow テンプレートを提供しています。

これらの Dataflow テンプレートは、データのインポート、エクスポート、バックアップ、復元、API の一括オペレーションなど、大規模なデータタスクの解決に役立ちます。専用の開発環境を使用しなくても、これらの処理を実行できます。テンプレートは Apache Beam 上に構築され、Dataflow を使用してデータを変換します。

テンプレートに関する一般的な情報については、Dataflow テンプレートをご覧ください。Google が提供するテンプレートのリストについては、Google 提供のテンプレートの概要をご覧ください。

このガイドでは、以下のバッチテンプレートについて説明します。

BigQuery to Cloud Storage TFRecords

BigQuery to Cloud Storage TFRecords テンプレートは、BigQuery クエリからデータを読み取り、Cloud Storage バケットに TFRecord 形式で書き込むパイプラインです。トレーニング、テスト、検証の分割パーセンテージを指定できます。デフォルトでは、トレーニングセットの分割パーセンテージは 1 または 100%、テストセットと検証セットは 0 または 0% です。データセットの分割を設定する場合は、トレーニング、テスト、検証の合計が 1 または 100% になるようにする必要があります（たとえば、0.6 + 0.2 + 0.2）。Dataflow では、各出力データセットに最適なシャード数が自動的に設定されます。

このパイプラインの要件:

BigQuery のデータセットとテーブルが存在すること。
パイプラインの実行前に出力先の Cloud Storage バケットが存在すること。トレーニング、テスト、検証のサブディレクトリは事前に作成する必要はありません。自動的に生成されます。

テンプレートのパラメータ

パラメータ	説明
`readQuery`	ソースからデータを抽出する BigQuery SQL クエリ。例: `select * from dataset1.sample_table`
`outputDirectory`	トレーニング、テスト、検証の TFRecord ファイルを書き込む最上位の Cloud Storage パスの接頭辞。例: `gs://mybucket/output`トレーニング、テスト、検証の TFRecord ファイルのサブディレクトリは、`outputDirectory` から自動的に生成されます。例: `gs://mybucket/output/train`
`trainingPercentage`	（省略可）トレーニングの TFRecord ファイルに割り当てられるクエリデータの割合。デフォルト値は 1 または 100% です。
`testingPercentage`	（省略可）テストの TFRecord ファイルに割り当てられるクエリデータの割合。デフォルト値は 0 または 0% です。
`validationPercentage`	（省略可）検証の TFRecord ファイルに割り当てられるクエリデータの割合。デフォルト値は 0 または 0% です。
`outputSuffix`	（省略可）トレーニング、テスト、検証で書き込まれる TFRecord ファイルの接頭辞。デフォルト値は `.tfrecord` です。

BigQuery to Cloud Storage TFRecord ファイルテンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the BigQuery to TFRecords template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records \
    --region REGION_NAME \
    --parameters \
readQuery=READ_QUERY,\
outputDirectory=OUTPUT_DIRECTORY,\
trainingPercentage=TRAINING_PERCENTAGE,\
testingPercentage=TESTING_PERCENTAGE,\
validationPercentage=VALIDATION_PERCENTAGE,\
outputSuffix=OUTPUT_FILENAME_SUFFIX

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
READ_QUERY: 実行する BigQuery クエリ
OUTPUT_DIRECTORY: 出力データセットの Cloud Storage パス接頭辞
TRAINING_PERCENTAGE: トレーニングデータセットの分割割合の小数値
TESTING_PERCENTAGE: テストデータセットの分割割合の小数値
VALIDATION_PERCENTAGE: 検証データセットの分割割合の小数値
OUTPUT_FILENAME_SUFFIX: 出力される TensorFlow レコードのファイルサフィックス

API

REST API を使用してテンプレートを実行するには、HTTP POST リクエストを送信します。API とその認証スコープの詳細については、projects.templates.launch をご覧ください。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records
{
   "jobName": "JOB_NAME",
   "parameters": {
       "readQuery":"READ_QUERY",
       "outputDirectory":"OUTPUT_DIRECTORY",
       "trainingPercentage":"TRAINING_PERCENTAGE",
       "testingPercentage":"TESTING_PERCENTAGE",
       "validationPercentage":"VALIDATION_PERCENTAGE",
       "outputSuffix":"OUTPUT_FILENAME_SUFFIX"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
READ_QUERY: 実行する BigQuery クエリ
OUTPUT_DIRECTORY: 出力データセットの Cloud Storage パス接頭辞
TRAINING_PERCENTAGE: トレーニングデータセットの分割割合の小数値
TESTING_PERCENTAGE: テストデータセットの分割割合の小数値
VALIDATION_PERCENTAGE: 検証データセットの分割割合の小数値
OUTPUT_FILENAME_SUFFIX: 出力される TensorFlow レコードのファイルサフィックス

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.BigQueryToTFRecord.Options;
import com.google.cloud.teleport.templates.common.BigQueryConverters.BigQueryReadOptions;
import com.google.protobuf.ByteString;
import java.util.Iterator;
import java.util.Random;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.util.Utf8;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.ByteArrayCoder;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.TFRecordIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.Partition;
import org.apache.beam.sdk.transforms.Reshuffle;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionList;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.annotations.VisibleForTesting;
import org.tensorflow.example.Example;
import org.tensorflow.example.Feature;
import org.tensorflow.example.Features;

/**
 * Dataflow template which reads BigQuery data and writes it to GCS as a set of TFRecords. The
 * source is a SQL query.
 */
@Template(
    name = "Cloud_BigQuery_to_GCS_TensorFlow_Records",
    category = TemplateCategory.BATCH,
    displayName = "BigQuery to TensorFlow Records",
    description =
        "A pipeline that reads rows from BigQuery and writes them as TFRecords in Cloud Storage. (NOTE: Nested BigQuery columns are currently not supported and should be unnested within the SQL query.)",
    optionsClass = Options.class,
    optionsOrder = {BigQueryReadOptions.class, Options.class},
    contactInformation = "https://cloud.google.com/support")
public class BigQueryToTFRecord {

  /**
   * The {@link BigQueryToTFRecord#buildFeatureFromIterator(Class, Object, Feature.Builder)} method
   * handles {@link GenericData.Array} that are passed into the {@link
   * BigQueryToTFRecord#buildFeature} method creating a TensorFlow feature from the record.
   */
  private static final String TRAIN = "train/";

  private static final String TEST = "test/";
  private static final String VAL = "val/";

  private static void buildFeatureFromIterator(
      Class<?> fieldType, Object field, Feature.Builder feature) {
    ByteString byteString;
    GenericData.Array f = (GenericData.Array) field;
    if (fieldType == Long.class) {
      Iterator<Long> longIterator = f.iterator();
      while (longIterator.hasNext()) {
        Long longValue = longIterator.next();
        feature.getInt64ListBuilder().addValue(longValue);
      }
    } else if (fieldType == double.class) {
      Iterator<Double> doubleIterator = f.iterator();
      while (doubleIterator.hasNext()) {
        double doubleValue = doubleIterator.next();
        feature.getFloatListBuilder().addValue((float) doubleValue);
      }
    } else if (fieldType == String.class) {
      Iterator<Utf8> stringIterator = f.iterator();
      while (stringIterator.hasNext()) {
        String stringValue = stringIterator.next().toString();
        byteString = ByteString.copyFromUtf8(stringValue);
        feature.getBytesListBuilder().addValue(byteString);
      }
    } else if (fieldType == boolean.class) {
      Iterator<Boolean> booleanIterator = f.iterator();
      while (booleanIterator.hasNext()) {
        Boolean boolValue = booleanIterator.next();
        int boolAsInt = boolValue ? 1 : 0;
        feature.getInt64ListBuilder().addValue(boolAsInt);
      }
    }
  }

  /**
   * The {@link BigQueryToTFRecord#buildFeature} method takes in an individual field and type
   * corresponding to a column value from a SchemaAndRecord Object returned from a BigQueryIO.read()
   * step. The method builds a TensorFlow Feature based on the type of the object- ie: STRING, TIME,
   * INTEGER etc..
   */
  private static Feature buildFeature(Object field, String type) {
    Feature.Builder feature = Feature.newBuilder();
    ByteString byteString;

    switch (type) {
      case "STRING":
      case "TIME":
      case "DATE":
        if (field instanceof GenericData.Array) {
          buildFeatureFromIterator(String.class, field, feature);
        } else {
          byteString = ByteString.copyFromUtf8(field.toString());
          feature.getBytesListBuilder().addValue(byteString);
        }
        break;
      case "BYTES":
        byteString = ByteString.copyFrom((byte[]) field);
        feature.getBytesListBuilder().addValue(byteString);
        break;
      case "INTEGER":
      case "INT64":
      case "TIMESTAMP":
        if (field instanceof GenericData.Array) {
          buildFeatureFromIterator(Long.class, field, feature);
        } else {
          feature.getInt64ListBuilder().addValue((long) field);
        }
        break;
      case "FLOAT":
      case "FLOAT64":
        if (field instanceof GenericData.Array) {
          buildFeatureFromIterator(double.class, field, feature);
        } else {
          feature.getFloatListBuilder().addValue((float) (double) field);
        }
        break;
      case "BOOLEAN":
      case "BOOL":
        if (field instanceof GenericData.Array) {
          buildFeatureFromIterator(boolean.class, field, feature);
        } else {
          int boolAsInt = (boolean) field ? 1 : 0;
          feature.getInt64ListBuilder().addValue(boolAsInt);
        }
        break;
      default:
        throw new RuntimeException("Unsupported type: " + type);
    }
    return feature.build();
  }

  /**
   * The {@link BigQueryToTFRecord#record2Example(SchemaAndRecord)} method uses takes in a
   * SchemaAndRecord Object returned from a BigQueryIO.read() step and builds a TensorFlow Example
   * from the record.
   */
  @VisibleForTesting
  protected static byte[] record2Example(SchemaAndRecord schemaAndRecord) {
    Example.Builder example = Example.newBuilder();
    Features.Builder features = example.getFeaturesBuilder();
    GenericRecord record = schemaAndRecord.getRecord();
    for (TableFieldSchema field : schemaAndRecord.getTableSchema().getFields()) {
      Object fieldValue = record.get(field.getName());
      if (fieldValue != null) {
        Feature feature = buildFeature(fieldValue, field.getType());
        features.putFeature(field.getName(), feature);
      }
    }
    return example.build().toByteArray();
  }

  /**
   * The {@link BigQueryToTFRecord#concatURI} method uses takes in a Cloud Storage URI and a
   * subdirectory name and safely concatenates them. The resulting String is used as a sink for
   * TFRecords.
   */
  private static String concatURI(String dir, String folder) {
    if (dir.endsWith("/")) {
      return dir + folder;
    } else {
      return dir + "/" + folder;
    }
  }

  /**
   * The {@link BigQueryToTFRecord#applyTrainTestValSplit} method transforms the PCollection by
   * randomly partitioning it into PCollections for each dataset.
   */
  static PCollectionList<byte[]> applyTrainTestValSplit(
      PCollection<byte[]> input,
      ValueProvider<Float> trainingPercentage,
      ValueProvider<Float> testingPercentage,
      ValueProvider<Float> validationPercentage,
      Random rand) {
    return input.apply(
        Partition.of(
            3,
            (Partition.PartitionFn<byte[]>)
                (number, numPartitions) -> {
                  Float train = trainingPercentage.get();
                  Float test = testingPercentage.get();
                  Float validation = validationPercentage.get();
                  Double d = rand.nextDouble();
                  if (train + test + validation != 1) {
                    throw new RuntimeException(
                        String.format(
                            "Train %.2f, Test %.2f, Validation"
                                + " %.2f percentages must add up to 100 percent",
                            train, test, validation));
                  }
                  if (d < train) {
                    return 0;
                  } else if (d >= train && d < train + test) {
                    return 1;
                  } else {
                    return 2;
                  }
                }));
  }

  /** Run the pipeline. */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(Options options) {
    Random rand = new Random(100); // set random seed
    Pipeline pipeline = Pipeline.create(options);

    PCollection<byte[]> bigQueryToExamples =
        pipeline
            .apply(
                "RecordToExample",
                BigQueryIO.read(BigQueryToTFRecord::record2Example)
                    .fromQuery(options.getReadQuery())
                    .withCoder(ByteArrayCoder.of())
                    .withTemplateCompatibility()
                    .withoutValidation()
                    .usingStandardSql()
                    .withMethod(BigQueryIO.TypedRead.Method.DIRECT_READ)
                // Enable BigQuery Storage API
                )
            .apply("ReshuffleResults", Reshuffle.viaRandomKey());

    PCollectionList<byte[]> partitionedExamples =
        applyTrainTestValSplit(
            bigQueryToExamples,
            options.getTrainingPercentage(),
            options.getTestingPercentage(),
            options.getValidationPercentage(),
            rand);

    partitionedExamples
        .get(0)
        .apply(
            "WriteTFTrainingRecord",
            FileIO.<byte[]>write()
                .via(TFRecordIO.sink())
                .to(
                    ValueProvider.NestedValueProvider.of(
                        options.getOutputDirectory(), dir -> concatURI(dir, TRAIN)))
                .withNumShards(0)
                .withSuffix(options.getOutputSuffix()));

    partitionedExamples
        .get(1)
        .apply(
            "WriteTFTestingRecord",
            FileIO.<byte[]>write()
                .via(TFRecordIO.sink())
                .to(
                    ValueProvider.NestedValueProvider.of(
                        options.getOutputDirectory(), dir -> concatURI(dir, TEST)))
                .withNumShards(0)
                .withSuffix(options.getOutputSuffix()));

    partitionedExamples
        .get(2)
        .apply(
            "WriteTFValidationRecord",
            FileIO.<byte[]>write()
                .via(TFRecordIO.sink())
                .to(
                    ValueProvider.NestedValueProvider.of(
                        options.getOutputDirectory(), dir -> concatURI(dir, VAL)))
                .withNumShards(0)
                .withSuffix(options.getOutputSuffix()));

    return pipeline.run();
  }

  /** Define command line arguments. */
  public interface Options extends BigQueryReadOptions {

    @TemplateParameter.GcsWriteFolder(
        order = 1,
        description = "Output Cloud Storage directory.",
        helpText = "Cloud Storage directory to store output TFRecord files.",
        example = "gs://your-bucket/your-path")
    ValueProvider<String> getOutputDirectory();

    void setOutputDirectory(ValueProvider<String> outputDirectory);

    @TemplateParameter.Text(
        order = 2,
        optional = true,
        regexes = {"^[A-Za-z_0-9.]*"},
        description = "The output suffix for TFRecord files",
        helpText = "File suffix to append to TFRecord files. Defaults to .tfrecord")
    @Default.String(".tfrecord")
    ValueProvider<String> getOutputSuffix();

    void setOutputSuffix(ValueProvider<String> outputSuffix);

    @TemplateParameter.Text(
        order = 3,
        optional = true,
        regexes = {"(^\\.[1-9]*$)|(^[01]*)"},
        description = "Percentage of data to be in the training set ",
        helpText = "Defaults to 1 or 100%. Should be decimal between 0 and 1 inclusive")
    @Default.Float(1)
    ValueProvider<Float> getTrainingPercentage();

    void setTrainingPercentage(ValueProvider<Float> trainingPercentage);

    @TemplateParameter.Text(
        order = 4,
        optional = true,
        regexes = {"(^\\.[1-9]*$)|(^[01]*)"},
        description = "Percentage of data to be in the testing set ",
        helpText = "Defaults to 0 or 0%. Should be decimal between 0 and 1 inclusive")
    @Default.Float(0)
    ValueProvider<Float> getTestingPercentage();

    void setTestingPercentage(ValueProvider<Float> testingPercentage);

    @TemplateParameter.Text(
        order = 5,
        optional = true,
        regexes = {"(^\\.[1-9]*$)|(^[01]*)"},
        description = "Percentage of data to be in the validation set ",
        helpText = "Defaults to 0 or 0%. Should be decimal between 0 and 1 inclusive")
    @Default.Float(0)
    ValueProvider<Float> getValidationPercentage();

    void setValidationPercentage(ValueProvider<Float> validationPercentage);
  }
}

BigQuery export to Parquet（Storage API 経由）

BigQuery export to Parquet テンプレートは、BigQuery テーブルからデータを読み取り、Parquet 形式で Cloud Storage バケットに書き込むバッチパイプラインです。このテンプレートは、BigQuery Storage API を使用してデータをエクスポートします。

このパイプラインの要件:

パイプラインを実行する前に、入力 BigQuery テーブルが存在すること。
パイプラインを実行する前に、出力先の Cloud Storage バケットが存在すること。

テンプレートのパラメータ

パラメータ	説明
`tableRef`	BigQuery 入力テーブルの場所。例: `<my-project>:<my-dataset>.<my-table>`
`bucket`	Parquet ファイルを書き込む Cloud Storage フォルダ。例: `gs://mybucket/exports`
`numShards`	（省略可）出力ファイルのシャード数。デフォルト値は 1 です。
`fields`	（省略可）入力 BigQuery テーブルから選択するフィールドのカンマ区切りのリスト。

BigQuery to Cloud Storage Parquet テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the BigQuery export to Parquet (via Storage API) template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/BigQuery_to_Parquet \
    --region=REGION_NAME \
    --parameters \
tableRef=BIGQUERY_TABLE,\
bucket=OUTPUT_DIRECTORY,\
numShards=NUM_SHARDS,\
fields=FIELDS

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGQUERY_TABLE: BigQuery テーブル名
OUTPUT_DIRECTORY: 出力ファイルを格納する Cloud Storage フォルダ
NUM_SHARDS: 目的の出力ファイルシャードの数
FIELDS: 入力 BigQuery テーブルから選択するフィールドのカンマ区切りリスト

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "tableRef": "BIGQUERY_TABLE",
          "bucket": "OUTPUT_DIRECTORY",
          "numShards": "NUM_SHARDS",
          "fields": "FIELDS"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/BigQuery_to_Parquet",
   }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGQUERY_TABLE: BigQuery テーブル名
OUTPUT_DIRECTORY: 出力ファイルを格納する Cloud Storage フォルダ
NUM_SHARDS: 目的の出力ファイルシャードの数
FIELDS: 入力 BigQuery テーブルから選択するフィールドのカンマ区切りリスト

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import com.google.api.gax.rpc.InvalidArgumentException;
import com.google.api.services.bigquery.model.TableReference;
import com.google.cloud.bigquery.storage.v1beta1.BigQueryStorageClient;
import com.google.cloud.bigquery.storage.v1beta1.ReadOptions.TableReadOptions;
import com.google.cloud.bigquery.storage.v1beta1.Storage.CreateReadSessionRequest;
import com.google.cloud.bigquery.storage.v1beta1.Storage.ReadSession;
import com.google.cloud.bigquery.storage.v1beta1.TableReferenceProto;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.templates.BigQueryToParquet.BigQueryToParquetOptions;
import com.google.common.base.Splitter;
import com.google.common.base.Strings;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method;
import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
import org.apache.beam.sdk.io.parquet.ParquetIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link BigQueryToParquet} pipeline exports data from a BigQuery table to Parquet file(s) in a
 * Google Cloud Storage bucket.
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>BigQuery Table exists.
 *   <li>Google Cloud Storage bucket exists.
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT=my-project
 * BUCKET_NAME=my-bucket
 * TABLE={$PROJECT}:my-dataset.my-table
 *
 * # Set containerization vars
 * IMAGE_NAME=my-image-name
 * TARGET_GCR_IMAGE=gcr.io/${PROJECT}/${IMAGE_NAME}
 * BASE_CONTAINER_IMAGE=my-base-container-image
 * BASE_CONTAINER_IMAGE_VERSION=my-base-container-image-version
 * APP_ROOT=/path/to/app-root
 * COMMAND_SPEC=/path/to/command-spec
 *
 * # Build and upload image
 * mvn clean package \
 * -Dimage=${TARGET_GCR_IMAGE} \
 * -Dbase-container-image=${BASE_CONTAINER_IMAGE} \
 * -Dbase-container-image.version=${BASE_CONTAINER_IMAGE_VERSION} \
 * -Dapp-root=${APP_ROOT} \
 * -Dcommand-spec=${COMMAND_SPEC}
 *
 * # Create an image spec in GCS that contains the path to the image
 * {
 *    "docker_template_spec": {
 *       "docker_image": $TARGET_GCR_IMAGE
 *     }
 *  }
 *
 * # Execute template:
 * API_ROOT_URL="https://dataflow.googleapis.com"
 * TEMPLATES_LAUNCH_API="${API_ROOT_URL}/v1b3/projects/${PROJECT}/templates:launch"
 * JOB_NAME="bigquery-to-parquet-`date +%Y%m%d-%H%M%S-%N`"
 *
 * time curl -X POST -H "Content-Type: application/json"     \
 *     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
 *     "${TEMPLATES_LAUNCH_API}"`
 *     `"?validateOnly=false"`
 *     `"&dynamicTemplate.gcsPath=${BUCKET_NAME}/path/to/image-spec"`
 *     `"&dynamicTemplate.stagingLocation=${BUCKET_NAME}/staging" \
 *     -d '
 *      {
 *       "jobName":"'$JOB_NAME'",
 *       "parameters": {
 *           "tableRef":"'$TABLE'",
 *           "bucket":"'$BUCKET_NAME/results'",
 *           "numShards":"5",
 *           "fields":"field1,field2"
 *        }
 *       }
 *      '
 * </pre>
 */
@Template(
    name = "BigQuery_to_Parquet",
    category = TemplateCategory.BATCH,
    displayName = "BigQuery export to Parquet (via Storage API)",
    description =
        "A pipeline to export a BigQuery table into Parquet files using the BigQuery Storage API.",
    optionsClass = BigQueryToParquetOptions.class,
    flexContainerName = "bigquery-to-parquet",
    contactInformation = "https://cloud.google.com/support")
public class BigQueryToParquet {

  /* Logger for class. */
  private static final Logger LOG = LoggerFactory.getLogger(BigQueryToParquet.class);

  /** File suffix for file to be written. */
  private static final String FILE_SUFFIX = ".parquet";

  /** Factory to create BigQueryStorageClients. */
  static class BigQueryStorageClientFactory {

    /**
     * Creates BigQueryStorage client for use in extracting table schema.
     *
     * @return BigQueryStorageClient
     */
    static BigQueryStorageClient create() {
      try {
        return BigQueryStorageClient.create();
      } catch (IOException e) {
        LOG.error("Error connecting to BigQueryStorage API: " + e.getMessage());
        throw new RuntimeException(e);
      }
    }
  }

  /** Factory to create ReadSessions. */
  static class ReadSessionFactory {

    /**
     * Creates ReadSession for schema extraction.
     *
     * @param client BigQueryStorage client used to create ReadSession.
     * @param tableString String that represents table to export from.
     * @param tableReadOptions TableReadOptions that specify any fields in the table to filter on.
     * @return session ReadSession object that contains the schema for the export.
     */
    static ReadSession create(
        BigQueryStorageClient client, String tableString, TableReadOptions tableReadOptions) {
      TableReference tableReference = BigQueryHelpers.parseTableSpec(tableString);
      String parentProjectId = "projects/" + tableReference.getProjectId();

      TableReferenceProto.TableReference storageTableRef =
          TableReferenceProto.TableReference.newBuilder()
              .setProjectId(tableReference.getProjectId())
              .setDatasetId(tableReference.getDatasetId())
              .setTableId(tableReference.getTableId())
              .build();

      CreateReadSessionRequest.Builder builder =
          CreateReadSessionRequest.newBuilder()
              .setParent(parentProjectId)
              .setReadOptions(tableReadOptions)
              .setTableReference(storageTableRef);
      try {
        return client.createReadSession(builder.build());
      } catch (InvalidArgumentException iae) {
        LOG.error("Error creating ReadSession: " + iae.getMessage());
        throw new RuntimeException(iae);
      }
    }
  }

  /**
   * The {@link BigQueryToParquetOptions} class provides the custom execution options passed by the
   * executor at the command-line.
   */
  public interface BigQueryToParquetOptions extends PipelineOptions {
    @TemplateParameter.BigQueryTable(
        order = 1,
        description = "BigQuery table to export",
        helpText = "BigQuery table location to export in the format <project>:<dataset>.<table>.",
        example = "your-project:your-dataset.your-table-name")
    @Required
    String getTableRef();

    void setTableRef(String tableRef);

    @TemplateParameter.GcsWriteFile(
        order = 2,
        description = "Output Cloud Storage file(s)",
        helpText = "Path and filename prefix for writing output files.",
        example = "gs://your-bucket/export/")
    @Required
    String getBucket();

    void setBucket(String bucket);

    @TemplateParameter.Integer(
        order = 3,
        optional = true,
        description = "Maximum output shards",
        helpText =
            "The maximum number of output shards produced when writing. A higher number of shards"
                + " means higher throughput for writing to Cloud Storage, but potentially higher"
                + " data aggregation cost across shards when processing output Cloud Storage"
                + " files.")
    @Default.Integer(0)
    Integer getNumShards();

    void setNumShards(Integer numShards);

    @TemplateParameter.Text(
        order = 4,
        optional = true,
        description = "List of field names",
        helpText = "Comma separated list of fields to select from the table.")
    String getFields();

    void setFields(String fields);

    @TemplateParameter.Text(
        order = 5,
        optional = true,
        description = "Row restrictions/filter.",
        helpText =
            "Read only rows which match the specified filter, which must be a SQL expression"
                + " compatible with Google standard SQL"
                + " (https://cloud.google.com/bigquery/docs/reference/standard-sql). If no value is"
                + " specified, then all rows are returned.")
    String getRowRestriction();

    void setRowRestriction(String restriction);
  }

  /**
   * The {@link BigQueryToParquet#getTableSchema(ReadSession)} method gets Avro schema for table
   * using from the {@link ReadSession} object.
   *
   * @param session ReadSession that contains schema for table, filtered by fields if any.
   * @return avroSchema Avro schema for table. If fields are provided then schema will only contain
   *     those fields.
   */
  private static Schema getTableSchema(ReadSession session) {
    Schema avroSchema;

    avroSchema = new Schema.Parser().parse(session.getAvroSchema().getSchema());
    LOG.info("Schema for export is: " + avroSchema.toString());

    return avroSchema;
  }

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    BigQueryToParquetOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(BigQueryToParquetOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  private static PipelineResult run(BigQueryToParquetOptions options) {

    // Create the pipeline.
    Pipeline pipeline = Pipeline.create(options);

    TableReadOptions.Builder builder = TableReadOptions.newBuilder();

    /* Add fields to filter export on, if any. */
    if (options.getFields() != null) {
      builder.addAllSelectedFields(Arrays.asList(options.getFields().split(",\\s*")));
    }

    TableReadOptions tableReadOptions = builder.build();
    BigQueryStorageClient client = BigQueryStorageClientFactory.create();
    ReadSession session =
        ReadSessionFactory.create(client, options.getTableRef(), tableReadOptions);

    // Extract schema from ReadSession
    Schema schema = getTableSchema(session);
    client.close();

    TypedRead<GenericRecord> readFromBQ =
        BigQueryIO.read(SchemaAndRecord::getRecord)
            .from(options.getTableRef())
            .withTemplateCompatibility()
            .withMethod(Method.DIRECT_READ)
            .withCoder(AvroCoder.of(schema));

    if (options.getFields() != null) {
      List<String> selectedFields = Splitter.on(",").splitToList(options.getFields());
      readFromBQ =
          selectedFields.isEmpty() ? readFromBQ : readFromBQ.withSelectedFields(selectedFields);
    }

    // Add row restrictions/filter if any.
    if (!Strings.isNullOrEmpty(options.getRowRestriction())) {
      readFromBQ = readFromBQ.withRowRestriction(options.getRowRestriction());
    }

    /*
     * Steps: 1) Read records from BigQuery via BigQueryIO.
     *        2) Write records to Google Cloud Storage in Parquet format.
     */
    pipeline
        /*
         * Step 1: Read records via BigQueryIO using supplied schema as a PCollection of
         *         {@link GenericRecord}.
         */
        .apply("ReadFromBigQuery", readFromBQ)
        /*
         * Step 2: Write records to Google Cloud Storage as one or more Parquet files
         *         via {@link ParquetIO}.
         */
        .apply(
            "WriteToParquet",
            FileIO.<GenericRecord>write()
                .via(ParquetIO.sink(schema))
                .to(options.getBucket())
                .withNumShards(options.getNumShards())
                .withSuffix(FILE_SUFFIX));

    // Execute the pipeline and return the result.
    return pipeline.run();
  }
}

BigQuery to Elasticsearch

BigQuery to Elasticsearch テンプレートは、BigQuery テーブルから Elasticsearch にデータをドキュメントとして取り込むバッチパイプラインです。テンプレートでは、テーブル全体を読み取ることも、クエリを使用して特定のレコードを読み取ることもできます。

このパイプラインの要件

ソース BigQuery テーブルが存在すること。
Google Cloud インスタンスまたは Elasticsearch バージョン 7.0 以降の Elastic Cloud 上に Elasticsearch ホストが存在し、このホストに Dataflow ワーカーマシンからアクセスできること。

テンプレートのパラメータ

パラメータ	説明
`connectionUrl`	`https://hostname:[port]` 形式の Elasticsearch URL。Elastic Cloud を使用する場合は CloudID を指定します。
`apiKey`	認証に使用される Base64 エンコードの API キー。
`index`	リクエストが発行される Elasticsearch インデックス。例: `my-index`
`inputTableSpec`	（省略可）Elasticsearch に挿入するために読み取る BigQuery テーブル。テーブルまたはクエリを指定する必要があります。例: `projectId:datasetId.tablename`
`query`	（省略可）BigQuery からデータを pull する SQL クエリ。テーブルまたはクエリを指定する必要があります。
`useLegacySql`	（省略可）レガシー SQL を使用するには true に設定します（クエリを提供する場合のみ）。デフォルト: `false`
`batchSize`	（省略可）バッチサイズ（ドキュメント数）。デフォルト: `1000`
`batchSizeBytes`	（省略可）バッチサイズ（バイト数）。デフォルト: `5242880`（5 MB）。
`maxRetryAttempts`	（省略可）最大再試行回数。0 より大きくする必要があります。デフォルト: `no retries`
`maxRetryDuration`	（省略可）最大再試行時間（ミリ秒）は 0 より大きくする必要があります。デフォルト: `no retries`。
`propertyAsIndex`	（省略可）インデックスに登録されているドキュメント内のプロパティ。その値は `_index` メタデータを指定し、一括リクエストではドキュメントに含まれます（`_index` UDF よりも優先適用されます）。デフォルト: none
`propertyAsId`	（省略可）インデックスに登録されているドキュメント内のプロパティ。その値は `_id` メタデータを指定し、一括リクエストではドキュメントに含まれます（`_id` UDF よりも優先適用されます）。デフォルト: none
`javaScriptIndexFnGcsPath`	（省略可）一括リクエストでドキュメントに含まれる `_index` メタデータを指定する関数の JavaScript UDF ソースへの Cloud Storage パス。デフォルト: none
`javaScriptIndexFnName`	（省略可）一括リクエストでドキュメントに含まれる `_index` メタデータを指定する関数の UDF JavaScript 関数名。デフォルト: none
`javaScriptIdFnGcsPath`	（省略可）一括リクエストでドキュメントに含まれる `_id` メタデータを指定する関数の JavaScript UDF ソースへの Cloud Storage パス。デフォルト: none
`javaScriptIdFnName`	（省略可）一括リクエストでドキュメントに含まれる `_id` メタデータを指定する関数の UDF JavaScript 関数名。デフォルト: none
`javaScriptTypeFnGcsPath`	（省略可）一括リクエストでドキュメントに含まれる `_type` メタデータを指定する関数の JavaScript UDF ソースへの Cloud Storage パス。デフォルト: none
`javaScriptTypeFnName`	（省略可）一括リクエストでドキュメントに含まれる `_type` メタデータを指定する関数の UDF JavaScript 関数名。デフォルト: none
`javaScriptIsDeleteFnGcsPath`	（省略可）ドキュメントを挿入や更新ではなく削除するかどうかを決定する関数の JavaScript UDF ソースへの Cloud Storage パス。この関数は、文字列値 `"true"` または `"false"` を返す必要があります。デフォルト: none
`javaScriptIsDeleteFnName`	（省略可）ドキュメントを挿入や更新ではなく削除するかどうかを決定する関数の UDF JavaScript 関数名。この関数は、文字列値 `"true"` または `"false"` を返す必要があります。デフォルト: none
`usePartialUpdate`	（省略可）Elasticsearch リクエストで部分的な更新（作成やインデックス作成ではなく更新、部分的なドキュメントを許可する）を使用するかどうか。デフォルト: `false`
`bulkInsertMethod`	（省略可）`INDEX`（インデックス、upserts を許可する）または `CREATE`（作成、duplicate _id でエラー）を Elasticsearch 一括リクエストで使用するかどうか。デフォルト: `CREATE`。

BigQuery to Elasticsearch テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the BigQuery to Elasticsearch template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/BigQuery_to_Elasticsearch \
    --parameters \
inputTableSpec=INPUT_TABLE_SPEC,\
connectionUrl=CONNECTION_URL,\
apiKey=APIKEY,\
index=INDEX

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
INPUT_TABLE_SPEC: BigQuery テーブル名。
CONNECTION_URL: Elasticsearch の URL。
APIKEY: 認証用に Base64 でエンコードされた API キー。
INDEX: Elasticsearch インデックス。

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputTableSpec": "INPUT_TABLE_SPEC",
          "connectionUrl": "CONNECTION_URL",
          "apiKey": "APIKEY",
          "index": "INDEX"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/BigQuery_to_Elasticsearch",
   }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
INPUT_TABLE_SPEC: BigQuery テーブル名。
CONNECTION_URL: Elasticsearch の URL。
APIKEY: 認証用に Base64 でエンコードされた API キー。
INDEX: Elasticsearch インデックス。

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.elasticsearch.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.elasticsearch.options.BigQueryToElasticsearchOptions;
import com.google.cloud.teleport.v2.elasticsearch.transforms.WriteToElasticsearch;
import com.google.cloud.teleport.v2.transforms.BigQueryConverters.ReadBigQuery;
import com.google.cloud.teleport.v2.transforms.BigQueryConverters.TableRowToJsonFn;
import com.google.cloud.teleport.v2.transforms.JavascriptTextTransformer.TransformTextViaJavascript;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.ParDo;

/**
 * The {@link BigQueryToElasticsearch} pipeline exports data from a BigQuery table to Elasticsearch.
 *
 * <p>Please refer to <b><a href=
 * "https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/v2/googlecloud-to-elasticsearch/docs/BigQueryToElasticsearch/README.md">
 * README.md</a></b> for further information.
 */
@Template(
    name = "BigQuery_to_Elasticsearch",
    category = TemplateCategory.BATCH,
    displayName = "BigQuery to Elasticsearch",
    description =
        "A pipeline which sends BigQuery records into an Elasticsearch instance as json documents.",
    optionsClass = BigQueryToElasticsearchOptions.class,
    flexContainerName = "bigquery-to-elasticsearch",
    contactInformation = "https://cloud.google.com/support")
public class BigQueryToElasticsearch {
  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    BigQueryToElasticsearchOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(BigQueryToElasticsearchOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  private static PipelineResult run(BigQueryToElasticsearchOptions options) {

    // Create the pipeline.
    Pipeline pipeline = Pipeline.create(options);
    /*
     * Steps: 1) Read records from BigQuery via BigQueryIO.
     *        2) Create json string from Table Row.
     *        3) Write records to Elasticsearch.
     *
     *
     * Step #1: Read from BigQuery. If a query is provided then it is used to get the TableRows.
     */
    pipeline
        .apply(
            "ReadFromBigQuery",
            ReadBigQuery.newBuilder()
                .setOptions(options.as(BigQueryToElasticsearchOptions.class))
                .build())

        /*
         * Step #2: Convert table rows to JSON documents.
         */
        .apply("TableRowsToJsonDocument", ParDo.of(new TableRowToJsonFn()))

        /*
         * Step #3: Apply UDF functions (if specified)
         */
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())

        /*
         * Step #4: Write converted records to Elasticsearch
         */
        .apply(
            "WriteToElasticsearch",
            WriteToElasticsearch.newBuilder()
                .setOptions(options.as(BigQueryToElasticsearchOptions.class))
                .build());

    return pipeline.run();
  }
}

BigQuery to MongoDB

BigQuery to MongoDB テンプレートは、BigQuery から行を読み取り、ドキュメントとして MongoDB に書き込むバッチパイプラインです。現在、各行がドキュメントとして格納されています。

このパイプラインの要件

ソース BigQuery テーブルが存在すること。
Dataflow ワーカーマシンからターゲット MongoDB インスタンスにアクセスできること。

テンプレートのパラメータ

パラメータ	説明
`mongoDbUri`	MongoDB 接続 URI。形式は `mongodb+srv://:@`。
`database`	コレクションを格納する MongoDB のデータベース。例: `my-db`。
`collection`	MongoDB データベース内のコレクションの名前。例: `my-collection`。
`inputTableSpec`	読み取り元の BigQuery テーブル。例: `bigquery-project:dataset.input_table`。

BigQuery to MongoDB テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the BigQuery to MongoDB template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

  gcloud beta dataflow flex-template run JOB_NAME \
      --project=PROJECT_ID \
      --region=REGION_NAME \
      --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/BigQuery_to_MongoDB \
      --parameters \
  inputTableSpec=INPUT_TABLE_SPEC,\
  mongoDbUri=MONGO_DB_URI,\
  database=DATABASE,\
  collection=COLLECTION

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
INPUT_TABLE_SPEC: ソース BigQuery テーブル名。
MONGO_DB_URI: MongoDB URI。
DATABASE: MongoDB データベース。
COLLECTION: MongoDB コレクション。

API

  POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
  {
     "launch_parameter": {
        "jobName": "JOB_NAME",
        "parameters": {
            "inputTableSpec": "INPUT_TABLE_SPEC",
            "mongoDbUri": "MONGO_DB_URI",
            "database": "DATABASE",
            "collection": "COLLECTION"
        },
        "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/BigQuery_to_MongoDB",
     }
  }

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
INPUT_TABLE_SPEC: ソース BigQuery テーブル名。
MONGO_DB_URI: MongoDB URI。
DATABASE: MongoDB データベース。
COLLECTION: MongoDB コレクション。

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.mongodb.templates;

import com.google.api.services.bigquery.model.TableRow;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.mongodb.options.BigQueryToMongoDbOptions.BigQueryReadOptions;
import com.google.cloud.teleport.v2.mongodb.options.BigQueryToMongoDbOptions.MongoDbOptions;
import com.google.cloud.teleport.v2.mongodb.templates.BigQueryToMongoDb.Options;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.mongodb.MongoDbIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.bson.Document;

/**
 * The {@link BigQueryToMongoDb} pipeline is a batch pipeline which reads data from BigQuery and
 * outputs the resulting records to MongoDB.
 */
@Template(
    name = "BigQuery_to_MongoDB",
    category = TemplateCategory.BATCH,
    displayName = "BigQuery to MongoDB",
    description =
        "A batch pipeline which reads data rows from BigQuery and writes them to MongoDB as"
            + " documents.",
    optionsClass = Options.class,
    flexContainerName = "bigquery-to-mongodb",
    contactInformation = "https://cloud.google.com/support")
public class BigQueryToMongoDb {
  /**
   * Options supported by {@link BigQueryToMongoDb}
   *
   * <p>Inherits standard configuration options.
   */
  public interface Options extends PipelineOptions, MongoDbOptions, BigQueryReadOptions {}

  private static class ParseAsDocumentsFn extends DoFn<String, Document> {

    @ProcessElement
    public void processElement(ProcessContext context) {
      context.output(Document.parse(context.element()));
    }
  }

  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    run(options);
  }

  public static boolean run(Options options) {
    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(BigQueryIO.readTableRows().withoutValidation().from(options.getInputTableSpec()))
        .apply(
            "bigQueryDataset",
            ParDo.of(
                new DoFn<TableRow, Document>() {
                  @ProcessElement
                  public void process(ProcessContext c) {
                    Document doc = new Document();
                    TableRow row = c.element();
                    row.forEach(
                        (key, value) -> {
                          if (key != "_id") {
                            doc.append(key, value);
                          }
                        });
                    c.output(doc);
                  }
                }))
        .apply(
            MongoDbIO.write()
                .withUri(options.getMongoDbUri())
                .withDatabase(options.getDatabase())
                .withCollection(options.getCollection()));
    pipeline.run();
    return true;
  }
}

Bigtable to Cloud Storage Avro

Bigtable to Cloud Storage Avro テンプレートは、Bigtable テーブルからデータを読み取り、Cloud Storage バケットに Avro 形式で書き込むパイプラインです。このテンプレートは、Bigtable から Cloud Storage にデータを移動する場合に使用できます。

このパイプラインの要件:

Bigtable テーブルが存在していること。
パイプラインを実行する前に、出力先の Cloud Storage バケットが存在すること。

テンプレートのパラメータ

パラメータ	説明
`bigtableProjectId`	データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID。
`bigtableInstanceId`	テーブルが含まれている Bigtable インスタンスの ID。
`bigtableTableId`	エクスポートする Bigtable テーブルの ID。
`outputDirectory`	データが書き込まれる Cloud Storage のパス。例: `gs://mybucket/somefolder`
`filenamePrefix`	Avro ファイル名の接頭辞。例: `output-`

Bigtable to Cloud Storage Avro file テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Cloud Bigtable to Avro Files on Cloud Storage template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
outputDirectory=OUTPUT_DIRECTORY,\
filenamePrefix=FILENAME_PREFIX

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID
INSTANCE_ID: テーブルが含まれている Bigtable インスタンスの ID
TABLE_ID: エクスポートする Bigtable テーブルの ID
OUTPUT_DIRECTORY: データの書き込み先の Cloud Storage パス（例: gs://mybucket/somefolder）
FILENAME_PREFIX: Avro ファイル名の接頭辞（例: output-）

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "outputDirectory": "OUTPUT_DIRECTORY",
       "filenamePrefix": "FILENAME_PREFIX",
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID
INSTANCE_ID: テーブルが含まれている Bigtable インスタンスの ID
TABLE_ID: エクスポートする Bigtable テーブルの ID
OUTPUT_DIRECTORY: データの書き込み先の Cloud Storage パス（例: gs://mybucket/somefolder）
FILENAME_PREFIX: Avro ファイル名の接頭辞（例: output-）

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.bigtable;

import com.google.bigtable.v2.Cell;
import com.google.bigtable.v2.Column;
import com.google.bigtable.v2.Family;
import com.google.bigtable.v2.Row;
import com.google.cloud.teleport.bigtable.BigtableToAvro.Options;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.util.DualInputNestedValueProvider;
import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput;
import com.google.protobuf.ByteOutput;
import com.google.protobuf.ByteString;
import com.google.protobuf.UnsafeByteOperations;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.List;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.AvroIO;
import org.apache.beam.sdk.io.gcp.bigtable.BigtableIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.transforms.SimpleFunction;

/**
 * Dataflow pipeline that exports data from a Cloud Bigtable table to Avro files in GCS. Currently,
 * filtering on Cloud Bigtable table is not supported.
 */
@Template(
    name = "Cloud_Bigtable_to_GCS_Avro",
    category = TemplateCategory.BATCH,
    displayName = "Cloud Bigtable to Avro Files in Cloud Storage",
    description =
        "A pipeline which reads in Cloud Bigtable table and writes it to Cloud Storage in Avro format.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class BigtableToAvro {

  /** Options for the export pipeline. */
  public interface Options extends PipelineOptions {
    @TemplateParameter.ProjectId(
        order = 1,
        description = "Project ID",
        helpText =
            "The ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from")
    ValueProvider<String> getBigtableProjectId();

    @SuppressWarnings("unused")
    void setBigtableProjectId(ValueProvider<String> projectId);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"[a-z][a-z0-9\\-]+[a-z0-9]"},
        description = "Instance ID",
        helpText = "The ID of the Cloud Bigtable instance that contains the table")
    ValueProvider<String> getBigtableInstanceId();

    @SuppressWarnings("unused")
    void setBigtableInstanceId(ValueProvider<String> instanceId);

    @TemplateParameter.Text(
        order = 3,
        regexes = {"[_a-zA-Z0-9][-_.a-zA-Z0-9]*"},
        description = "Table ID",
        helpText = "The ID of the Cloud Bigtable table to read")
    ValueProvider<String> getBigtableTableId();

    @SuppressWarnings("unused")
    void setBigtableTableId(ValueProvider<String> tableId);

    @TemplateParameter.GcsWriteFolder(
        order = 4,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime formatting is used to parse directory path for date & time formatters.",
        example = "gs://your-bucket/your-path")
    ValueProvider<String> getOutputDirectory();

    @SuppressWarnings("unused")
    void setOutputDirectory(ValueProvider<String> outputDirectory);

    @TemplateParameter.Text(
        order = 5,
        description = "Avro file prefix",
        helpText = "The prefix of the Avro file name. For example, \"table1-\"")
    ValueProvider<String> getFilenamePrefix();

    @SuppressWarnings("unused")
    void setFilenamePrefix(ValueProvider<String> filenamePrefix);
  }

  /**
   * Runs a pipeline to export data from a Cloud Bigtable table to Avro files in GCS.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    PipelineResult result = run(options);

    // Wait for pipeline to finish only if it is not constructing a template.
    if (options.as(DataflowPipelineOptions.class).getTemplateLocation() == null) {
      result.waitUntilFinish();
    }
  }

  public static PipelineResult run(Options options) {
    Pipeline pipeline = Pipeline.create(PipelineUtils.tweakPipelineOptions(options));

    BigtableIO.Read read =
        BigtableIO.read()
            .withProjectId(options.getBigtableProjectId())
            .withInstanceId(options.getBigtableInstanceId())
            .withTableId(options.getBigtableTableId());

    // Do not validate input fields if it is running as a template.
    if (options.as(DataflowPipelineOptions.class).getTemplateLocation() != null) {
      read = read.withoutValidation();
    }

    ValueProvider<String> filePathPrefix =
        DualInputNestedValueProvider.of(
            options.getOutputDirectory(),
            options.getFilenamePrefix(),
            new SerializableFunction<TranslatorInput<String, String>, String>() {
              @Override
              public String apply(TranslatorInput<String, String> input) {
                return new StringBuilder(input.getX()).append(input.getY()).toString();
              }
            });

    pipeline
        .apply("Read from Bigtable", read)
        .apply("Transform to Avro", MapElements.via(new BigtableToAvroFn()))
        .apply(
            "Write to Avro in GCS",
            AvroIO.write(BigtableRow.class).to(filePathPrefix).withSuffix(".avro"));

    return pipeline.run();
  }

  /** Translates Bigtable {@link Row} to Avro {@link BigtableRow}. */
  static class BigtableToAvroFn extends SimpleFunction<Row, BigtableRow> {
    @Override
    public BigtableRow apply(Row row) {
      ByteBuffer key = ByteBuffer.wrap(toByteArray(row.getKey()));
      List<BigtableCell> cells = new ArrayList<>();
      for (Family family : row.getFamiliesList()) {
        String familyName = family.getName();
        for (Column column : family.getColumnsList()) {
          ByteBuffer qualifier = ByteBuffer.wrap(toByteArray(column.getQualifier()));
          for (Cell cell : column.getCellsList()) {
            long timestamp = cell.getTimestampMicros();
            ByteBuffer value = ByteBuffer.wrap(toByteArray(cell.getValue()));
            cells.add(new BigtableCell(familyName, qualifier, timestamp, value));
          }
        }
      }
      return new BigtableRow(key, cells);
    }
  }

  /**
   * Extracts the byte array from the given {@link ByteString} without copy.
   *
   * @param byteString A {@link ByteString} from which to extract the array.
   * @return an array of byte.
   */
  protected static byte[] toByteArray(final ByteString byteString) {
    try {
      ZeroCopyByteOutput byteOutput = new ZeroCopyByteOutput();
      UnsafeByteOperations.unsafeWriteTo(byteString, byteOutput);
      return byteOutput.bytes;
    } catch (IOException e) {
      return byteString.toByteArray();
    }
  }

  private static final class ZeroCopyByteOutput extends ByteOutput {
    private byte[] bytes;

    @Override
    public void writeLazy(byte[] value, int offset, int length) {
      if (offset != 0 || length != value.length) {
        throw new UnsupportedOperationException();
      }
      bytes = value;
    }

    @Override
    public void write(byte value) {
      throw new UnsupportedOperationException();
    }

    @Override
    public void write(byte[] value, int offset, int length) {
      throw new UnsupportedOperationException();
    }

    @Override
    public void write(ByteBuffer value) {
      throw new UnsupportedOperationException();
    }

    @Override
    public void writeLazy(ByteBuffer value) {
      throw new UnsupportedOperationException();
    }
  }
}

Bigtable to Cloud Storage Parquet

Bigtable to Cloud Storage Parquet テンプレートは、Bigtable テーブルからデータを読み取り、Cloud Storage バケットに Parquet 形式で書き込むパイプラインです。このテンプレートは、Bigtable から Cloud Storage にデータを移動する場合に使用できます。

このパイプラインの要件:

Bigtable テーブルが存在していること。
パイプラインを実行する前に、出力先の Cloud Storage バケットが存在すること。

テンプレートのパラメータ

パラメータ	説明
`bigtableProjectId`	データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID。
`bigtableInstanceId`	テーブルが含まれている Bigtable インスタンスの ID。
`bigtableTableId`	エクスポートする Bigtable テーブルの ID。
`outputDirectory`	データが書き込まれる Cloud Storage のパス。例: `gs://mybucket/somefolder`
`filenamePrefix`	Parquet ファイル名の接頭辞。例: `output-`
`numShards`	出力ファイルのシャード数。例: `2`

Bigtable to Cloud Storage Parquet ファイルテンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Cloud Bigtable to Parquet Files on Cloud Storage template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
outputDirectory=OUTPUT_DIRECTORY,\
filenamePrefix=FILENAME_PREFIX,\
numShards=NUM_SHARDS

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID
INSTANCE_ID: テーブルが含まれている Bigtable インスタンスの ID
TABLE_ID: エクスポートする Bigtable テーブルの ID
OUTPUT_DIRECTORY: データの書き込み先の Cloud Storage パス（例: gs://mybucket/somefolder）
FILENAME_PREFIX: Parquet ファイル名の接頭辞（例: output-）
NUM_SHARDS: 出力する Parquet ファイルの数（例: 1）

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "outputDirectory": "OUTPUT_DIRECTORY",
       "filenamePrefix": "FILENAME_PREFIX",
       "numShards": "NUM_SHARDS"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID
INSTANCE_ID: テーブルが含まれている Bigtable インスタンスの ID
TABLE_ID: エクスポートする Bigtable テーブルの ID
OUTPUT_DIRECTORY: データの書き込み先の Cloud Storage パス（例: gs://mybucket/somefolder）
FILENAME_PREFIX: Parquet ファイル名の接頭辞（例: output-）
NUM_SHARDS: 出力する Parquet ファイルの数（例: 1）

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.bigtable;

import static com.google.cloud.teleport.bigtable.BigtableToAvro.toByteArray;

import com.google.bigtable.v2.Cell;
import com.google.bigtable.v2.Column;
import com.google.bigtable.v2.Family;
import com.google.bigtable.v2.Row;
import com.google.cloud.teleport.bigtable.BigtableToParquet.Options;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.List;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.GenericRecordBuilder;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.gcp.bigtable.BigtableIO;
import org.apache.beam.sdk.io.parquet.ParquetIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.PCollection;

/**
 * Dataflow pipeline that exports data from a Cloud Bigtable table to Parquet files in GCS.
 * Currently, filtering on Cloud Bigtable table is not supported.
 */
@Template(
    name = "Cloud_Bigtable_to_GCS_Parquet",
    category = TemplateCategory.BATCH,
    displayName = "Cloud Bigtable to Parquet Files on Cloud Storage",
    description =
        "A pipeline which reads in Cloud Bigtable table and writes it to Cloud Storage in Parquet format.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class BigtableToParquet {

  /** Options for the export pipeline. */
  public interface Options extends PipelineOptions {

    @TemplateParameter.ProjectId(
        order = 1,
        description = "Project ID",
        helpText =
            "The ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from")
    ValueProvider<String> getBigtableProjectId();

    @SuppressWarnings("unused")
    void setBigtableProjectId(ValueProvider<String> projectId);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"[a-z][a-z0-9\\-]+[a-z0-9]"},
        description = "Instance ID",
        helpText = "The ID of the Cloud Bigtable instance that contains the table")
    ValueProvider<String> getBigtableInstanceId();

    @SuppressWarnings("unused")
    void setBigtableInstanceId(ValueProvider<String> instanceId);

    @TemplateParameter.Text(
        order = 3,
        regexes = {"[_a-zA-Z0-9][-_.a-zA-Z0-9]*"},
        description = "Table ID",
        helpText = "The ID of the Cloud Bigtable table to export")
    ValueProvider<String> getBigtableTableId();

    @SuppressWarnings("unused")
    void setBigtableTableId(ValueProvider<String> tableId);

    @TemplateParameter.GcsWriteFolder(
        order = 4,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime formatting is used to parse directory path for date & time formatters.",
        example = "gs://your-bucket/your-path")
    ValueProvider<String> getOutputDirectory();

    @SuppressWarnings("unused")
    void setOutputDirectory(ValueProvider<String> outputDirectory);

    @TemplateParameter.Text(
        order = 5,
        description = "Parquet file prefix",
        helpText = "The prefix of the Parquet file name. For example, \"table1-\"")
    @Default.String("output")
    ValueProvider<String> getFilenamePrefix();

    @SuppressWarnings("unused")
    void setFilenamePrefix(ValueProvider<String> filenamePrefix);

    @TemplateParameter.Integer(
        order = 6,
        optional = true,
        description = "Maximum output shards",
        helpText =
            "The maximum number of output shards produced when writing. A higher number of "
                + "shards means higher throughput for writing to Cloud Storage, but potentially higher "
                + "data aggregation cost across shards when processing output Cloud Storage files. "
                + "Default value is decided by the runner.")
    @Default.Integer(0)
    ValueProvider<Integer> getNumShards();

    @SuppressWarnings("unused")
    void setNumShards(ValueProvider<Integer> numShards);
  }

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    PipelineResult result = run(options);

    // Wait for pipeline to finish only if it is not constructing a template.
    if (options.as(DataflowPipelineOptions.class).getTemplateLocation() == null) {
      result.waitUntilFinish();
    }
  }

  /**
   * Runs a pipeline to export data from a Cloud Bigtable table to Parquet file(s) in GCS.
   *
   * @param options arguments to the pipeline
   */
  public static PipelineResult run(Options options) {
    Pipeline pipeline = Pipeline.create(PipelineUtils.tweakPipelineOptions(options));
    BigtableIO.Read read =
        BigtableIO.read()
            .withProjectId(options.getBigtableProjectId())
            .withInstanceId(options.getBigtableInstanceId())
            .withTableId(options.getBigtableTableId());

    // Do not validate input fields if it is running as a template.
    if (options.as(DataflowPipelineOptions.class).getTemplateLocation() != null) {
      read = read.withoutValidation();
    }

    /**
     * Steps: 1) Read records from Bigtable. 2) Convert a Bigtable Row to a GenericRecord. 3) Write
     * GenericRecord(s) to GCS in parquet format.
     */
    pipeline
        .apply("Read from Bigtable", read)
        .apply("Transform to Parquet", MapElements.via(new BigtableToParquetFn()))
        .setCoder(AvroCoder.of(GenericRecord.class, BigtableRow.getClassSchema()))
        .apply(
            "Write to Parquet in GCS",
            FileIO.<GenericRecord>write()
                .via(ParquetIO.sink(BigtableRow.getClassSchema()))
                .to(options.getOutputDirectory())
                .withPrefix(options.getFilenamePrefix())
                .withSuffix(".parquet")
                .withNumShards(options.getNumShards()));

    return pipeline.run();
  }

  /**
   * Translates a {@link PCollection} of Bigtable {@link Row} to a {@link PCollection} of {@link
   * GenericRecord}.
   */
  static class BigtableToParquetFn extends SimpleFunction<Row, GenericRecord> {
    @Override
    public GenericRecord apply(Row row) {
      ByteBuffer key = ByteBuffer.wrap(toByteArray(row.getKey()));
      List<BigtableCell> cells = new ArrayList<>();
      for (Family family : row.getFamiliesList()) {
        String familyName = family.getName();
        for (Column column : family.getColumnsList()) {
          ByteBuffer qualifier = ByteBuffer.wrap(toByteArray(column.getQualifier()));
          for (Cell cell : column.getCellsList()) {
            long timestamp = cell.getTimestampMicros();
            ByteBuffer value = ByteBuffer.wrap(toByteArray(cell.getValue()));
            cells.add(new BigtableCell(familyName, qualifier, timestamp, value));
          }
        }
      }
      return new GenericRecordBuilder(BigtableRow.getClassSchema())
          .set("key", key)
          .set("cells", cells)
          .build();
    }
  }
}

Bigtable to Cloud Storage SequenceFile

Bigtable to Cloud Storage to SequenceFile テンプレートは、Bigtable テーブルからデータを読み取り、SequenceFile 形式で Cloud Storage バケットに書き込むパイプラインです。このテンプレートは、Bigtable から Cloud Storage にデータをコピーする場合に使用できます。

このパイプラインの要件:

Bigtable テーブルが存在していること。
パイプラインを実行する前に、出力先の Cloud Storage バケットが存在すること。

テンプレートのパラメータ

パラメータ	説明
`bigtableProject`	データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID。
`bigtableInstanceId`	テーブルが含まれている Bigtable インスタンスの ID。
`bigtableTableId`	エクスポートする Bigtable テーブルの ID。
`bigtableAppProfileId`	エクスポートに使用される Bigtable アプリケーションプロファイルの ID。アプリプロファイルを指定しないと、Bigtable はインスタンスのデフォルトのアプリプロファイルを使用します。
`destinationPath`	データが書き込まれる Cloud Storage のパス。例: `gs://mybucket/somefolder`
`filenamePrefix`	SequenceFile ファイル名の接頭辞。例: `output-`

Bigtable to Cloud Storage SequenceFile テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Cloud Bigtable to SequenceFile Files on Cloud Storage template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile \
    --region REGION_NAME \
    --parameters \
bigtableProject=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
bigtableAppProfileId=APPLICATION_PROFILE_ID,\
destinationPath=DESTINATION_PATH,\
filenamePrefix=FILENAME_PREFIX

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID
INSTANCE_ID: テーブルが含まれている Bigtable インスタンスの ID
TABLE_ID: エクスポートする Bigtable テーブルの ID
APPLICATION_PROFILE_ID: エクスポートに使用される Bigtable アプリケーションプロファイルの ID。
DESTINATION_PATH: データの書き込み先の Cloud Storage パス（例: gs://mybucket/somefolder）
FILENAME_PREFIX: SequenceFile ファイル名の接頭辞（例: output-）

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProject": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "bigtableAppProfileId": "APPLICATION_PROFILE_ID",
       "destinationPath": "DESTINATION_PATH",
       "filenamePrefix": "FILENAME_PREFIX",
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID
INSTANCE_ID: テーブルが含まれている Bigtable インスタンスの ID
TABLE_ID: エクスポートする Bigtable テーブルの ID
APPLICATION_PROFILE_ID: エクスポートに使用される Bigtable アプリケーションプロファイルの ID。
DESTINATION_PATH: データの書き込み先の Cloud Storage パス（例: gs://mybucket/somefolder）
FILENAME_PREFIX: SequenceFile ファイル名の接頭辞（例: output-）

テンプレートのソースコード

Java

このテンプレートのソースコードは、GitHub の GoogleCloudPlatform/cloud-bigtable-client リポジトリにあります。

Datastore to Cloud Storage Text [非推奨]

このテンプレートはサポートが終了しており、2022 年第 1 四半期に廃止されます。Firestore to Cloud Storage Text テンプレートに移行してください。

Datastore to Cloud Storage Text テンプレートは、Datastore エンティティを読み取り、Cloud Storage にテキストファイルとして書き込むバッチパイプラインです。各エンティティを JSON 文字列として扱う関数を使用できます。このような関数を使用しない場合、出力ファイルの各行はシリアル化された JSON エンティティとなります。

このパイプラインの要件:

パイプラインを実行する前に、プロジェクトで Datastore を設定する必要があります。

テンプレートのパラメータ

パラメータ	説明
`datastoreReadGqlQuery`	取得するエンティティを指定する GQL クエリ。例: `SELECT * FROM MyKind`
`datastoreReadProjectId`	データを読み取る Datastore インスタンスの Google Cloud プロジェクト ID。
`datastoreReadNamespace`	要求されたエンティティの名前空間。デフォルトの名前空間を使用するには、このパラメータを空白のままにします。
`javascriptTextTransformGcsPath`	（省略可）使用する JavaScript ユーザー定義関数（UDF）を定義する `.js` ファイルの Cloud Storage URI。例: `gs://my-bucket/my-udfs/my_file.js`
`javascriptTextTransformFunctionName`	（省略可）使用する JavaScript ユーザー定義関数（UDF）の名前。たとえば、JavaScript 関数が `myTransform(inJson) { /...do stuff.../ }` の場合、関数名は `myTransform` です。JavaScript UDF の例については、UDF の例をご覧ください。
`textWritePrefix`	データの書き込み先を示す Cloud Storage パスの接頭辞。例: `gs://mybucket/somefolder/`

Datastore to Cloud Storage Text テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Datastore to Text Files on Cloud Storage template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Datastore_to_GCS_Text \
    --region REGION_NAME \
    --parameters \
datastoreReadGqlQuery="SELECT * FROM DATASTORE_KIND",\
datastoreReadProjectId=DATASTORE_PROJECT_ID,\
datastoreReadNamespace=DATASTORE_NAMESPACE,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
textWritePrefix=gs://BUCKET_NAME/output/

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
BUCKET_NAME: Cloud Storage バケットの名前
DATASTORE_PROJECT_ID: Datastore インスタンスが存在する Cloud プロジェクトの ID
DATASTORE_KIND: Datastore エンティティのタイプ
DATASTORE_NAMESPACE: Datastore エンティティの名前空間
JAVASCRIPT_FUNCTION: 使用する JavaScript ユーザー定義関数（UDF）の名前
たとえば、JavaScript 関数が myTransform(inJson) { /*...do stuff...*/ } の場合、関数名は myTransform です。JavaScript UDF の例については、UDF の例をご覧ください。
PATH_TO_JAVASCRIPT_UDF_FILE: 使用する JavaScript ユーザー定義関数（UDF）を定義する .js ファイルの Cloud Storage URI。例: gs://my-bucket/my-udfs/my_file.js

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Datastore_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "datastoreReadGqlQuery": "SELECT * FROM DATASTORE_KIND"
       "datastoreReadProjectId": "DATASTORE_PROJECT_ID",
       "datastoreReadNamespace": "DATASTORE_NAMESPACE",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "textWritePrefix": "gs://BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
BUCKET_NAME: Cloud Storage バケットの名前
DATASTORE_PROJECT_ID: Datastore インスタンスが存在する Cloud プロジェクトの ID
DATASTORE_KIND: Datastore エンティティのタイプ
DATASTORE_NAMESPACE: Datastore エンティティの名前空間
JAVASCRIPT_FUNCTION: 使用する JavaScript ユーザー定義関数（UDF）の名前
たとえば、JavaScript 関数が myTransform(inJson) { /*...do stuff...*/ } の場合、関数名は myTransform です。JavaScript UDF の例については、UDF の例をご覧ください。
PATH_TO_JAVASCRIPT_UDF_FILE: 使用する JavaScript ユーザー定義関数（UDF）を定義する .js ファイルの Cloud Storage URI。例: gs://my-bucket/my-udfs/my_file.js

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.DatastoreToText.DatastoreToTextOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreReadOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.ReadJsonEntities;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import com.google.cloud.teleport.templates.common.TextConverters.FilesystemWriteOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;

/**
 * Dataflow template which copies Datastore Entities to a Text sink. Text is encoded using JSON
 * encoded entity in the v1/Entity rest format:
 * https://cloud.google.com/datastore/docs/reference/rest/v1/Entity
 */
@Template(
    name = "Datastore_to_GCS_Text",
    category = TemplateCategory.BATCH,
    displayName = "Datastore to Text Files on Cloud Storage [Deprecated]",
    description =
        "Batch pipeline. Reads Datastore entities and writes them to Cloud Storage as text files.",
    optionsClass = DatastoreToTextOptions.class,
    skipOptions = {"firestoreReadNamespace", "firestoreReadGqlQuery", "firestoreReadProjectId"},
    contactInformation = "https://cloud.google.com/support")
@Template(
    name = "Firestore_to_GCS_Text",
    category = TemplateCategory.BATCH,
    displayName = "Firestore (Datastore mode) to Text Files on Cloud Storage",
    description =
        "Batch pipeline. Reads Firestore entities and writes them to Cloud Storage as text files.",
    optionsClass = DatastoreToTextOptions.class,
    skipOptions = {"datastoreReadNamespace", "datastoreReadGqlQuery", "datastoreReadProjectId"},
    contactInformation = "https://cloud.google.com/support")
public class DatastoreToText {

  public static ValueProvider<String> selectProvidedInput(
      ValueProvider<String> datastoreInput, ValueProvider<String> firestoreInput) {
    return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
  }

  /** Custom PipelineOptions. */
  public interface DatastoreToTextOptions
      extends PipelineOptions,
          DatastoreReadOptions,
          JavascriptTextTransformerOptions,
          FilesystemWriteOptions {}

  /**
   * Runs a pipeline which reads in Entities from Datastore, passes in the JSON encoded Entities to
   * a Javascript UDF, and writes the JSON to TextIO sink.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    DatastoreToTextOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(DatastoreToTextOptions.class);

    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(
            ReadJsonEntities.newBuilder()
                .setGqlQuery(
                    selectProvidedInput(
                        options.getDatastoreReadGqlQuery(), options.getFirestoreReadGqlQuery()))
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreReadProjectId(), options.getFirestoreReadProjectId()))
                .setNamespace(
                    selectProvidedInput(
                        options.getDatastoreReadNamespace(), options.getFirestoreReadNamespace()))
                .build())
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(TextIO.write().to(options.getTextWritePrefix()).withSuffix(".json"));

    pipeline.run();
  }
}

Firestore to Cloud Storage Text

Firestore to Cloud Storage Text テンプレートは、Firestore エンティティを読み取り、Cloud Storage にテキストファイルとして書き込むバッチパイプラインです。各エンティティを JSON 文字列として扱う関数を使用できます。このような関数を使用しない場合、出力ファイルの各行はシリアル化された JSON エンティティとなります。

このパイプラインの要件:

パイプラインを実行する前に、プロジェクトで Firestore を設定する必要があります。

テンプレートのパラメータ

パラメータ	説明
`firestoreReadGqlQuery`	取得するエンティティを指定する GQL クエリ。例: `SELECT * FROM MyKind`
`firestoreReadProjectId`	データを読み取る Firestore インスタンスの Google Cloud プロジェクト ID。
`firestoreReadNamespace`	要求されたエンティティの名前空間。デフォルトの名前空間を使用するには、このパラメータを空白のままにします。
`javascriptTextTransformGcsPath`	（省略可）使用する JavaScript ユーザー定義関数（UDF）を定義する `.js` ファイルの Cloud Storage URI。例: `gs://my-bucket/my-udfs/my_file.js`
`javascriptTextTransformFunctionName`	（省略可）使用する JavaScript ユーザー定義関数（UDF）の名前。たとえば、JavaScript 関数が `myTransform(inJson) { /...do stuff.../ }` の場合、関数名は `myTransform` です。JavaScript UDF の例については、UDF の例をご覧ください。
`textWritePrefix`	データの書き込み先を示す Cloud Storage パスの接頭辞。例: `gs://mybucket/somefolder/`

Firestore to Cloud Storage Text テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Firestore to Text Files on Cloud Storage template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Firestore_to_GCS_Text \
    --region REGION_NAME \
    --parameters \
firestoreReadGqlQuery="SELECT * FROM FIRESTORE_KIND",\
firestoreReadProjectId=FIRESTORE_PROJECT_ID,\
firestoreReadNamespace=FIRESTORE_NAMESPACE,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
textWritePrefix=gs://BUCKET_NAME/output/

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
BUCKET_NAME: Cloud Storage バケットの名前
FIRESTORE_PROJECT_ID: Firestore インスタンスが存在する Cloud プロジェクトの ID
FIRESTORE_KIND: Firestore エンティティのタイプ
FIRESTORE_NAMESPACE: Firestore エンティティの名前空間
JAVASCRIPT_FUNCTION: 使用する JavaScript ユーザー定義関数（UDF）の名前
たとえば、JavaScript 関数が myTransform(inJson) { /*...do stuff...*/ } の場合、関数名は myTransform です。JavaScript UDF の例については、UDF の例をご覧ください。
PATH_TO_JAVASCRIPT_UDF_FILE: 使用する JavaScript ユーザー定義関数（UDF）を定義する .js ファイルの Cloud Storage URI。例: gs://my-bucket/my-udfs/my_file.js

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Firestore_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "firestoreReadGqlQuery": "SELECT * FROM FIRESTORE_KIND"
       "firestoreReadProjectId": "FIRESTORE_PROJECT_ID",
       "firestoreReadNamespace": "FIRESTORE_NAMESPACE",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "textWritePrefix": "gs://BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
BUCKET_NAME: Cloud Storage バケットの名前
FIRESTORE_PROJECT_ID: Firestore インスタンスが存在する Cloud プロジェクトの ID
FIRESTORE_KIND: Firestore エンティティのタイプ
FIRESTORE_NAMESPACE: Firestore エンティティの名前空間
JAVASCRIPT_FUNCTION: 使用する JavaScript ユーザー定義関数（UDF）の名前
たとえば、JavaScript 関数が myTransform(inJson) { /*...do stuff...*/ } の場合、関数名は myTransform です。JavaScript UDF の例については、UDF の例をご覧ください。
PATH_TO_JAVASCRIPT_UDF_FILE: 使用する JavaScript ユーザー定義関数（UDF）を定義する .js ファイルの Cloud Storage URI。例: gs://my-bucket/my-udfs/my_file.js

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.DatastoreToText.DatastoreToTextOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreReadOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.ReadJsonEntities;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import com.google.cloud.teleport.templates.common.TextConverters.FilesystemWriteOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;

/**
 * Dataflow template which copies Datastore Entities to a Text sink. Text is encoded using JSON
 * encoded entity in the v1/Entity rest format:
 * https://cloud.google.com/datastore/docs/reference/rest/v1/Entity
 */
@Template(
    name = "Datastore_to_GCS_Text",
    category = TemplateCategory.BATCH,
    displayName = "Datastore to Text Files on Cloud Storage [Deprecated]",
    description =
        "Batch pipeline. Reads Datastore entities and writes them to Cloud Storage as text files.",
    optionsClass = DatastoreToTextOptions.class,
    skipOptions = {"firestoreReadNamespace", "firestoreReadGqlQuery", "firestoreReadProjectId"},
    contactInformation = "https://cloud.google.com/support")
@Template(
    name = "Firestore_to_GCS_Text",
    category = TemplateCategory.BATCH,
    displayName = "Firestore (Datastore mode) to Text Files on Cloud Storage",
    description =
        "Batch pipeline. Reads Firestore entities and writes them to Cloud Storage as text files.",
    optionsClass = DatastoreToTextOptions.class,
    skipOptions = {"datastoreReadNamespace", "datastoreReadGqlQuery", "datastoreReadProjectId"},
    contactInformation = "https://cloud.google.com/support")
public class DatastoreToText {

  public static ValueProvider<String> selectProvidedInput(
      ValueProvider<String> datastoreInput, ValueProvider<String> firestoreInput) {
    return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
  }

  /** Custom PipelineOptions. */
  public interface DatastoreToTextOptions
      extends PipelineOptions,
          DatastoreReadOptions,
          JavascriptTextTransformerOptions,
          FilesystemWriteOptions {}

  /**
   * Runs a pipeline which reads in Entities from Datastore, passes in the JSON encoded Entities to
   * a Javascript UDF, and writes the JSON to TextIO sink.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    DatastoreToTextOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(DatastoreToTextOptions.class);

    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(
            ReadJsonEntities.newBuilder()
                .setGqlQuery(
                    selectProvidedInput(
                        options.getDatastoreReadGqlQuery(), options.getFirestoreReadGqlQuery()))
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreReadProjectId(), options.getFirestoreReadProjectId()))
                .setNamespace(
                    selectProvidedInput(
                        options.getDatastoreReadNamespace(), options.getFirestoreReadNamespace()))
                .build())
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(TextIO.write().to(options.getTextWritePrefix()).withSuffix(".json"));

    pipeline.run();
  }
}

Cloud Spanner to Cloud Storage Avro

Cloud Spanner to Avro Files on Cloud Storage テンプレートは、Cloud Spanner データベース全体を Avro 形式で Cloud Storage にエクスポートするバッチパイプラインです。Cloud Spanner データベースをエクスポートすると、選択したバケット内にフォルダが作成されます。フォルダには以下が含まれています。

spanner-export.json ファイル。
エクスポートしたデータベースの角テーブルの TableName-manifest.json ファイル。
1 つ以上の TableName.avro-#####-of-##### ファイル。

たとえば、Singers と Albums の 2 つのテーブルを持つデータベースをエクスポートして、次のファイルセットを作成します。

Albums-manifest.json
Albums.avro-00000-of-00002
Albums.avro-00001-of-00002
Singers-manifest.json
Singers.avro-00000-of-00003
Singers.avro-00001-of-00003
Singers.avro-00002-of-00003
spanner-export.json

このパイプラインの要件:

Cloud Spanner データベースが存在すること。
出力先の Cloud Storage バケットが存在すること。
Dataflow ジョブの実行に必要な IAM ロールに加えて、Cloud Spanner のデータの読み取りと Cloud Storage バケットへの書き込みに適切な IAM ロールも必要です。

テンプレートのパラメータ

パラメータ	説明
`instanceId`	エクスポートする Cloud Spanner データベースのインスタンス ID。
`databaseId`	エクスポートする Cloud Spanner データベースのデータベース ID。
`outputDir`	Avro ファイルのエクスポート先にする Cloud Storage パス。エクスポートジョブによって、このパスの下にディレクトリが新規作成されます。ここに、エクスポートされたファイルが格納されます。
`snapshotTime`	（省略可）読み取る Cloud Spanner データベースのバージョンに対応するタイムスタンプ。タイムスタンプは RFC 3339 UTC Zulu 形式で指定する必要があります。例: `1990-12-31T23:59:60Z`タイムスタンプは過去の日付でなければならず、タイムスタンプステイルネスの最大値が適用されます。
`tableNames`	（省略可）エクスポートする Cloud Spanner データベースのサブセットを指定するテーブルのカンマ区切りリスト。このリストには、すべての関連テーブル（親テーブル、外部キーで参照されるテーブル）を含める必要があります。明示的に指定されていない場合は、エクスポートを正常に行うために shouldExportRelatedTables フラグを設定する必要があります。
`shouldExportRelatedTables`	（省略可）エクスポートするすべてのテーブルを含めるために tableNames パラメータと組み合わせて使用するフラグ。
`spannerProjectId`	（省略可）データを読み取る Cloud Spanner データベースの Google Cloud プロジェクト ID。

Cloud Spanner to Avro Files on Cloud Storage テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
Google Cloud コンソールの Spanner インスタンスページにジョブを表示するには、ジョブ名が次の形式になっている必要があります。
```
cloud-spanner-export-SPANNER_INSTANCE_ID-SPANNER_DATABASE_NAME
```
次のように置き換えます。
- SPANNER_INSTANCE_ID: Spanner インスタンスの ID
- SPANNER_DATABASE_NAME: Spanner データベースの名前
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Cloud Spanner to Avro Files on Cloud Storage template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro \
    --region REGION_NAME \
    --staging-location GCS_STAGING_LOCATION \
    --parameters \
instanceId=INSTANCE_ID,\
databaseId=DATABASE_ID,\
outputDir=GCS_DIRECTORY

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
ジョブを Google Cloud コンソールの Cloud Spanner の部分に表示するには、ジョブ名を cloud-spanner-export-INSTANCE_ID-DATABASE_ID という形式に一致させる必要があります。
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
GCS_STAGING_LOCATION: 一時ファイルを書き込むパス。例: gs://mybucket/temp
INSTANCE_ID: Cloud Spanner インスタンス ID
DATABASE_ID: Cloud Spanner データベース ID
GCS_DIRECTORY: Avro ファイルのエクスポート先になる Cloud Storage のパス

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro
{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "outputDir": "gs://GCS_DIRECTORY"
   }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
ジョブを Google Cloud コンソールの Cloud Spanner の部分に表示するには、ジョブ名を cloud-spanner-export-INSTANCE_ID-DATABASE_ID という形式に一致させる必要があります。
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
GCS_STAGING_LOCATION: 一時ファイルを書き込むパス。例: gs://mybucket/temp
INSTANCE_ID: Cloud Spanner インスタンス ID
DATABASE_ID: Cloud Spanner データベース ID
GCS_DIRECTORY: Avro ファイルのエクスポート先になる Cloud Storage のパス

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.spanner;

import com.google.cloud.spanner.Options.RpcPriority;
import com.google.cloud.spanner.SpannerOptions;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateCreationParameter;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.spanner.ExportPipeline.ExportPipelineOptions;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.spanner.SpannerConfig;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider;
import org.apache.beam.sdk.transforms.SerializableFunction;

/** Dataflow template that exports a Cloud Spanner database to Avro files in GCS. */
@Template(
    name = "Cloud_Spanner_to_GCS_Avro",
    category = TemplateCategory.BATCH,
    displayName = "Cloud Spanner to Avro Files on Cloud Storage",
    description =
        "A pipeline to export a Cloud Spanner database to a set of Avro files in Cloud Storage.",
    optionsClass = ExportPipelineOptions.class,
    contactInformation = "https://cloud.google.com/support")
public class ExportPipeline {

  /** Options for Export pipeline. */
  public interface ExportPipelineOptions extends PipelineOptions {
    @TemplateParameter.Text(
        order = 1,
        regexes = {"[a-z][a-z0-9\\-]*[a-z0-9]"},
        description = "Cloud Spanner instance id",
        helpText = "The instance id of the Cloud Spanner database that you want to export.")
    ValueProvider<String> getInstanceId();

    void setInstanceId(ValueProvider<String> value);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"[a-z][a-z0-9_\\-]*[a-z0-9]"},
        description = "Cloud Spanner database id",
        helpText = "The database id of the Cloud Spanner database that you want to export.")
    ValueProvider<String> getDatabaseId();

    void setDatabaseId(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFolder(
        order = 3,
        description = "Cloud Storage output directory",
        helpText =
            "The Cloud Storage path where the Avro files should be exported to. A new directory will be created under this path that contains the export.",
        example = "gs://your-bucket/your-path")
    ValueProvider<String> getOutputDir();

    void setOutputDir(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFolder(
        order = 4,
        optional = true,
        description = "Cloud Storage temp directory for storing Avro files",
        helpText =
            "The Cloud Storage path where the temporary Avro files can be created. Ex: gs://your-bucket/your-path")
    ValueProvider<String> getAvroTempDirectory();

    void setAvroTempDirectory(ValueProvider<String> value);

    @TemplateCreationParameter(value = "")
    @Description("Test dataflow job identifier for Beam Direct Runner")
    @Default.String(value = "")
    ValueProvider<String> getTestJobId();

    void setTestJobId(ValueProvider<String> jobId);

    @TemplateParameter.Text(
        order = 6,
        optional = true,
        description = "Cloud Spanner Endpoint to call",
        helpText = "The Cloud Spanner endpoint to call in the template. Only used for testing.",
        example = "https://batch-spanner.googleapis.com")
    @Default.String("https://batch-spanner.googleapis.com")
    ValueProvider<String> getSpannerHost();

    void setSpannerHost(ValueProvider<String> value);

    @TemplateCreationParameter(value = "false")
    @Description("If true, wait for job finish")
    @Default.Boolean(true)
    boolean getWaitUntilFinish();

    void setWaitUntilFinish(boolean value);

    @TemplateParameter.Text(
        order = 7,
        optional = true,
        regexes = {
          "^([0-9]{4})-([0-9]{2})-([0-9]{2})T([0-9]{2}):([0-9]{2}):(([0-9]{2})(\\.[0-9]+)?)Z$"
        },
        description = "Snapshot time",
        helpText =
            "Specifies the snapshot time as RFC 3339 format in UTC time without the timezone offset(always ends in 'Z'). Timestamp must be in the past and Maximum timestamp staleness applies. See https://cloud.google.com/spanner/docs/timestamp-bounds#maximum_timestamp_staleness",
        example = "1990-12-31T23:59:59Z")
    @Default.String(value = "")
    ValueProvider<String> getSnapshotTime();

    void setSnapshotTime(ValueProvider<String> value);

    @TemplateParameter.ProjectId(
        order = 8,
        optional = true,
        description = "Cloud Spanner Project Id",
        helpText = "The project id of the Cloud Spanner instance.")
    ValueProvider<String> getSpannerProjectId();

    void setSpannerProjectId(ValueProvider<String> value);

    @TemplateParameter.Boolean(
        order = 9,
        optional = true,
        description = "Export Timestamps as Timestamp-micros type",
        helpText =
            "If true, Timestamps are exported as timestamp-micros type. Timestamps are exported as ISO8601 strings at nanosecond precision by default.")
    @Default.Boolean(false)
    ValueProvider<Boolean> getShouldExportTimestampAsLogicalType();

    void setShouldExportTimestampAsLogicalType(ValueProvider<Boolean> value);

    @TemplateParameter.Text(
        order = 10,
        optional = true,
        regexes = {"^[a-zA-Z0-9_]+(,[a-zA-Z0-9_]+)*$"},
        description = "Cloud Spanner table name(s).",
        helpText =
            "If provided, only this comma separated list of tables are exported. Ancestor tables and tables that are referenced via foreign keys are required. If not explicitly listed, the `shouldExportRelatedTables` flag must be set for a successful export.")
    @Default.String(value = "")
    ValueProvider<String> getTableNames();

    void setTableNames(ValueProvider<String> value);

    @TemplateParameter.Boolean(
        order = 11,
        optional = true,
        description = "Export necessary Related Spanner tables.",
        helpText =
            "Used in conjunction with `tableNames`. If true, add related tables necessary for the export, such as interleaved parent tables and foreign keys tables.  If `tableNames` is specified but doesn't include related tables, this option must be set to true for a successful export.")
    @Default.Boolean(false)
    ValueProvider<Boolean> getShouldExportRelatedTables();

    void setShouldExportRelatedTables(ValueProvider<Boolean> value);

    @TemplateParameter.Enum(
        order = 12,
        enumOptions = {"LOW", "MEDIUM", "HIGH"},
        optional = true,
        description = "Priority for Spanner RPC invocations",
        helpText =
            "The request priority for Cloud Spanner calls. The value must be one of: [HIGH,MEDIUM,LOW].")
    ValueProvider<RpcPriority> getSpannerPriority();

    void setSpannerPriority(ValueProvider<RpcPriority> value);
  }

  /**
   * Runs a pipeline to export a Cloud Spanner database to Avro files.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {

    ExportPipelineOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(ExportPipelineOptions.class);

    Pipeline p = Pipeline.create(options);

    SpannerConfig spannerConfig =
        SpannerConfig.create()
            // Temporary fix explicitly setting SpannerConfig.projectId to the default project
            // if spannerProjectId is not provided as a parameter. Required as of Beam 2.38,
            // which no longer accepts null label values on metrics, and SpannerIO#setup() has
            // a bug resulting in the label value being set to the original parameter value,
            // with no fallback to the default project.
            // TODO: remove NestedValueProvider when this is fixed in Beam.
            .withProjectId(
                NestedValueProvider.of(
                    options.getSpannerProjectId(),
                    (SerializableFunction<String, String>)
                        input -> input != null ? input : SpannerOptions.getDefaultProjectId()))
            .withHost(options.getSpannerHost())
            .withInstanceId(options.getInstanceId())
            .withDatabaseId(options.getDatabaseId())
            .withRpcPriority(options.getSpannerPriority());
    p.begin()
        .apply(
            "Run Export",
            new ExportTransform(
                spannerConfig,
                options.getOutputDir(),
                options.getTestJobId(),
                options.getSnapshotTime(),
                options.getTableNames(),
                options.getShouldExportRelatedTables(),
                options.getShouldExportTimestampAsLogicalType(),
                options.getAvroTempDirectory()));
    PipelineResult result = p.run();
    if (options.getWaitUntilFinish()
        &&
        /* Only if template location is null, there is a dataflow job to wait for. Else it's
         * template generation which doesn't start a dataflow job.
         */
        options.as(DataflowPipelineOptions.class).getTemplateLocation() == null) {
      result.waitUntilFinish();
    }
  }
}

Cloud Spanner to Cloud Storage Text

Cloud Spanner to Cloud Storage Text テンプレートは、Cloud Spanner テーブルからデータを読み取り、CSV テキストファイルとして Cloud Storage に書き込むバッチパイプラインです。

このパイプラインの要件:

パイプラインを実行する前に、入力 Spanner テーブルが存在すること。

テンプレートのパラメータ

パラメータ	説明
`spannerProjectId`	データを読み取る Cloud Spanner データベースの Google Cloud プロジェクト ID。
`spannerDatabaseId`	リクエストされたテーブルのデータベース ID。
`spannerInstanceId`	リクエストされたテーブルのインスタンス ID。
`spannerTable`	データを読み取るテーブル。
`textWritePrefix`	出力テキストファイルを書き込むディレクトリ。末尾に / を付加してください。例: `gs://mybucket/somefolder/`
`spannerSnapshotTime`	（省略可）読み取る Cloud Spanner データベースのバージョンに対応するタイムスタンプ。タイムスタンプは RFC 3339 UTC Zulu 形式で指定する必要があります。例: `1990-12-31T23:59:60Z`タイムスタンプは過去の日付でなければならず、タイムスタンプステイルネスの最大値が適用されます。

Cloud Spanner to Cloud Storage Text テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Cloud Spanner to Text Files on Cloud Storage template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Spanner_to_GCS_Text \
    --region REGION_NAME \
    --parameters \
spannerProjectId=SPANNER_PROJECT_ID,\
spannerDatabaseId=DATABASE_ID,\
spannerInstanceId=INSTANCE_ID,\
spannerTable=TABLE_ID,\
textWritePrefix=gs://BUCKET_NAME/output/

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
SPANNER_PROJECT_ID: データを読み取る Spanner データベースの Cloud プロジェクト ID
DATABASE_ID: Spanner データベース ID
BUCKET_NAME: Cloud Storage バケットの名前
INSTANCE_ID: Spanner インスタンス ID
TABLE_ID: Spanner テーブル ID

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Spanner_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "spannerProjectId": "SPANNER_PROJECT_ID",
       "spannerDatabaseId": "DATABASE_ID",
       "spannerInstanceId": "INSTANCE_ID",
       "spannerTable": "TABLE_ID",
       "textWritePrefix": "gs://BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
SPANNER_PROJECT_ID: データを読み取る Spanner データベースの Cloud プロジェクト ID
DATABASE_ID: Spanner データベース ID
BUCKET_NAME: Cloud Storage バケットの名前
INSTANCE_ID: Spanner インスタンス ID
TABLE_ID: Spanner テーブル ID

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import static com.google.cloud.teleport.util.ValueProviderUtils.eitherOrValueProvider;

import com.google.cloud.spanner.Options.RpcPriority;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.SpannerToText.SpannerToTextOptions;
import com.google.cloud.teleport.templates.common.SpannerConverters;
import com.google.cloud.teleport.templates.common.SpannerConverters.CreateTransactionFnWithTimestamp;
import com.google.cloud.teleport.templates.common.SpannerConverters.SpannerReadOptions;
import com.google.cloud.teleport.templates.common.TextConverters.FilesystemWriteOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.FileSystems;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.io.gcp.spanner.LocalSpannerIO;
import org.apache.beam.sdk.io.gcp.spanner.ReadOperation;
import org.apache.beam.sdk.io.gcp.spanner.SpannerConfig;
import org.apache.beam.sdk.io.gcp.spanner.Transaction;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.transforms.View;
import org.apache.beam.sdk.values.PBegin;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionView;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * Dataflow template which copies a Spanner table to a Text sink. It exports a Spanner table using
 * <a href="https://cloud.google.com/spanner/docs/reads#read_data_in_parallel">Batch API</a>, which
 * creates multiple workers in parallel for better performance. The result is written to a CSV file
 * in Google Cloud Storage. The table schema file is saved in json format along with the exported
 * table.
 *
 * <p>Schema file sample: { "id":"INT64", "name":"STRING(MAX)" }
 *
 * <p>A sample run:
 *
 * <pre>
 * mvn compile exec:java \
 *   -Dexec.mainClass=com.google.cloud.teleport.templates.SpannerToText \
 *   -Dexec.args="--runner=DataflowRunner \
 *                --spannerProjectId=projectId \
 *                --gcpTempLocation=gs://gsTmpLocation \
 *                --spannerInstanceId=instanceId \
 *                --spannerDatabaseId=databaseId \
 *                --spannerTable=table_name \
 *                --spannerSnapshotTime=snapshot_time \
 *                --textWritePrefix=gcsOutputPath"
 * </pre>
 */
@Template(
    name = "Spanner_to_GCS_Text",
    category = TemplateCategory.BATCH,
    displayName = "Cloud Spanner to Text Files on Cloud Storage",
    description =
        "A pipeline which reads in Cloud Spanner table and writes it to Cloud Storage as CSV text files.",
    optionsClass = SpannerToTextOptions.class,
    contactInformation = "https://cloud.google.com/support")
public class SpannerToText {

  private static final Logger LOG = LoggerFactory.getLogger(SpannerToText.class);

  /** Custom PipelineOptions. */
  public interface SpannerToTextOptions
      extends PipelineOptions, SpannerReadOptions, FilesystemWriteOptions {

    @TemplateParameter.GcsWriteFolder(
        order = 1,
        optional = true,
        description = "Cloud Storage temp directory for storing CSV files",
        helpText = "The Cloud Storage path where the temporary CSV files can be stored.",
        example = "gs://your-bucket/your-path")
    ValueProvider<String> getCsvTempDirectory();

    @SuppressWarnings("unused")
    void setCsvTempDirectory(ValueProvider<String> value);

    @TemplateParameter.Enum(
        order = 2,
        enumOptions = {"LOW", "MEDIUM", "HIGH"},
        optional = true,
        description = "Priority for Spanner RPC invocations",
        helpText =
            "The request priority for Cloud Spanner calls. The value must be one of: [HIGH,MEDIUM,LOW].")
    ValueProvider<RpcPriority> getSpannerPriority();

    void setSpannerPriority(ValueProvider<RpcPriority> value);
  }

  /**
   * Runs a pipeline which reads in Records from Spanner, and writes the CSV to TextIO sink.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    LOG.info("Starting pipeline setup");
    PipelineOptionsFactory.register(SpannerToTextOptions.class);
    SpannerToTextOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(SpannerToTextOptions.class);

    FileSystems.setDefaultPipelineOptions(options);
    Pipeline pipeline = Pipeline.create(options);

    SpannerConfig spannerConfig =
        SpannerConfig.create()
            .withHost(options.getSpannerHost())
            .withProjectId(options.getSpannerProjectId())
            .withInstanceId(options.getSpannerInstanceId())
            .withDatabaseId(options.getSpannerDatabaseId())
            .withRpcPriority(options.getSpannerPriority());

    PTransform<PBegin, PCollection<ReadOperation>> spannerExport =
        SpannerConverters.ExportTransformFactory.create(
            options.getSpannerTable(),
            spannerConfig,
            options.getTextWritePrefix(),
            options.getSpannerSnapshotTime());

    /* CreateTransaction and CreateTransactionFn classes in LocalSpannerIO
     * only take a timestamp object for exact staleness which works when
     * parameters are provided during template compile time. They do not work with
     * a Timestamp valueProvider which can take parameters at runtime. Hence a new
     * ParDo class CreateTransactionFnWithTimestamp had to be created for this
     * purpose.
     */
    PCollectionView<Transaction> tx =
        pipeline
            .apply("Setup for Transaction", Create.of(1))
            .apply(
                "Create transaction",
                ParDo.of(
                    new CreateTransactionFnWithTimestamp(
                        spannerConfig, options.getSpannerSnapshotTime())))
            .apply("As PCollectionView", View.asSingleton());

    PCollection<String> csv =
        pipeline
            .apply("Create export", spannerExport)
            // We need to use LocalSpannerIO.readAll() instead of LocalSpannerIO.read()
            // because ValueProvider parameters such as table name required for
            // LocalSpannerIO.read() can be read only inside DoFn but LocalSpannerIO.read() is of
            // type PTransform<PBegin, Struct>, which prevents prepending it with DoFn that reads
            // these parameters at the pipeline execution time.
            .apply(
                "Read all records",
                LocalSpannerIO.readAll().withTransaction(tx).withSpannerConfig(spannerConfig))
            .apply(
                "Struct To Csv",
                MapElements.into(TypeDescriptors.strings())
                    .via(struct -> (new SpannerConverters.StructCsvPrinter()).print(struct)));

    ValueProvider<ResourceId> tempDirectoryResource =
        ValueProvider.NestedValueProvider.of(
            eitherOrValueProvider(options.getCsvTempDirectory(), options.getTextWritePrefix()),
            (SerializableFunction<String, ResourceId>) s -> FileSystems.matchNewResource(s, true));

    csv.apply(
        "Write to storage",
        TextIO.write()
            .to(options.getTextWritePrefix())
            .withSuffix(".csv")
            .withTempDirectory(tempDirectoryResource));

    pipeline.run();
    LOG.info("Completed pipeline setup");
  }
}

Cloud Storage Avro to Bigtable

Cloud Storage Avro to Bigtable テンプレートは、Cloud Storage バケットの Avro ファイルからデータを読み取り、そのデータを Bigtable テーブルに書き込むパイプラインです。このテンプレートは、Cloud Storage から Bigtable にデータをコピーする場合に使用できます。

このパイプラインの要件:

Bigtable テーブルが存在し、Avro ファイルにエクスポートしたものと同じ列ファミリーがこのテーブルにあること。
パイプラインを実行する前に、入力 Avro ファイルが Cloud Storage バケット内に存在すること。
Bigtable が入力の Avro ファイルに特定のスキーマを想定していること。

テンプレートのパラメータ

パラメータ	説明
`bigtableProjectId`	データを書き込む Bigtable インスタンスの Google Cloud プロジェクトの ID。
`bigtableInstanceId`	テーブルが含まれている Bigtable インスタンスの ID。
`bigtableTableId`	インポートする Bigtable テーブルの ID。
`inputFilePattern`	データが存在する Cloud Storage パスのパターン（例: `gs://mybucket/somefolder/prefix*`）。

Cloud Storage Avro file to Bigtable テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Avro Files on Cloud Storage to Cloud Bigtable template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
inputFilePattern=INPUT_FILE_PATTERN

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID
INSTANCE_ID: テーブルが含まれている Bigtable インスタンスの ID
TABLE_ID: エクスポートする Bigtable テーブルの ID
INPUT_FILE_PATTERN: データが存在する Cloud Storage パスのパターン（例: gs://mybucket/somefolder/prefix*）

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "inputFilePattern": "INPUT_FILE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID
INSTANCE_ID: テーブルが含まれている Bigtable インスタンスの ID
TABLE_ID: エクスポートする Bigtable テーブルの ID
INPUT_FILE_PATTERN: データが存在する Cloud Storage パスのパターン（例: gs://mybucket/somefolder/prefix*）

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.bigtable;

import com.google.bigtable.v2.Mutation;
import com.google.bigtable.v2.Mutation.SetCell;
import com.google.cloud.teleport.bigtable.AvroToBigtable.Options;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.common.base.MoreObjects;
import com.google.common.collect.ImmutableList;
import com.google.protobuf.ByteString;
import java.nio.ByteBuffer;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.AvroIO;
import org.apache.beam.sdk.io.gcp.bigtable.BigtableIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.StaticValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.KV;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * Dataflow pipeline that imports data from Avro files in GCS to a Cloud Bigtable table. The Cloud
 * Bigtable table must be created before running the pipeline and must have a compatible table
 * schema. For example, if {@link BigtableCell} from the Avro files has a 'family' of "f1", the
 * Bigtable table should have a column family of "f1".
 */
@Template(
    name = "GCS_Avro_to_Cloud_Bigtable",
    category = TemplateCategory.BATCH,
    displayName = "Avro Files on Cloud Storage to Cloud Bigtable",
    description =
        "A pipeline which reads data from Avro files in Cloud Storage and writes it to Cloud Bigtable table.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public final class AvroToBigtable {
  private static final Logger LOG = LoggerFactory.getLogger(AvroToBigtable.class);

  /** Maximum number of mutations allowed per row by Cloud bigtable. */
  private static final int MAX_MUTATIONS_PER_ROW = 100000;

  private static final Boolean DEFAULT_SPLIT_LARGE_ROWS = false;

  /** Options for the import pipeline. */
  public interface Options extends PipelineOptions {
    @TemplateParameter.ProjectId(
        order = 1,
        description = "Project ID",
        helpText =
            "The ID of the Google Cloud project of the Cloud Bigtable instance that you want to write data to")
    ValueProvider<String> getBigtableProjectId();

    @SuppressWarnings("unused")
    void setBigtableProjectId(ValueProvider<String> projectId);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"[a-z][a-z0-9\\-]+[a-z0-9]"},
        description = "Instance ID",
        helpText = "The ID of the Cloud Bigtable instance that contains the table")
    ValueProvider<String> getBigtableInstanceId();

    @SuppressWarnings("unused")
    void setBigtableInstanceId(ValueProvider<String> instanceId);

    @TemplateParameter.Text(
        order = 4,
        regexes = {"[_a-zA-Z0-9][-_.a-zA-Z0-9]*"},
        description = "Table ID",
        helpText = "The ID of the Cloud Bigtable table to write")
    ValueProvider<String> getBigtableTableId();

    @SuppressWarnings("unused")
    void setBigtableTableId(ValueProvider<String> tableId);

    @TemplateParameter.GcsReadFile(
        order = 5,
        description = "Input Cloud Storage File(s)",
        helpText = "The Cloud Storage location of the files you'd like to process.",
        example = "gs://your-bucket/your-files/*.avro")
    ValueProvider<String> getInputFilePattern();

    @SuppressWarnings("unused")
    void setInputFilePattern(ValueProvider<String> inputFilePattern);

    @TemplateParameter.Boolean(
        order = 6,
        optional = true,
        description = "If true, large rows will be split into multiple MutateRows requests",
        helpText =
            "The flag for enabling splitting of large rows into multiple MutateRows requests. Note that when a large row is split between multiple API calls, the updates to the row are not atomic. ")
    ValueProvider<Boolean> getSplitLargeRows();

    void setSplitLargeRows(ValueProvider<Boolean> splitLargeRows);
  }

  /**
   * Runs a pipeline to import Avro files in GCS to a Cloud Bigtable table.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    PipelineResult result = run(options);

    // Wait for pipeline to finish only if it is not constructing a template.
    if (options.as(DataflowPipelineOptions.class).getTemplateLocation() == null) {
      result.waitUntilFinish();
    }
  }

  public static PipelineResult run(Options options) {
    Pipeline pipeline = Pipeline.create(PipelineUtils.tweakPipelineOptions(options));

    BigtableIO.Write write =
        BigtableIO.write()
            .withProjectId(options.getBigtableProjectId())
            .withInstanceId(options.getBigtableInstanceId())
            .withTableId(options.getBigtableTableId());

    pipeline
        .apply("Read from Avro", AvroIO.read(BigtableRow.class).from(options.getInputFilePattern()))
        .apply(
            "Transform to Bigtable",
            ParDo.of(
                AvroToBigtableFn.createWithSplitLargeRows(
                    options.getSplitLargeRows(), MAX_MUTATIONS_PER_ROW)))
        .apply("Write to Bigtable", write);

    return pipeline.run();
  }

  /**
   * Translates {@link BigtableRow} to {@link Mutation}s along with a row key. The mutations are
   * {@link SetCell}s that set the value for specified cells with family name, column qualifier and
   * timestamp.
   */
  static class AvroToBigtableFn extends DoFn<BigtableRow, KV<ByteString, Iterable<Mutation>>> {
    private final ValueProvider<Boolean> splitLargeRowsFlag;
    private Boolean splitLargeRows;
    private final int maxMutationsPerRow;

    public static AvroToBigtableFn create() {
      return new AvroToBigtableFn(StaticValueProvider.of(false), MAX_MUTATIONS_PER_ROW);
    }

    public static AvroToBigtableFn createWithSplitLargeRows(
        ValueProvider<Boolean> splitLargeRowsFlag, int maxMutationsPerRequest) {
      return new AvroToBigtableFn(splitLargeRowsFlag, maxMutationsPerRequest);
    }

    private AvroToBigtableFn(
        ValueProvider<Boolean> splitLargeRowsFlag, int maxMutationsPerRequest) {
      this.splitLargeRowsFlag = splitLargeRowsFlag;
      this.maxMutationsPerRow = maxMutationsPerRequest;
    }

    @Setup
    public void setup() {
      if (splitLargeRowsFlag != null) {
        splitLargeRows = splitLargeRowsFlag.get();
      }
      splitLargeRows = MoreObjects.firstNonNull(splitLargeRows, DEFAULT_SPLIT_LARGE_ROWS);
      LOG.info("splitLargeRows set to: " + splitLargeRows);
    }

    @ProcessElement
    public void processElement(
        @Element BigtableRow row, OutputReceiver<KV<ByteString, Iterable<Mutation>>> out) {
      ByteString key = toByteString(row.getKey());
      // BulkMutation doesn't split rows. Currently, if a single row contains more than 100,000
      // mutations, the service will fail the request.
      ImmutableList.Builder<Mutation> mutations = ImmutableList.builder();
      int cellsProcessed = 0;
      for (BigtableCell cell : row.getCells()) {
        SetCell setCell =
            SetCell.newBuilder()
                .setFamilyName(cell.getFamily().toString())
                .setColumnQualifier(toByteString(cell.getQualifier()))
                .setTimestampMicros(cell.getTimestamp())
                .setValue(toByteString(cell.getValue()))
                .build();

        mutations.add(Mutation.newBuilder().setSetCell(setCell).build());
        cellsProcessed++;

        if (this.splitLargeRows && cellsProcessed % maxMutationsPerRow == 0) {
          // Send a MutateRow request when we have accumulated max mutations per row.
          out.output(KV.of(key, mutations.build()));
          mutations = ImmutableList.builder();
        }
      }

      // Flush any remaining mutations.
      ImmutableList remainingMutations = mutations.build();
      if (!remainingMutations.isEmpty()) {
        out.output(KV.of(key, remainingMutations));
      }
    }
  }

  /** Copies the content in {@code byteBuffer} into a {@link ByteString}. */
  protected static ByteString toByteString(ByteBuffer byteBuffer) {
    return ByteString.copyFrom(byteBuffer.array());
  }
}

Cloud Storage Avro to Cloud Spanner

Cloud Storage Avro files to Cloud Spanner テンプレートは、Cloud Storage に保存されている Cloud Spanner からエクスポートされた Avro ファイルを読み取り、そのファイルを Cloud Spanner データベースにインポートするバッチパイプラインです。

このパイプラインの要件:

ターゲットの Cloud Spanner データベースが存在し、空であること。
Cloud Storage バケットの読み取り権限と、対象の Cloud Spanner データベースに対する書き込み権限が必要です。
入力された Cloud Storage パスが存在する必要があります。また、インポートするファイルの JSON 記述を含む spanner-export.json ファイルがそこに格納されている必要があります。

テンプレートのパラメータ

パラメータ	説明
`instanceId`	Cloud Spanner データベースのインスタンス ID。
`databaseId`	Cloud Spanner データベースのデータベース ID。
`inputDir`	Avro ファイルのインポート元となる Cloud Storage のパス。

Cloud Storage Avro to Cloud Spanner テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
Google Cloud コンソールの Spanner インスタンスページにジョブを表示するには、ジョブ名が次の形式になっている必要があります。
```
cloud-spanner-import-SPANNER_INSTANCE_ID-SPANNER_DATABASE_NAME
```
次のように置き換えます。
- SPANNER_INSTANCE_ID: Spanner インスタンスの ID
- SPANNER_DATABASE_NAME: Spanner データベースの名前
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Avro Files on Cloud Storage to Cloud Spanner template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner \
    --region REGION_NAME \
    --staging-location GCS_STAGING_LOCATION \
    --parameters \
instanceId=INSTANCE_ID,\
databaseId=DATABASE_ID,\
inputDir=GCS_DIRECTORY

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
INSTANCE_ID: データベースを含む Spanner インスタンスの ID
DATABASE_ID: インポート先の Spanner データベースの ID
GCS_DIRECTORY: Avro ファイルのインポート元となる Cloud Storage パス。例: gs://mybucket/somefolder

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner
{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "inputDir": "gs://GCS_DIRECTORY"
   },
   "environment": {
       "machineType": "n1-standard-2"
   }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
INSTANCE_ID: データベースを含む Spanner インスタンスの ID
DATABASE_ID: インポート先の Spanner データベースの ID
GCS_DIRECTORY: Avro ファイルのインポート元となる Cloud Storage パス。例: gs://mybucket/somefolder

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.spanner;

import com.google.cloud.spanner.Options.RpcPriority;
import com.google.cloud.spanner.SpannerOptions;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateCreationParameter;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.spanner.ImportPipeline.Options;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.spanner.SpannerConfig;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider;
import org.apache.beam.sdk.transforms.SerializableFunction;

/** Avro to Cloud Spanner Import pipeline. */
@Template(
    name = "GCS_Avro_to_Cloud_Spanner",
    category = TemplateCategory.BATCH,
    displayName = "Avro Files on Cloud Storage to Cloud Spanner",
    description =
        "A pipeline to import a Cloud Spanner database from a set of Avro files in Cloud Storage.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class ImportPipeline {

  /** Options for {@link ImportPipeline}. */
  public interface Options extends PipelineOptions {

    @TemplateParameter.Text(
        order = 1,
        regexes = {"^[a-z0-9\\-]+$"},
        description = "Cloud Spanner instance id",
        helpText = "The instance id of the Cloud Spanner database that you want to import to.")
    ValueProvider<String> getInstanceId();

    void setInstanceId(ValueProvider<String> value);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"^[a-z_0-9\\-]+$"},
        description = "Cloud Spanner database id",
        helpText =
            "The database id of the Cloud Spanner database that you want to import into (must already exist).")
    ValueProvider<String> getDatabaseId();

    void setDatabaseId(ValueProvider<String> value);

    @TemplateParameter.GcsReadFolder(
        order = 3,
        description = "Cloud storage input directory",
        helpText = "The Cloud Storage path where the Avro files should be imported from.")
    ValueProvider<String> getInputDir();

    void setInputDir(ValueProvider<String> value);

    @TemplateParameter.Text(
        order = 4,
        optional = true,
        description = "Cloud Spanner Endpoint to call",
        helpText = "The Cloud Spanner endpoint to call in the template. Only used for testing.",
        example = "https://batch-spanner.googleapis.com")
    @Default.String("https://batch-spanner.googleapis.com")
    ValueProvider<String> getSpannerHost();

    void setSpannerHost(ValueProvider<String> value);

    @TemplateParameter.Boolean(
        order = 5,
        optional = true,
        description = "Wait for Indexes",
        helpText =
            "By default the import pipeline is not blocked on index creation, and it "
                + "may complete with indexes still being created in the background. In testing, it may "
                + "be useful to set this option to false so that the pipeline waits until indexes are "
                + "finished.")
    @Default.Boolean(false)
    ValueProvider<Boolean> getWaitForIndexes();

    void setWaitForIndexes(ValueProvider<Boolean> value);

    @TemplateParameter.Boolean(
        order = 6,
        optional = true,
        description = "Wait for Foreign Keys",
        helpText =
            "By default the import pipeline is not blocked on foreign key creation, and it may complete"
                + " with foreign keys still being created in the background. In testing, it may be"
                + " useful to set this option to false so that the pipeline waits until foreign keys"
                + " are finished.")
    @Default.Boolean(false)
    ValueProvider<Boolean> getWaitForForeignKeys();

    void setWaitForForeignKeys(ValueProvider<Boolean> value);

    @TemplateParameter.Boolean(
        order = 7,
        optional = true,
        description = "Wait for Foreign Keys",
        helpText =
            "By default the import pipeline is blocked on change stream creation. If false, it may"
                + " complete with change streams still being created in the background.")
    @Default.Boolean(true)
    ValueProvider<Boolean> getWaitForChangeStreams();

    void setWaitForChangeStreams(ValueProvider<Boolean> value);

    @TemplateParameter.Boolean(
        order = 8,
        optional = true,
        description = "Create Indexes early",
        helpText =
            "Flag to turn off early index creation if there are many indexes. Indexes and Foreign keys are created after dataload. If there are more than "
                + "40 DDL statements to be executed after dataload, it is preferable to create the "
                + "indexes before datalod. This is the flag to turn the feature off.")
    @Default.Boolean(true)
    ValueProvider<Boolean> getEarlyIndexCreateFlag();

    void setEarlyIndexCreateFlag(ValueProvider<Boolean> value);

    @TemplateCreationParameter(value = "false")
    @Description("If true, wait for job finish")
    @Default.Boolean(true)
    boolean getWaitUntilFinish();

    @TemplateParameter.ProjectId(
        order = 9,
        optional = true,
        description = "Cloud Spanner Project Id",
        helpText = "The project id of the Cloud Spanner instance.")
    ValueProvider<String> getSpannerProjectId();

    void setSpannerProjectId(ValueProvider<String> value);

    void setWaitUntilFinish(boolean value);

    @TemplateParameter.Text(
        order = 10,
        optional = true,
        regexes = {"[0-9]+"},
        description = "DDL Creation timeout in minutes",
        helpText = "DDL Creation timeout in minutes.")
    @Default.Integer(30)
    ValueProvider<Integer> getDDLCreationTimeoutInMinutes();

    void setDDLCreationTimeoutInMinutes(ValueProvider<Integer> value);

    @TemplateParameter.Enum(
        order = 11,
        enumOptions = {"LOW", "MEDIUM", "HIGH"},
        optional = true,
        description = "Priority for Spanner RPC invocations",
        helpText =
            "The request priority for Cloud Spanner calls. The value must be one of: [HIGH,MEDIUM,LOW].")
    ValueProvider<RpcPriority> getSpannerPriority();

    void setSpannerPriority(ValueProvider<RpcPriority> value);
  }

  public static void main(String[] args) {

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    Pipeline p = Pipeline.create(options);

    SpannerConfig spannerConfig =
        SpannerConfig.create()
            // Temporary fix explicitly setting SpannerConfig.projectId to the default project
            // if spannerProjectId is not provided as a parameter. Required as of Beam 2.38,
            // which no longer accepts null label values on metrics, and SpannerIO#setup() has
            // a bug resulting in the label value being set to the original parameter value,
            // with no fallback to the default project.
            // TODO: remove NestedValueProvider when this is fixed in Beam.
            .withProjectId(
                NestedValueProvider.of(
                    options.getSpannerProjectId(),
                    (SerializableFunction<String, String>)
                        input -> input != null ? input : SpannerOptions.getDefaultProjectId()))
            .withHost(options.getSpannerHost())
            .withInstanceId(options.getInstanceId())
            .withDatabaseId(options.getDatabaseId())
            .withRpcPriority(options.getSpannerPriority());

    p.apply(
        new ImportTransform(
            spannerConfig,
            options.getInputDir(),
            options.getWaitForIndexes(),
            options.getWaitForForeignKeys(),
            options.getWaitForChangeStreams(),
            options.getEarlyIndexCreateFlag(),
            options.getDDLCreationTimeoutInMinutes()));

    PipelineResult result = p.run();

    if (options.getWaitUntilFinish()
        &&
        /* Only if template location is null, there is a dataflow job to wait for. Else it's
         * template generation which doesn't start a dataflow job.
         */
        options.as(DataflowPipelineOptions.class).getTemplateLocation() == null) {
      result.waitUntilFinish();
    }
  }
}

Cloud Storage Parquet to Bigtable

Cloud Storage Parquet to Bigtable テンプレートは、Cloud Storage バケットの Parquet ファイルからデータを読み取り、そのデータを Bigtable テーブルに書き込むパイプラインです。このテンプレートは、Cloud Storage から Bigtable にデータをコピーする場合に使用できます。

このパイプラインの要件:

Bigtable テーブルが存在し、Parquet ファイルにエクスポートしたものと同じ列ファミリーがこのテーブルにあること。
パイプラインを実行する前に、入力 Parquet ファイルが Cloud Storage バケット内に存在すること。
Bigtable が入力の Parquet ファイルに特定のスキーマを想定していること。

テンプレートのパラメータ

パラメータ	説明
`bigtableProjectId`	データを書き込む Bigtable インスタンスの Google Cloud プロジェクトの ID。
`bigtableInstanceId`	テーブルが含まれている Bigtable インスタンスの ID。
`bigtableTableId`	インポートする Bigtable テーブルの ID。
`inputFilePattern`	データが存在する Cloud Storage パスのパターン（例: `gs://mybucket/somefolder/prefix*`）。

Cloud Storage Parquet file to Bigtable テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Parquet Files on Cloud Storage to Cloud Bigtable template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
inputFilePattern=INPUT_FILE_PATTERN

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID
INSTANCE_ID: テーブルが含まれている Bigtable インスタンスの ID
TABLE_ID: エクスポートする Bigtable テーブルの ID
INPUT_FILE_PATTERN: データが存在する Cloud Storage パスのパターン（例: gs://mybucket/somefolder/prefix*）

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "inputFilePattern": "INPUT_FILE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID
INSTANCE_ID: テーブルが含まれている Bigtable インスタンスの ID
TABLE_ID: エクスポートする Bigtable テーブルの ID
INPUT_FILE_PATTERN: データが存在する Cloud Storage パスのパターン（例: gs://mybucket/somefolder/prefix*）

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.bigtable;

import static com.google.cloud.teleport.bigtable.AvroToBigtable.toByteString;

import com.google.bigtable.v2.Mutation;
import com.google.cloud.teleport.bigtable.ParquetToBigtable.Options;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.protobuf.ByteString;
import java.nio.ByteBuffer;
import java.util.List;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.beam.runners.direct.DirectRunner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.bigtable.BigtableIO;
import org.apache.beam.sdk.io.parquet.ParquetIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.StaticValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.MoreObjects;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableList;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link ParquetToBigtable} pipeline imports data from Parquet files in GCS to a Cloud Bigtable
 * table. The Cloud Bigtable table must be created before running the pipeline and must have a
 * compatible table schema. For example, if {@link BigtableCell} from the Parquet files has a
 * 'family' of "f1", the Bigtable table should have a column family of "f1".
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>Bigtable instance.
 *   <li>Bigtable table with compatible table schema.
 *   <li>Google Cloud Storage input bucket and parquet file(s) exists.
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 *
 * # Set the pipeline vars
 * PROJECT_ID=PROJECT ID HERE
 * PIPELINE_FOLDER=gs://${PROJECT_ID}/dataflow/pipelines/parquet-to-bigtable
 * BIGTABLE_INSTANCE_ID=BIGTABLE INSTANCE ID HERE
 * BIGTABLE_TABLE_ID=BIGTABLE TABLE ID HERE
 *
 * # Set the runner
 * RUNNER=DataflowRunner
 *
 * # Build the template
 * mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.bigtable.ParquetToBigtable \
 * -Dexec.cleanupDaemonThreads=false \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=${PIPELINE_FOLDER}/staging \
 * --tempLocation=${PIPELINE_FOLDER}/temp \
 * --templateLocation=${PIPELINE_FOLDER}/template \
 * --runner=${RUNNER}"
 *
 * # Execute the template
 * JOB_NAME=parquet-to-bigtable-$USER-`date +"%Y%m%d-%H%M%S%z"`
 *
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "bigtableProjectId=${PROJECT_ID},\
 * bigtableInstanceId=${BIGTABLE_INSTANCE_ID},\
 * bigtableTableId=${BIGTABLE_TABLE_ID},\
 * inputFilePattern=${PIPELINE_FOLDER}/path/to/file/filename-*.parquet"
 * </pre>
 */
@Template(
    name = "GCS_Parquet_to_Cloud_Bigtable",
    category = TemplateCategory.BATCH,
    displayName = "Parquet Files on Cloud Storage to Cloud Bigtable",
    description =
        "A pipeline which reads data from Parquet files in Cloud Storage and writes it to Cloud Bigtable table.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class ParquetToBigtable {
  private static final Logger LOG = LoggerFactory.getLogger(AvroToBigtable.class);

  /** Maximum number of mutations allowed per row by Cloud bigtable. */
  private static final int MAX_MUTATIONS_PER_ROW = 100000;

  private static final Boolean DEFAULT_SPLIT_LARGE_ROWS = false;

  /** Options for the import pipeline. */
  public interface Options extends PipelineOptions {
    @TemplateParameter.ProjectId(
        order = 1,
        description = "Project ID",
        helpText =
            "The ID of the Google Cloud project of the Cloud Bigtable instance that you want to write data to")
    ValueProvider<String> getBigtableProjectId();

    @SuppressWarnings("unused")
    void setBigtableProjectId(ValueProvider<String> projectId);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"[a-z][a-z0-9\\-]+[a-z0-9]"},
        description = "Instance ID",
        helpText = "The ID of the Cloud Bigtable instance that contains the table")
    ValueProvider<String> getBigtableInstanceId();

    @SuppressWarnings("unused")
    void setBigtableInstanceId(ValueProvider<String> instanceId);

    @TemplateParameter.Text(
        order = 3,
        regexes = {"[_a-zA-Z0-9][-_.a-zA-Z0-9]*"},
        description = "Table ID",
        helpText = "The ID of the Cloud Bigtable table to write")
    ValueProvider<String> getBigtableTableId();

    @SuppressWarnings("unused")
    void setBigtableTableId(ValueProvider<String> tableId);

    @TemplateParameter.GcsReadFile(
        order = 4,
        description = "Input Cloud Storage File(s)",
        helpText = "The Cloud Storage location of the files you'd like to process.",
        example = "gs://your-bucket/your-files/*.parquet")
    ValueProvider<String> getInputFilePattern();

    @SuppressWarnings("unused")
    void setInputFilePattern(ValueProvider<String> inputFilePattern);

    @TemplateParameter.Boolean(
        order = 5,
        optional = true,
        description = "If true, large rows will be split into multiple MutateRows requests",
        helpText =
            "The flag for enabling splitting of large rows into multiple MutateRows requests. Note that when a large row is split between multiple API calls, the updates to the row are not atomic. ")
    ValueProvider<Boolean> getSplitLargeRows();

    void setSplitLargeRows(ValueProvider<Boolean> splitLargeRows);
  }

  /**
   * Runs a pipeline to import Parquet files in GCS to a Cloud Bigtable table.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    PipelineResult result = run(options);
  }

  public static PipelineResult run(Options options) {
    Pipeline pipeline = Pipeline.create(PipelineUtils.tweakPipelineOptions(options));

    BigtableIO.Write write =
        BigtableIO.write()
            .withProjectId(options.getBigtableProjectId())
            .withInstanceId(options.getBigtableInstanceId())
            .withTableId(options.getBigtableTableId());

    /**
     * Steps: 1) Read records from Parquet File. 2) Convert a GenericRecord to a
     * KV<ByteString,Iterable<Mutation>>. 3) Write KV to Bigtable's table.
     */
    pipeline
        .apply(
            "Read from Parquet",
            ParquetIO.read(BigtableRow.getClassSchema()).from(options.getInputFilePattern()))
        .apply(
            "Transform to Bigtable",
            ParDo.of(
                ParquetToBigtableFn.createWithSplitLargeRows(
                    options.getSplitLargeRows(), MAX_MUTATIONS_PER_ROW)))
        .apply("Write to Bigtable", write);

    return pipeline.run();
  }

  static class ParquetToBigtableFn extends DoFn<GenericRecord, KV<ByteString, Iterable<Mutation>>> {

    private final ValueProvider<Boolean> splitLargeRowsFlag;
    private Boolean splitLargeRows;
    private final int maxMutationsPerRow;

    public static ParquetToBigtableFn create() {
      return new ParquetToBigtableFn(StaticValueProvider.of(false), MAX_MUTATIONS_PER_ROW);
    }

    public static ParquetToBigtableFn createWithSplitLargeRows(
        ValueProvider<Boolean> splitLargeRowsFlag, int maxMutationsPerRequest) {
      return new ParquetToBigtableFn(splitLargeRowsFlag, maxMutationsPerRequest);
    }

    @Setup
    public void setup() {
      if (splitLargeRowsFlag != null) {
        splitLargeRows = splitLargeRowsFlag.get();
      }
      splitLargeRows = MoreObjects.firstNonNull(splitLargeRows, DEFAULT_SPLIT_LARGE_ROWS);
      LOG.info("splitLargeRows set to: " + splitLargeRows);
    }

    private ParquetToBigtableFn(
        ValueProvider<Boolean> splitLargeRowsFlag, int maxMutationsPerRequest) {
      this.splitLargeRowsFlag = splitLargeRowsFlag;
      this.maxMutationsPerRow = maxMutationsPerRequest;
    }

    @ProcessElement
    public void processElement(ProcessContext ctx) {
      Class runner = ctx.getPipelineOptions().getRunner();
      ByteString key = toByteString((ByteBuffer) ctx.element().get(0));

      // BulkMutation doesn't split rows. Currently, if a single row contains more than 100,000
      // mutations, the service will fail the request.
      ImmutableList.Builder<Mutation> mutations = ImmutableList.builder();
      List<Object> cells = (List) ctx.element().get(1);
      int cellsProcessed = 0;
      for (Object element : cells) {
        Mutation.SetCell setCell = null;
        if (runner.isAssignableFrom(DirectRunner.class)) {
          setCell =
              Mutation.SetCell.newBuilder()
                  .setFamilyName(((GenericData.Record) element).get(0).toString())
                  .setColumnQualifier(
                      toByteString((ByteBuffer) ((GenericData.Record) element).get(1)))
                  .setTimestampMicros((Long) ((GenericData.Record) element).get(2))
                  .setValue(toByteString((ByteBuffer) ((GenericData.Record) element).get(3)))
                  .build();
        } else {
          BigtableCell bigtableCell = (BigtableCell) element;
          setCell =
              Mutation.SetCell.newBuilder()
                  .setFamilyName(bigtableCell.getFamily().toString())
                  .setColumnQualifier(toByteString(bigtableCell.getQualifier()))
                  .setTimestampMicros(bigtableCell.getTimestamp())
                  .setValue(toByteString(bigtableCell.getValue()))
                  .build();
        }
        mutations.add(Mutation.newBuilder().setSetCell(setCell).build());
        cellsProcessed++;

        if (this.splitLargeRows && cellsProcessed % maxMutationsPerRow == 0) {
          // Send a MutateRow request when we have accumulated max mutations per row.
          ctx.output(KV.of(key, mutations.build()));
          mutations = ImmutableList.builder();
        }
      }

      // Flush any remaining mutations.
      ImmutableList remainingMutations = mutations.build();
      if (!remainingMutations.isEmpty()) {
        ctx.output(KV.of(key, remainingMutations));
      }
    }
  }
}

Cloud Storage SequenceFile to Bigtable

Cloud Storage SequenceFile to Bigtable テンプレートは、Cloud Storage バケット内の SequenceFile からデータを読み取り、そのデータを Bigtable テーブルに書き込むパイプラインです。このテンプレートは、Cloud Storage から Bigtable にデータをコピーする場合に使用できます。

このパイプラインの要件:

Bigtable テーブルが存在していること。
パイプラインを実行する前に、入力 SequenceFiles が Cloud Storage バケット内に存在すること。
入力 SequenceFiles が Bigtable または HBase からエクスポートされていること。

テンプレートのパラメータ

パラメータ	説明
`bigtableProject`	データを書き込む Bigtable インスタンスの Google Cloud プロジェクトの ID。
`bigtableInstanceId`	テーブルが含まれている Bigtable インスタンスの ID。
`bigtableTableId`	インポートする Bigtable テーブルの ID。
`bigtableAppProfileId`	インポートに使用される Bigtable アプリケーションプロファイルの ID。アプリプロファイルを指定しないと、Bigtable はインスタンスのデフォルトのアプリプロファイルを使用します。
`sourcePattern`	データが存在する Cloud Storage パスのパターン（例: `gs://mybucket/somefolder/prefix*`）。

Cloud Storage SequenceFile to Bigtable テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the SequenceFile Files on Cloud Storage to Cloud Bigtable template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable \
    --region REGION_NAME \
    --parameters \
bigtableProject=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
bigtableAppProfileId=APPLICATION_PROFILE_ID,\
sourcePattern=SOURCE_PATTERN

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID
INSTANCE_ID: テーブルが含まれている Bigtable インスタンスの ID
TABLE_ID: エクスポートする Bigtable テーブルの ID
APPLICATION_PROFILE_ID: エクスポートに使用される Bigtable アプリケーションプロファイルの ID。
SOURCE_PATTERN: データが存在する Cloud Storage パスのパターン（例: gs://mybucket/somefolder/prefix*）

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProject": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "bigtableAppProfileId": "APPLICATION_PROFILE_ID",
       "sourcePattern": "SOURCE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: データを読み取る Bigtable インスタンスの Google Cloud プロジェクトの ID
INSTANCE_ID: テーブルが含まれている Bigtable インスタンスの ID
TABLE_ID: エクスポートする Bigtable テーブルの ID
APPLICATION_PROFILE_ID: エクスポートに使用される Bigtable アプリケーションプロファイルの ID。
SOURCE_PATTERN: データが存在する Cloud Storage パスのパターン（例: gs://mybucket/somefolder/prefix*）

テンプレートのソースコード

Java

このテンプレートのソースコードは、GitHub の GoogleCloudPlatform/cloud-bigtable-client リポジトリにあります。

Cloud Storage Text to BigQuery

Cloud Storage Text to BigQuery パイプラインは、Cloud Storage に保存されているテキストファイルを読み取り、ユーザーが指定する JavaScript ユーザー定義関数（UDF）を使用してそれらのファイルを変換し、結果を BigQuery テーブルに追加するバッチパイプラインです。

注: 追加ではなく BigQuery テーブルのデータを上書きする場合は、テンプレートのソースコードの WriteDisposition を WRITE_APPEND から WRITE_TRUNCATE に更新します。

このパイプラインの要件:

BigQuery スキーマを記述する JSON ファイルを作成します。
BigQuery Schema というタイトルのトップレベルの JSON 配列があり、その内容が {"name": "COLUMN_NAME", "type": "DATA_TYPE"} のパターンに従っていることを確認します。

Cloud Storage Text to BigQuery バッチテンプレートでは、ターゲットの BigQuery テーブルの STRUCT（レコード）フィールドへのデータのインポートはサポートされていません。

次の JSON は、BigQuery スキーマの例を示しています。
```
{
  "BigQuery Schema": [
    {
      "name": "location",
      "type": "STRING"
    },
    {
      "name": "name",
      "type": "STRING"
    },
    {
      "name": "age",
      "type": "STRING"
    },
    {
      "name": "color",
      "type": "STRING"
    },
    {
      "name": "coffee",
      "type": "STRING"
    }
  ]
}
```
JavaScript（.js）ファイルを作成し、このファイル内に、テキスト行の変換ロジックを提供する UDF 関数を含めます。使用する関数は、JSON 文字列を返します。
たとえば、次の関数は、CSV ファイルの各行を分割し、値を変換してから JSON 文字列を返します。
```
function transform(line) {
var values = line.split(',');

var obj = new Object();
obj.location = values[0];
obj.name = values[1];
obj.age = values[2];
obj.color = values[3];
obj.coffee = values[4];
var jsonString = JSON.stringify(obj);

return jsonString;
}
```

テンプレートのパラメータ

パラメータ	説明
`javascriptTextTransformFunctionName`	使用する JavaScript ユーザー定義関数（UDF）の名前。たとえば、JavaScript 関数が `myTransform(inJson) { /...do stuff.../ }` の場合、関数名は `myTransform` です。JavaScript UDF の例については、UDF の例をご覧ください。
`JSONPath`	Cloud Storage に格納された BigQuery スキーマを定義する JSON ファイルへの `gs://` パス。例: `gs://path/to/my/schema.json`
`javascriptTextTransformGcsPath`	: 使用する JavaScript ユーザー定義関数（UDF）を定義する `.js` ファイルの Cloud Storage URI。例: `gs://my-bucket/my-udfs/my_file.js`
`inputFilePattern`	Cloud Storage 内で処理するテキストの `gs://` パス。例: `gs://path/to/my/text/data.txt`
`outputTable`	処理されたデータを格納するために作成する BigQuery テーブル名。既存の BigQuery テーブルを再利用すると、データは宛先テーブルに追加されます。例: `my-project-name:my-dataset.my-table`
`bigQueryLoadingTemporaryDirectory`	BigQuery 読み込みプロセスの一時ディレクトリ。例: `gs://my-bucket/my-files/temp_dir`

Cloud Storage Text to BigQuery テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Text Files on Cloud Storage to BigQuery (Batch) template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery \
    --region REGION_NAME \
    --parameters \
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
JAVASCRIPT_FUNCTION: 使用する JavaScript ユーザー定義関数（UDF）の名前
たとえば、JavaScript 関数が myTransform(inJson) { /*...do stuff...*/ } の場合、関数名は myTransform です。JavaScript UDF の例については、UDF の例をご覧ください。
PATH_TO_BIGQUERY_SCHEMA_JSON: スキーマ定義を含む JSON ファイルへの Cloud Storage パス
PATH_TO_JAVASCRIPT_UDF_FILE: 使用する JavaScript ユーザー定義関数（UDF）を定義する .js ファイルの Cloud Storage URI。例: gs://my-bucket/my-udfs/my_file.js
PATH_TO_TEXT_DATA: テキストデータセットへの Cloud Storage パス
BIGQUERY_TABLE: BigQuery テーブル名
PATH_TO_TEMP_DIR_ON_GCS: 一時ディレクトリへの Cloud Storage パス

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "JSONPath": "PATH_TO_BIGQUERY_SCHEMA_JSON",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "inputFilePattern":"PATH_TO_TEXT_DATA",
       "outputTable":"BIGQUERY_TABLE",
       "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
JAVASCRIPT_FUNCTION: 使用する JavaScript ユーザー定義関数（UDF）の名前
たとえば、JavaScript 関数が myTransform(inJson) { /*...do stuff...*/ } の場合、関数名は myTransform です。JavaScript UDF の例については、UDF の例をご覧ください。
PATH_TO_BIGQUERY_SCHEMA_JSON: スキーマ定義を含む JSON ファイルへの Cloud Storage パス
PATH_TO_JAVASCRIPT_UDF_FILE: 使用する JavaScript ユーザー定義関数（UDF）を定義する .js ファイルの Cloud Storage URI。例: gs://my-bucket/my-udfs/my_file.js
PATH_TO_TEXT_DATA: テキストデータセットへの Cloud Storage パス
BIGQUERY_TABLE: BigQuery テーブル名
PATH_TO_TEMP_DIR_ON_GCS: 一時ディレクトリへの Cloud Storage パス

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.api.services.bigquery.model.TableSchema;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.TextIOToBigQuery.Options;
import com.google.cloud.teleport.templates.common.BigQueryConverters;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import java.util.ArrayList;
import java.util.List;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider;
import org.json.JSONArray;
import org.json.JSONObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * Templated pipeline to read text from TextIO, apply a javascript UDF to it, and write it to GCS.
 */
@Template(
    name = "GCS_Text_to_BigQuery",
    category = TemplateCategory.BATCH,
    displayName = "Text Files on Cloud Storage to BigQuery",
    description =
        "Batch pipeline. Reads text files stored in Cloud Storage, transforms them using a JavaScript user-defined function (UDF), and outputs the result to BigQuery.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class TextIOToBigQuery {

  /** Options supported by {@link TextIOToBigQuery}. */
  public interface Options extends DataflowPipelineOptions, JavascriptTextTransformerOptions {

    @TemplateParameter.GcsReadFile(
        order = 1,
        description = "Cloud Storage Input File(s)",
        helpText = "Path of the file pattern glob to read from.",
        example = "gs://your-bucket/path/*.csv")
    ValueProvider<String> getInputFilePattern();

    void setInputFilePattern(ValueProvider<String> value);

    @TemplateParameter.GcsReadFile(
        order = 2,
        description = "Cloud Storage location of your BigQuery schema file, described as a JSON",
        helpText =
            "JSON file with BigQuery Schema description. JSON Example: {\n"
                + "\t\"BigQuery Schema\": [\n"
                + "\t\t{\n"
                + "\t\t\t\"name\": \"location\",\n"
                + "\t\t\t\"type\": \"STRING\"\n"
                + "\t\t},\n"
                + "\t\t{\n"
                + "\t\t\t\"name\": \"name\",\n"
                + "\t\t\t\"type\": \"STRING\"\n"
                + "\t\t},\n"
                + "\t\t{\n"
                + "\t\t\t\"name\": \"age\",\n"
                + "\t\t\t\"type\": \"STRING\"\n"
                + "\t\t},\n"
                + "\t\t{\n"
                + "\t\t\t\"name\": \"color\",\n"
                + "\t\t\t\"type\": \"STRING\"\n"
                + "\t\t},\n"
                + "\t\t{\n"
                + "\t\t\t\"name\": \"coffee\",\n"
                + "\t\t\t\"type\": \"STRING\"\n"
                + "\t\t}\n"
                + "\t]\n"
                + "}")
    ValueProvider<String> getJSONPath();

    void setJSONPath(ValueProvider<String> value);

    @TemplateParameter.BigQueryTable(
        order = 3,
        description = "BigQuery output table",
        helpText =
            "BigQuery table location to write the output to. The table's schema must match the "
                + "input objects.")
    ValueProvider<String> getOutputTable();

    void setOutputTable(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFolder(
        order = 6,
        description = "Temporary directory for BigQuery loading process",
        helpText = "Temporary directory for BigQuery loading process",
        example = "gs://your-bucket/your-files/temp_dir")
    @Validation.Required
    ValueProvider<String> getBigQueryLoadingTemporaryDirectory();

    void setBigQueryLoadingTemporaryDirectory(ValueProvider<String> directory);
  }

  private static final Logger LOG = LoggerFactory.getLogger(TextIOToBigQuery.class);

  private static final String BIGQUERY_SCHEMA = "BigQuery Schema";
  private static final String NAME = "name";
  private static final String TYPE = "type";
  private static final String MODE = "mode";
  private static final String RECORD_TYPE = "RECORD";
  private static final String FIELDS_ENTRY = "fields";

  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply("Read from source", TextIO.read().from(options.getInputFilePattern()))
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(BigQueryConverters.jsonToTableRow())
        .apply(
            "Insert into Bigquery",
            BigQueryIO.writeTableRows()
                .withSchema(
                    NestedValueProvider.of(
                        options.getJSONPath(),
                        jsonPath -> {
                          TableSchema tableSchema = new TableSchema();
                          List<TableFieldSchema> fields = new ArrayList<>();
                          SchemaParser schemaParser = new SchemaParser();

                          try {
                            JSONObject jsonSchema = schemaParser.parseSchema(jsonPath);
                            JSONArray bqSchemaJsonArray = jsonSchema.getJSONArray(BIGQUERY_SCHEMA);

                            for (int i = 0; i < bqSchemaJsonArray.length(); i++) {
                              JSONObject inputField = bqSchemaJsonArray.getJSONObject(i);
                              fields.add(convertToTableFieldSchema(inputField));
                            }
                            tableSchema.setFields(fields);

                          } catch (Exception e) {
                            throw new RuntimeException("Error parsing schema " + jsonPath, e);
                          }
                          return tableSchema;
                        }))
                .to(options.getOutputTable())
                .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
                .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                .withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory()));

    pipeline.run();
  }

  /**
   * Convert a JSONObject from the Schema JSON to a TableFieldSchema. In case of RECORD, it handles
   * the conversion recursively.
   *
   * @param inputField Input field to convert.
   * @return TableFieldSchema instance to populate the schema.
   */
  private static TableFieldSchema convertToTableFieldSchema(JSONObject inputField) {
    TableFieldSchema field =
        new TableFieldSchema()
            .setName(inputField.getString(NAME))
            .setType(inputField.getString(TYPE));

    if (inputField.has(MODE)) {
      field.setMode(inputField.getString(MODE));
    }

    if (inputField.getString(TYPE) != null && inputField.getString(TYPE).equals(RECORD_TYPE)) {
      List<TableFieldSchema> nestedFields = new ArrayList<>();
      JSONArray fieldsArr = inputField.getJSONArray(FIELDS_ENTRY);
      for (int i = 0; i < fieldsArr.length(); i++) {
        JSONObject nestedJSON = fieldsArr.getJSONObject(i);
        nestedFields.add(convertToTableFieldSchema(nestedJSON));
      }
      field.setFields(nestedFields);
    }

    return field;
  }
}

Cloud Storage Text to Datastore [非推奨]

このテンプレートはサポートが終了しており、2022 年第 1 四半期に廃止されます。Cloud Storage Text to Firestore テンプレートに移行してください。

Cloud Storage Text to Datastore テンプレートは、Cloud Storage に保存されたテキストファイルを読み取り、JSON にエンコードされたエンティティを Datastore に書き込むバッチパイプラインです。入力テキストファイルの各行は、指定された JSON 形式である必要があります。

このパイプラインの要件:

データストアが宛先プロジェクトで有効にされている必要があります。

テンプレートのパラメータ

パラメータ	説明
`textReadPattern`	テキストデータファイルの場所を指定する Cloud Storage のパスパターン。例: `gs://mybucket/somepath/*.json`
`javascriptTextTransformGcsPath`	（省略可）使用する JavaScript ユーザー定義関数（UDF）を定義する `.js` ファイルの Cloud Storage URI。例: `gs://my-bucket/my-udfs/my_file.js`
`javascriptTextTransformFunctionName`	（省略可）使用する JavaScript ユーザー定義関数（UDF）の名前。たとえば、JavaScript 関数が `myTransform(inJson) { /...do stuff.../ }` の場合、関数名は `myTransform` です。JavaScript UDF の例については、UDF の例をご覧ください。
`datastoreWriteProjectId`	Datastore エンティティを書き込む Google Cloud プロジェクト ID
`datastoreHintNumWorkers`	（省略可）Datastore のランプアップスロットリングステップで予想されるワーカー数のヒント。デフォルトは、`500` です。
`errorWritePath`	処理中に発生したエラーを書き込むために使用するエラーログ出力ファイル。例: `gs://bucket-name/errors.txt`

Cloud Storage Text to Datastore テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Text Files on Cloud Storage to Datastore template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_Datastore \
    --region REGION_NAME \
    --parameters \
textReadPattern=PATH_TO_INPUT_TEXT_FILES,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
datastoreWriteProjectId=PROJECT_ID,\
errorWritePath=ERROR_FILE_WRITE_PATH

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
PATH_TO_INPUT_TEXT_FILES: Cloud Storage 上の入力ファイルパターン
JAVASCRIPT_FUNCTION: 使用する JavaScript ユーザー定義関数（UDF）の名前
たとえば、JavaScript 関数が myTransform(inJson) { /*...do stuff...*/ } の場合、関数名は myTransform です。JavaScript UDF の例については、UDF の例をご覧ください。
PATH_TO_JAVASCRIPT_UDF_FILE: 使用する JavaScript ユーザー定義関数（UDF）を定義する .js ファイルの Cloud Storage URI。例: gs://my-bucket/my-udfs/my_file.js
ERROR_FILE_WRITE_PATH: Cloud Storage 上のエラーファイルの目的のパス

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Datastore
{
   "jobName": "JOB_NAME",
   "parameters": {
       "textReadPattern": "PATH_TO_INPUT_TEXT_FILES",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "datastoreWriteProjectId": "PROJECT_ID",
       "errorWritePath": "ERROR_FILE_WRITE_PATH"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
PATH_TO_INPUT_TEXT_FILES: Cloud Storage 上の入力ファイルパターン
JAVASCRIPT_FUNCTION: 使用する JavaScript ユーザー定義関数（UDF）の名前
たとえば、JavaScript 関数が myTransform(inJson) { /*...do stuff...*/ } の場合、関数名は myTransform です。JavaScript UDF の例については、UDF の例をご覧ください。
PATH_TO_JAVASCRIPT_UDF_FILE: 使用する JavaScript ユーザー定義関数（UDF）を定義する .js ファイルの Cloud Storage URI。例: gs://my-bucket/my-udfs/my_file.js
ERROR_FILE_WRITE_PATH: Cloud Storage 上のエラーファイルの目的のパス

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.MultiTemplate;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.TextToDatastore.TextToDatastoreOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreWriteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.WriteJsonEntities;
import com.google.cloud.teleport.templates.common.ErrorConverters.ErrorWriteOptions;
import com.google.cloud.teleport.templates.common.ErrorConverters.LogErrors;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import com.google.cloud.teleport.templates.common.TextConverters.FilesystemReadOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.values.TupleTag;

/**
 * Dataflow template which reads from a Text Source and writes JSON encoded Entities into Datastore.
 * The Json is expected to be in the format of:
 * https://cloud.google.com/datastore/docs/reference/rest/v1/Entity
 */
@MultiTemplate({
  @Template(
      name = "GCS_Text_to_Datastore",
      category = TemplateCategory.BATCH,
      displayName = "Text Files on Cloud Storage to Datastore [Deprecated]",
      description =
          "Batch pipeline. Reads from text files stored in Cloud Storage and writes JSON-encoded entities to Datastore.",
      optionsClass = TextToDatastoreOptions.class,
      skipOptions = {
        "firestoreWriteProjectId",
        "firestoreWriteEntityKind",
        "firestoreWriteNamespace",
        "firestoreHintNumWorkers"
      },
      contactInformation = "https://cloud.google.com/support"),
  @Template(
      name = "GCS_Text_to_Firestore",
      category = TemplateCategory.BATCH,
      displayName = "Text Files on Cloud Storage to Firestore (Datastore mode)",
      description =
          "Batch pipeline. Reads from text files stored in Cloud Storage and writes JSON-encoded entities to Firestore.",
      optionsClass = TextToDatastoreOptions.class,
      skipOptions = {
        "datastoreWriteProjectId",
        "datastoreWriteEntityKind",
        "datastoreWriteNamespace",
        "datastoreHintNumWorkers"
      },
      contactInformation = "https://cloud.google.com/support")
})
public class TextToDatastore {

  public static <T> ValueProvider<T> selectProvidedInput(
      ValueProvider<T> datastoreInput, ValueProvider<T> firestoreInput) {
    return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
  }

  /** TextToDatastore Pipeline Options. */
  public interface TextToDatastoreOptions
      extends PipelineOptions,
          FilesystemReadOptions,
          JavascriptTextTransformerOptions,
          DatastoreWriteOptions,
          ErrorWriteOptions {}

  /**
   * Runs a pipeline which reads from a Text Source, passes the Text to a Javascript UDF, writes the
   * JSON encoded Entities to a TextIO sink.
   *
   * <p>If your Text Source does not contain JSON encoded Entities, then you'll need to supply a
   * Javascript UDF which transforms your data to be JSON encoded Entities.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    TextToDatastoreOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(TextToDatastoreOptions.class);

    TupleTag<String> errorTag = new TupleTag<String>("errors") {};

    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(TextIO.read().from(options.getTextReadPattern()))
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(
            WriteJsonEntities.newBuilder()
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreWriteProjectId(), options.getFirestoreWriteProjectId()))
                .setHintNumWorkers(
                    selectProvidedInput(
                        options.getDatastoreHintNumWorkers(), options.getFirestoreHintNumWorkers()))
                .setErrorTag(errorTag)
                .build())
        .apply(
            LogErrors.newBuilder()
                .setErrorWritePath(options.getErrorWritePath())
                .setErrorTag(errorTag)
                .build());

    pipeline.run();
  }
}

Cloud Storage Text to Firestore

Cloud Storage Text to Firestore テンプレートは、Cloud Storage に保存されたテキストファイルを読み取り、JSON にエンコードされたエンティティを Firestore に書き込むバッチパイプラインです。入力テキストファイルの各行は、指定された JSON 形式である必要があります。

このパイプラインの要件:

Firestore が宛先プロジェクトで有効にされている必要があります。

テンプレートのパラメータ

パラメータ	説明
`textReadPattern`	テキストデータファイルの場所を指定する Cloud Storage のパスパターン。例: `gs://mybucket/somepath/*.json`
`javascriptTextTransformGcsPath`	（省略可）使用する JavaScript ユーザー定義関数（UDF）を定義する `.js` ファイルの Cloud Storage URI。例: `gs://my-bucket/my-udfs/my_file.js`
`javascriptTextTransformFunctionName`	（省略可）使用する JavaScript ユーザー定義関数（UDF）の名前。たとえば、JavaScript 関数が `myTransform(inJson) { /...do stuff.../ }` の場合、関数名は `myTransform` です。JavaScript UDF の例については、UDF の例をご覧ください。
`firestoreWriteProjectId`	Firestore エンティティを書き込む Google Cloud プロジェクト ID
`firestoreHintNumWorkers`	（省略可）Firestore のランプアップスロットリングステップで予想されるワーカー数のヒント。デフォルトは、`500` です。
`errorWritePath`	処理中に発生したエラーを書き込むために使用するエラーログ出力ファイル。例: `gs://bucket-name/errors.txt`

Cloud Storage Text to Firestore テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Text Files on Cloud Storage to Firestore template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_Firestore \
    --region REGION_NAME \
    --parameters \
textReadPattern=PATH_TO_INPUT_TEXT_FILES,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
firestoreWriteProjectId=PROJECT_ID,\
errorWritePath=ERROR_FILE_WRITE_PATH

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
PATH_TO_INPUT_TEXT_FILES: Cloud Storage 上の入力ファイルパターン
JAVASCRIPT_FUNCTION: 使用する JavaScript ユーザー定義関数（UDF）の名前
たとえば、JavaScript 関数が myTransform(inJson) { /*...do stuff...*/ } の場合、関数名は myTransform です。JavaScript UDF の例については、UDF の例をご覧ください。
PATH_TO_JAVASCRIPT_UDF_FILE: 使用する JavaScript ユーザー定義関数（UDF）を定義する .js ファイルの Cloud Storage URI。例: gs://my-bucket/my-udfs/my_file.js
ERROR_FILE_WRITE_PATH: Cloud Storage 上のエラーファイルの目的のパス

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Firestore
{
   "jobName": "JOB_NAME",
   "parameters": {
       "textReadPattern": "PATH_TO_INPUT_TEXT_FILES",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "firestoreWriteProjectId": "PROJECT_ID",
       "errorWritePath": "ERROR_FILE_WRITE_PATH"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
PATH_TO_INPUT_TEXT_FILES: Cloud Storage 上の入力ファイルパターン
JAVASCRIPT_FUNCTION: 使用する JavaScript ユーザー定義関数（UDF）の名前
たとえば、JavaScript 関数が myTransform(inJson) { /*...do stuff...*/ } の場合、関数名は myTransform です。JavaScript UDF の例については、UDF の例をご覧ください。
PATH_TO_JAVASCRIPT_UDF_FILE: 使用する JavaScript ユーザー定義関数（UDF）を定義する .js ファイルの Cloud Storage URI。例: gs://my-bucket/my-udfs/my_file.js
ERROR_FILE_WRITE_PATH: Cloud Storage 上のエラーファイルの目的のパス

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.MultiTemplate;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.TextToDatastore.TextToDatastoreOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreWriteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.WriteJsonEntities;
import com.google.cloud.teleport.templates.common.ErrorConverters.ErrorWriteOptions;
import com.google.cloud.teleport.templates.common.ErrorConverters.LogErrors;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import com.google.cloud.teleport.templates.common.TextConverters.FilesystemReadOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.values.TupleTag;

/**
 * Dataflow template which reads from a Text Source and writes JSON encoded Entities into Datastore.
 * The Json is expected to be in the format of:
 * https://cloud.google.com/datastore/docs/reference/rest/v1/Entity
 */
@MultiTemplate({
  @Template(
      name = "GCS_Text_to_Datastore",
      category = TemplateCategory.BATCH,
      displayName = "Text Files on Cloud Storage to Datastore [Deprecated]",
      description =
          "Batch pipeline. Reads from text files stored in Cloud Storage and writes JSON-encoded entities to Datastore.",
      optionsClass = TextToDatastoreOptions.class,
      skipOptions = {
        "firestoreWriteProjectId",
        "firestoreWriteEntityKind",
        "firestoreWriteNamespace",
        "firestoreHintNumWorkers"
      },
      contactInformation = "https://cloud.google.com/support"),
  @Template(
      name = "GCS_Text_to_Firestore",
      category = TemplateCategory.BATCH,
      displayName = "Text Files on Cloud Storage to Firestore (Datastore mode)",
      description =
          "Batch pipeline. Reads from text files stored in Cloud Storage and writes JSON-encoded entities to Firestore.",
      optionsClass = TextToDatastoreOptions.class,
      skipOptions = {
        "datastoreWriteProjectId",
        "datastoreWriteEntityKind",
        "datastoreWriteNamespace",
        "datastoreHintNumWorkers"
      },
      contactInformation = "https://cloud.google.com/support")
})
public class TextToDatastore {

  public static <T> ValueProvider<T> selectProvidedInput(
      ValueProvider<T> datastoreInput, ValueProvider<T> firestoreInput) {
    return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
  }

  /** TextToDatastore Pipeline Options. */
  public interface TextToDatastoreOptions
      extends PipelineOptions,
          FilesystemReadOptions,
          JavascriptTextTransformerOptions,
          DatastoreWriteOptions,
          ErrorWriteOptions {}

  /**
   * Runs a pipeline which reads from a Text Source, passes the Text to a Javascript UDF, writes the
   * JSON encoded Entities to a TextIO sink.
   *
   * <p>If your Text Source does not contain JSON encoded Entities, then you'll need to supply a
   * Javascript UDF which transforms your data to be JSON encoded Entities.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    TextToDatastoreOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(TextToDatastoreOptions.class);

    TupleTag<String> errorTag = new TupleTag<String>("errors") {};

    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(TextIO.read().from(options.getTextReadPattern()))
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(
            WriteJsonEntities.newBuilder()
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreWriteProjectId(), options.getFirestoreWriteProjectId()))
                .setHintNumWorkers(
                    selectProvidedInput(
                        options.getDatastoreHintNumWorkers(), options.getFirestoreHintNumWorkers()))
                .setErrorTag(errorTag)
                .build())
        .apply(
            LogErrors.newBuilder()
                .setErrorWritePath(options.getErrorWritePath())
                .setErrorTag(errorTag)
                .build());

    pipeline.run();
  }
}

Cloud Storage Text to Pub/Sub（Batch）

このテンプレートは、Cloud Storage に保存されたテキストファイルからレコードを読み取り、Pub/Sub トピックに公開するバッチパイプラインを作成します。このテンプレートは、JSON レコードまたは Pub/Sub トピックが改行区りで含まれるファイルまたは CSV ファイルのレコードを Pub/Sub トピックに公開し、リアルタイムで処理する場合に使用できます。また、Pub/Sub でデータを再生することもできます。

このテンプレートでは、個々のレコードにタイムスタンプを設定しません。実行中はイベント時間と公開時間が同じになります。パイプラインの処理が正確なイベント時間に依存している場合は、このパイプラインを使用しないでください。

このパイプラインの要件:

読み込むファイルは、改行区切りの JSON または CSV 形式でなければなりません。ソースファイル内に複数行にわたるレコードがあると、ファイル内の各行がメッセージとして Pub/Sub に公開されるため、ダウンストリームで問題が発生する可能性があります。
パイプラインを実行する前に、Pub/Sub トピックが存在している必要があります。

テンプレートのパラメータ

パラメータ	説明
`inputFilePattern`	読み込み元の入力ファイルのパターン。例: `gs://bucket-name/files/*.json`
`outputTopic`	書き込み先の Pub/Sub 入力トピック。名前は `projects/<project-id>/topics/<topic-name>` の形式にする必要があります。

Cloud Storage Text to Pub/Sub（Batch）テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Text Files on Cloud Storage to Pub/Sub (Batch) template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub \
    --region REGION_NAME \
    --parameters \
inputFilePattern=gs://BUCKET_NAME/files/*.json,\
outputTopic=projects/PROJECT_ID/topics/TOPIC_NAME

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
TOPIC_NAME: Pub/Sub トピック名
BUCKET_NAME: Cloud Storage バケットの名前

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://BUCKET_NAME/files/*.json",
       "outputTopic": "projects/PROJECT_ID/topics/TOPIC_NAME"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
TOPIC_NAME: Pub/Sub トピック名
BUCKET_NAME: Cloud Storage バケットの名前

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.TextToPubsub.Options;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;

/**
 * The {@code TextToPubsub} pipeline publishes records to Cloud Pub/Sub from a set of files. The
 * pipeline reads each file row-by-row and publishes each record as a string message. At the moment,
 * publishing messages with attributes is unsupported.
 *
 * <p>Example Usage:
 *
 * <pre>
 * {@code mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.TextToPubsub \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
 * --tempLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
 * --runner=DataflowRunner \
 * --inputFilePattern=gs://path/to/demo_file.csv \
 * --outputTopic=projects/${PROJECT_ID}/topics/${TOPIC_NAME}"
 * }
 * </pre>
 */
@Template(
    name = "GCS_Text_to_Cloud_PubSub",
    category = TemplateCategory.BATCH,
    displayName = "Cloud Storage Text File to Pub/Sub (Batch)",
    description =
        "Batch pipeline. Reads records from text files stored in Cloud Storage and publishes them to a Pub/Sub topic.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class TextToPubsub {

  /** The custom options supported by the pipeline. Inherits standard configuration options. */
  public interface Options extends PipelineOptions {
    @TemplateParameter.GcsReadFile(
        order = 1,
        description = "Cloud Storage Input File(s)",
        helpText = "Path of the file pattern glob to read from.",
        example = "gs://your-bucket/path/*.txt")
    @Required
    ValueProvider<String> getInputFilePattern();

    void setInputFilePattern(ValueProvider<String> value);

    @TemplateParameter.PubsubTopic(
        order = 2,
        description = "Output Pub/Sub topic",
        helpText =
            "The name of the topic to which data should published, in the format of 'projects/your-project-id/topics/your-topic-name'",
        example = "projects/your-project-id/topics/your-topic-name")
    @Required
    ValueProvider<String> getOutputTopic();

    void setOutputTopic(ValueProvider<String> value);
  }

  /**
   * Main entry-point for the pipeline. Reads in the command-line arguments, parses them, and
   * executes the pipeline.
   *
   * @param args Arguments passed in from the command-line.
   */
  public static void main(String[] args) {

    // Parse the user options passed from the command-line
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    run(options);
  }

  /**
   * Executes the pipeline with the provided execution parameters.
   *
   * @param options The execution parameters.
   */
  public static PipelineResult run(Options options) {
    // Create the pipeline.
    Pipeline pipeline = Pipeline.create(options);

    /*
     * Steps:
     *  1) Read from the text source.
     *  2) Write each text record to Pub/Sub
     */
    pipeline
        .apply("Read Text Data", TextIO.read().from(options.getInputFilePattern()))
        .apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic()));

    return pipeline.run();
  }
}

Cloud Storage Text to Cloud Spanner

Cloud Storage Text to Cloud Spanner テンプレートは、Cloud Storage から CSV テキストファイルを読み取り、Cloud Spanner データベースにインポートするバッチパイプラインです。

このパイプラインの要件:

ターゲットの Cloud Spanner データベースとテーブルが存在している必要があります。
Cloud Storage バケットの読み取り権限と、対象の Cloud Spanner データベースに対する書き込み権限が必要です。
CSV ファイルを含む入力 Cloud Storage パスが存在している必要があります。
CSV ファイルの JSON 記述を含むインポートマニフェストファイルを作成し、そのマニフェストを Cloud Storage に保存する必要があります。
ターゲットの Cloud Spanner データベースにすでにスキーマがある場合、マニフェストファイルで指定された列は、ターゲットデータベースのスキーマ内の対応する列と同じデータ型である必要があります。

ASCII または UTF-8 でエンコードされたマニフェストファイルは、次の形式に一致する必要があります。

マニフェストの形式と例

次のメッセージタイプに対応するマニフェストファイルの形式は、プロトコルバッファで参照できます。

message ImportManifest {
  // The per-table import manifest.
  message TableManifest {
    // Required. The name of the destination table.
    string table_name = 1;
    // Required. The CSV files to import. This value can be either a filepath or a glob pattern.
    repeated string file_patterns = 2;
    // The schema for a table column.
    message Column {
      // Required for each Column that you specify. The name of the column in the
      // destination table.
      string column_name = 1;
      // Required for each Column that you specify. The type of the column.
      string type_name = 2;
    }
    // Optional. The schema for the table columns.
    repeated Column columns = 3;
  }
  // Required. The TableManifest of the tables to be imported.
  repeated TableManifest tables = 1;

  enum ProtoDialect {
    GOOGLE_STANDARD_SQL = 0;
    POSTGRESQL = 1;
  }
  // Optional. The dialect of the receiving database. Defaults to GOOGLE_STANDARD_SQL.
  ProtoDialect dialect = 2;
}

次の例は、Albums と Singers というテーブルを GoogleSQL 言語データベースにインポートするマニフェストファイルを示しています。Albums テーブルは、ジョブがデータベースから取得する列スキーマを使用し、Singers テーブルはマニフェストファイルが指定するスキーマを使用します。

{
  "tables": [
    {
      "table_name": "Albums",
      "file_patterns": [
        "gs://bucket1/Albums_1.csv",
        "gs://bucket1/Albums_2.csv"
      ]
    },
    {
      "table_name": "Singers",
      "file_patterns": [
        "gs://bucket1/Singers*.csv"
      ],
      "columns": [
        {"column_name": "SingerId", "type_name": "INT64"},
        {"column_name": "FirstName", "type_name": "STRING"},
        {"column_name": "LastName", "type_name": "STRING"}
      ]
    }
  ]
}

インポートするテキストファイルは、ASCII または UTF-8 エンコードの CSV 形式である必要があります。UTF-8 エンコードファイルではバイトオーダーマーク（BOM）を使用しないことをおすすめします。

データは次のタイプのいずれかに一致する必要があります。

GoogleSQL

    BOOL
    INT64
    FLOAT64
    NUMERIC
    STRING
    DATE
    TIMESTAMP
    BYPES
    JSON

PostgreSQL

    boolean
    bigint
    double precision
    numeric
    character varying, text
    date
    timestamp with time zone
    bytea

注: インポートマニフェストファイルで対象のテーブルの列名とデータ型を指定しない場合、CSV ファイルの列は対象のデータベースの列と同じ順序である必要があります。次のクエリを実行すると、テーブル内の列の順序を確認できます。

SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME =
      TABLE_NAME ORDER BY ORDINAL_POSITION

テンプレートのパラメータ

パラメータ	説明
`instanceId`	Cloud Spanner データベースのインスタンス ID。
`databaseId`	Cloud Spanner データベースのデータベース ID。
`importManifest`	Cloud Storage のインポートマニフェストファイルへのパス。
`columnDelimiter`	ソースファイルが使用する列区切り文字。デフォルト値は `,` です。
`fieldQualifier`	文字は、`columnDelimiter` を含むソースファイル内の任意の値を囲む必要があります。デフォルト値は `"` です。
`trailingDelimiter`	ソースファイルの行の末尾に区切り文字があるかどうかを指定します（つまり、`columnDelimiter` 文字が各行の最後の列の値の後に表示されるかどうか）。デフォルト値は `true` です。
`escape`	ソースファイルが使用するエスケープ文字。デフォルトでは、このパラメータは設定されておらず、テンプレートではエスケープ文字は使用されません。
`nullString`	`NULL` 値を表す文字列。デフォルトでは、このパラメータは設定されておらず、テンプレートでは null 文字列は使用されません。
`dateFormat`	日付列を解析するために使用される形式。デフォルトでは、パイプラインは日付列を `yyyy-M-d[' 00:00:00']` として解析します。たとえば 2019-01-31 または 2019-1-1 00:00:00 とします。日付形式が異なる場合は、`java.time.format.DateTimeFormatter` のパターンを使って指定します。
`timestampFormat`	timestamp 列を解析するために使用される形式。タイムスタンプが長整数の場合、Unix エポック時間として解析されます。それ以外の場合は、`java.time.format.DateTimeFormatter.ISO_INSTANT` の形式を使用して、文字列として解析されます。その他の場合は、独自のパターン文字列を指定します。たとえば、`MMM dd yyyy HH:mm:ss.SSSVV` タイムスタンプの形式は `"Jan 21 1998 01:02:03.456+08:00"` です。

日付形式やタイムスタンプ形式をカスタマイズする必要がある場合は、有効な java.time.format.DateTimeFormatter パターンであることを確認してください。次の表に、カスタマイズされた date 列と timestamp 列の形式の例を示します。

型	入力値	フォーマット	備考
`DATE`	2011-3-31		デフォルトでは、テンプレートはこの形式を解析できます。`dateFormat` パラメータを指定する必要はありません。
`DATE`	2011-3-31 00:00:00		デフォルトでは、テンプレートはこの形式を解析できます。形式を指定する必要はありません。必要に応じて `yyyy-M-d' 00:00:00'` が使用できます。
`DATE`	2018 年 4 月 1 日	dd MMM, yy
`DATE`	西暦 2019 年 4 月 3 日水曜日	EEEE, LLLL d, yyyy G
`TIMESTAMP`	2019-01-02T11:22:33Z 2019-01-02T11:22:33.123Z 2019-01-02T11:22:33.12356789Z		デフォルトの形式 `ISO_INSTANT` はこのタイプのタイムスタンプを解析できます。`timestampFormat` パラメータを指定する必要はありません。
`TIMESTAMP`	1568402363		デフォルトでは、テンプレートはこのタイプのタイムスタンプを解析し、Unix エポック時間として扱います。
`TIMESTAMP`	2008 年 6 月 3 日（火曜日）11:05:30 GMT	EEE, d MMM yyyy HH:mm:ss VV
`TIMESTAMP`	2018 年 12 月 31 日 110530.123 太平洋標準時	yyyy/MM/dd HHmmss.SSSz
`TIMESTAMP`	2019-01-02T11:22:33Z または 2019-01-02T11:22:33.123Z	yyyy-MM-dd'T'HH:mm:ss [.SSS]VV	入力列が 2019-01-02T11:22:33Z と 2019-01-02T11:22:33.123Z の場合、この形式のタイムスタンプはデフォルトの形式で解析できます。独自のフォーマットパラメータを指定する必要はありません。`yyyy-MM-dd'T'HH:mm:ss[.SSS]VV` を使用すると、両方のケースに対応できます。接尾辞「Z」は文字リテラルではなくタイムゾーン ID として解析する必要があるため、`yyyy-MM-dd'T'HH:mm:ss[.SSS]'Z'` は使用できません。内部的には、タイムスタンプ列は `java.time.Instant` に変換されます。そのため、UTC で指定するか、タイムゾーン情報を設定する必要があります。2019-01-02 11:22:33 のようなローカル日時は、有効な `java.time.Instant` として解析されません。

Text Files on Cloud Storage to Cloud Spanner テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Text Files on Cloud Storage to Cloud Spanner template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner \
    --region REGION_NAME \
    --parameters \
instanceId=INSTANCE_ID,\
databaseId=DATABASE_ID,\
importManifest=GCS_PATH_TO_IMPORT_MANIFEST

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
INSTANCE_ID: Cloud Spanner インスタンス ID
DATABASE_ID: Cloud Spanner データベース ID
GCS_PATH_TO_IMPORT_MANIFEST: インポートマニフェストファイルへの Cloud Storage パス

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner
{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "importManifest": "GCS_PATH_TO_IMPORT_MANIFEST"
   },
   "environment": {
       "machineType": "n1-standard-2"
   }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
INSTANCE_ID: Cloud Spanner インスタンス ID
DATABASE_ID: Cloud Spanner データベース ID
GCS_PATH_TO_IMPORT_MANIFEST: インポートマニフェストファイルへの Cloud Storage パス

テンプレートのソースコード

Java

このテンプレートのソースコードは、GitHub 上の GoogleCloudPlatform/DataflowTemplates リポジトリにあります。

Cloud Storage to Elasticsearch

Cloud Storage to Elasticsearch テンプレートは、Cloud Storage バケットに保存されている CSV ファイルからデータを読み取り、データを JSON ドキュメントとして Elasticsearch に書き込むバッチパイプラインです。

このパイプラインの要件:

Cloud Storage バケットが存在している必要があります。
Dataflow からアクセス可能な Google Cloud インスタンスまたは Elasticsearch Cloud に Elasticsearch ホストが存在している必要があります。
エラー出力用の BigQuery テーブルが存在している必要があります。

テンプレートのパラメータ

パラメータ	説明
`inputFileSpec`	CSV ファイルを検索する Cloud Storage ファイルパターン。例: `gs://mybucket/test-*.csv`
`connectionUrl`	`https://hostname:[port]` 形式の Elasticsearch URL。Elastic Cloud を使用する場合は CloudID を指定します。
`apiKey`	認証に使用される Base64 エンコードの API キー。
`index`	リクエストが発行される Elasticsearch インデックス（`my-index` など）。
`deadletterTable`	挿入先の送信に失敗した BigQuery の Deadletter テーブル。例: `<your-project>:<your-dataset>.<your-table-name>`
`containsHeaders`	（省略可）CSV にヘッダーが含まれているかどうかを示すブール値。デフォルトは `true` です。
`delimiter`	（省略可）CSV ファイルの区切り文字。例: `,`
`csvFormat`	（省略可）Apache Commons CSV 形式に準拠する CSV 形式。デフォルト: `Default`
`jsonSchemaPath`	（省略可）JSON スキーマのパス。デフォルト: `null`
`largeNumFiles`	（省略可）ファイルの数が数万個の場合は、true に設定します。デフォルト: `false`
`javascriptTextTransformGcsPath`	（省略可）使用する JavaScript ユーザー定義関数（UDF）を定義する `.js` ファイルの Cloud Storage URI。例: `gs://my-bucket/my-udfs/my_file.js`
`javascriptTextTransformFunctionName`	（省略可）使用する JavaScript ユーザー定義関数（UDF）の名前。たとえば、JavaScript 関数が `myTransform(inJson) { /...do stuff.../ }` の場合、関数名は `myTransform` です。JavaScript UDF の例については、UDF の例をご覧ください。
`batchSize`	（省略可）バッチサイズ（ドキュメント数）。デフォルト: `1000`
`batchSizeBytes`	（省略可）バッチサイズ（バイト数）。デフォルト: `5242880`（5 MB）。
`maxRetryAttempts`	（省略可）最大再試行回数。0 より大きくする必要があります。デフォルト: 再試行なし
`maxRetryDuration`	（省略可）最大再試行時間（ミリ秒）は 0 より大きくする必要があります。デフォルト: 再試行なし
`csvFileEncoding`	（省略可）CSV ファイルのエンコード。
`propertyAsIndex`	（省略可）インデックスに登録されているドキュメント内のプロパティ。その値は `_index` メタデータを指定し、一括リクエストではドキュメントに含まれます（`_index` UDF よりも優先適用されます）。デフォルト: none
`propertyAsId`	（省略可）インデックスに登録されているドキュメント内のプロパティ。その値は `_id` メタデータを指定し、一括リクエストではドキュメントに含まれます（`_id` UDF よりも優先適用されます）。デフォルト: none
`javaScriptIndexFnGcsPath`	（省略可）一括リクエストでドキュメントに含まれる `_index` メタデータを指定する関数の JavaScript UDF ソースへの Cloud Storage パス。デフォルト: none
`javaScriptIndexFnName`	（省略可）一括リクエストでドキュメントに含まれる `_index` メタデータを指定する関数の UDF JavaScript 関数名。デフォルト: none
`javaScriptIdFnGcsPath`	（省略可）一括リクエストでドキュメントに含まれる `_id` メタデータを指定する関数の JavaScript UDF ソースへの Cloud Storage パス。デフォルト: none
`javaScriptIdFnName`	（省略可）一括リクエストでドキュメントに含まれる `_id` メタデータを指定する関数の UDF JavaScript 関数名。デフォルト: none
`javaScriptTypeFnGcsPath`	（省略可）一括リクエストでドキュメントに含まれる `_type` メタデータを指定する関数の JavaScript UDF ソースへの Cloud Storage パス。デフォルト: none
`javaScriptTypeFnName`	（省略可）一括リクエストでドキュメントに含まれる `_type` メタデータを指定する関数の UDF JavaScript 関数名。デフォルト: none
`javaScriptIsDeleteFnGcsPath`	（省略可）ドキュメントを挿入や更新ではなく削除するかどうかを決定する関数の JavaScript UDF ソースへの Cloud Storage パス。この関数は、文字列値 `"true"` または `"false"` を返す必要があります。デフォルト: none
`javaScriptIsDeleteFnName`	（省略可）ドキュメントを挿入や更新ではなく削除するかどうかを決定する関数の UDF JavaScript 関数名。この関数は、文字列値 `"true"` または `"false"` を返す必要があります。デフォルト: none
`usePartialUpdate`	（省略可）Elasticsearch リクエストで部分的な更新（作成やインデックス作成ではなく更新、部分的なドキュメントを許可する）を使用するかどうか。デフォルト: `false`
`bulkInsertMethod`	（省略可）`INDEX`（インデックス、upserts を許可する）または `CREATE`（作成、duplicate _id でエラー）を Elasticsearch 一括リクエストで使用するかどうか。デフォルト: `CREATE`。

Cloud Storage to Elasticsearch テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Cloud Storage to Elasticsearch template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID\
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/GCS_to_Elasticsearch \
    --parameters \
inputFileSpec=INPUT_FILE_SPEC,\
connectionUrl=CONNECTION_URL,\
apiKey=APIKEY,\
index=INDEX,\
deadletterTable=DEADLETTER_TABLE,\

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
INPUT_FILE_SPEC: Cloud Storage ファイルパターン。
CONNECTION_URL: Elasticsearch の URL。
APIKEY: 認証用に Base64 でエンコードされた API キー。
INDEX: Elasticsearch インデックス。
DEADLETTER_TABLE: BigQuery テーブル。

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputFileSpec": "INPUT_FILE_SPEC",
          "connectionUrl": "CONNECTION_URL",
          "apiKey": "APIKEY",
          "index": "INDEX",
          "deadletterTable": "DEADLETTER_TABLE"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/GCS_to_Elasticsearch",
   }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
INPUT_FILE_SPEC: Cloud Storage ファイルパターン。
CONNECTION_URL: Elasticsearch の URL。
APIKEY: 認証用に Base64 でエンコードされた API キー。
INDEX: Elasticsearch インデックス。
DEADLETTER_TABLE: BigQuery テーブル。

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.elasticsearch.templates;

import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.elasticsearch.options.GCSToElasticsearchOptions;
import com.google.cloud.teleport.v2.elasticsearch.transforms.WriteToElasticsearch;
import com.google.cloud.teleport.v2.transforms.CsvConverters;
import com.google.cloud.teleport.v2.transforms.ErrorConverters.WriteStringMessageErrors;
import com.google.cloud.teleport.v2.utils.SchemaUtils;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.NullableCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.WithTimestamps;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.joda.time.Instant;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link GCSToElasticsearch} pipeline exports data from one or more CSV files in Cloud Storage
 * to Elasticsearch.
 *
 * <p>Please refer to <b><a href=
 * "https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/v2/googlecloud-to-elasticsearch/docs/GCSToElasticsearch/README.md">
 * README.md</a></b> for further information.
 */
@Template(
    name = "GCS_to_Elasticsearch",
    category = TemplateCategory.BATCH,
    displayName = "Cloud Storage to Elasticsearch",
    description =
        "A pipeline to ingest csv files from Cloud Storage and writes each line into Elasticsearch"
            + " as a json document.",
    optionsClass = GCSToElasticsearchOptions.class,
    flexContainerName = "gcs-to-elasticsearch",
    contactInformation = "https://cloud.google.com/support")
public class GCSToElasticsearch {

  /** The tag for the headers of the CSV if required. */
  static final TupleTag<String> CSV_HEADERS = new TupleTag<String>() {};

  /** The tag for the lines of the CSV. */
  static final TupleTag<String> CSV_LINES = new TupleTag<String>() {};

  /** The tag for the dead-letter output of the UDF. */
  static final TupleTag<FailsafeElement<String, String>> PROCESSING_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /** The tag for the main output for the UDF. */
  static final TupleTag<FailsafeElement<String, String>> PROCESSING_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /* Logger for class. */
  private static final Logger LOG = LoggerFactory.getLogger(GCSToElasticsearch.class);

  /** String/String Coder for FailsafeElement. */
  private static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(
          NullableCoder.of(StringUtf8Coder.of()), NullableCoder.of(StringUtf8Coder.of()));

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    GCSToElasticsearchOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(GCSToElasticsearchOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  private static PipelineResult run(GCSToElasticsearchOptions options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Register the coder for pipeline
    CoderRegistry coderRegistry = pipeline.getCoderRegistry();
    coderRegistry.registerCoderForType(
        FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER);

    // Throw error if containsHeaders is true and a schema or Udf is also set.
    if (options.getContainsHeaders()) {
      checkArgument(
          options.getJavascriptTextTransformGcsPath() == null
              && options.getJsonSchemaPath() == null,
          "Cannot parse file containing headers with UDF or Json schema.");
    }

    // Throw error if only one retry configuration parameter is set.
    checkArgument(
        (options.getMaxRetryAttempts() == null && options.getMaxRetryDuration() == null)
            || (options.getMaxRetryAttempts() != null && options.getMaxRetryDuration() != null),
        "To specify retry configuration both max attempts and max duration must be set.");

    /*
     * Steps: 1) Read records from CSV(s) via {@link CsvConverters.ReadCsv}.
     *        2) Convert lines to JSON strings via {@link CsvConverters.LineToFailsafeJson}.
     *        3a) Write JSON strings as documents to Elasticsearch via {@link ElasticsearchIO}.
     *        3b) Write elements that failed processing to {@link org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO}.
     */
    PCollectionTuple convertedCsvLines =
        pipeline
            /*
             * Step 1: Read CSV file(s) from Cloud Storage using {@link CsvConverters.ReadCsv}.
             */
            .apply(
                "ReadCsv",
                CsvConverters.ReadCsv.newBuilder()
                    .setCsvFormat(options.getCsvFormat())
                    .setDelimiter(options.getDelimiter())
                    .setHasHeaders(options.getContainsHeaders())
                    .setInputFileSpec(options.getInputFileSpec())
                    .setHeaderTag(CSV_HEADERS)
                    .setLineTag(CSV_LINES)
                    .setFileEncoding(options.getCsvFileEncoding())
                    .build())
            /*
             * Step 2: Convert lines to Elasticsearch document.
             */
            .apply(
                "ConvertLine",
                CsvConverters.LineToFailsafeJson.newBuilder()
                    .setDelimiter(options.getDelimiter())
                    .setUdfFileSystemPath(options.getJavascriptTextTransformGcsPath())
                    .setUdfFunctionName(options.getJavascriptTextTransformFunctionName())
                    .setJsonSchemaPath(options.getJsonSchemaPath())
                    .setHeaderTag(CSV_HEADERS)
                    .setLineTag(CSV_LINES)
                    .setUdfOutputTag(PROCESSING_OUT)
                    .setUdfDeadletterTag(PROCESSING_DEADLETTER_OUT)
                    .build());
    /*
     * Step 3a: Write elements that were successfully processed to Elasticsearch using {@link WriteToElasticsearch}.
     */
    convertedCsvLines
        .get(PROCESSING_OUT)
        .apply(
            "GetJsonDocuments",
            MapElements.into(TypeDescriptors.strings()).via(FailsafeElement::getPayload))
        .apply(
            "WriteToElasticsearch",
            WriteToElasticsearch.newBuilder()
                .setOptions(options.as(GCSToElasticsearchOptions.class))
                .build());

    /*
     * Step 3b: Write elements that failed processing to deadletter table via {@link BigQueryIO}.
     */
    convertedCsvLines
        .get(PROCESSING_DEADLETTER_OUT)
        .apply(
            "AddTimestamps",
            WithTimestamps.of((FailsafeElement<String, String> failures) -> new Instant()))
        .apply(
            "WriteFailedElementsToBigQuery",
            WriteStringMessageErrors.newBuilder()
                .setErrorRecordsTable(options.getDeadletterTable())
                .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA)
                .build());

    return pipeline.run();
  }
}

Java Database Connectivity（JDBC）to BigQuery

JDBC to BigQuery テンプレートは、リレーショナルデータベーステーブルから既存の BigQuery テーブルにデータをコピーするバッチパイプラインです。このパイプラインは、JDBC を使用してリレーショナルデータベースに接続します。このテンプレートを使用すると、使用可能な JDBC ドライバがある任意のリレーショナルデータベースから BigQuery にデータをコピーできます。保護をさらに強化するために、Cloud KMS 鍵で暗号化された Base64 でエンコードされたユーザー名、パスワード、接続文字列パラメータを渡すこともできます。詳しくは Cloud KMS API 暗号化エンドポイントで、ユーザー名、パスワード、接続文字列パラメータの暗号化の詳細をご覧ください。

このパイプラインの要件:

リレーショナルデータベース用の JDBC ドライバが使用可能である必要があります。
パイプラインを実行する前に、BigQuery テーブルが存在する必要があります。
BigQuery テーブルに互換性のあるスキーマが必要です。
リレーショナルデータベースは、Dataflow が実行されているサブネットからアクセス可能である必要があります。

テンプレートのパラメータ

パラメータ	説明
`driverJars`	ドライバ JAR ファイルのカンマ区切りのリスト。例: `gs://<my-bucket>/driver_jar1.jar,gs://<my-bucket>/driver_jar2.jar`
`driverClassName`	JDBC ドライバのクラス名。例: `com.mysql.jdbc.Driver`
`connectionURL`	JDBC 接続 URL 文字列。例: `jdbc:mysql://some-host:3306/sampledb`Base64 でエンコードされ、Cloud KMS 鍵で暗号化される文字列として渡すことができます。
`query`	ソースで実行されるクエリでデータを抽出します。例: `select * from sampledb.sample_table`
`outputTable`	BigQuery 出力テーブルの場所。`<my-project>:<my-dataset>.<my-table>` の形式で指定します。
`bigQueryLoadingTemporaryDirectory`	BigQuery 読み込みプロセスの一時ディレクトリ。例: `gs://<my-bucket>/my-files/temp_dir`
`connectionProperties`	（省略可）JDBC 接続に使用するプロパティ文字列。文字列の形式は `[propertyName=property;]*` にする必要があります。例: `unicode=true;characterEncoding=UTF-8`
`username`	（省略可）JDBC 接続に使用するユーザー名。Cloud KMS 鍵で暗号化された Base64 エンコード文字列として渡すことができます。
`password`	（省略可）JDBC 接続に使用するパスワード。Cloud KMS 鍵で暗号化された Base64 エンコード文字列として渡すことができます。
`KMSEncryptionKey`	（省略可）ユーザー名、パスワード、接続文字列を復号するための Cloud KMS 暗号鍵。Cloud KMS 鍵が渡された場合、ユーザー名、パスワード、接続文字列はすべて暗号化されて渡されます。
`disabledAlgorithms`	（省略可）無効にするアルゴリズム。カンマ区切りで指定します。この値が `none` に設定されている場合、アルゴリズムは無効になりません。アルゴリズムはデフォルトで無効であり、脆弱性やパフォーマンス上の問題があるため、慎重に使用してください。例: `SSLv3, RC4.`
`extraFilesToStage`	ワーカーにステージングするファイルのカンマ区切りの Cloud Storage パスまたは Secret Manager シークレット。これらのファイルは、各ワーカーの `/extra_files` ディレクトリに保存されます。例: `gs://<my-bucket>/file.txt,projects/<project-id>/secrets/<secret-id>/versions/<version-id>`。

JDBC to BigQuery テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the JDBC to BigQuery template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Jdbc_to_BigQuery \
    --region REGION_NAME \
    --parameters \
driverJars=DRIVER_PATHS,\
driverClassName=DRIVER_CLASS_NAME,\
connectionURL=JDBC_CONNECTION_URL,\
query=SOURCE_SQL_QUERY,\
outputTable=PROJECT_ID:DATASET.TABLE_NAME,
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS,\
connectionProperties=CONNECTION_PROPERTIES,\
username=CONNECTION_USERNAME,\
password=CONNECTION_PASSWORD,\
KMSEncryptionKey=KMS_ENCRYPTION_KEY

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
DRIVER_PATHS: カンマで区切った JDBC ドライバの Cloud Storage パス
DRIVER_CLASS_NAME: ドライブクラス名
JDBC_CONNECTION_URL: JDBC 接続 URL
SOURCE_SQL_QUERY: ソースデータベースで実行する SQL クエリ
DATASET: BigQuery データセット。TABLE_NAME は BigQuery テーブル名に置き換えます
PATH_TO_TEMP_DIR_ON_GCS: 一時ディレクトリへの Cloud Storage パス
CONNECTION_PROPERTIES: JDBC 接続プロパティ（必要に応じて）
CONNECTION_USERNAME: JDBC 接続のユーザー名
CONNECTION_PASSWORD: JDBC 接続パスワード
KMS_ENCRYPTION_KEY: Cloud KMS 暗号鍵

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Jdbc_to_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "driverJars": "DRIVER_PATHS",
       "driverClassName": "DRIVER_CLASS_NAME",
       "connectionURL": "JDBC_CONNECTION_URL",
       "query": "SOURCE_SQL_QUERY",
       "outputTable": "PROJECT_ID:DATASET.TABLE_NAME",
       "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS",
       "connectionProperties": "CONNECTION_PROPERTIES",
       "username": "CONNECTION_USERNAME",
       "password": "CONNECTION_PASSWORD",
       "KMSEncryptionKey":"KMS_ENCRYPTION_KEY"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
DRIVER_PATHS: カンマで区切った JDBC ドライバの Cloud Storage パス
DRIVER_CLASS_NAME: ドライブクラス名
JDBC_CONNECTION_URL: JDBC 接続 URL
SOURCE_SQL_QUERY: ソースデータベースで実行する SQL クエリ
DATASET: BigQuery データセット。TABLE_NAME は BigQuery テーブル名に置き換えます
PATH_TO_TEMP_DIR_ON_GCS: 一時ディレクトリへの Cloud Storage パス
CONNECTION_PROPERTIES: JDBC 接続プロパティ（必要に応じて）
CONNECTION_USERNAME: JDBC 接続のユーザー名
CONNECTION_PASSWORD: JDBC 接続パスワード
KMS_ENCRYPTION_KEY: Cloud KMS 暗号鍵

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.api.services.bigquery.model.TableRow;
import com.google.cloud.teleport.io.DynamicJdbcIO;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.common.JdbcConverters;
import com.google.cloud.teleport.util.KMSEncryptedNestedValueProvider;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.TableRowJsonCoder;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * A template that copies data from a relational database using JDBC to an existing BigQuery table.
 */
@Template(
    name = "Jdbc_to_BigQuery",
    category = TemplateCategory.BATCH,
    displayName = "JDBC to BigQuery",
    description =
        "A pipeline that reads from a JDBC source and writes to a BigQuery table. JDBC connection string, user name and password can be passed in directly as plaintext or encrypted using the Google Cloud KMS API.  If the parameter KMSEncryptionKey is specified, connectionURL, username, and password should be all in encrypted format. A sample curl command for the KMS API encrypt endpoint: curl -s -X POST \"https://cloudkms.googleapis.com/v1/projects/your-project/locations/your-path/keyRings/your-keyring/cryptoKeys/your-key:encrypt\"  -d \"{\\\"plaintext\\\":\\\"PasteBase64EncodedString\\\"}\" -H \"Authorization: Bearer $(gcloud auth application-default print-access-token)\" -H \"Content-Type: application/json\"",
    optionsClass = JdbcConverters.JdbcToBigQueryOptions.class,
    contactInformation = "https://cloud.google.com/support")
public class JdbcToBigQuery {

  private static final Logger LOG = LoggerFactory.getLogger(JdbcToBigQuery.class);

  private static ValueProvider<String> maybeDecrypt(
      ValueProvider<String> unencryptedValue, ValueProvider<String> kmsKey) {
    return new KMSEncryptedNestedValueProvider(unencryptedValue, kmsKey);
  }

  /**
   * Main entry point for executing the pipeline. This will run the pipeline asynchronously. If
   * blocking execution is required, use the {@link
   * JdbcToBigQuery#run(JdbcConverters.JdbcToBigQueryOptions)} method to start the pipeline and
   * invoke {@code result.waitUntilFinish()} on the {@link PipelineResult}
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {

    // Parse the user options passed from the command-line
    JdbcConverters.JdbcToBigQueryOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(JdbcConverters.JdbcToBigQueryOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  private static PipelineResult run(JdbcConverters.JdbcToBigQueryOptions options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    /*
     * Steps: 1) Read records via JDBC and convert to TableRow via RowMapper
     *        2) Append TableRow to BigQuery via BigQueryIO
     */
    pipeline
        /*
         * Step 1: Read records via JDBC and convert to TableRow
         *         via {@link org.apache.beam.sdk.io.jdbc.JdbcIO.RowMapper}
         */
        .apply(
            "Read from JdbcIO",
            DynamicJdbcIO.<TableRow>read()
                .withDataSourceConfiguration(
                    DynamicJdbcIO.DynamicDataSourceConfiguration.create(
                            options.getDriverClassName(),
                            maybeDecrypt(options.getConnectionURL(), options.getKMSEncryptionKey()))
                        .withUsername(
                            maybeDecrypt(options.getUsername(), options.getKMSEncryptionKey()))
                        .withPassword(
                            maybeDecrypt(options.getPassword(), options.getKMSEncryptionKey()))
                        .withDriverJars(options.getDriverJars())
                        .withConnectionProperties(options.getConnectionProperties()))
                .withQuery(options.getQuery())
                .withCoder(TableRowJsonCoder.of())
                .withRowMapper(JdbcConverters.getResultSetToTableRow(options.getUseColumnAlias())))
        /*
         * Step 2: Append TableRow to an existing BigQuery table
         */
        .apply(
            "Write to BigQuery",
            BigQueryIO.writeTableRows()
                .withoutValidation()
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
                .withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory())
                .to(options.getOutputTable()));

    // Execute the pipeline and return the result.
    return pipeline.run();
  }
}

Java Database Connectivity（JDBC）to Pub/Sub

Java Database Connectivity（JDBC）to Pub/Sub テンプレートは、JDBC ソースからデータを取り込み、結果のレコードを JSON 文字列として既存の Pub/Sub トピックに書き込むバッチパイプラインです。

このパイプラインの要件:

パイプラインを実行する前に JDBC ソースが存在している必要があります。
Cloud Pub/Sub 出力トピックは、パイプラインを実行する前に存在している必要があります。

テンプレートのパラメータ

パラメータ	説明
`driverClassName`	JDBC ドライバのクラス名。例: `com.mysql.jdbc.Driver`
`connectionUrl`	JDBC 接続 URL 文字列。例: `jdbc:mysql://some-host:3306/sampledb`Base64 でエンコードされ、Cloud KMS 鍵で暗号化される文字列として渡すことができます。
`driverJars`	JDBC ドライバのカンマ区切りの Cloud Storage パス。例: `gs://your-bucket/driver_jar1.jar,gs://your-bucket/driver_jar2.jar`
`username`	（省略可）JDBC 接続に使用するユーザー名。Cloud KMS 鍵で暗号化された Base64 エンコード文字列として渡すことができます。
`password`	（省略可）JDBC 接続に使用するパスワード。Cloud KMS 鍵で暗号化された Base64 エンコード文字列として渡すことができます。
`connectionProperties`	（省略可）JDBC 接続に使用するプロパティ文字列。文字列の形式は `[propertyName=property;]*` にする必要があります。例: `unicode=true;characterEncoding=UTF-8`
`query`	ソースで実行されるクエリでデータを抽出します。例: `select * from sampledb.sample_table`
`outputTopic`	公開先の Pub/Sub トピック。`projects/<project>/topics/<topic>` の形式で指定します。
`KMSEncryptionKey`	（省略可）ユーザー名、パスワード、接続文字列を復号するための Cloud KMS 暗号鍵。Cloud KMS 鍵が渡された場合、ユーザー名、パスワード、接続文字列はすべて暗号化されて渡されます。
`extraFilesToStage`	ワーカーにステージングするファイルのカンマ区切りの Cloud Storage パスまたは Secret Manager シークレット。これらのファイルは、各ワーカーの `/extra_files` ディレクトリに保存されます。例: `gs://<my-bucket>/file.txt,projects/<project-id>/secrets/<secret-id>/versions/<version-id>`。

Java Database Connectivity（JDBC）to Pub/Sub テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the JDBC to Pub/Sub template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/flex/Jdbc_to_PubSub \
    --region REGION_NAME \
    --parameters \
driverClassName=DRIVER_CLASS_NAME,\
connectionURL=JDBC_CONNECTION_URL,\
driverJars=DRIVER_PATHS,\
username=CONNECTION_USERNAME,\
password=CONNECTION_PASSWORD,\
connectionProperties=CONNECTION_PROPERTIES,\
query=SOURCE_SQL_QUERY,\
outputTopic=OUTPUT_TOPIC,\
KMSEncryptionKey=KMS_ENCRYPTION_KEY

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
DRIVER_CLASS_NAME: ドライバのクラス名
JDBC_CONNECTION_URL: JDBC 接続 URL
DRIVER_PATHS: カンマで区切った JDBC ドライバの Cloud Storage パス
CONNECTION_USERNAME: JDBC 接続のユーザー名
CONNECTION_PASSWORD: JDBC 接続パスワード
CONNECTION_PROPERTIES: JDBC 接続プロパティ（必要に応じて）
SOURCE_SQL_QUERY: ソースデータベースで実行する SQL クエリ
OUTPUT_TOPIC: 公開先の Pub/Sub
KMS_ENCRYPTION_KEY: Cloud KMS 暗号鍵

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "jobName": "JOB_NAME",
   "parameters": {
       "driverClassName": "DRIVER_CLASS_NAME",
       "connectionURL": "JDBC_CONNECTION_URL",
       "driverJars": "DRIVER_PATHS",
       "username": "CONNECTION_USERNAME",
       "password": "CONNECTION_PASSWORD",
       "connectionProperties": "CONNECTION_PROPERTIES",
       "query": "SOURCE_SQL_QUERY",
       "outputTopic": "OUTPUT_TOPIC",
       "KMSEncryptionKey":"KMS_ENCRYPTION_KEY"
   },
   "environment": { "zone": "us-central1-f" },
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
DRIVER_CLASS_NAME: ドライバのクラス名
JDBC_CONNECTION_URL: JDBC 接続 URL
DRIVER_PATHS: カンマで区切った JDBC ドライバの Cloud Storage パス
CONNECTION_USERNAME: JDBC 接続のユーザー名
CONNECTION_PASSWORD: JDBC 接続パスワード
CONNECTION_PROPERTIES: JDBC 接続プロパティ（必要に応じて）
SOURCE_SQL_QUERY: ソースデータベースで実行する SQL クエリ
OUTPUT_TOPIC: 公開先の Pub/Sub
KMS_ENCRYPTION_KEY: Cloud KMS 暗号鍵

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import static com.google.cloud.teleport.v2.utils.KMSUtils.maybeDecrypt;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.io.DynamicJdbcIO;
import com.google.cloud.teleport.v2.options.JdbcToPubsubOptions;
import java.sql.Clob;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.jdbc.JdbcIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.values.PCollection;
import org.json.JSONObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link JdbcToPubsub} batch pipeline reads data from JDBC and publishes to Google Cloud
 * PubSub. <br>
 */
@Template(
    name = "Jdbc_to_PubSub",
    category = TemplateCategory.BATCH,
    displayName = "JDBC to Pub/Sub",
    description =
        "A batch pipeline which ingests data from JDBC source and writes to a pre-existing Pub/Sub"
            + " topic as a JSON string. JDBC connection string, user name and password can be"
            + " passed in directly as plaintext or encrypted using the Google Cloud KMS API.  If"
            + " the parameter KMSEncryptionKey is specified, connectionUrl, username, and password"
            + " should be all in encrypted format. A sample curl command for the KMS API encrypt"
            + " endpoint: curl -s -X POST"
            + " \"https://cloudkms.googleapis.com/v1/projects/your-project/locations/your-path/keyRings/your-keyring/cryptoKeys/your-key:encrypt\""
            + "  -d \"{\\\"plaintext\\\":\"PasteBase64EncodedString\\\"}\"  -H \"Authorization:"
            + " Bearer $(gcloud auth application-default print-access-token)\"  -H \"Content-Type:"
            + " application/json\"",
    optionsClass = JdbcToPubsubOptions.class,
    flexContainerName = "jdbc-to-pubsub",
    contactInformation = "https://cloud.google.com/support")
public class JdbcToPubsub {

  /* Logger for class.*/
  private static final Logger LOG = LoggerFactory.getLogger(JdbcToPubsub.class);

  /**
   * {@link JdbcIO.RowMapper} implementation to convert Jdbc ResultSet rows to UTF-8 encoded JSONs.
   */
  public static class ResultSetToJSONString implements JdbcIO.RowMapper<String> {

    @Override
    public String mapRow(ResultSet resultSet) throws Exception {
      ResultSetMetaData metaData = resultSet.getMetaData();
      JSONObject json = new JSONObject();

      for (int i = 1; i <= metaData.getColumnCount(); i++) {
        Object value = resultSet.getObject(i);

        // JSONObject.put() does not support null values. The exception is JSONObject.NULL
        if (value == null) {
          json.put(metaData.getColumnLabel(i), JSONObject.NULL);
          continue;
        }

        switch (metaData.getColumnTypeName(i).toLowerCase()) {
          case "clob":
            Clob clobObject = resultSet.getClob(i);
            if (clobObject.length() > Integer.MAX_VALUE) {
              LOG.warn(
                  "The Clob value size {} in column {} exceeds 2GB and will be truncated.",
                  clobObject.length(),
                  metaData.getColumnLabel(i));
            }
            json.put(
                metaData.getColumnLabel(i), clobObject.getSubString(1, (int) clobObject.length()));
            break;
          default:
            json.put(metaData.getColumnLabel(i), value);
        }
      }
      return json.toString();
    }
  }

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    JdbcToPubsubOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(JdbcToPubsubOptions.class);

    run(options);
  }

  /**
   * Runs a pipeline which reads message from JDBC and writes to Pub/Sub.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(JdbcToPubsubOptions options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    LOG.info("Starting Jdbc-To-PubSub Pipeline.");

    /*
     * Steps:
     *  1) Read data from a Jdbc Table
     *  2) Write to Pub/Sub topic
     */
    DynamicJdbcIO.DynamicDataSourceConfiguration dataSourceConfiguration =
        DynamicJdbcIO.DynamicDataSourceConfiguration.create(
                options.getDriverClassName(),
                maybeDecrypt(options.getConnectionUrl(), options.getKMSEncryptionKey()))
            .withDriverJars(options.getDriverJars());
    if (options.getUsername() != null) {
      dataSourceConfiguration =
          dataSourceConfiguration.withUsername(
              maybeDecrypt(options.getUsername(), options.getKMSEncryptionKey()));
    }
    if (options.getPassword() != null) {
      dataSourceConfiguration =
          dataSourceConfiguration.withPassword(
              maybeDecrypt(options.getPassword(), options.getKMSEncryptionKey()));
    }
    if (options.getConnectionProperties() != null) {
      dataSourceConfiguration =
          dataSourceConfiguration.withConnectionProperties(options.getConnectionProperties());
    }

    PCollection<String> jdbcData =
        pipeline.apply(
            "readFromJdbc",
            DynamicJdbcIO.<String>read()
                .withDataSourceConfiguration(dataSourceConfiguration)
                .withQuery(options.getQuery())
                .withCoder(StringUtf8Coder.of())
                .withRowMapper(new ResultSetToJSONString()));

    jdbcData.apply("writeSuccessMessages", PubsubIO.writeStrings().to(options.getOutputTopic()));

    return pipeline.run();
  }
}

Apache Cassandra to Cloud Bigtable

Apache Cassandra to Cloud Bigtable テンプレートは、Apache Cassandra から Cloud Bigtable にテーブルをコピーします。このテンプレートに対して行う必要のある構成は最小限に抑えられており、Cassandra のテーブル構造を Cloud Bigtable で可能な限り再現します。

Apache Cassandra to Cloud Bigtable テンプレートは次の場合に役立ちます。

短いダウンタイムしか許容されない状況で Apache Cassandra データベースを移行する。
グローバルなサービス提供を目的として、Cassandra のテーブルを Cloud Bigtable に定期的に複製する。

このパイプラインの要件:

パイプラインを実行する前に、複製先の Bigtable テーブルが存在していること。
Dataflow ワーカーと Apache Cassandra ノードの間のネットワーク接続。

型変換

Apache Cassandra to Cloud Bigtable テンプレートでは、Apache Cassandra のデータ型が Cloud Bigtable のデータ型に自動的に変換されます。

ほとんどのプリミティブは Cloud Bigtable と Apache Cassandra で同じように表現されますが、次のプリミティブは異なる方法で表現されます。

Date と Timestamp は DateTime オブジェクトに変換されます。
UUID は String に変換されます。
Varint は BigDecimal に変換されます。

Apache Cassandra は、Tuple、List、Set、Map などの複雑な型もネイティブにサポートしています。Apache Beam にはタプルに対応する型がないため、このパイプラインではタプルはサポートされません。

たとえば、Apache Cassandra では「mylist」という名前の List 型の列を使用し、次の表のような値を格納できます。

row	mylist
1	`(a,b,c)`

このリスト列はパイプラインによって 3 つの異なる列に展開されます（Cloud Bigtable ではこれを列修飾子といいます）。列の名前は「mylist」ですが、「mylist[0]」のようにリスト内のアイテムのインデックスがパイプラインによって追加されます。

row	mylist[0]	mylist[1]	mylist[2]
1	a	b	c

このパイプラインでは、セットもリストと同じように処理されますが、セルがキーか値かを示す接尾辞が追加されます。

row	mymap
1	`{"first_key":"first_value","another_key":"different_value"}`

変換後、テーブルは次のようになります。

row	mymap[0].key	mymap[0].value	mymap[1].key	mymap[1].value
1	first_key	first_value	another_key	different_value

主キー変換

Apache Cassandra では、主キーはデータ定義言語を使用して定義されます。主キーは、単純、複合、クラスタ化された列の複合のいずれかです。Cloud Bigtable では、バイト配列を辞書順に並べ替える行キーの手動作成がサポートされています。このパイプラインは、キーの型の情報を自動で収集し、複数の値に基づいて行キーを作成するためのベストプラクティスに基づいてキーを作成します。

テンプレートのパラメータ

パラメータ	説明
`cassandraHosts`	Apache Cassandra ノードのホストをカンマ区切りのリストで表したもの。
`cassandraPort`	（省略可）ノード上の Apache Cassandra に到達する TCP ポート（デフォルトは `9042`）。
`cassandraKeyspace`	テーブルが配置されている Apache Cassandra キースペース。
`cassandraTable`	コピーする Apache Cassandra テーブル。
`bigtableProjectId`	Apache Cassandra テーブルがコピーされる Bigtable インスタンスの Google Cloud プロジェクト ID。
`bigtableInstanceId`	Apache Cassandra テーブルをコピーする Cloud Bigtable インスタンス ID。
`bigtableTableId`	Apache Cassandra テーブルをコピーする Bigtable テーブルの名前。
`defaultColumnFamily`	（省略可）Bigtable テーブルの列ファミリーの名前（デフォルトは `default`）。
`rowKeySeparator`	（省略可）行キーの作成に使用される区切り文字（デフォルトは `#`）。

Apache Cassandra to Cloud Bigtable テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Cassandra to Cloud Bigtable template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cassandra_To_Cloud_Bigtable \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=BIGTABLE_INSTANCE_ID,\
bigtableTableId=BIGTABLE_TABLE_ID,\
cassandraHosts=CASSANDRA_HOSTS,\
cassandraKeyspace=CASSANDRA_KEYSPACE,\
cassandraTable=CASSANDRA_TABLE

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: Cloud Bigtable が配置されているプロジェクト ID
BIGTABLE_INSTANCE_ID: Cloud Bigtable インスタンス ID
BIGTABLE_TABLE_ID: Cloud Bigtable テーブル名
CASSANDRA_HOSTS: Apache Cassandra のホストリスト。複数のホストがある場合は、カンマをエスケープするための手順を行ってください
CASSANDRA_KEYSPACE: テーブルが配置されている Apache Cassandra キースペース
CASSANDRA_TABLE: 移行する必要がある Apache Cassandra テーブル

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cassandra_To_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "BIGTABLE_INSTANCE_ID",
       "bigtableTableId": "BIGTABLE_TABLE_ID",
       "cassandraHosts": "CASSANDRA_HOSTS",
       "cassandraKeyspace": "CASSANDRA_KEYSPACE",
       "cassandraTable": "CASSANDRA_TABLE"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJET_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
BIGTABLE_PROJECT_ID: Cloud Bigtable が配置されているプロジェクト ID
BIGTABLE_INSTANCE_ID: Cloud Bigtable インスタンス ID
BIGTABLE_TABLE_ID: Cloud Bigtable テーブル名
CASSANDRA_HOSTS: Apache Cassandra のホストリスト。複数のホストがある場合は、カンマをエスケープするための手順を行ってください
CASSANDRA_KEYSPACE: テーブルが配置されている Apache Cassandra キースペース
CASSANDRA_TABLE: 移行する必要がある Apache Cassandra テーブル

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.bigtable;

import com.datastax.driver.core.Session;
import com.google.cloud.teleport.bigtable.CassandraToBigtable.Options;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import java.util.Arrays;
import java.util.List;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.coders.SerializableCoder;
import org.apache.beam.sdk.io.cassandra.CassandraIO;
import org.apache.beam.sdk.io.cassandra.Mapper;
import org.apache.beam.sdk.io.gcp.bigtable.BigtableIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.values.Row;

/**
 * This Dataflow Template performs a one off copy of one table from Apache Cassandra to Cloud
 * Bigtable. It is designed to require minimal configuration and aims to replicate the table
 * structure in Cassandra as closely as possible in Cloud Bigtable. To run the pipeline go to
 * "Create a job from Template", enter the required configuration and press "Run job"
 *
 * <p>The minimum required configuration required to run the pipeline is:
 *
 * <ul>
 *   <li><b>cassandraHosts:</b> The hosts of the Cassandra nodes in a comma separated value list.
 *   <li><b>cassandraPort:</b> The tcp port where Cassandra can be reached on the nodes.
 *   <li><b>cassandraKeyspace:</b> The Cassandra keyspace where the table is located.
 *   <li><b>cassandraTable:</b> The Cassandra table to be copied.
 *   <li><b>bigtableProjectId:</b> The Project ID of the Bigtable instance where the Cassandra table
 *       should be copied.
 *   <li><b>bigtableInstanceId:</b> The Bigtable Instance ID where the Cassandra table should be
 *       copied.
 *   <li><b>bigtableTableId:</b> The name of the Bigtable table where the Cassandra table should be
 *       copied.
 * </ul>
 */
@Template(
    name = "Cassandra_To_Cloud_Bigtable",
    category = TemplateCategory.BATCH,
    displayName = "Cassandra to Cloud Bigtable",
    description = "A pipeline to import a Apache Cassandra table into Cloud Bigtable.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
final class CassandraToBigtable {

  public interface Options extends PipelineOptions {

    @TemplateParameter.Text(
        order = 1,
        regexes = {"^[a-zA-Z0-9\\.\\-,]*$"},
        description = "Cassandra Hosts",
        helpText = "Comma separated value list of hostnames or ips of the Cassandra nodes.")
    ValueProvider<String> getCassandraHosts();

    @SuppressWarnings("unused")
    void setCassandraHosts(ValueProvider<String> hosts);

    @TemplateParameter.Text(
        order = 2,
        optional = true,
        regexes = {
          "^([0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])$"
        },
        description = "Cassandra Port",
        helpText = "The port where cassandra can be reached. Defaults to 9042.")
    @Default.Integer(9042)
    ValueProvider<Integer> getCassandraPort();

    @SuppressWarnings("unused")
    void setCassandraPort(ValueProvider<Integer> port);

    @TemplateParameter.Text(
        order = 3,
        regexes = {"^[a-zA-Z0-9][a-zA-Z0-9_]{0,47}$"},
        description = "Cassandra Keyspace",
        helpText = "Cassandra Keyspace where the table to be migrated can be located.")
    ValueProvider<String> getCassandraKeyspace();

    @SuppressWarnings("unused")
    void setCassandraKeyspace(ValueProvider<String> keyspace);

    @TemplateParameter.Text(
        order = 4,
        regexes = {"^[a-zA-Z][a-zA-Z0-9_]*$"},
        description = "Cassandra Table",
        helpText = "The name of the Cassandra table to Migrate")
    ValueProvider<String> getCassandraTable();

    @SuppressWarnings("unused")
    void setCassandraTable(ValueProvider<String> cassandraTable);

    @TemplateParameter.ProjectId(
        order = 5,
        description = "Bigtable Project ID",
        helpText = "The Project ID where the target Bigtable Instance is running.")
    ValueProvider<String> getBigtableProjectId();

    @SuppressWarnings("unused")
    void setBigtableProjectId(ValueProvider<String> projectId);

    @TemplateParameter.Text(
        order = 6,
        regexes = {"[a-z][a-z0-9\\-]+[a-z0-9]"},
        description = "Target Bigtable Instance",
        helpText = "The target Bigtable Instance where you want to write the data.")
    ValueProvider<String> getBigtableInstanceId();

    @SuppressWarnings("unused")
    void setBigtableInstanceId(ValueProvider<String> bigtableInstanceId);

    @TemplateParameter.Text(
        order = 7,
        regexes = {"[_a-zA-Z0-9][-_.a-zA-Z0-9]*"},
        description = "Target Bigtable Table",
        helpText = "The target Bigtable table where you want to write the data.")
    ValueProvider<String> getBigtableTableId();

    @SuppressWarnings("unused")
    void setBigtableTableId(ValueProvider<String> bigtableTableId);

    @TemplateParameter.Text(
        order = 8,
        optional = true,
        regexes = {"[-_.a-zA-Z0-9]+"},
        description = "The Default Bigtable Column Family",
        helpText =
            "This specifies the default column family to write data into. If no columnFamilyMapping is specified all Columns will be written into this column family. Default value is \"default\"")
    @Default.String("default")
    ValueProvider<String> getDefaultColumnFamily();

    @SuppressWarnings("unused")
    void setDefaultColumnFamily(ValueProvider<String> defaultColumnFamily);

    @TemplateParameter.Text(
        order = 9,
        optional = true,
        description = "The Row Key Separator",
        helpText =
            "All primary key fields will be appended to form your Bigtable Row Key. The rowKeySeparator allows you to specify a character separator. Default separator is '#'.")
    @Default.String("#")
    ValueProvider<String> getRowKeySeparator();

    @SuppressWarnings("unused")
    void setRowKeySeparator(ValueProvider<String> rowKeySeparator);

    @TemplateParameter.Boolean(
        order = 10,
        optional = true,
        description = "If true, large rows will be split into multiple MutateRows requests",
        helpText =
            "The flag for enabling splitting of large rows into multiple MutateRows requests. Note that when a large row is split between multiple API calls, the updates to the row are not atomic. ")
    ValueProvider<Boolean> getSplitLargeRows();

    void setSplitLargeRows(ValueProvider<Boolean> splitLargeRows);
  }

  /**
   * Runs a pipeline to copy one Cassandra table to Cloud Bigtable.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    // Split the Cassandra Hosts value provider into a list value provider.
    ValueProvider.NestedValueProvider<List<String>, String> hosts =
        ValueProvider.NestedValueProvider.of(
            options.getCassandraHosts(),
            (SerializableFunction<String, List<String>>) value -> Arrays.asList(value.split(",")));

    Pipeline p = Pipeline.create(PipelineUtils.tweakPipelineOptions(options));

    // Create a factory method to inject the CassandraRowMapperFn to allow custom type mapping.
    SerializableFunction<Session, Mapper> cassandraObjectMapperFactory =
        new CassandraRowMapperFactory(options.getCassandraTable(), options.getCassandraKeyspace());

    CassandraIO.Read<Row> source =
        CassandraIO.<Row>read()
            .withHosts(hosts)
            .withPort(options.getCassandraPort())
            .withKeyspace(options.getCassandraKeyspace())
            .withTable(options.getCassandraTable())
            .withMapperFactoryFn(cassandraObjectMapperFactory)
            .withEntity(Row.class)
            .withCoder(SerializableCoder.of(Row.class));

    BigtableIO.Write sink =
        BigtableIO.write()
            .withProjectId(options.getBigtableProjectId())
            .withInstanceId(options.getBigtableInstanceId())
            .withTableId(options.getBigtableTableId());

    p.apply("Read from Cassandra", source)
        .apply(
            "Convert Row",
            ParDo.of(
                BeamRowToBigtableFn.createWithSplitLargeRows(
                    options.getRowKeySeparator(),
                    options.getDefaultColumnFamily(),
                    options.getSplitLargeRows(),
                    BeamRowToBigtableFn.MAX_MUTATION_PER_REQUEST)))
        .apply("Write to Bigtable", sink);
    p.run();
  }
}

MongoDB to BigQuery

MongoDB to BigQuery テンプレートは、MongoDB からドキュメントを読み取り、userOption パラメータで指定されたとおりに BigQuery に書き込むバッチパイプラインです。

このパイプラインの要件

ターゲット BigQuery データセットが存在すること。
ソース MongoDB インスタンスに Dataflow ワーカーマシンからアクセスできること。

テンプレートのパラメータ

パラメータ	説明
`mongoDbUri`	MongoDB 接続 URI。形式は `mongodb+srv://:@`。
`database`	コレクションを読み取る MongoDB 内のデータベース。例: `my-db`。
`collection`	MongoDB データベース内のコレクションの名前。例: `my-collection`。
`outputTableSpec`	書き込み先の BigQuery テーブル。例: `bigquery-project:dataset.output_table`。
`userOption`	`FLATTEN` または `NONE`。`FLATTEN`: ドキュメントを第 1 レベルでフラット化します。`NONE` は、ドキュメント全体を JSON 文字列として格納します。

MongoDB to BigQuery テンプレートの実行

コンソール

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the MongoDB to BigQuery template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/MongoDB_to_BigQuery \
    --parameters \
outputTableSpec=OUTPUT_TABLE_SPEC,\
mongoDbUri=MONGO_DB_URI,\
database=DATABASE,\
collection=COLLECTION,\
userOption=USER_OPTION

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
OUTPUT_TABLE_SPEC: ターゲット BigQuery テーブル名。
MONGO_DB_URI: MongoDB URI。
DATABASE: MongoDB データベース。
COLLECTION: MongoDB コレクション。
USER_OPTION: FLATTEN または NONE。

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputTableSpec": "INPUT_TABLE_SPEC",
          "mongoDbUri": "MONGO_DB_URI",
          "database": "DATABASE",
          "collection": "COLLECTION",
          "userOption": "USER_OPTION"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/MongoDB_to_BigQuery",
   }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
OUTPUT_TABLE_SPEC: ターゲット BigQuery テーブル名。
MONGO_DB_URI: MongoDB URI。
DATABASE: MongoDB データベース。
COLLECTION: MongoDB コレクション。
USER_OPTION: FLATTEN または NONE。

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.mongodb.templates;

import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableSchema;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.mongodb.options.MongoDbToBigQueryOptions.BigQueryWriteOptions;
import com.google.cloud.teleport.v2.mongodb.options.MongoDbToBigQueryOptions.JavascriptDocumentTransformerOptions;
import com.google.cloud.teleport.v2.mongodb.options.MongoDbToBigQueryOptions.MongoDbOptions;
import com.google.cloud.teleport.v2.mongodb.templates.MongoDbToBigQuery.Options;
import com.google.cloud.teleport.v2.options.BigQueryStorageApiBatchOptions;
import com.google.cloud.teleport.v2.transforms.JavascriptDocumentTransformer.TransformDocumentViaJavascript;
import com.google.cloud.teleport.v2.utils.BigQueryIOUtils;
import java.io.IOException;
import javax.script.ScriptException;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.mongodb.MongoDbIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.bson.Document;

/**
 * The {@link MongoDbToBigQuery} pipeline is a batch pipeline which ingests data from MongoDB and
 * outputs the resulting records to BigQuery.
 */
@Template(
    name = "MongoDB_to_BigQuery",
    category = TemplateCategory.BATCH,
    displayName = "MongoDB to BigQuery",
    description =
        "A batch pipeline which reads data documents from MongoDB and writes them to BigQuery.",
    optionsClass = Options.class,
    flexContainerName = "mongodb-to-bigquery",
    contactInformation = "https://cloud.google.com/support")
public class MongoDbToBigQuery {
  /**
   * Options supported by {@link MongoDbToBigQuery}
   *
   * <p>Inherits standard configuration options.
   */
  public interface Options
      extends PipelineOptions,
          MongoDbOptions,
          BigQueryWriteOptions,
          BigQueryStorageApiBatchOptions,
          JavascriptDocumentTransformerOptions {}

  private static class ParseAsDocumentsFn extends DoFn<String, Document> {
    @ProcessElement
    public void processElement(ProcessContext context) {
      context.output(Document.parse(context.element()));
    }
  }

  public static void main(String[] args)
      throws ScriptException, IOException, NoSuchMethodException {
    UncaughtExceptionLogger.register();

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    BigQueryIOUtils.validateBQStorageApiOptionsBatch(options);

    run(options);
  }

  public static boolean run(Options options)
      throws ScriptException, IOException, NoSuchMethodException {
    Pipeline pipeline = Pipeline.create(options);
    String userOption = options.getUserOption();

    TableSchema bigquerySchema;

    if (options.getJavascriptDocumentTransformFunctionName() != null
        && options.getJavascriptDocumentTransformGcsPath() != null) {
      bigquerySchema =
          MongoDbUtils.getTableFieldSchemaForUDF(
              options.getMongoDbUri(),
              options.getDatabase(),
              options.getCollection(),
              options.getJavascriptDocumentTransformGcsPath(),
              options.getJavascriptDocumentTransformFunctionName(),
              options.getUserOption());
    } else {
      bigquerySchema =
          MongoDbUtils.getTableFieldSchema(
              options.getMongoDbUri(),
              options.getDatabase(),
              options.getCollection(),
              options.getUserOption());
    }

    pipeline
        .apply(
            "Read Documents",
            MongoDbIO.read()
                .withUri(options.getMongoDbUri())
                .withDatabase(options.getDatabase())
                .withCollection(options.getCollection()))
        .apply(
            "UDF",
            TransformDocumentViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptDocumentTransformGcsPath())
                .setFunctionName(options.getJavascriptDocumentTransformFunctionName())
                .build())
        .apply(
            "Transform to TableRow",
            ParDo.of(
                new DoFn<Document, TableRow>() {

                  @ProcessElement
                  public void process(ProcessContext c) {
                    Document document = c.element();
                    TableRow row = MongoDbUtils.getTableSchema(document, userOption);
                    c.output(row);
                  }
                }))
        .apply(
            "Write to Bigquery",
            BigQueryIO.writeTableRows()
                .to(options.getOutputTableSpec())
                .withSchema(bigquerySchema)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
    pipeline.run();
    return true;
  }
}

Google 提供の Dataflow バッチ テンプレート

BigQuery to Cloud Storage TFRecords

テンプレートのパラメータ

BigQuery to Cloud Storage TFRecord ファイル テンプレートの実行

コンソール

gcloud

API

テンプレートのソースコード

Java

BigQuery export to Parquet（Storage API 経由）

テンプレートのパラメータ

BigQuery to Cloud Storage Parquet テンプレートの実行

コンソール

gcloud

API

テンプレートのソースコード

Java

BigQuery to Elasticsearch

テンプレートのパラメータ

BigQuery to Elasticsearch テンプレートの実行

コンソール

gcloud

API

テンプレートのソースコード

Java

BigQuery to MongoDB

テンプレートのパラメータ

BigQuery to MongoDB テンプレートの実行

コンソール

gcloud

API

テンプレートのソースコード

Java

Bigtable to Cloud Storage Avro

テンプレートのパラメータ

Bigtable to Cloud Storage Avro file テンプレートの実行

コンソール

gcloud

API

テンプレートのソースコード

Java

Bigtable to Cloud Storage Parquet

テンプレートのパラメータ

Bigtable to Cloud Storage Parquet ファイル テンプレートの実行

コンソール

gcloud

API

テンプレートのソースコード

Java

Bigtable to Cloud Storage SequenceFile

テンプレートのパラメータ

Bigtable to Cloud Storage SequenceFile テンプレートの実行

コンソール

gcloud

API

テンプレートのソースコード

Java

Datastore to Cloud Storage Text [非推奨]

テンプレートのパラメータ

Datastore to Cloud Storage Text テンプレートの実行

コンソール

gcloud

API

テンプレートのソースコード

Java

Firestore to Cloud Storage Text

テンプレートのパラメータ

Firestore to Cloud Storage Text テンプレートの実行

コンソール

gcloud

API

テンプレートのソースコード

Java

Cloud Spanner to Cloud Storage Avro

テンプレートのパラメータ

Cloud Spanner to Avro Files on Cloud Storage テンプレートの実行

コンソール

gcloud

API

テンプレートのソースコード

Google 提供の Dataflow バッチテンプレート

BigQuery to Cloud Storage TFRecord ファイルテンプレートの実行

Bigtable to Cloud Storage Parquet ファイルテンプレートの実行