Google 提供のユーティリティテンプレート

Google はオープンソースの Dataflow テンプレートを提供しています。テンプレートに関する一般的な情報については、Dataflow テンプレートをご覧ください。Google が提供するテンプレートのリストについては、Google 提供のテンプレートの概要をご覧ください。

このガイドでは、ユーティリティテンプレートについて説明します。

File Format Conversion（Avro、Parquet、CSV）

File Format Conversion テンプレートは、Cloud Storage に格納されたファイルをサポートされている形式から別の形式に変換するバッチパイプラインです。

次の形式変換がサポートされています。

CSV から Avro
CSV から Parquet
Avro から Parquet
Parquet から Avro

このパイプラインの要件:

パイプラインを実行する前に、出力先の Cloud Storage バケットが存在すること。

テンプレートのパラメータ

パラメータ	説明
`inputFileFormat`	入力ファイルの形式。`[csv, avro, parquet]` のいずれかにする必要があります。
`outputFileFormat`	出力ファイルの形式。`[avro, parquet]` のいずれかにする必要があります。
`inputFileSpec`	入力ファイルの Cloud Storage パスのパターン。例: `gs://bucket-name/path/*.csv`
`outputBucket`	出力ファイルを書き込む Cloud Storage フォルダ。このパスはスラッシュで終わる必要があります。例: `gs://bucket-name/output/`
`schema`	Avro スキーマファイルへの Cloud Storage パス。例: `gs://bucket-name/schema/my-schema.avsc`
`containsHeaders`	（省略可）入力 CSV ファイルにはヘッダーレコード（true/false）が含まれています。デフォルト値は `false` です。CSV ファイルを読み込む場合にのみ必要です。
`csvFormat`	（省略可）レコードの解析に使用する CSV 形式の仕様。デフォルト値は `Default` です。詳細については、Apache Commons CSV 形式をご覧ください。
`delimiter`	（省略可）入力 CSV ファイルで使用されるフィールド区切り文字。
`outputFilePrefix`	（省略可）出力ファイルの接頭辞。デフォルト値は `output` です。
`numShards`	（省略可）出力ファイルのシャード数。

File Format Conversion テンプレートの実行

Console

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Convert file formats template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/File_Format_Conversion \
    --parameters \
inputFileFormat=INPUT_FORMAT,\
outputFileFormat=OUTPUT_FORMAT,\
inputFileSpec=INPUT_FILES,\
schema=SCHEMA,\
outputBucket=OUTPUT_FOLDER

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
INPUT_FORMAT: 入力ファイルの形式。[csv, avro, parquet] のいずれかにする必要があります。
OUTPUT_FORMAT: 出力ファイルの形式。[avro, parquet] のいずれかにする必要があります。
INPUT_FILES: 入力ファイルのパスパターン
OUTPUT_FOLDER: 出力ファイルを格納する Cloud Storage フォルダ
SCHEMA: Avro スキーマファイルのパス

API

REST API を使用してテンプレートを実行するには、HTTP POST リクエストを送信します。API とその認証スコープの詳細については、projects.templates.launch をご覧ください。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputFileFormat": "INPUT_FORMAT",
          "outputFileFormat": "OUTPUT_FORMAT",
          "inputFileSpec": "INPUT_FILES",
          "schema": "SCHEMA",
          "outputBucket": "OUTPUT_FOLDER"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/File_Format_Conversion",
   }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
INPUT_FORMAT: 入力ファイルの形式。[csv, avro, parquet] のいずれかにする必要があります。
OUTPUT_FORMAT: 出力ファイルの形式。[avro, parquet] のいずれかにする必要があります。
INPUT_FILES: 入力ファイルのパスパターン
OUTPUT_FOLDER: 出力ファイルを格納する Cloud Storage フォルダ
SCHEMA: Avro スキーマファイルのパス

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.templates.FileFormatConversion.FileFormatConversionOptions;
import com.google.cloud.teleport.v2.transforms.AvroConverters.AvroOptions;
import com.google.cloud.teleport.v2.transforms.CsvConverters.CsvPipelineOptions;
import com.google.cloud.teleport.v2.transforms.ParquetConverters.ParquetOptions;
import java.util.EnumMap;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link FileFormatConversion} pipeline takes in an input file, converts it to a desired format
 * and saves it to Cloud Storage. Supported file transformations are:
 *
 * <ul>
 *   <li>Csv to Avro
 *   <li>Csv to Parquet
 *   <li>Avro to Parquet
 *   <li>Parquet to Avro
 * </ul>
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>Input file exists in Google Cloud Storage.
 *   <li>Google Cloud Storage output bucket exists.
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT=my-project
 * BUCKET_NAME=my-bucket
 *
 * # Set containerization vars
 * IMAGE_NAME=my-image-name
 * TARGET_GCR_IMAGE=gcr.io/${PROJECT}/${IMAGE_NAME}
 * BASE_CONTAINER_IMAGE=my-base-container-image
 * BASE_CONTAINER_IMAGE_VERSION=my-base-container-image-version
 * APP_ROOT=/path/to/app-root
 * COMMAND_SPEC=/path/to/command-spec
 *
 * # Set vars for execution
 * export INPUT_FILE_FORMAT=Csv
 * export OUTPUT_FILE_FORMAT=Avro
 * export AVRO_SCHEMA_PATH=gs://path/to/avro/schema
 * export HEADERS=false
 * export DELIMITER=","
 *
 * # Build and upload image
 * mvn clean package \
 * -Dimage=${TARGET_GCR_IMAGE} \
 * -Dbase-container-image=${BASE_CONTAINER_IMAGE} \
 * -Dbase-container-image.version=${BASE_CONTAINER_IMAGE_VERSION} \
 * -Dapp-root=${APP_ROOT} \
 * -Dcommand-spec=${COMMAND_SPEC}
 *
 * # Create an image spec in GCS that contains the path to the image
 * {
 *    "docker_template_spec": {
 *       "docker_image": $TARGET_GCR_IMAGE
 *     }
 *  }
 *
 * # Execute template:
 * API_ROOT_URL="https://dataflow.googleapis.com"
 * TEMPLATES_LAUNCH_API="${API_ROOT_URL}/v1b3/projects/${PROJECT}/templates:launch"
 * JOB_NAME="csv-to-avro-`date +%Y%m%d-%H%M%S-%N`"
 *
 * time curl -X POST -H "Content-Type: application/json"     \
 *     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
 *     "${TEMPLATES_LAUNCH_API}"`
 *     `"?validateOnly=false"`
 *     `"&dynamicTemplate.gcsPath=${BUCKET_NAME}/path/to/image-spec"`
 *     `"&dynamicTemplate.stagingLocation=${BUCKET_NAME}/staging" \
 *     -d '
 *      {
 *       "jobName":"'$JOB_NAME'",
 *       "parameters": {
 *            "inputFileFormat":"'$INPUT_FILE_FORMAT'",
 *            "outputFileFormat":"'$OUTPUT_FILE_FORMAT'",
 *            "inputFileSpec":"'$BUCKET_NAME/path/to/input-file'",
 *            "outputBucket":"'$BUCKET_NAME/path/to/output-location/'",
 *            "containsHeaders":"'$HEADERS'",
 *            "schema":"'$AVRO_SCHEMA_PATH'",
 *            "outputFilePrefix":"output-file",
 *            "numShards":"3",
 *            "delimiter":"'$DELIMITER'"
 *         }
 *       }
 *      '
 * </pre>
 */
@Template(
    name = "File_Format_Conversion",
    category = TemplateCategory.UTILITIES,
    displayName = "Convert file formats between Avro, Parquet & CSV",
    description = "A pipeline to convert file formats between Avro, Parquet & csv.",
    optionsClass = FileFormatConversionOptions.class,
    optionalOptions = {"deadletterTable"},
    flexContainerName = "file-format-conversion",
    contactInformation = "https://cloud.google.com/support")
public class FileFormatConversion {

  /** Logger for class. */
  private static final Logger LOG = LoggerFactory.getLogger(FileFormatConversionFactory.class);

  private static EnumMap<ValidFileFormats, String> validFileFormats =
      new EnumMap<ValidFileFormats, String>(ValidFileFormats.class);

  /**
   * The {@link FileFormatConversionOptions} provides the custom execution options passed by the
   * executor at the command-line.
   */
  public interface FileFormatConversionOptions
      extends PipelineOptions, CsvPipelineOptions, AvroOptions, ParquetOptions {
    @TemplateParameter.Enum(
        order = 1,
        enumOptions = {"avro", "csv", "parquet"},
        description = "File format of the input files.",
        helpText = "File format of the input files. Needs to be either avro, parquet or csv.")
    @Required
    String getInputFileFormat();

    void setInputFileFormat(String inputFileFormat);

    @TemplateParameter.Enum(
        order = 2,
        enumOptions = {"avro", "parquet"},
        description = "File format of the output files.",
        helpText = "File format of the output files. Needs to be either avro or parquet.")
    @Required
    String getOutputFileFormat();

    void setOutputFileFormat(String outputFileFormat);
  }

  /** The {@link ValidFileFormats} enum contains all valid file formats. */
  public enum ValidFileFormats {
    CSV,
    AVRO,
    PARQUET
  }

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    FileFormatConversionOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(FileFormatConversionOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options.
   *
   * @param options The execution options.
   * @return The pipeline result.
   * @throws RuntimeException thrown if incorrect file formats are passed.
   */
  public static PipelineResult run(FileFormatConversionOptions options) {
    String inputFileFormat = options.getInputFileFormat().toUpperCase();
    String outputFileFormat = options.getOutputFileFormat().toUpperCase();

    validFileFormats.put(ValidFileFormats.CSV, "CSV");
    validFileFormats.put(ValidFileFormats.AVRO, "AVRO");
    validFileFormats.put(ValidFileFormats.PARQUET, "PARQUET");

    if (!validFileFormats.containsValue(inputFileFormat)) {
      throw new IllegalArgumentException("Invalid input file format.");
    }
    if (!validFileFormats.containsValue(outputFileFormat)) {
      throw new IllegalArgumentException("Invalid output file format.");
    }
    if (inputFileFormat.equals(outputFileFormat)) {
      throw new IllegalArgumentException("Input and output file format cannot be the same.");
    }

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    pipeline.apply(
        inputFileFormat + " to " + outputFileFormat,
        FileFormatConversionFactory.FileFormat.newBuilder()
            .setOptions(options)
            .setInputFileFormat(inputFileFormat)
            .setOutputFileFormat(outputFileFormat)
            .build());

    return pipeline.run();
  }
}

Bulk Compress Cloud Storage Files

Bulk Compress Cloud Storage Files テンプレートは、Cloud Storage 上のファイルを指定した場所に圧縮するバッチパイプラインです。このテンプレートは、定期的なアーカイブプロセスの一環として大量のファイルを圧縮する必要がある場合に役立ちます。サポートされている圧縮モードは、BZIP2、DEFLATE、GZIP です。出力先の場所に出力されるファイルは、元のファイル名の命名スキーマに従って命名され、ファイル名の末尾に圧縮モードの拡張子が付加されます。付加される拡張子は、.bzip2、.deflate、.gz のいずれかです。

圧縮処理中に発生したエラーは、CSV 形式（ファイル名, エラーメッセージ）のエラーファイルに出力されます。パイプラインの実行中にエラーが発生しなくてもエラーファイルは作成されますが、ファイル内にエラーレコードはありません。

このパイプラインの要件:

圧縮形式は、BZIP2、DEFLATE、GZIP のいずれかにすること。
パイプラインの実行前に出力ディレクトリが存在している必要があります。

テンプレートのパラメータ

パラメータ	説明
`inputFilePattern`	読み込み元の入力ファイルのパターン。例: `gs://bucket-name/uncompressed/*.txt`
`outputDirectory`	出力を書き込む場所。例: `gs://bucket-name/compressed/`
`outputFailureFile`	圧縮処理中に発生したエラーの書き込みに使用されるエラーログ出力ファイル。たとえば、`gs://bucket-name/compressed/failed.csv` とします。エラーが発生しなくてもファイルは作成されますが、その中身は空です。このファイルの内容は CSV 形式（ファイル名, エラー）であり、圧縮に失敗したファイルごとに 1 行が使用されます。
`compression`	一致するファイルを圧縮するために使用する圧縮アルゴリズム。`BZIP2`、`DEFLATE`、`GZIP` のいずれかにする必要があります。

Bulk Compress Cloud Storage Files テンプレートの実行

Console

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Bulk Compress Files on Cloud Storage template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Bulk_Compress_GCS_Files \
    --region REGION_NAME \
    --parameters \
inputFilePattern=gs://BUCKET_NAME/uncompressed/*.txt,\
outputDirectory=gs://BUCKET_NAME/compressed,\
outputFailureFile=gs://BUCKET_NAME/failed/failure.csv,\
compression=COMPRESSION

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
BUCKET_NAME: Cloud Storage バケットの名前
COMPRESSION: 任意の圧縮アルゴリズム

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Bulk_Compress_GCS_Files
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://BUCKET_NAME/uncompressed/*.txt",
       "outputDirectory": "gs://BUCKET_NAME/compressed",
       "outputFailureFile": "gs://BUCKET_NAME/failed/failure.csv",
       "compression": "COMPRESSION"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
BUCKET_NAME: Cloud Storage バケットの名前
COMPRESSION: 任意の圧縮アルゴリズム

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.BulkCompressor.Options;
import com.google.common.collect.ImmutableList;
import com.google.common.io.ByteStreams;
import java.io.IOException;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.Compression;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.FileSystems;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.fs.MatchResult;
import org.apache.beam.sdk.io.fs.ResolveOptions.StandardResolveOptions;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.util.MimeTypes;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TupleTagList;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link BulkCompressor} is a batch pipeline that compresses files on matched by an input file
 * pattern and outputs them to a specified file location. This pipeline can be useful when you need
 * to compress large batches of files as part of a perodic archival process. The supported
 * compression modes are: <code>BZIP2</code>, <code>DEFLATE</code>, <code>GZIP</code>, <code>ZIP
 * </code>. Files output to the destination location will follow a naming schema of original
 * filename appended with the compression mode extension. The extensions appended will be one of:
 * <code>.bzip2</code>, <code>.deflate</code>, <code>.gz</code>, <code>.zip</code> as determined by
 * the compression type.
 *
 * <p>Any errors which occur during the compression process will be output to the failure file in
 * CSV format of filename, error message. If no failures occur during execution, the error file will
 * still be created but will contain no error records.
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>The compression must be in one of the following formats: <code>BZIP2</code>, <code>DEFLATE
 *       </code>, <code>GZIP</code>, <code>ZIP</code>.
 *   <li>The output directory must exist prior to pipeline execution.
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT_ID=PROJECT ID HERE
 * PIPELINE_FOLDER=gs://${PROJECT_ID}/dataflow/pipelines/bulk-compressor
 *
 * # Set the runner
 * RUNNER=DataflowRunner
 *
 * # Build the template
 * mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.BulkCompressor \
 * -Dexec.cleanupDaemonThreads=false \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=${PIPELINE_FOLDER}/staging \
 * --tempLocation=${PIPELINE_FOLDER}/temp \
 * --templateLocation=${PIPELINE_FOLDER}/template \
 * --runner=${RUNNER}"
 *
 * # Execute the template
 * JOB_NAME=bulk-compressor-$USER-`date +"%Y%m%d-%H%M%S%z"`
 *
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "inputFilePattern=${PIPELINE_FOLDER}/test/uncompressed/*,\
 * outputDirectory=${PIPELINE_FOLDER}/test/compressed,\
 * outputFailureFile=${PIPELINE_FOLDER}/test/failure/failed-${JOB_NAME}.csv,\
 * compression=GZIP"
 * </pre>
 */
@Template(
    name = "Bulk_Compress_GCS_Files",
    category = TemplateCategory.UTILITIES,
    displayName = "Bulk Compress Files on Cloud Storage",
    description = "Batch pipeline. Compresses files on Cloud Storage to a specified location.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class BulkCompressor {

  /** The logger to output status messages to. */
  private static final Logger LOG = LoggerFactory.getLogger(BulkCompressor.class);

  /** The tag used to identify the main output of the {@link Compressor}. */
  private static final TupleTag<String> COMPRESSOR_MAIN_OUT = new TupleTag<String>() {};

  /** The tag used to identify the dead-letter output of the {@link Compressor}. */
  private static final TupleTag<KV<String, String>> DEADLETTER_TAG =
      new TupleTag<KV<String, String>>() {};

  /**
   * The {@link Options} class provides the custom execution options passed by the executor at the
   * command-line.
   */
  public interface Options extends PipelineOptions {
    @TemplateParameter.GcsReadFile(
        order = 1,
        description = "Input Cloud Storage File(s)",
        helpText = "The Cloud Storage location of the files you'd like to process.",
        example = "gs://your-bucket/your-files/*.txt")
    @Required
    ValueProvider<String> getInputFilePattern();

    void setInputFilePattern(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFolder(
        order = 2,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime formatting is used to parse directory path for date & time formatters.",
        example = "gs://your-bucket/your-path")
    @Required
    ValueProvider<String> getOutputDirectory();

    void setOutputDirectory(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFile(
        order = 3,
        description = "Output failure file",
        helpText =
            "The error log output file to use for write failures that occur during compression. The contents will be one line for "
                + "each file which failed compression. Note that this parameter will "
                + "allow the pipeline to continue processing in the event of a failure.",
        example = "gs://your-bucket/compressed/failed.csv")
    @Required
    ValueProvider<String> getOutputFailureFile();

    void setOutputFailureFile(ValueProvider<String> value);

    @TemplateParameter.Enum(
        order = 4,
        enumOptions = {"BZIP2", "DEFLATE", "GZIP"},
        description = "Compression",
        helpText =
            "The compression algorithm used to compress the matched files. Valid algorithms: BZIP2, DEFLATE, GZIP")
    @Required
    ValueProvider<Compression> getCompression();

    void setCompression(ValueProvider<Compression> value);
  }

  /**
   * The main entry-point for pipeline execution. This method will start the pipeline but will not
   * wait for it's execution to finish. If blocking execution is required, use the {@link
   * BulkCompressor#run(Options)} method to start the pipeline and invoke {@code
   * result.waitUntilFinish()} on the {@link PipelineResult}.
   *
   * @param args The command-line args passed by the executor.
   */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(Options options) {

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    /*
     * Steps:
     *   1) Find all files matching the input pattern
     *   2) Compress the files found and output them to the output directory
     *   3) Write any errors to the failure output file
     */
    PCollectionTuple compressOut =
        pipeline
            .apply("Match File(s)", FileIO.match().filepattern(options.getInputFilePattern()))
            .apply(
                "Compress File(s)",
                ParDo.of(new Compressor(options.getOutputDirectory(), options.getCompression()))
                    .withOutputTags(COMPRESSOR_MAIN_OUT, TupleTagList.of(DEADLETTER_TAG)));

    compressOut
        .get(DEADLETTER_TAG)
        .apply(
            "Format Errors",
            MapElements.into(TypeDescriptors.strings())
                .via(kv -> String.format("%s,%s", kv.getKey(), kv.getValue())))
        .apply(
            "Write Error File",
            TextIO.write()
                .to(options.getOutputFailureFile())
                .withHeader("Filename,Error")
                .withoutSharding());

    return pipeline.run();
  }

  /**
   * The {@link Compressor} accepts {@link MatchResult.Metadata} from the FileSystems API and
   * compresses each file to an output location. Any compression failures which occur during
   * execution will be output to a separate output for further processing.
   */
  @SuppressWarnings("serial")
  public static class Compressor extends DoFn<MatchResult.Metadata, String> {

    private final ValueProvider<String> destinationLocation;
    private final ValueProvider<Compression> compressionValue;

    Compressor(ValueProvider<String> destinationLocation, ValueProvider<Compression> compression) {
      this.destinationLocation = destinationLocation;
      this.compressionValue = compression;
    }

    @ProcessElement
    public void processElement(ProcessContext context) {
      ResourceId inputFile = context.element().resourceId();
      Compression compression = compressionValue.get();

      // Add the compression extension to the output filename. Example: demo.txt -> demo.txt.gz
      String outputFilename = inputFile.getFilename() + compression.getSuggestedSuffix();

      // Resolve the necessary resources to perform the transfer
      ResourceId outputDir = FileSystems.matchNewResource(destinationLocation.get(), true);
      ResourceId outputFile =
          outputDir.resolve(outputFilename, StandardResolveOptions.RESOLVE_FILE);
      ResourceId tempFile =
          outputDir.resolve("temp-" + outputFilename, StandardResolveOptions.RESOLVE_FILE);

      // Perform the copy of the compressed channel to the destination.
      try (ReadableByteChannel readerChannel = FileSystems.open(inputFile)) {
        try (WritableByteChannel writerChannel =
            compression.writeCompressed(FileSystems.create(tempFile, MimeTypes.BINARY))) {

          // Execute the copy to the temporary file
          ByteStreams.copy(readerChannel, writerChannel);
        }

        // Rename the temporary file to the output file
        FileSystems.rename(ImmutableList.of(tempFile), ImmutableList.of(outputFile));

        // Output the path to the uncompressed file
        context.output(outputFile.toString());
      } catch (IOException e) {
        LOG.error("Error occurred during compression of {}", inputFile.toString(), e);
        context.output(DEADLETTER_TAG, KV.of(inputFile.toString(), e.getMessage()));
      }
    }
  }
}

Bulk Decompress Cloud Storage Files

Bulk Decompress Cloud Storage Files テンプレートは、Cloud Storage 上のファイルを指定された場所に解凍するバッチパイプラインです。移行中はネットワーク帯域幅のコストを最小限に抑えるために圧縮データを使用する一方、移行が完了したら、分析処理速度を最大限にするために非圧縮データを処理する場合に、この機能が役立ちます。このパイプラインは、1 回の実行時に自動的に複数の圧縮モードを同時に処理し、ファイル拡張子（.bzip2、.deflate、.gz、.zip）に基づいて使用する解凍モードを判断します。

このパイプラインの要件:

解凍するファイルの形式は、Bzip2、Deflate、Gzip、Zip のいずれかでなければなりません。
パイプラインの実行前に出力ディレクトリが存在している必要があります。

テンプレートのパラメータ

パラメータ	説明
`inputFilePattern`	読み込み元の入力ファイルのパターン。例: `gs://bucket-name/compressed/*.gz`
`outputDirectory`	出力を書き込む場所。例: `gs://bucket-name/decompressed`
`outputFailureFile`	解凍処理中に発生したエラーを書き込むために使用するエラーログ出力ファイル。たとえば、`gs://bucket-name/decompressed/failed.csv` とします。エラーが発生しなくてもファイルは作成されますが、その中身は空です。このファイルの内容は CSV 形式（ファイル名, エラー）であり、解凍に失敗したファイルごとに 1 行が使用されます。

Bulk Decompress Cloud Storage Files テンプレートの実行

Console

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Bulk Decompress Files on Cloud Storage template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Bulk_Decompress_GCS_Files \
    --region REGION_NAME \
    --parameters \
inputFilePattern=gs://BUCKET_NAME/compressed/*.gz,\
outputDirectory=gs://BUCKET_NAME/decompressed,\
outputFailureFile=OUTPUT_FAILURE_FILE_PATH

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
BUCKET_NAME: Cloud Storage バケットの名前
OUTPUT_FAILURE_FILE_PATH: エラー情報を含むファイルへの任意のパス

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Bulk_Decompress_GCS_Files
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://BUCKET_NAME/compressed/*.gz",
       "outputDirectory": "gs://BUCKET_NAME/decompressed",
       "outputFailureFile": "OUTPUT_FAILURE_FILE_PATH"
   },
   "environment": { "zone": "us-central1-f" }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
BUCKET_NAME: Cloud Storage バケットの名前
OUTPUT_FAILURE_FILE_PATH: エラー情報を含むファイルへの任意のパス

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.BulkDecompressor.Options;
import com.google.common.annotations.VisibleForTesting;
import com.google.common.collect.ImmutableList;
import com.google.common.io.ByteStreams;
import com.google.common.io.Files;
import java.io.IOException;
import java.io.StringWriter;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;
import java.util.Set;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import javax.annotation.Nullable;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.Compression;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.FileSystems;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.fs.MatchResult;
import org.apache.beam.sdk.io.fs.MoveOptions;
import org.apache.beam.sdk.io.fs.ResolveOptions.StandardResolveOptions;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.util.MimeTypes;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TupleTagList;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVPrinter;
import org.apache.commons.csv.QuoteMode;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * This pipeline decompresses file(s) from Google Cloud Storage and re-uploads them to a destination
 * location.
 *
 * <p><b>Parameters</b>
 *
 * <p>The {@code --inputFilePattern} parameter specifies a file glob to process. Files found can be
 * expressed in the following formats:
 *
 * <pre>
 * --inputFilePattern=gs://bucket-name/compressed-dir/*
 * --inputFilePattern=gs://bucket-name/compressed-dir/demo*.gz
 * </pre>
 *
 * <p>The {@code --outputDirectory} parameter can be expressed in the following formats:
 *
 * <pre>
 * --outputDirectory=gs://bucket-name
 * --outputDirectory=gs://bucket-name/decompressed-dir
 * </pre>
 *
 * <p>The {@code --outputFailureFile} parameter indicates the file to write the names of the files
 * which failed decompression and their associated error messages. This file can then be used for
 * subsequent processing by another process outside of Dataflow (e.g. send an email with the
 * failures, etc.). If there are no failures, the file will still be created but will be empty. The
 * failure file structure contains both the file that caused the error and the error message in CSV
 * format. The file will contain one header row and two columns (Filename, Error). The filename
 * output to the failureFile will be the full path of the file for ease of debugging.
 *
 * <pre>
 * --outputFailureFile=gs://bucket-name/decompressed-dir/failed.csv
 * </pre>
 *
 * <p>Example Output File:
 *
 * <pre>
 * Filename,Error
 * gs://docs-demo/compressedFile.gz, File is malformed or not compressed in BZIP2 format.
 * </pre>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.BulkDecompressor \
 * -Dexec.cleanupDaemonThreads=false \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
 * --tempLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
 * --runner=DataflowRunner \
 * --inputFilePattern=gs://${PROJECT_ID}/compressed-dir/*.gz \
 * --outputDirectory=gs://${PROJECT_ID}/decompressed-dir \
 * --outputFailureFile=gs://${PROJECT_ID}/decompressed-dir/failed.csv"
 * </pre>
 */
@Template(
    name = "Bulk_Decompress_GCS_Files",
    category = TemplateCategory.UTILITIES,
    displayName = "Bulk Decompress Files on Cloud Storage",
    description =
        "A pipeline which decompresses files on Cloud Storage to a specified location. Supported formats: Bzip2, deflate, and gzip.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class BulkDecompressor {

  /** The logger to output status messages to. */
  private static final Logger LOG = LoggerFactory.getLogger(BulkDecompressor.class);

  /**
   * A list of the {@link Compression} values excluding {@link Compression#AUTO} and {@link
   * Compression#UNCOMPRESSED}.
   */
  @VisibleForTesting
  static final Set<Compression> SUPPORTED_COMPRESSIONS =
      Stream.of(Compression.values())
          .filter(value -> value != Compression.AUTO && value != Compression.UNCOMPRESSED)
          .collect(Collectors.toSet());

  /** The error msg given when the pipeline matches a file but cannot determine the compression. */
  @VisibleForTesting
  static final String UNCOMPRESSED_ERROR_MSG =
      "Skipping file %s because it did not match any compression mode (%s)";

  @VisibleForTesting
  static final String MALFORMED_ERROR_MSG =
      "The file resource %s is malformed or not in %s compressed format.";

  /** The tag used to identify the main output of the {@link Decompress} DoFn. */
  @VisibleForTesting
  static final TupleTag<String> DECOMPRESS_MAIN_OUT_TAG = new TupleTag<String>() {};

  /** The tag used to identify the dead-letter sideOutput of the {@link Decompress} DoFn. */
  @VisibleForTesting
  static final TupleTag<KV<String, String>> DEADLETTER_TAG = new TupleTag<KV<String, String>>() {};

  /**
   * The {@link Options} class provides the custom execution options passed by the executor at the
   * command-line.
   */
  public interface Options extends PipelineOptions {
    @TemplateParameter.GcsReadFile(
        order = 1,
        description = "Input Cloud Storage File(s)",
        helpText = "The Cloud Storage location of the files you'd like to process.",
        example = "gs://your-bucket/your-files/*.gz")
    @Required
    ValueProvider<String> getInputFilePattern();

    void setInputFilePattern(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFolder(
        order = 2,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime formatting is used to parse directory path for date & time formatters.",
        example = "gs://your-bucket/decompressed/")
    @Required
    ValueProvider<String> getOutputDirectory();

    void setOutputDirectory(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFile(
        order = 3,
        description = "The output file for failures during the decompression process",
        helpText =
            "The output file to write failures to during the decompression process. If there are no failures, the file will still be created but will be empty. The contents will be one line for each file which failed decompression in CSV format (Filename, Error). Note that this parameter will allow the pipeline to continue processing in the event of a failure.",
        example = "gs://your-bucket/decompressed/failed.csv")
    @Required
    ValueProvider<String> getOutputFailureFile();

    void setOutputFailureFile(ValueProvider<String> value);
  }

  /**
   * The main entry-point for pipeline execution. This method will start the pipeline but will not
   * wait for it's execution to finish. If blocking execution is required, use the {@link
   * BulkDecompressor#run(Options)} method to start the pipeline and invoke {@code
   * result.waitUntilFinish()} on the {@link PipelineResult}.
   *
   * @param args The command-line args passed by the executor.
   */
  public static void main(String[] args) {

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(Options options) {

    /*
     * Steps:
     *   1) Find all files matching the input pattern
     *   2) Decompress the files found and output them to the output directory
     *   3) Write any errors to the failure output file
     */

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Run the pipeline over the work items.
    PCollectionTuple decompressOut =
        pipeline
            .apply("MatchFile(s)", FileIO.match().filepattern(options.getInputFilePattern()))
            .apply(
                "DecompressFile(s)",
                ParDo.of(new Decompress(options.getOutputDirectory()))
                    .withOutputTags(DECOMPRESS_MAIN_OUT_TAG, TupleTagList.of(DEADLETTER_TAG)));

    decompressOut
        .get(DEADLETTER_TAG)
        .apply(
            "FormatErrors",
            MapElements.into(TypeDescriptors.strings())
                .via(
                    kv -> {
                      StringWriter stringWriter = new StringWriter();
                      try {
                        CSVPrinter printer =
                            new CSVPrinter(
                                stringWriter,
                                CSVFormat.DEFAULT
                                    .withEscape('\\')
                                    .withQuoteMode(QuoteMode.NONE)
                                    .withRecordSeparator('\n'));
                        printer.printRecord(kv.getKey(), kv.getValue());
                      } catch (IOException e) {
                        throw new RuntimeException(e);
                      }

                      return stringWriter.toString();
                    }))

        // We don't expect error files to be large so we'll create a single
        // file for ease of reprocessing by processes outside of Dataflow.
        .apply(
            "WriteErrorFile",
            TextIO.write()
                .to(options.getOutputFailureFile())
                .withHeader("Filename,Error")
                .withoutSharding());

    return pipeline.run();
  }

  /**
   * Performs the decompression of an object on Google Cloud Storage and uploads the decompressed
   * object back to a specified destination location.
   */
  @SuppressWarnings("serial")
  public static class Decompress extends DoFn<MatchResult.Metadata, String> {

    private final ValueProvider<String> destinationLocation;

    Decompress(ValueProvider<String> destinationLocation) {
      this.destinationLocation = destinationLocation;
    }

    @ProcessElement
    public void processElement(ProcessContext context) {
      ResourceId inputFile = context.element().resourceId();

      // Output a record to the failure file if the file doesn't match a known compression.
      if (!Compression.AUTO.isCompressed(inputFile.toString())) {
        String errorMsg =
            String.format(UNCOMPRESSED_ERROR_MSG, inputFile.toString(), SUPPORTED_COMPRESSIONS);

        context.output(DEADLETTER_TAG, KV.of(inputFile.toString(), errorMsg));
      } else {
        try {
          ResourceId outputFile = decompress(inputFile);
          context.output(outputFile.toString());
        } catch (IOException e) {
          LOG.error(e.getMessage());
          context.output(DEADLETTER_TAG, KV.of(inputFile.toString(), e.getMessage()));
        }
      }
    }

    /**
     * Decompresses the inputFile using the specified compression and outputs to the main output of
     * the {@link Decompress} doFn. Files output to the destination will be first written as temp
     * files with a "temp-" prefix within the output directory. If a file fails decompression, the
     * filename and the associated error will be output to the dead-letter.
     *
     * @param inputFile The inputFile to decompress.
     * @return A {@link ResourceId} which points to the resulting file from the decompression.
     */
    private ResourceId decompress(ResourceId inputFile) throws IOException {
      // Remove the compressed extension from the file. Example: demo.txt.gz -> demo.txt
      String outputFilename = Files.getNameWithoutExtension(inputFile.toString());

      // Resolve the necessary resources to perform the transfer.
      ResourceId outputDir = FileSystems.matchNewResource(destinationLocation.get(), true);
      ResourceId outputFile =
          outputDir.resolve(outputFilename, StandardResolveOptions.RESOLVE_FILE);
      ResourceId tempFile =
          outputDir.resolve(
              Files.getFileExtension(inputFile.toString()) + "-temp-" + outputFilename,
              StandardResolveOptions.RESOLVE_FILE);

      // Resolve the compression
      Compression compression = Compression.detect(inputFile.toString());

      // Perform the copy of the decompressed channel into the destination.
      try (ReadableByteChannel readerChannel =
          compression.readDecompressed(FileSystems.open(inputFile))) {
        try (WritableByteChannel writerChannel = FileSystems.create(tempFile, MimeTypes.TEXT)) {
          ByteStreams.copy(readerChannel, writerChannel);
        }

        // Rename the temp file to the output file.
        FileSystems.rename(
            ImmutableList.of(tempFile),
            ImmutableList.of(outputFile),
            MoveOptions.StandardMoveOptions.IGNORE_MISSING_FILES);
      } catch (IOException e) {
        String msg = e.getMessage();

        LOG.error("Error occurred during decompression of {}", inputFile.toString(), e);
        throw new IOException(sanitizeDecompressionErrorMsg(msg, inputFile, compression));
      }

      return outputFile;
    }

    /**
     * The error messages coming from the compression library are not consistent across compression
     * modes. Here we'll attempt to unify the messages to inform the user more clearly when we've
     * encountered a file which is not compressed or malformed. Note that GZIP and ZIP compression
     * modes will not throw an exception when a decompression is attempted on a file which is not
     * compressed.
     *
     * @param errorMsg The error message thrown during decompression.
     * @param inputFile The input file which failed decompression.
     * @param compression The compression mode used during decompression.
     * @return The sanitized error message. If the error was not from a malformed file, the same
     *     error message passed will be returned (if not null) or an empty string will be returned
     *     (if null).
     */
    private String sanitizeDecompressionErrorMsg(
        @Nullable String errorMsg, ResourceId inputFile, Compression compression) {
      if (errorMsg != null
          && (errorMsg.contains("not in the BZip2 format")
              || errorMsg.contains("incorrect header check"))) {
        errorMsg = String.format(MALFORMED_ERROR_MSG, inputFile.toString(), compression);
      }

      return errorMsg == null ? "" : errorMsg;
    }
  }
}

Datastore Bulk Delete（非推奨）

このテンプレートはサポートが終了しており、2022 年第 1 四半期に廃止されます。Firestore Bulk Delete テンプレートに移行してください。

Datastore Bulk Delete テンプレートは、指定の GQL クエリを使用して Datastore からエンティティを読み込み、選択したターゲットプロジェクト内のすべての一致エンティティを削除するパイプラインです。このパイプラインはオプションで JSON でエンコードされた Datastore エンティティを JavaScript UDF に渡すことができます。これを使用すると、null 値を返すことでエンティティを除外できます。

このパイプラインの要件:

テンプレートを実行する前に、Datastore をプロジェクトで設定する必要があります。
読み取る Datastore インスタンスと削除する Datastore インスタンスが異なる場合は、Dataflow ワーカーサービスアカウントに、あるインスタンスから読み取り、別のインスタンスから削除する権限が必要です。

テンプレートのパラメータ

パラメータ	説明
`datastoreReadGqlQuery`	削除対象としてマッチするエンティティを指定する GQL クエリ。キーのみのクエリを使用すると、パフォーマンスが向上する可能性があります。たとえば、「SELECT __key__ FROM MyKind」です。
`datastoreReadProjectId`	GQL クエリで一致するエンティティを読み取る Datastore インスタンスのプロジェクト ID。
`datastoreDeleteProjectId`	一致するエンティティを削除する Datastore インスタンスのプロジェクト ID。Datastore インスタンス内で読み取りと削除を行う場合は、`datastoreReadProjectId` と同じでもかまいません。
`datastoreReadNamespace`	（省略可）リクエストされるエンティティの名前空間。デフォルトの名前空間には「""」を設定します。
`datastoreHintNumWorkers`	（省略可）Datastore のランプアップスロットリングステップで予想されるワーカー数のヒント。デフォルトは、`500` です。
`javascriptTextTransformGcsPath`	（省略可）使用する JavaScript ユーザー定義関数（UDF）を定義する `.js` ファイルの Cloud Storage URI。例: `gs://my-bucket/my-udfs/my_file.js`
`javascriptTextTransformFunctionName`	（省略可）使用する JavaScript ユーザー定義関数（UDF）の名前。たとえば、JavaScript 関数が `myTransform(inJson) { /...do stuff.../ }` の場合、関数名は `myTransform` です。JavaScript UDF の例については、UDF の例をご覧ください。この関数で特定の Datastore エンティティに関して未定義の値や null が返される場合、そのエンティティは削除されません。

Datastore Bulk Delete テンプレートの実行

Console

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Bulk Delete Entities in Datastore template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Datastore_to_Datastore_Delete \
    --region REGION_NAME \
    --parameters \
datastoreReadGqlQuery="GQL_QUERY",\
datastoreReadProjectId=DATASTORE_READ_AND_DELETE_PROJECT_ID,\
datastoreDeleteProjectId=DATASTORE_READ_AND_DELETE_PROJECT_ID

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
GQL_QUERY: 削除するエンティティを照合するために使用するクエリ
DATASTORE_READ_AND_DELETE_PROJECT_ID: Datastore インスタンスのプロジェクト ID。この例では、同じ Datastore インスタンスからの読み取りと削除の両方を行います。

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Datastore_to_Datastore_Delete
{
   "jobName": "JOB_NAME",
   "parameters": {
       "datastoreReadGqlQuery": "GQL_QUERY",
       "datastoreReadProjectId": "DATASTORE_READ_AND_DELETE_PROJECT_ID",
       "datastoreDeleteProjectId": "DATASTORE_READ_AND_DELETE_PROJECT_ID"
   },
   "environment": { "zone": "us-central1-f" }
   }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
GQL_QUERY: 削除するエンティティを照合するために使用するクエリ
DATASTORE_READ_AND_DELETE_PROJECT_ID: Datastore インスタンスのプロジェクト ID。この例では、同じ Datastore インスタンスからの読み取りと削除の両方を行います。

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.DatastoreToDatastoreDelete.DatastoreToDatastoreDeleteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreDeleteEntityJson;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreDeleteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreReadOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.ReadJsonEntities;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;

/** Dataflow template which deletes pulled Datastore Entities. */
@Template(
    name = "Datastore_to_Datastore_Delete",
    category = TemplateCategory.UTILITIES,
    displayName = "Bulk Delete Entities in Datastore [Deprecated]",
    description =
        "A pipeline which reads in Entities (via a GQL query) from Datastore, optionally passes in the JSON encoded Entities to a JavaScript UDF, and then deletes all matching Entities in the selected target project.",
    optionsClass = DatastoreToDatastoreDeleteOptions.class,
    skipOptions = {
      "firestoreReadGqlQuery",
      "firestoreReadProjectId",
      "firestoreReadNamespace",
      "firestoreDeleteProjectId",
      "firestoreHintNumWorkers"
    },
    contactInformation = "https://cloud.google.com/support")
@Template(
    name = "Firestore_to_Firestore_Delete",
    category = TemplateCategory.UTILITIES,
    displayName = "Bulk Delete Entities in Firestore (Datastore mode)",
    description =
        "A pipeline which reads in Entities (via a GQL query) from Firestore, optionally passes in the JSON encoded Entities to a JavaScript UDF, and then deletes all matching Entities in the selected target project.",
    optionsClass = DatastoreToDatastoreDeleteOptions.class,
    skipOptions = {
      "datastoreReadGqlQuery",
      "datastoreReadProjectId",
      "datastoreReadNamespace",
      "datastoreDeleteProjectId",
      "datastoreHintNumWorkers"
    },
    contactInformation = "https://cloud.google.com/support")
public class DatastoreToDatastoreDelete {

  public static <T> ValueProvider<T> selectProvidedInput(
      ValueProvider<T> datastoreInput, ValueProvider<T> firestoreInput) {
    return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
  }

  /** Custom PipelineOptions. */
  public interface DatastoreToDatastoreDeleteOptions
      extends PipelineOptions,
          DatastoreReadOptions,
          JavascriptTextTransformerOptions,
          DatastoreDeleteOptions {}

  /**
   * Runs a pipeline which reads in Entities from datastore, passes in the JSON encoded Entities to
   * a Javascript UDF, and deletes all the Entities.
   *
   * <p>If the UDF returns value of undefined or null for a given Entity, then that Entity will not
   * be deleted.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    DatastoreToDatastoreDeleteOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(DatastoreToDatastoreDeleteOptions.class);

    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(
            ReadJsonEntities.newBuilder()
                .setGqlQuery(
                    selectProvidedInput(
                        options.getDatastoreReadGqlQuery(), options.getFirestoreReadGqlQuery()))
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreReadProjectId(), options.getFirestoreReadProjectId()))
                .setNamespace(
                    selectProvidedInput(
                        options.getDatastoreReadNamespace(), options.getFirestoreReadNamespace()))
                .build())
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(
            DatastoreDeleteEntityJson.newBuilder()
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreDeleteProjectId(),
                        options.getFirestoreDeleteProjectId()))
                .setHintNumWorkers(
                    selectProvidedInput(
                        options.getDatastoreHintNumWorkers(), options.getFirestoreHintNumWorkers()))
                .build());

    pipeline.run();
  }
}

Firestore Bulk Delete

Firestore Bulk Delete テンプレートは、指定の GQL クエリを使用して Firestore からエンティティを読み込み、選択したターゲットプロジェクト内のすべての一致エンティティを削除するパイプラインです。このパイプラインはオプションで JSON でエンコードされた Firestore エンティティを JavaScript UDF に渡すことができます。これを使用すると、null 値を返すことでエンティティを除外できます。

このパイプラインの要件:

テンプレートを実行する前に、Firestore をプロジェクトで設定する必要があります。
読み取る Firestore インスタンスと削除する Firestore インスタンスが異なる場合は、Dataflow ワーカーサービスアカウントに、あるインスタンスから読み取り、別のインスタンスから削除する権限が必要です。

テンプレートのパラメータ

パラメータ	説明
`firestoreReadGqlQuery`	削除対象としてマッチするエンティティを指定する GQL クエリ。キーのみのクエリを使用すると、パフォーマンスが向上する可能性があります。たとえば、「SELECT __key__ FROM MyKind」です。
`firestoreReadProjectId`	GQL クエリで一致するエンティティを読み取る Firestore インスタンスのプロジェクト ID。
`firestoreDeleteProjectId`	一致するエンティティを削除する Firestore インスタンスのプロジェクト ID。Firestore インスタンス内で読み取りと削除を行う場合は、`firestoreReadProjectId` と同じでもかまいません。
`firestoreReadNamespace`	（省略可）リクエストされるエンティティの名前空間。デフォルトの名前空間には「""」を設定します。
`firestoreHintNumWorkers`	（省略可）Firestore のランプアップスロットリングステップで予想されるワーカー数のヒント。デフォルトは、`500` です。
`javascriptTextTransformGcsPath`	（省略可）使用する JavaScript ユーザー定義関数（UDF）を定義する `.js` ファイルの Cloud Storage URI。例: `gs://my-bucket/my-udfs/my_file.js`
`javascriptTextTransformFunctionName`	（省略可）使用する JavaScript ユーザー定義関数（UDF）の名前。たとえば、JavaScript 関数が `myTransform(inJson) { /...do stuff.../ }` の場合、関数名は `myTransform` です。JavaScript UDF の例については、UDF の例をご覧ください。この関数で特定の Firestore エンティティに関して未定義の値や null が返される場合、そのエンティティは削除されません。

Firestore Bulk Delete テンプレートの実行

Console

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Bulk Delete Entities in Firestore template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Firestore_to_Firestore_Delete \
    --region REGION_NAME \
    --parameters \
firestoreReadGqlQuery="GQL_QUERY",\
firestoreReadProjectId=FIRESTORE_READ_AND_DELETE_PROJECT_ID,\
firestoreDeleteProjectId=FIRESTORE_READ_AND_DELETE_PROJECT_ID

次のように置き換えます。

JOB_NAME: 一意の任意のジョブ名
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
GQL_QUERY: 削除するエンティティを照合するために使用するクエリ
FIRESTORE_READ_AND_DELETE_PROJECT_ID: Firestore インスタンスのプロジェクト ID。この例では、同じ Firestore インスタンスからの読み取りと削除の両方を行います。

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Firestore_to_Firestore_Delete
{
   "jobName": "JOB_NAME",
   "parameters": {
       "firestoreReadGqlQuery": "GQL_QUERY",
       "firestoreReadProjectId": "FIRESTORE_READ_AND_DELETE_PROJECT_ID",
       "firestoreDeleteProjectId": "FIRESTORE_READ_AND_DELETE_PROJECT_ID"
   },
   "environment": { "zone": "us-central1-f" }
   }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
JOB_NAME: 一意の任意のジョブ名
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
GQL_QUERY: 削除するエンティティを照合するために使用するクエリ
FIRESTORE_READ_AND_DELETE_PROJECT_ID: Firestore インスタンスのプロジェクト ID。この例では、同じ Firestore インスタンスからの読み取りと削除の両方を行います。

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.DatastoreToDatastoreDelete.DatastoreToDatastoreDeleteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreDeleteEntityJson;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreDeleteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreReadOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.ReadJsonEntities;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;

/** Dataflow template which deletes pulled Datastore Entities. */
@Template(
    name = "Datastore_to_Datastore_Delete",
    category = TemplateCategory.UTILITIES,
    displayName = "Bulk Delete Entities in Datastore [Deprecated]",
    description =
        "A pipeline which reads in Entities (via a GQL query) from Datastore, optionally passes in the JSON encoded Entities to a JavaScript UDF, and then deletes all matching Entities in the selected target project.",
    optionsClass = DatastoreToDatastoreDeleteOptions.class,
    skipOptions = {
      "firestoreReadGqlQuery",
      "firestoreReadProjectId",
      "firestoreReadNamespace",
      "firestoreDeleteProjectId",
      "firestoreHintNumWorkers"
    },
    contactInformation = "https://cloud.google.com/support")
@Template(
    name = "Firestore_to_Firestore_Delete",
    category = TemplateCategory.UTILITIES,
    displayName = "Bulk Delete Entities in Firestore (Datastore mode)",
    description =
        "A pipeline which reads in Entities (via a GQL query) from Firestore, optionally passes in the JSON encoded Entities to a JavaScript UDF, and then deletes all matching Entities in the selected target project.",
    optionsClass = DatastoreToDatastoreDeleteOptions.class,
    skipOptions = {
      "datastoreReadGqlQuery",
      "datastoreReadProjectId",
      "datastoreReadNamespace",
      "datastoreDeleteProjectId",
      "datastoreHintNumWorkers"
    },
    contactInformation = "https://cloud.google.com/support")
public class DatastoreToDatastoreDelete {

  public static <T> ValueProvider<T> selectProvidedInput(
      ValueProvider<T> datastoreInput, ValueProvider<T> firestoreInput) {
    return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
  }

  /** Custom PipelineOptions. */
  public interface DatastoreToDatastoreDeleteOptions
      extends PipelineOptions,
          DatastoreReadOptions,
          JavascriptTextTransformerOptions,
          DatastoreDeleteOptions {}

  /**
   * Runs a pipeline which reads in Entities from datastore, passes in the JSON encoded Entities to
   * a Javascript UDF, and deletes all the Entities.
   *
   * <p>If the UDF returns value of undefined or null for a given Entity, then that Entity will not
   * be deleted.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    DatastoreToDatastoreDeleteOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(DatastoreToDatastoreDeleteOptions.class);

    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(
            ReadJsonEntities.newBuilder()
                .setGqlQuery(
                    selectProvidedInput(
                        options.getDatastoreReadGqlQuery(), options.getFirestoreReadGqlQuery()))
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreReadProjectId(), options.getFirestoreReadProjectId()))
                .setNamespace(
                    selectProvidedInput(
                        options.getDatastoreReadNamespace(), options.getFirestoreReadNamespace()))
                .build())
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(
            DatastoreDeleteEntityJson.newBuilder()
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreDeleteProjectId(),
                        options.getFirestoreDeleteProjectId()))
                .setHintNumWorkers(
                    selectProvidedInput(
                        options.getDatastoreHintNumWorkers(), options.getFirestoreHintNumWorkers()))
                .build());

    pipeline.run();
  }
}

Pub/Sub、BigQuery、Cloud Storage へのストリーミングデータ生成ツール

ストリーミングデータ生成ツールテンプレートは、ユーザーが指定したスキーマに基づいて、指定されたレートで無限または固定数の合成レコードまたはメッセージを生成するために使用されます。対応している宛先には、Pub/Sub トピック、BigQuery テーブル、Cloud Storage バケットがあります。

次のようなユースケースが考えられます。

Pub/Sub トピックへの大規模でリアルタイムのイベント公開をシミュレーションし、公開されたイベントを処理するために必要な受信者の数と規模を測定して判断します。
パフォーマンスベンチマークを評価する、または概念実証として機能するには、BigQuery テーブルまたは Cloud Storage バケットに合成データを生成します。

サポートされているシンクとエンコード形式

次の表は、このテンプレートでサポートされるシンクおよびエンコード形式を示したものです。

	JSON	Avro	Parquet
Pub/Sub	○	はい	×
BigQuery	はい	×	×
Cloud Storage	○	はい	はい

このパイプラインの要件:

生成されたデータの JSON テンプレートを含むスキーマファイルを作成します。このテンプレートは JSON データ生成ツールライブラリを使用しているため、スキーマの各フィールドにさまざまな faker 関数を指定できます。詳細については、json-data-generator ドキュメントをご覧ください。

次に例を示します。
```
{
  "id": {{integer(0,1000)}},
  "name": "{{uuid()}}",
  "isInStock": {{bool()}}
}
```
スキーマファイルを Cloud Storage バケットにアップロードします。
実行する前に出力ターゲットが存在している必要があります。ターゲットは、シンクタイプに応じて、Pub/Sub トピック、BigQuery テーブル、Cloud Storage バケットのいずれかである必要があります。
出力エンコードが Avro または Parquet の場合は、Avro スキーマファイルを作成し、Cloud Storage の場所に保存します。

テンプレートのパラメータ

パラメータ	説明
`schemaLocation`	スキーマファイルの場所。例: `gs://mybucket/filename.json`
`qps`	1 秒あたりにパブリッシュされるメッセージ数。例: `100`
`sinkType`	（省略可）出力シンクのタイプ。指定可能な値は `PUBSUB`、`BIGQUERY`、`GCS` です。デフォルトは PUBSUB です。
`outputType`	（省略可）出力エンコードタイプ。指定可能な値は `JSON`、`AVRO`、`PARQUET` です。デフォルトは JSON です。
`avroSchemaLocation`	（省略可）AVRO スキーマファイルの場所。`outputType` が AVRO または PARQUET の場合は必須。例: `gs://mybucket/filename.avsc`
`topic`	（省略可）パイプラインがデータを公開する Pub/Sub トピックの名前。`sinkType` が Pub/Sub の場合は必須。例: `projects/my-project-ID/topics/my-topic-ID`。
`outputTableSpec`	（省略可）出力 BigQuery テーブルの名前。`sinkType` が BigQuery の場合は必須。例: `my-project-ID:my_dataset_name.my-table-name`
`writeDisposition`	（省略可）BigQuery の書き込み処理。指定可能な値は `WRITE_APPEND`、`WRITE_EMPTY`、`WRITE_TRUNCATE` です。デフォルトは WRITE_APPEND です。
`outputDeadletterTable`	（省略可）失敗したレコードを格納する出力 BigQuery テーブルの名前。指定されていない場合、パイプラインは実行中に {output_table_name}_error_records という名前のテーブルを作成します。例: `my-project-ID:my_dataset_name.my-table-name`
`outputDirectory`	（省略可）出力される Cloud Storage の場所のパス。`sinkType` が Cloud Storage の場合は必須。例: `gs://mybucket/pathprefix/`
`outputFilenamePrefix`	（省略可）Cloud Storage に書き込まれる出力ファイルのファイル名の接頭辞。デフォルトは output- です。
`windowDuration`	（省略可）出力が Cloud Storage に書き込まれる時間間隔。デフォルトは 1m（つまり 1 分）です。
`numShards`	（省略可）出力シャードの最大数。`sinkType` が Cloud Storage の場合に必須で、1 以上の数値に設定する必要があります。
`messagesLimit`	（省略可）出力メッセージの最大数。デフォルトは 0 で、制限がないことを示します。
`autoscalingAlgorithm`	（省略可）ワーカーの自動スケーリングに使用されるアルゴリズム。使用できる値は、自動スケーリングを有効にする `THROUGHPUT_BASED` または無効にする `NONE` です。
`maxNumWorkers`	（省略可）ワーカーマシンの最大数。例: `10`

ストリーミングデータ生成ツールテンプレートの実行

Console

Dataflow の [テンプレートからジョブを作成] ページに移動します。

[テンプレートからジョブを作成] に移動

[ジョブ名] フィールドに、固有のジョブ名を入力します。
（省略可）[リージョンエンドポイント] で、プルダウンメニューから値を選択します。デフォルトのリージョンエンドポイントは us-central1 です。
Dataflow ジョブを実行できるリージョンのリストについては、Dataflow のロケーションをご覧ください。
[Dataflow テンプレート] プルダウンメニューから、the Streaming Data Generator template を選択します。
表示されたパラメータフィールドに、パラメータ値を入力します。
[ジョブを実行] をクリックします。

gcloud

シェルまたはターミナルで、テンプレートを実行します。

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/Streaming_Data_Generator \
    --parameters \
schemaLocation=SCHEMA_LOCATION,\
qps=QPS,\
topic=PUBSUB_TOPIC

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
REGION_NAME: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
SCHEMA_LOCATION: Cloud Storage のスキーマファイルのパス。例: gs://mybucket/filename.json
QPS: 1 秒あたりにパブリッシュされるメッセージ数
PUBSUB_TOPIC: 出力 Pub/Sub トピック。例: projects/my-project-ID/topics/my-topic-ID

API

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "schemaLocation": "SCHEMA_LOCATION",
          "qps": "QPS",
          "topic": "PUBSUB_TOPIC"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/Streaming_Data_Generator",
   }
}

次のように置き換えます。

PROJECT_ID: Dataflow ジョブを実行する Cloud プロジェクト ID
LOCATION: Dataflow ジョブをデプロイするリージョンエンドポイント。例: us-central1
JOB_NAME: 一意の任意のジョブ名
VERSION: 使用するテンプレートのバージョン
使用できる値は次のとおりです。
- latest: 最新バージョンのテンプレートを使用します。このテンプレートは、バケット内の日付のない親フォルダ（gs://dataflow-templates/latest/）にあります。
- バージョン名（例: 2021-09-20-00_RC00）。特定のバージョンのテンプレートを使用します。このテンプレートは、バケット gs://dataflow-templates/ 内の、名前に日付が入った親フォルダに格納されています。
注: 最新のテンプレートでは、互換性のない変更が行われている場合があります。こうした互換性のない変更が本番環境のワークフローに影響しないように、本番環境では最新の日付付き親フォルダに保存されているテンプレートを使用する必要があります。
SCHEMA_LOCATION: Cloud Storage のスキーマファイルのパス。例: gs://mybucket/filename.json
QPS: 1 秒あたりにパブリッシュされるメッセージ数
PUBSUB_TOPIC: 出力 Pub/Sub トピック。例: projects/my-project-ID/topics/my-topic-ID

テンプレートのソースコード

Java

GitHub で表示フィードバック

/*
 * Copyright (C) 2020 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument;
import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkNotNull;

import com.github.vincentrussell.json.datagenerator.JsonDataGenerator;
import com.github.vincentrussell.json.datagenerator.JsonDataGeneratorException;
import com.github.vincentrussell.json.datagenerator.impl.JsonDataGeneratorImpl;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.templates.StreamingDataGenerator.StreamingDataGeneratorOptions;
import com.google.cloud.teleport.v2.transforms.StreamingDataGeneratorWriteToBigQuery;
import com.google.cloud.teleport.v2.transforms.StreamingDataGeneratorWriteToGcs;
import com.google.cloud.teleport.v2.transforms.StreamingDataGeneratorWriteToJdbc;
import com.google.cloud.teleport.v2.transforms.StreamingDataGeneratorWriteToPubSub;
import com.google.cloud.teleport.v2.transforms.StreamingDataGeneratorWriteToSpanner;
import com.google.cloud.teleport.v2.utils.DurationUtils;
import com.google.cloud.teleport.v2.utils.GCSUtils;
import com.google.cloud.teleport.v2.utils.MetadataValidator;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import javax.annotation.Nonnull;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.FileSystems;
import org.apache.beam.sdk.io.GenerateSequence;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.windowing.FixedWindows;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PDone;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.annotations.VisibleForTesting;
import org.joda.time.Duration;
import org.joda.time.Instant;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link StreamingDataGenerator} is a streaming pipeline which generates messages at a
 * specified rate to either Pub/Sub topic or BigQuery/GCS. The messages are generated according to a
 * schema template which instructs the pipeline how to populate the messages with fake data
 * compliant to constraints.
 *
 * <p>The number of workers executing the pipeline must be large enough to support the supplied QPS.
 * Use a general rule of 2,500 QPS per core in the worker pool.
 *
 * <p>See <a href="https://github.com/vincentrussell/json-data-generator">json-data-generator</a>
 * for instructions on how to construct the schema file.
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT=my-project
 * BUCKET_NAME=my-bucket
 * SCHEMA_LOCATION=gs://{bucket}/{path}/{to}/game-event-schema.json
 * PUBSUB_TOPIC=projects/{project-id}/topics/{topic-id}
 * QPS=2500
 *
 * # Set containerization vars
 * IMAGE_NAME=my-image-name
 * TARGET_GCR_IMAGE=gcr.io/${PROJECT}/${IMAGE_NAME}
 * BASE_CONTAINER_IMAGE=my-base-container-image
 * BASE_CONTAINER_IMAGE_VERSION=my-base-container-image-version
 * APP_ROOT=/path/to/app-root
 * COMMAND_SPEC=/path/to/command-spec
 *
 * # Build and upload image
 * mvn clean package \
 * -Dimage=${TARGET_GCR_IMAGE} \
 * -Dbase-container-image=${BASE_CONTAINER_IMAGE} \
 * -Dbase-container-image.version=${BASE_CONTAINER_IMAGE_VERSION} \
 * -Dapp-root=${APP_ROOT} \
 * -Dcommand-spec=${COMMAND_SPEC}
 *
 * # Create a template spec containing the details of image location and metadata in GCS
 *   as specified in README.md file
 *
 * # Execute template:
 * JOB_NAME={job-name}
 * PROJECT={project-id}
 * TEMPLATE_SPEC_GCSPATH=gs://path/to/template-spec
 * SCHEMA_LOCATION=gs://path/to/schema.json
 * PUBSUB_TOPIC=projects/$PROJECT/topics/{topic-name}
 * QPS=1
 *
 * gcloud beta dataflow flex-template run $JOB_NAME \
 *         --project=$PROJECT --region=us-central1 --flex-template  \
 *         --template-file-gcs-location=$TEMPLATE_SPEC_GCSPATH \
 *         --parameters autoscalingAlgorithm="THROUGHPUT_BASED",schemaLocation=$SCHEMA_LOCATION,topic=$PUBSUB_TOPIC,qps=$QPS,maxNumWorkers=3
 *
 * </pre>
 */
@Template(
    name = "Streaming_Data_Generator",
    category = TemplateCategory.UTILITIES,
    displayName = "Streaming Data Generator",
    description =
        "A pipeline to publish messages at specified QPS.This template can be used to benchmark"
            + " performance of streaming pipelines.",
    optionsClass = StreamingDataGeneratorOptions.class,
    flexContainerName = "streaming-data-generator",
    contactInformation = "https://cloud.google.com/support")
public class StreamingDataGenerator {

  private static final Logger logger = LoggerFactory.getLogger(StreamingDataGenerator.class);

  /**
   * The {@link StreamingDataGeneratorOptions} class provides the custom execution options passed by
   * the executor at the command-line.
   */
  public interface StreamingDataGeneratorOptions extends PipelineOptions {
    @TemplateParameter.Text(
        order = 1,
        regexes = {"^[1-9][0-9]*$"},
        description = "Required output rate",
        helpText = "Indicates rate of messages per second to be published to Pub/Sub")
    @Required
    Long getQps();

    void setQps(Long value);

    @TemplateParameter.Enum(
        order = 2,
        enumOptions = {"GAME_EVENT"},
        optional = true,
        description = "Schema template to generate fake data",
        helpText = "Pre-existing schema template to use. The value must be one of: [GAME_EVENT]")
    SchemaTemplate getSchemaTemplate();

    void setSchemaTemplate(SchemaTemplate value);

    @TemplateParameter.GcsReadFile(
        order = 3,
        optional = true,
        description = "Location of Schema file to generate fake data",
        helpText = "Cloud Storage path of schema location.",
        example = "gs://<bucket-name>/prefix")
    String getSchemaLocation();

    void setSchemaLocation(String value);

    @TemplateParameter.PubsubTopic(
        order = 4,
        optional = true,
        description = "Output Pub/Sub topic",
        helpText = "The name of the topic to which the pipeline should publish data.",
        example = "projects/<project-id>/topics/<topic-name>")
    String getTopic();

    void setTopic(String value);

    @TemplateParameter.Long(
        order = 5,
        optional = true,
        description = "Maximum number of output Messages",
        helpText =
            "Indicates maximum number of output messages to be generated. 0 means unlimited.")
    @Default.Long(0L)
    Long getMessagesLimit();

    void setMessagesLimit(Long value);

    @TemplateParameter.Enum(
        order = 6,
        enumOptions = {"AVRO", "JSON", "PARQUET"},
        optional = true,
        description = "Output Encoding Type",
        helpText = "The message Output type. Default is JSON.")
    @Default.Enum("JSON")
    OutputType getOutputType();

    void setOutputType(OutputType value);

    @TemplateParameter.GcsReadFile(
        order = 7,
        optional = true,
        description = "Location of Avro Schema file",
        helpText =
            "Cloud Storage path of Avro schema location. Mandatory when output type is AVRO or"
                + " PARQUET.",
        example = "gs://your-bucket/your-path/schema.avsc")
    String getAvroSchemaLocation();

    void setAvroSchemaLocation(String value);

    @TemplateParameter.Enum(
        order = 8,
        enumOptions = {"BIGQUERY", "GCS", "PUBSUB", "JDBC", "SPANNER"},
        optional = true,
        description = "Output Sink Type",
        helpText = "The message Sink type. Default is PUBSUB")
    @Default.Enum("PUBSUB")
    SinkType getSinkType();

    void setSinkType(SinkType value);

    @TemplateParameter.BigQueryTable(
        order = 9,
        optional = true,
        description = "Output BigQuery table",
        helpText = "Output BigQuery table. Mandatory when sinkType is BIGQUERY",
        example = "<project>:<dataset>.<table_name>")
    String getOutputTableSpec();

    void setOutputTableSpec(String value);

    @TemplateParameter.Enum(
        order = 10,
        enumOptions = {"WRITE_APPEND", "WRITE_EMPTY", "WRITE_TRUNCATE"},
        optional = true,
        description = "Write Disposition to use for BigQuery",
        helpText =
            "BigQuery WriteDisposition. For example, WRITE_APPEND, WRITE_EMPTY or WRITE_TRUNCATE.")
    @Default.String("WRITE_APPEND")
    String getWriteDisposition();

    void setWriteDisposition(String writeDisposition);

    @TemplateParameter.BigQueryTable(
        order = 11,
        optional = true,
        description = "The dead-letter table name to output failed messages to BigQuery",
        helpText =
            "Messages failed to reach the output table for all kind of reasons (e.g., mismatched"
                + " schema, malformed json) are written to this table. If it doesn't exist, it will"
                + " be created during pipeline execution.",
        example = "your-project-id:your-dataset.your-table-name")
    String getOutputDeadletterTable();

    void setOutputDeadletterTable(String outputDeadletterTable);

    @TemplateParameter.Duration(
        order = 12,
        optional = true,
        description = "Window duration",
        helpText =
            "The window duration/size in which data will be written to Cloud Storage. Allowed"
                + " formats are: Ns (for seconds, example: 5s), Nm (for minutes, example: 12m), Nh"
                + " (for hours, example: 2h).",
        example = "1m")
    @Default.String("1m")
    String getWindowDuration();

    void setWindowDuration(String windowDuration);

    @TemplateParameter.GcsWriteFolder(
        order = 13,
        optional = true,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime"
                + " formatting is used to parse directory path for date & time formatters.",
        example = "gs://your-bucket/your-path/")
    String getOutputDirectory();

    void setOutputDirectory(String outputDirectory);

    @TemplateParameter.Text(
        order = 14,
        optional = true,
        description = "Output filename prefix of the files to write",
        helpText = "The prefix to place on each windowed file.",
        example = "output-")
    @Default.String("output-")
    String getOutputFilenamePrefix();

    void setOutputFilenamePrefix(String outputFilenamePrefix);

    @TemplateParameter.Integer(
        order = 15,
        optional = true,
        description = "Maximum output shards",
        helpText =
            "The maximum number of output shards produced when writing. A higher number of shards"
                + " means higher throughput for writing to Cloud Storage, but potentially higher"
                + " data aggregation cost across shards when processing output Cloud Storage files."
                + " Default value is decided by the runner.")
    @Default.Integer(0)
    Integer getNumShards();

    void setNumShards(Integer numShards);

    @TemplateParameter.Text(
        order = 16,
        optional = true,
        regexes = {"^.+$"},
        description = "JDBC driver class name.",
        helpText = "JDBC driver class name to use.",
        example = "com.mysql.jdbc.Driver")
    String getDriverClassName();

    void setDriverClassName(String driverClassName);

    @TemplateParameter.Text(
        order = 17,
        optional = true,
        regexes = {
          "(^jdbc:[a-zA-Z0-9/:@.?_+!*=&-;]+$)|(^([A-Za-z0-9+/]{4}){1,}([A-Za-z0-9+/]{0,3})={0,3})"
        },
        description = "JDBC connection URL string.",
        helpText = "Url connection string to connect to the JDBC source.",
        example = "jdbc:mysql://some-host:3306/sampledb")
    String getConnectionUrl();

    void setConnectionUrl(String connectionUrl);

    @TemplateParameter.Text(
        order = 18,
        optional = true,
        regexes = {"^.+$"},
        description = "JDBC connection username.",
        helpText = "User name to be used for the JDBC connection.")
    String getUsername();

    void setUsername(String username);

    @TemplateParameter.Password(
        order = 19,
        optional = true,
        description = "JDBC connection password.",
        helpText = "Password to be used for the JDBC connection.")
    String getPassword();

    void setPassword(String password);

    @TemplateParameter.Text(
        order = 20,
        optional = true,
        regexes = {"^[a-zA-Z0-9_;!*&=@#-:\\/]+$"},
        description = "JDBC connection property string.",
        helpText =
            "Properties string to use for the JDBC connection. Format of the string must be"
                + " [propertyName=property;]*.",
        example = "unicode=true;characterEncoding=UTF-8")
    String getConnectionProperties();

    void setConnectionProperties(String connectionProperties);

    @TemplateParameter.Text(
        order = 21,
        optional = true,
        regexes = {"^.+$"},
        description = "Statement which will be executed against the database.",
        helpText =
            "SQL statement which will be executed to write to the database. The statement must"
                + " specify the column names of the table in any order. Only the values of the"
                + " specified column names will be read from the json and added to the statement.",
        example = "INSERT INTO tableName (column1, column2) VALUES (?,?)")
    String getStatement();

    void setStatement(String statement);

    @TemplateParameter.Text(
        order = 22,
        optional = true,
        regexes = {"^.+$"},
        description = "GCP Project Id of where the Spanner table lives.",
        helpText = "GCP Project Id of where the Spanner table lives.")
    String getProjectId();

    void setProjectId(String projectId);

    @TemplateParameter.Text(
        order = 23,
        optional = true,
        regexes = {"^.+$"},
        description = "Cloud Spanner instance name.",
        helpText = "Cloud Spanner instance name.")
    String getSpannerInstanceName();

    void setSpannerInstanceName(String spannerInstanceName);

    @TemplateParameter.Text(
        order = 24,
        optional = true,
        regexes = {"^.+$"},
        description = "Cloud Spanner database name.",
        helpText = "Cloud Spanner database name.")
    String getSpannerDatabaseName();

    void setSpannerDatabaseName(String spannerDBName);

    @TemplateParameter.Text(
        order = 25,
        optional = true,
        regexes = {"^.+$"},
        description = "Cloud Spanner table name.",
        helpText = "Cloud Spanner table name.")
    String getSpannerTableName();

    void setSpannerTableName(String spannerTableName);
  }

  /** Allowed list of existing schema templates. */
  public enum SchemaTemplate {
    GAME_EVENT(
        "{\n"
            + "  \"eventId\": \"{{uuid()}}\",\n"
            + "  \"eventTimestamp\": {{timestamp()}},\n"
            + "  \"ipv4\": \"{{ipv4()}}\",\n"
            + "  \"ipv6\": \"{{ipv6()}}\",\n"
            + "  \"country\": \"{{country()}}\",\n"
            + "  \"username\": \"{{username()}}\",\n"
            + "  \"quest\": \"{{random(\"A Break In the Ice\", \"Ghosts of Perdition\", \"Survive"
            + " the Low Road\")}}\",\n"
            + "  \"score\": {{integer(100, 10000)}},\n"
            + "  \"completed\": {{bool()}}\n"
            + "}"),
    LOG_ENTRY(
        "{\n"
            + "  \"logName\": \"{{alpha(10,20)}}\",\n"
            + "  \"resource\": {\n"
            + "    \"type\": \"{{alpha(5,10)}}\"\n"
            + "  },\n"
            + "  \"timestamp\": {{timestamp()}},\n"
            + "  \"receiveTimestamp\": {{timestamp()}},\n"
            + "  \"severity\": \"{{random(\"DEFAULT\", \"DEBUG\", \"INFO\", \"NOTICE\","
            + " \"WARNING\", \"ERROR\", \"CRITICAL\", \"ERROR\")}}\",\n"
            + "  \"insertId\": \"{{uuid()}}\",\n"
            + "  \"trace\": \"{{uuid()}}\",\n"
            + "  \"spanId\": \"{{uuid()}}\",\n"
            + "  \"jsonPayload\": {\n"
            + "    \"bytes_sent\": {{integer(1000,20000)}},\n"
            + "    \"connection\": {\n"
            + "      \"dest_ip\": \"{{ipv4()}}\",\n"
            + "      \"dest_port\": {{integer(0,65000)}},\n"
            + "      \"protocol\": {{integer(0,6)}},\n"
            + "      \"src_ip\": \"{{ipv4()}}\",\n"
            + "      \"src_port\": {{integer(0,65000)}}\n"
            + "    },\n"
            + "    \"dest_instance\": {\n"
            + "      \"project_id\": \"{{concat(\"PROJECT\", integer(0,3))}}\",\n"
            + "      \"region\": \"{{country()}}\",\n"
            + "      \"vm_name\": \"{{username()}}\",\n"
            + "      \"zone\": \"{{state()}}\"\n"
            + "    },\n"
            + "    \"end_time\": {{timestamp()}},\n"
            + "    \"packets_sent\": {{integer(100,400)}},\n"
            + "    \"reporter\": \"{{random(\"SRC\", \"DEST\")}}\",\n"
            + "    \"rtt_msec\": {{integer(0,20)}},\n"
            + "    \"start_time\": {{timestamp()}}\n"
            + "  }\n"
            + "}");

    private final String schema;

    SchemaTemplate(String schema) {
      this.schema = schema;
    }

    public String getSchema() {
      return schema;
    }
  }

  /** Allowed list of message encoding types. */
  public enum OutputType {
    JSON(".json"),
    AVRO(".avro"),
    PARQUET(".parquet");

    private final String fileExtension;

    /** Sets file extension associated with output type. */
    OutputType(String fileExtension) {
      this.fileExtension = fileExtension;
    }

    /** Returns file extension associated with output type. */
    public String getFileExtension() {
      return fileExtension;
    }
  }

  /** Allowed list of sink types. */
  public enum SinkType {
    PUBSUB,
    BIGQUERY,
    GCS,
    JDBC,
    SPANNER
  }

  /**
   * The main entry-point for pipeline execution. This method will start the pipeline but will not
   * wait for it's execution to finish. If blocking execution is required, use the {@link
   * StreamingDataGenerator#run(StreamingDataGeneratorOptions)} method to start the pipeline and
   * invoke {@code result.waitUntilFinish()} on the {@link PipelineResult}.
   *
   * @param args command-line args passed by the executor.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    StreamingDataGeneratorOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(StreamingDataGeneratorOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options the execution options.
   * @return the pipeline result.
   */
  public static PipelineResult run(@Nonnull StreamingDataGeneratorOptions options) {
    checkNotNull(options, "options argument to run method cannot be null.");
    MetadataValidator.validate(options);

    // FileSystems does not set the default configuration in workers till Pipeline.run
    // Explicitly registering standard file systems.
    FileSystems.setDefaultPipelineOptions(options);
    String schema = getSchema(options.getSchemaTemplate(), options.getSchemaLocation());

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    /*
     * Steps:
     *  1) Trigger at the supplied QPS
     *  2) Generate messages containing fake data
     *  3) Write messages to appropriate Sink
     */
    PCollection<byte[]> generatedMessages =
        pipeline
            .apply("Trigger", createTrigger(options))
            .apply("Generate Fake Messages", ParDo.of(new MessageGeneratorFn(schema)));

    if (options.getSinkType().equals(SinkType.GCS)) {
      generatedMessages =
          generatedMessages.apply(
              options.getWindowDuration() + " Window",
              Window.into(
                  FixedWindows.of(DurationUtils.parseDuration(options.getWindowDuration()))));
    }

    generatedMessages.apply(
        "Write To " + options.getSinkType().name(), createSink(options, schema));

    return pipeline.run();
  }

  /**
   * Creates either Bounded or UnBounded Source based on messageLimit pipeline option.
   *
   * @param options the pipeline options.
   */
  private static GenerateSequence createTrigger(@Nonnull StreamingDataGeneratorOptions options) {
    checkNotNull(options, "options argument to createTrigger method cannot be null.");
    GenerateSequence generateSequence =
        GenerateSequence.from(0L)
            .withRate(options.getQps(), /* periodLength = */ Duration.standardSeconds(1L));

    return options.getMessagesLimit() > 0
        ? generateSequence.to(options.getMessagesLimit())
        : generateSequence;
  }

  /**
   * The {@link MessageGeneratorFn} class generates fake messages based on supplied schema
   *
   * <p>See <a href="https://github.com/vincentrussell/json-data-generator">json-data-generator</a>
   * for instructions on how to construct the schema file.
   */
  @VisibleForTesting
  static class MessageGeneratorFn extends DoFn<Long, byte[]> {

    // Not initialized inline or constructor because {@link JsonDataGenerator} is not serializable.
    private transient JsonDataGenerator dataGenerator;
    private final String schema;

    MessageGeneratorFn(String schema) {
      this.schema = schema;
    }

    @Setup
    public void setup() {
      dataGenerator = new JsonDataGeneratorImpl();
    }

    @ProcessElement
    public void processElement(
        @Element Long element,
        @Timestamp Instant timestamp,
        OutputReceiver<byte[]> receiver,
        ProcessContext context)
        throws IOException, JsonDataGeneratorException {

      byte[] payload;

      // Generate the fake JSON according to the schema.
      try (ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream()) {
        dataGenerator.generateTestDataJson(schema, byteArrayOutputStream);
        payload = byteArrayOutputStream.toByteArray();
      }

      receiver.output(payload);
    }
  }

  /**
   * Creates appropriate sink based on sinkType pipeline option.
   *
   * @param options the pipeline options.
   */
  @VisibleForTesting
  static PTransform<PCollection<byte[]>, PDone> createSink(
      @Nonnull StreamingDataGeneratorOptions options, @Nonnull String schema) {
    checkNotNull(options, "options argument to createSink method cannot be null.");
    checkNotNull(schema, "schema argument to createSink method cannot be null.");

    switch (options.getSinkType()) {
      case PUBSUB:
        checkArgument(
            options.getTopic() != null,
            String.format(
                "Missing required value --topic for %s sink type", options.getSinkType().name()));
        return StreamingDataGeneratorWriteToPubSub.Writer.builder(options, schema).build();
      case BIGQUERY:
        checkArgument(
            options.getOutputTableSpec() != null,
            String.format(
                "Missing required value --outputTableSpec in format"
                    + " <project>:<dataset>.<table_name> for %s sink type",
                options.getSinkType().name()));
        return StreamingDataGeneratorWriteToBigQuery.builder(options).build();
      case GCS:
        checkArgument(
            options.getOutputDirectory() != null,
            String.format(
                "Missing required value --outputDirectory in format gs:// for %s sink type",
                options.getSinkType().name()));
        return StreamingDataGeneratorWriteToGcs.builder(options).build();
      case JDBC:
        checkArgument(
            options.getDriverClassName() != null,
            String.format(
                "Missing required value --driverClassName for %s sink type",
                options.getSinkType().name()));
        checkArgument(
            options.getConnectionUrl() != null,
            String.format(
                "Missing required value --connectionUrl for %s sink type",
                options.getSinkType().name()));
        checkArgument(
            options.getStatement() != null,
            String.format(
                "Missing required value --statement for %s sink type",
                options.getSinkType().name()));
        return StreamingDataGeneratorWriteToJdbc.builder(options).build();
      case SPANNER:
        checkArgument(
            options.getProjectId() != null,
            String.format(
                "Missing required value --projectId for %s sink type",
                options.getSinkType().name()));
        checkArgument(
            options.getSpannerInstanceName() != null,
            String.format(
                "Missing required value --spannerInstanceName for %s sink type",
                options.getSinkType().name()));
        checkArgument(
            options.getSpannerDatabaseName() != null,
            String.format(
                "Missing required value --spannerDatabaseName for %s sink type",
                options.getSinkType().name()));
        checkArgument(
            options.getSpannerTableName() != null,
            String.format(
                "Missing required value --spannerTableName for %s sink type",
                options.getSinkType().name()));
        return StreamingDataGeneratorWriteToSpanner.builder(options).build();
      default:
        throw new IllegalArgumentException("Unsupported Sink.");
    }
  }

  private static String getSchema(SchemaTemplate schemaTemplate, String schemaLocation) {
    checkArgument(
        schemaTemplate != null || schemaLocation != null,
        "Either schemaTemplate or schemaLocation argument of MessageGeneratorFn class must be"
            + " provided.");
    if (schemaLocation != null) {
      return GCSUtils.getGcsFileAsString(schemaLocation);
    } else {
      return schemaTemplate.getSchema();
    }
  }
}

Google 提供のユーティリティ テンプレート

File Format Conversion（Avro、Parquet、CSV）

テンプレートのパラメータ

File Format Conversion テンプレートの実行

Console

gcloud

API

テンプレートのソースコード

Java

Bulk Compress Cloud Storage Files

テンプレートのパラメータ

Bulk Compress Cloud Storage Files テンプレートの実行

Console

gcloud

API

テンプレートのソースコード

Java

Bulk Decompress Cloud Storage Files

テンプレートのパラメータ

Bulk Decompress Cloud Storage Files テンプレートの実行

Console

gcloud

API

テンプレートのソースコード

Java

Datastore Bulk Delete（非推奨）

テンプレートのパラメータ

Datastore Bulk Delete テンプレートの実行

Console

gcloud

API

テンプレートのソースコード

Java

Firestore Bulk Delete

テンプレートのパラメータ

Firestore Bulk Delete テンプレートの実行

Console

gcloud

API

テンプレートのソースコード

Java

Pub/Sub、BigQuery、Cloud Storage へのストリーミング データ生成ツール

サポートされているシンクとエンコード形式

テンプレートのパラメータ

ストリーミング データ生成ツール テンプレートの実行

Console

gcloud

API

テンプレートのソースコード

Java

Google 提供のユーティリティテンプレート

Pub/Sub、BigQuery、Cloud Storage へのストリーミングデータ生成ツール

ストリーミングデータ生成ツールテンプレートの実行