Google 提供的实用程序模板

Google 提供了一组开源 Dataflow 模板。如需了解有关模板的一般信息，请参阅 Dataflow 模板。如需查看 Google 提供的所有模板的列表，请参阅开始使用 Google 提供的模板。

本指南介绍了实用程序模板。

文件格式转换（Avro、Parquet、CSV）

文件格式转换模板是批处理流水线，用于将 Cloud Storage 中存储的文件从一种受支持的格式转换为另一种格式。

支持以下格式转换：

CSV 到 Avro
CSV 到 Parquet
Avro 到 Parquet
Parquet 到 Avro

对此流水线的要求：

在运行此流水线之前，输出 Cloud Storage 存储桶必须已存在。

模板参数

参数	说明
`inputFileFormat`	输入文件格式。必须为 `[csv, avro, parquet]` 之一。
`outputFileFormat`	输出文件格式。必须为 `[avro, parquet]` 之一。
`inputFileSpec`	输入文件的 Cloud Storage 路径模式。例如 `gs://bucket-name/path/*.csv`。
`outputBucket`	用于写入输出文件的 Cloud Storage 文件夹。此路径应以斜杠结尾。例如 `gs://bucket-name/output/`。
`schema`	Avro 架构文件的 Cloud Storage 路径（例如 `gs://bucket-name/schema/my-schema.avsc`）
`containsHeaders`	（可选）输入 CSV 文件包含标题记录 (true/false)。默认值为 `false`。仅在读取 CSV 文件时才需要。
`csvFormat`	（可选）用于解析记录的 CSV 格式规范。默认值为 `Default`。如需了解详情，请参阅 Apache Commons CSV 格式。
`delimiter`	（可选）输入 CSV 文件使用的字段分隔符。
`outputFilePrefix`	（可选）输出文件前缀。默认值为 `output`。
`numShards`	（可选）输出文件分片数。

运行文件格式转换模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Convert file formats template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/File_Format_Conversion \
    --parameters \
inputFileFormat=INPUT_FORMAT,\
outputFileFormat=OUTPUT_FORMAT,\
inputFileSpec=INPUT_FILES,\
schema=SCHEMA,\
outputBucket=OUTPUT_FOLDER

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
INPUT_FORMAT：输入文件的文件格式；必须为 [csv, avro, parquet] 中的一个
OUTPUT_FORMAT：输出文件的文件格式；必须为 [avro, parquet] 中的一个
INPUT_FILES：输入文件的路径模式
OUTPUT_FOLDER：输出文件的 Cloud Storage 文件夹
SCHEMA：Avro 架构文件的路径

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputFileFormat": "INPUT_FORMAT",
          "outputFileFormat": "OUTPUT_FORMAT",
          "inputFileSpec": "INPUT_FILES",
          "schema": "SCHEMA",
          "outputBucket": "OUTPUT_FOLDER"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/File_Format_Conversion",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
INPUT_FORMAT：输入文件的文件格式；必须为 [csv, avro, parquet] 中的一个
OUTPUT_FORMAT：输出文件的文件格式；必须为 [avro, parquet] 中的一个
INPUT_FILES：输入文件的路径模式
OUTPUT_FOLDER：输出文件的 Cloud Storage 文件夹
SCHEMA：Avro 架构文件的路径

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.templates.FileFormatConversion.FileFormatConversionOptions;
import com.google.cloud.teleport.v2.transforms.AvroConverters.AvroOptions;
import com.google.cloud.teleport.v2.transforms.CsvConverters.CsvPipelineOptions;
import com.google.cloud.teleport.v2.transforms.ParquetConverters.ParquetOptions;
import java.util.EnumMap;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link FileFormatConversion} pipeline takes in an input file, converts it to a desired format
 * and saves it to Cloud Storage. Supported file transformations are:
 *
 * <ul>
 *   <li>Csv to Avro
 *   <li>Csv to Parquet
 *   <li>Avro to Parquet
 *   <li>Parquet to Avro
 * </ul>
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>Input file exists in Google Cloud Storage.
 *   <li>Google Cloud Storage output bucket exists.
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT=my-project
 * BUCKET_NAME=my-bucket
 *
 * # Set containerization vars
 * IMAGE_NAME=my-image-name
 * TARGET_GCR_IMAGE=gcr.io/${PROJECT}/${IMAGE_NAME}
 * BASE_CONTAINER_IMAGE=my-base-container-image
 * BASE_CONTAINER_IMAGE_VERSION=my-base-container-image-version
 * APP_ROOT=/path/to/app-root
 * COMMAND_SPEC=/path/to/command-spec
 *
 * # Set vars for execution
 * export INPUT_FILE_FORMAT=Csv
 * export OUTPUT_FILE_FORMAT=Avro
 * export AVRO_SCHEMA_PATH=gs://path/to/avro/schema
 * export HEADERS=false
 * export DELIMITER=","
 *
 * # Build and upload image
 * mvn clean package \
 * -Dimage=${TARGET_GCR_IMAGE} \
 * -Dbase-container-image=${BASE_CONTAINER_IMAGE} \
 * -Dbase-container-image.version=${BASE_CONTAINER_IMAGE_VERSION} \
 * -Dapp-root=${APP_ROOT} \
 * -Dcommand-spec=${COMMAND_SPEC}
 *
 * # Create an image spec in GCS that contains the path to the image
 * {
 *    "docker_template_spec": {
 *       "docker_image": $TARGET_GCR_IMAGE
 *     }
 *  }
 *
 * # Execute template:
 * API_ROOT_URL="https://dataflow.googleapis.com"
 * TEMPLATES_LAUNCH_API="${API_ROOT_URL}/v1b3/projects/${PROJECT}/templates:launch"
 * JOB_NAME="csv-to-avro-`date +%Y%m%d-%H%M%S-%N`"
 *
 * time curl -X POST -H "Content-Type: application/json"     \
 *     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
 *     "${TEMPLATES_LAUNCH_API}"`
 *     `"?validateOnly=false"`
 *     `"&dynamicTemplate.gcsPath=${BUCKET_NAME}/path/to/image-spec"`
 *     `"&dynamicTemplate.stagingLocation=${BUCKET_NAME}/staging" \
 *     -d '
 *      {
 *       "jobName":"'$JOB_NAME'",
 *       "parameters": {
 *            "inputFileFormat":"'$INPUT_FILE_FORMAT'",
 *            "outputFileFormat":"'$OUTPUT_FILE_FORMAT'",
 *            "inputFileSpec":"'$BUCKET_NAME/path/to/input-file'",
 *            "outputBucket":"'$BUCKET_NAME/path/to/output-location/'",
 *            "containsHeaders":"'$HEADERS'",
 *            "schema":"'$AVRO_SCHEMA_PATH'",
 *            "outputFilePrefix":"output-file",
 *            "numShards":"3",
 *            "delimiter":"'$DELIMITER'"
 *         }
 *       }
 *      '
 * </pre>
 */
@Template(
    name = "File_Format_Conversion",
    category = TemplateCategory.UTILITIES,
    displayName = "Convert file formats between Avro, Parquet & CSV",
    description = "A pipeline to convert file formats between Avro, Parquet & csv.",
    optionsClass = FileFormatConversionOptions.class,
    optionalOptions = {"deadletterTable"},
    flexContainerName = "file-format-conversion",
    contactInformation = "https://cloud.google.com/support")
public class FileFormatConversion {

  /** Logger for class. */
  private static final Logger LOG = LoggerFactory.getLogger(FileFormatConversionFactory.class);

  private static EnumMap<ValidFileFormats, String> validFileFormats =
      new EnumMap<ValidFileFormats, String>(ValidFileFormats.class);

  /**
   * The {@link FileFormatConversionOptions} provides the custom execution options passed by the
   * executor at the command-line.
   */
  public interface FileFormatConversionOptions
      extends PipelineOptions, CsvPipelineOptions, AvroOptions, ParquetOptions {
    @TemplateParameter.Enum(
        order = 1,
        enumOptions = {"avro", "csv", "parquet"},
        description = "File format of the input files.",
        helpText = "File format of the input files. Needs to be either avro, parquet or csv.")
    @Required
    String getInputFileFormat();

    void setInputFileFormat(String inputFileFormat);

    @TemplateParameter.Enum(
        order = 2,
        enumOptions = {"avro", "parquet"},
        description = "File format of the output files.",
        helpText = "File format of the output files. Needs to be either avro or parquet.")
    @Required
    String getOutputFileFormat();

    void setOutputFileFormat(String outputFileFormat);
  }

  /** The {@link ValidFileFormats} enum contains all valid file formats. */
  public enum ValidFileFormats {
    CSV,
    AVRO,
    PARQUET
  }

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    FileFormatConversionOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(FileFormatConversionOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options.
   *
   * @param options The execution options.
   * @return The pipeline result.
   * @throws RuntimeException thrown if incorrect file formats are passed.
   */
  public static PipelineResult run(FileFormatConversionOptions options) {
    String inputFileFormat = options.getInputFileFormat().toUpperCase();
    String outputFileFormat = options.getOutputFileFormat().toUpperCase();

    validFileFormats.put(ValidFileFormats.CSV, "CSV");
    validFileFormats.put(ValidFileFormats.AVRO, "AVRO");
    validFileFormats.put(ValidFileFormats.PARQUET, "PARQUET");

    if (!validFileFormats.containsValue(inputFileFormat)) {
      throw new IllegalArgumentException("Invalid input file format.");
    }
    if (!validFileFormats.containsValue(outputFileFormat)) {
      throw new IllegalArgumentException("Invalid output file format.");
    }
    if (inputFileFormat.equals(outputFileFormat)) {
      throw new IllegalArgumentException("Input and output file format cannot be the same.");
    }

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    pipeline.apply(
        inputFileFormat + " to " + outputFileFormat,
        FileFormatConversionFactory.FileFormat.newBuilder()
            .setOptions(options)
            .setInputFileFormat(inputFileFormat)
            .setOutputFileFormat(outputFileFormat)
            .build());

    return pipeline.run();
  }
}

Bulk Compress Cloud Storage Files

Bulk Compress Cloud Storage Files 模板是一种批处理流水线，用于将 Cloud Storage 上的文件压缩到指定位置。如果您需要在定期归档过程中压缩大批量文件，此模板非常有用。支持的压缩模式包括 BZIP2、DEFLATE、GZIP。输出到目标位置的文件将遵循原始文件名附加压缩模式扩展名这一命名架构。可以附加以下扩展名之一：.bzip2、.deflate、.gz。

压缩过程中发生的任何错误都将输出到故障文件中，该文件采用“文件名 + 错误消息”的 CSV 格式。如果在运行流水线期间没有发生故障，系统仍将创建错误文件，但其中不会包含任何错误记录。

对此流水线的要求：

必须采用以下压缩格式之一：BZIP2、DEFLATE、GZIP。
在运行流水线之前，必须已存在输出目录。

模板参数

参数	说明
`inputFilePattern`	要从中读取数据的输入文件模式。例如 `gs://bucket-name/uncompressed/*.txt`。
`outputDirectory`	要向其中写入内容的输出位置。例如 `gs://bucket-name/compressed/`。
`outputFailureFile`	错误日志输出文件，用于写入在压缩过程中发生的故障。例如 `gs://bucket-name/compressed/failed.csv`。如果没有发生故障，则系统仍会创建该文件，但其中不会包含任何内容。文件内容采用“文件名 + 错误”的 CSV 格式，每一行对应一个压缩失败的文件。
`compression`	用于压缩匹配文件的压缩算法。必须为以下项之一：`BZIP2`、`DEFLATE`、`GZIP`

运行 Bulk Compress Cloud Storage Files 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Bulk Compress Files on Cloud Storage template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Bulk_Compress_GCS_Files \
    --region REGION_NAME \
    --parameters \
inputFilePattern=gs://BUCKET_NAME/uncompressed/*.txt,\
outputDirectory=gs://BUCKET_NAME/compressed,\
outputFailureFile=gs://BUCKET_NAME/failed/failure.csv,\
compression=COMPRESSION

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
BUCKET_NAME - Cloud Storage 存储桶的名称。
COMPRESSION：您选择的压缩算法

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Bulk_Compress_GCS_Files
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://BUCKET_NAME/uncompressed/*.txt",
       "outputDirectory": "gs://BUCKET_NAME/compressed",
       "outputFailureFile": "gs://BUCKET_NAME/failed/failure.csv",
       "compression": "COMPRESSION"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
BUCKET_NAME - Cloud Storage 存储桶的名称。
COMPRESSION：您选择的压缩算法

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.BulkCompressor.Options;
import com.google.common.collect.ImmutableList;
import com.google.common.io.ByteStreams;
import java.io.IOException;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.Compression;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.FileSystems;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.fs.MatchResult;
import org.apache.beam.sdk.io.fs.ResolveOptions.StandardResolveOptions;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.util.MimeTypes;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TupleTagList;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link BulkCompressor} is a batch pipeline that compresses files on matched by an input file
 * pattern and outputs them to a specified file location. This pipeline can be useful when you need
 * to compress large batches of files as part of a perodic archival process. The supported
 * compression modes are: <code>BZIP2</code>, <code>DEFLATE</code>, <code>GZIP</code>, <code>ZIP
 * </code>. Files output to the destination location will follow a naming schema of original
 * filename appended with the compression mode extension. The extensions appended will be one of:
 * <code>.bzip2</code>, <code>.deflate</code>, <code>.gz</code>, <code>.zip</code> as determined by
 * the compression type.
 *
 * <p>Any errors which occur during the compression process will be output to the failure file in
 * CSV format of filename, error message. If no failures occur during execution, the error file will
 * still be created but will contain no error records.
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>The compression must be in one of the following formats: <code>BZIP2</code>, <code>DEFLATE
 *       </code>, <code>GZIP</code>, <code>ZIP</code>.
 *   <li>The output directory must exist prior to pipeline execution.
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT_ID=PROJECT ID HERE
 * PIPELINE_FOLDER=gs://${PROJECT_ID}/dataflow/pipelines/bulk-compressor
 *
 * # Set the runner
 * RUNNER=DataflowRunner
 *
 * # Build the template
 * mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.BulkCompressor \
 * -Dexec.cleanupDaemonThreads=false \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=${PIPELINE_FOLDER}/staging \
 * --tempLocation=${PIPELINE_FOLDER}/temp \
 * --templateLocation=${PIPELINE_FOLDER}/template \
 * --runner=${RUNNER}"
 *
 * # Execute the template
 * JOB_NAME=bulk-compressor-$USER-`date +"%Y%m%d-%H%M%S%z"`
 *
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "inputFilePattern=${PIPELINE_FOLDER}/test/uncompressed/*,\
 * outputDirectory=${PIPELINE_FOLDER}/test/compressed,\
 * outputFailureFile=${PIPELINE_FOLDER}/test/failure/failed-${JOB_NAME}.csv,\
 * compression=GZIP"
 * </pre>
 */
@Template(
    name = "Bulk_Compress_GCS_Files",
    category = TemplateCategory.UTILITIES,
    displayName = "Bulk Compress Files on Cloud Storage",
    description = "Batch pipeline. Compresses files on Cloud Storage to a specified location.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class BulkCompressor {

  /** The logger to output status messages to. */
  private static final Logger LOG = LoggerFactory.getLogger(BulkCompressor.class);

  /** The tag used to identify the main output of the {@link Compressor}. */
  private static final TupleTag<String> COMPRESSOR_MAIN_OUT = new TupleTag<String>() {};

  /** The tag used to identify the dead-letter output of the {@link Compressor}. */
  private static final TupleTag<KV<String, String>> DEADLETTER_TAG =
      new TupleTag<KV<String, String>>() {};

  /**
   * The {@link Options} class provides the custom execution options passed by the executor at the
   * command-line.
   */
  public interface Options extends PipelineOptions {
    @TemplateParameter.GcsReadFile(
        order = 1,
        description = "Input Cloud Storage File(s)",
        helpText = "The Cloud Storage location of the files you'd like to process.",
        example = "gs://your-bucket/your-files/*.txt")
    @Required
    ValueProvider<String> getInputFilePattern();

    void setInputFilePattern(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFolder(
        order = 2,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime formatting is used to parse directory path for date & time formatters.",
        example = "gs://your-bucket/your-path")
    @Required
    ValueProvider<String> getOutputDirectory();

    void setOutputDirectory(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFile(
        order = 3,
        description = "Output failure file",
        helpText =
            "The error log output file to use for write failures that occur during compression. The contents will be one line for "
                + "each file which failed compression. Note that this parameter will "
                + "allow the pipeline to continue processing in the event of a failure.",
        example = "gs://your-bucket/compressed/failed.csv")
    @Required
    ValueProvider<String> getOutputFailureFile();

    void setOutputFailureFile(ValueProvider<String> value);

    @TemplateParameter.Enum(
        order = 4,
        enumOptions = {"BZIP2", "DEFLATE", "GZIP"},
        description = "Compression",
        helpText =
            "The compression algorithm used to compress the matched files. Valid algorithms: BZIP2, DEFLATE, GZIP")
    @Required
    ValueProvider<Compression> getCompression();

    void setCompression(ValueProvider<Compression> value);
  }

  /**
   * The main entry-point for pipeline execution. This method will start the pipeline but will not
   * wait for it's execution to finish. If blocking execution is required, use the {@link
   * BulkCompressor#run(Options)} method to start the pipeline and invoke {@code
   * result.waitUntilFinish()} on the {@link PipelineResult}.
   *
   * @param args The command-line args passed by the executor.
   */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(Options options) {

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    /*
     * Steps:
     *   1) Find all files matching the input pattern
     *   2) Compress the files found and output them to the output directory
     *   3) Write any errors to the failure output file
     */
    PCollectionTuple compressOut =
        pipeline
            .apply("Match File(s)", FileIO.match().filepattern(options.getInputFilePattern()))
            .apply(
                "Compress File(s)",
                ParDo.of(new Compressor(options.getOutputDirectory(), options.getCompression()))
                    .withOutputTags(COMPRESSOR_MAIN_OUT, TupleTagList.of(DEADLETTER_TAG)));

    compressOut
        .get(DEADLETTER_TAG)
        .apply(
            "Format Errors",
            MapElements.into(TypeDescriptors.strings())
                .via(kv -> String.format("%s,%s", kv.getKey(), kv.getValue())))
        .apply(
            "Write Error File",
            TextIO.write()
                .to(options.getOutputFailureFile())
                .withHeader("Filename,Error")
                .withoutSharding());

    return pipeline.run();
  }

  /**
   * The {@link Compressor} accepts {@link MatchResult.Metadata} from the FileSystems API and
   * compresses each file to an output location. Any compression failures which occur during
   * execution will be output to a separate output for further processing.
   */
  @SuppressWarnings("serial")
  public static class Compressor extends DoFn<MatchResult.Metadata, String> {

    private final ValueProvider<String> destinationLocation;
    private final ValueProvider<Compression> compressionValue;

    Compressor(ValueProvider<String> destinationLocation, ValueProvider<Compression> compression) {
      this.destinationLocation = destinationLocation;
      this.compressionValue = compression;
    }

    @ProcessElement
    public void processElement(ProcessContext context) {
      ResourceId inputFile = context.element().resourceId();
      Compression compression = compressionValue.get();

      // Add the compression extension to the output filename. Example: demo.txt -> demo.txt.gz
      String outputFilename = inputFile.getFilename() + compression.getSuggestedSuffix();

      // Resolve the necessary resources to perform the transfer
      ResourceId outputDir = FileSystems.matchNewResource(destinationLocation.get(), true);
      ResourceId outputFile =
          outputDir.resolve(outputFilename, StandardResolveOptions.RESOLVE_FILE);
      ResourceId tempFile =
          outputDir.resolve("temp-" + outputFilename, StandardResolveOptions.RESOLVE_FILE);

      // Perform the copy of the compressed channel to the destination.
      try (ReadableByteChannel readerChannel = FileSystems.open(inputFile)) {
        try (WritableByteChannel writerChannel =
            compression.writeCompressed(FileSystems.create(tempFile, MimeTypes.BINARY))) {

          // Execute the copy to the temporary file
          ByteStreams.copy(readerChannel, writerChannel);
        }

        // Rename the temporary file to the output file
        FileSystems.rename(ImmutableList.of(tempFile), ImmutableList.of(outputFile));

        // Output the path to the uncompressed file
        context.output(outputFile.toString());
      } catch (IOException e) {
        LOG.error("Error occurred during compression of {}", inputFile.toString(), e);
        context.output(DEADLETTER_TAG, KV.of(inputFile.toString(), e.getMessage()));
      }
    }
  }
}

Bulk Decompress Cloud Storage Files

Bulk Decompress Cloud Storage Files 模板是一种批处理流水线，用于将 Cloud Storage 上的文件解压缩到指定位置。此功能适用于以下情况：您希望在迁移过程中使用压缩数据，以最大限度降低网络带宽费用，但在迁移后使用非压缩数据，以最大限度提高分析处理速度。此流水线会在一次运行期间自动处理多种压缩模式，并根据文件扩展名（.bzip2、.deflate、.gz、.zip）确定要使用的解压缩模式。

对此流水线的要求：

需要解压缩的文件必须是以下格式之一：Bzip2、Deflate、Gzip、Zip。
在运行流水线之前，必须已存在输出目录。

模板参数

参数	说明
`inputFilePattern`	要从中读取数据的输入文件模式。例如 `gs://bucket-name/compressed/*.gz`。
`outputDirectory`	要向其中写入内容的输出位置。例如 `gs://bucket-name/decompressed`。
`outputFailureFile`	错误日志输出文件，用于写入在解压缩过程中发生的故障。例如 `gs://bucket-name/decompressed/failed.csv`。如果没有发生故障，则系统仍会创建该文件，但其中不会包含任何内容。文件内容采用 CSV 格式（文件名, 错误），并且每行对应一个解压缩失败的文件。

运行 Bulk Decompress Cloud Storage Files 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Bulk Decompress Files on Cloud Storage template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Bulk_Decompress_GCS_Files \
    --region REGION_NAME \
    --parameters \
inputFilePattern=gs://BUCKET_NAME/compressed/*.gz,\
outputDirectory=gs://BUCKET_NAME/decompressed,\
outputFailureFile=OUTPUT_FAILURE_FILE_PATH

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
BUCKET_NAME - Cloud Storage 存储桶的名称。
OUTPUT_FAILURE_FILE_PATH：您选择的包含失败信息的文件的路径

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Bulk_Decompress_GCS_Files
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://BUCKET_NAME/compressed/*.gz",
       "outputDirectory": "gs://BUCKET_NAME/decompressed",
       "outputFailureFile": "OUTPUT_FAILURE_FILE_PATH"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
BUCKET_NAME - Cloud Storage 存储桶的名称。
OUTPUT_FAILURE_FILE_PATH：您选择的包含失败信息的文件的路径

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.BulkDecompressor.Options;
import com.google.common.annotations.VisibleForTesting;
import com.google.common.collect.ImmutableList;
import com.google.common.io.ByteStreams;
import com.google.common.io.Files;
import java.io.IOException;
import java.io.StringWriter;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;
import java.util.Set;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import javax.annotation.Nullable;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.Compression;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.FileSystems;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.fs.MatchResult;
import org.apache.beam.sdk.io.fs.MoveOptions;
import org.apache.beam.sdk.io.fs.ResolveOptions.StandardResolveOptions;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.util.MimeTypes;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TupleTagList;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVPrinter;
import org.apache.commons.csv.QuoteMode;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * This pipeline decompresses file(s) from Google Cloud Storage and re-uploads them to a destination
 * location.
 *
 * <p><b>Parameters</b>
 *
 * <p>The {@code --inputFilePattern} parameter specifies a file glob to process. Files found can be
 * expressed in the following formats:
 *
 * <pre>
 * --inputFilePattern=gs://bucket-name/compressed-dir/*
 * --inputFilePattern=gs://bucket-name/compressed-dir/demo*.gz
 * </pre>
 *
 * <p>The {@code --outputDirectory} parameter can be expressed in the following formats:
 *
 * <pre>
 * --outputDirectory=gs://bucket-name
 * --outputDirectory=gs://bucket-name/decompressed-dir
 * </pre>
 *
 * <p>The {@code --outputFailureFile} parameter indicates the file to write the names of the files
 * which failed decompression and their associated error messages. This file can then be used for
 * subsequent processing by another process outside of Dataflow (e.g. send an email with the
 * failures, etc.). If there are no failures, the file will still be created but will be empty. The
 * failure file structure contains both the file that caused the error and the error message in CSV
 * format. The file will contain one header row and two columns (Filename, Error). The filename
 * output to the failureFile will be the full path of the file for ease of debugging.
 *
 * <pre>
 * --outputFailureFile=gs://bucket-name/decompressed-dir/failed.csv
 * </pre>
 *
 * <p>Example Output File:
 *
 * <pre>
 * Filename,Error
 * gs://docs-demo/compressedFile.gz, File is malformed or not compressed in BZIP2 format.
 * </pre>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.BulkDecompressor \
 * -Dexec.cleanupDaemonThreads=false \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
 * --tempLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
 * --runner=DataflowRunner \
 * --inputFilePattern=gs://${PROJECT_ID}/compressed-dir/*.gz \
 * --outputDirectory=gs://${PROJECT_ID}/decompressed-dir \
 * --outputFailureFile=gs://${PROJECT_ID}/decompressed-dir/failed.csv"
 * </pre>
 */
@Template(
    name = "Bulk_Decompress_GCS_Files",
    category = TemplateCategory.UTILITIES,
    displayName = "Bulk Decompress Files on Cloud Storage",
    description =
        "A pipeline which decompresses files on Cloud Storage to a specified location. Supported formats: Bzip2, deflate, and gzip.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class BulkDecompressor {

  /** The logger to output status messages to. */
  private static final Logger LOG = LoggerFactory.getLogger(BulkDecompressor.class);

  /**
   * A list of the {@link Compression} values excluding {@link Compression#AUTO} and {@link
   * Compression#UNCOMPRESSED}.
   */
  @VisibleForTesting
  static final Set<Compression> SUPPORTED_COMPRESSIONS =
      Stream.of(Compression.values())
          .filter(value -> value != Compression.AUTO && value != Compression.UNCOMPRESSED)
          .collect(Collectors.toSet());

  /** The error msg given when the pipeline matches a file but cannot determine the compression. */
  @VisibleForTesting
  static final String UNCOMPRESSED_ERROR_MSG =
      "Skipping file %s because it did not match any compression mode (%s)";

  @VisibleForTesting
  static final String MALFORMED_ERROR_MSG =
      "The file resource %s is malformed or not in %s compressed format.";

  /** The tag used to identify the main output of the {@link Decompress} DoFn. */
  @VisibleForTesting
  static final TupleTag<String> DECOMPRESS_MAIN_OUT_TAG = new TupleTag<String>() {};

  /** The tag used to identify the dead-letter sideOutput of the {@link Decompress} DoFn. */
  @VisibleForTesting
  static final TupleTag<KV<String, String>> DEADLETTER_TAG = new TupleTag<KV<String, String>>() {};

  /**
   * The {@link Options} class provides the custom execution options passed by the executor at the
   * command-line.
   */
  public interface Options extends PipelineOptions {
    @TemplateParameter.GcsReadFile(
        order = 1,
        description = "Input Cloud Storage File(s)",
        helpText = "The Cloud Storage location of the files you'd like to process.",
        example = "gs://your-bucket/your-files/*.gz")
    @Required
    ValueProvider<String> getInputFilePattern();

    void setInputFilePattern(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFolder(
        order = 2,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime formatting is used to parse directory path for date & time formatters.",
        example = "gs://your-bucket/decompressed/")
    @Required
    ValueProvider<String> getOutputDirectory();

    void setOutputDirectory(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFile(
        order = 3,
        description = "The output file for failures during the decompression process",
        helpText =
            "The output file to write failures to during the decompression process. If there are no failures, the file will still be created but will be empty. The contents will be one line for each file which failed decompression in CSV format (Filename, Error). Note that this parameter will allow the pipeline to continue processing in the event of a failure.",
        example = "gs://your-bucket/decompressed/failed.csv")
    @Required
    ValueProvider<String> getOutputFailureFile();

    void setOutputFailureFile(ValueProvider<String> value);
  }

  /**
   * The main entry-point for pipeline execution. This method will start the pipeline but will not
   * wait for it's execution to finish. If blocking execution is required, use the {@link
   * BulkDecompressor#run(Options)} method to start the pipeline and invoke {@code
   * result.waitUntilFinish()} on the {@link PipelineResult}.
   *
   * @param args The command-line args passed by the executor.
   */
  public static void main(String[] args) {

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(Options options) {

    /*
     * Steps:
     *   1) Find all files matching the input pattern
     *   2) Decompress the files found and output them to the output directory
     *   3) Write any errors to the failure output file
     */

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Run the pipeline over the work items.
    PCollectionTuple decompressOut =
        pipeline
            .apply("MatchFile(s)", FileIO.match().filepattern(options.getInputFilePattern()))
            .apply(
                "DecompressFile(s)",
                ParDo.of(new Decompress(options.getOutputDirectory()))
                    .withOutputTags(DECOMPRESS_MAIN_OUT_TAG, TupleTagList.of(DEADLETTER_TAG)));

    decompressOut
        .get(DEADLETTER_TAG)
        .apply(
            "FormatErrors",
            MapElements.into(TypeDescriptors.strings())
                .via(
                    kv -> {
                      StringWriter stringWriter = new StringWriter();
                      try {
                        CSVPrinter printer =
                            new CSVPrinter(
                                stringWriter,
                                CSVFormat.DEFAULT
                                    .withEscape('\\')
                                    .withQuoteMode(QuoteMode.NONE)
                                    .withRecordSeparator('\n'));
                        printer.printRecord(kv.getKey(), kv.getValue());
                      } catch (IOException e) {
                        throw new RuntimeException(e);
                      }

                      return stringWriter.toString();
                    }))

        // We don't expect error files to be large so we'll create a single
        // file for ease of reprocessing by processes outside of Dataflow.
        .apply(
            "WriteErrorFile",
            TextIO.write()
                .to(options.getOutputFailureFile())
                .withHeader("Filename,Error")
                .withoutSharding());

    return pipeline.run();
  }

  /**
   * Performs the decompression of an object on Google Cloud Storage and uploads the decompressed
   * object back to a specified destination location.
   */
  @SuppressWarnings("serial")
  public static class Decompress extends DoFn<MatchResult.Metadata, String> {

    private final ValueProvider<String> destinationLocation;

    Decompress(ValueProvider<String> destinationLocation) {
      this.destinationLocation = destinationLocation;
    }

    @ProcessElement
    public void processElement(ProcessContext context) {
      ResourceId inputFile = context.element().resourceId();

      // Output a record to the failure file if the file doesn't match a known compression.
      if (!Compression.AUTO.isCompressed(inputFile.toString())) {
        String errorMsg =
            String.format(UNCOMPRESSED_ERROR_MSG, inputFile.toString(), SUPPORTED_COMPRESSIONS);

        context.output(DEADLETTER_TAG, KV.of(inputFile.toString(), errorMsg));
      } else {
        try {
          ResourceId outputFile = decompress(inputFile);
          context.output(outputFile.toString());
        } catch (IOException e) {
          LOG.error(e.getMessage());
          context.output(DEADLETTER_TAG, KV.of(inputFile.toString(), e.getMessage()));
        }
      }
    }

    /**
     * Decompresses the inputFile using the specified compression and outputs to the main output of
     * the {@link Decompress} doFn. Files output to the destination will be first written as temp
     * files with a "temp-" prefix within the output directory. If a file fails decompression, the
     * filename and the associated error will be output to the dead-letter.
     *
     * @param inputFile The inputFile to decompress.
     * @return A {@link ResourceId} which points to the resulting file from the decompression.
     */
    private ResourceId decompress(ResourceId inputFile) throws IOException {
      // Remove the compressed extension from the file. Example: demo.txt.gz -> demo.txt
      String outputFilename = Files.getNameWithoutExtension(inputFile.toString());

      // Resolve the necessary resources to perform the transfer.
      ResourceId outputDir = FileSystems.matchNewResource(destinationLocation.get(), true);
      ResourceId outputFile =
          outputDir.resolve(outputFilename, StandardResolveOptions.RESOLVE_FILE);
      ResourceId tempFile =
          outputDir.resolve(
              Files.getFileExtension(inputFile.toString()) + "-temp-" + outputFilename,
              StandardResolveOptions.RESOLVE_FILE);

      // Resolve the compression
      Compression compression = Compression.detect(inputFile.toString());

      // Perform the copy of the decompressed channel into the destination.
      try (ReadableByteChannel readerChannel =
          compression.readDecompressed(FileSystems.open(inputFile))) {
        try (WritableByteChannel writerChannel = FileSystems.create(tempFile, MimeTypes.TEXT)) {
          ByteStreams.copy(readerChannel, writerChannel);
        }

        // Rename the temp file to the output file.
        FileSystems.rename(
            ImmutableList.of(tempFile),
            ImmutableList.of(outputFile),
            MoveOptions.StandardMoveOptions.IGNORE_MISSING_FILES);
      } catch (IOException e) {
        String msg = e.getMessage();

        LOG.error("Error occurred during decompression of {}", inputFile.toString(), e);
        throw new IOException(sanitizeDecompressionErrorMsg(msg, inputFile, compression));
      }

      return outputFile;
    }

    /**
     * The error messages coming from the compression library are not consistent across compression
     * modes. Here we'll attempt to unify the messages to inform the user more clearly when we've
     * encountered a file which is not compressed or malformed. Note that GZIP and ZIP compression
     * modes will not throw an exception when a decompression is attempted on a file which is not
     * compressed.
     *
     * @param errorMsg The error message thrown during decompression.
     * @param inputFile The input file which failed decompression.
     * @param compression The compression mode used during decompression.
     * @return The sanitized error message. If the error was not from a malformed file, the same
     *     error message passed will be returned (if not null) or an empty string will be returned
     *     (if null).
     */
    private String sanitizeDecompressionErrorMsg(
        @Nullable String errorMsg, ResourceId inputFile, Compression compression) {
      if (errorMsg != null
          && (errorMsg.contains("not in the BZip2 format")
              || errorMsg.contains("incorrect header check"))) {
        errorMsg = String.format(MALFORMED_ERROR_MSG, inputFile.toString(), compression);
      }

      return errorMsg == null ? "" : errorMsg;
    }
  }
}

Datastore Bulk Delete [已弃用]

此模板已弃用，将于 2022 年第一季度移除。请迁移到 Firestore Bulk Delete 模板。

Datastore Bulk Delete 模板是一种流水线，它使用给定的 GQL 查询从 Datastore 中读取实体，然后删除所选目标项目中的所有匹配实体。此流水线可选择性地将 JSON 编码的 Datastore 实体传递给您的 JavaScript UDF，使您可以通过返回 null 值来过滤掉实体。

对此流水线的要求：

必须先在项目中设置 Datastore，然后才能运行模板。
如果从单独的 Datastore 实例中读取和删除，则 Dataflow 工作器服务帐号必须具有从一个实例读取并从另一个实例中删除的权限。

模板参数

参数	说明
`datastoreReadGqlQuery`	GQL 查询，指定要删除的匹配实体。使用仅限于键的查询可以提高性能。例如：“SELECT __key__ FROM MyKind”。
`datastoreReadProjectId`	您要从中读取用于匹配的实体（使用 GQL 查询）的 Datastore 实例的项目 ID。
`datastoreDeleteProjectId`	要从中删除匹配的实体的 Datastore 实例的项目 ID。如果您希望在同一个 Datastore 实例中执行读取和删除操作，则此 ID 可以与 `datastoreReadProjectId` 相同。
`datastoreReadNamespace`	（可选）所请求实体的命名空间。默认名称空间设置为“”。
`datastoreHintNumWorkers`	（可选）Datastore 逐步增加限制步骤中的预期工作器数量的提示。默认值为 `500`。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。如果此函数为给定 Datastore 实体返回未定义的值或 null 值，则该实体不会被删除。

运行 Datastore Bulk Delete 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Bulk Delete Entities in Datastore template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Datastore_to_Datastore_Delete \
    --region REGION_NAME \
    --parameters \
datastoreReadGqlQuery="GQL_QUERY",\
datastoreReadProjectId=DATASTORE_READ_AND_DELETE_PROJECT_ID,\
datastoreDeleteProjectId=DATASTORE_READ_AND_DELETE_PROJECT_ID

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
GQL_QUERY：用于匹配要删除的实体的查询
DATASTORE_READ_AND_DELETE_PROJECT_ID：您的 Datastore 实例项目 ID。此示例从同一个 Datastore 实例中读取和删除。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Datastore_to_Datastore_Delete
{
   "jobName": "JOB_NAME",
   "parameters": {
       "datastoreReadGqlQuery": "GQL_QUERY",
       "datastoreReadProjectId": "DATASTORE_READ_AND_DELETE_PROJECT_ID",
       "datastoreDeleteProjectId": "DATASTORE_READ_AND_DELETE_PROJECT_ID"
   },
   "environment": { "zone": "us-central1-f" }
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
GQL_QUERY：用于匹配要删除的实体的查询
DATASTORE_READ_AND_DELETE_PROJECT_ID：您的 Datastore 实例项目 ID。此示例从同一个 Datastore 实例中读取和删除。

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.DatastoreToDatastoreDelete.DatastoreToDatastoreDeleteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreDeleteEntityJson;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreDeleteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreReadOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.ReadJsonEntities;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;

/** Dataflow template which deletes pulled Datastore Entities. */
@Template(
    name = "Datastore_to_Datastore_Delete",
    category = TemplateCategory.UTILITIES,
    displayName = "Bulk Delete Entities in Datastore [Deprecated]",
    description =
        "A pipeline which reads in Entities (via a GQL query) from Datastore, optionally passes in the JSON encoded Entities to a JavaScript UDF, and then deletes all matching Entities in the selected target project.",
    optionsClass = DatastoreToDatastoreDeleteOptions.class,
    skipOptions = {
      "firestoreReadGqlQuery",
      "firestoreReadProjectId",
      "firestoreReadNamespace",
      "firestoreDeleteProjectId",
      "firestoreHintNumWorkers"
    },
    contactInformation = "https://cloud.google.com/support")
@Template(
    name = "Firestore_to_Firestore_Delete",
    category = TemplateCategory.UTILITIES,
    displayName = "Bulk Delete Entities in Firestore (Datastore mode)",
    description =
        "A pipeline which reads in Entities (via a GQL query) from Firestore, optionally passes in the JSON encoded Entities to a JavaScript UDF, and then deletes all matching Entities in the selected target project.",
    optionsClass = DatastoreToDatastoreDeleteOptions.class,
    skipOptions = {
      "datastoreReadGqlQuery",
      "datastoreReadProjectId",
      "datastoreReadNamespace",
      "datastoreDeleteProjectId",
      "datastoreHintNumWorkers"
    },
    contactInformation = "https://cloud.google.com/support")
public class DatastoreToDatastoreDelete {

  public static <T> ValueProvider<T> selectProvidedInput(
      ValueProvider<T> datastoreInput, ValueProvider<T> firestoreInput) {
    return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
  }

  /** Custom PipelineOptions. */
  public interface DatastoreToDatastoreDeleteOptions
      extends PipelineOptions,
          DatastoreReadOptions,
          JavascriptTextTransformerOptions,
          DatastoreDeleteOptions {}

  /**
   * Runs a pipeline which reads in Entities from datastore, passes in the JSON encoded Entities to
   * a Javascript UDF, and deletes all the Entities.
   *
   * <p>If the UDF returns value of undefined or null for a given Entity, then that Entity will not
   * be deleted.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    DatastoreToDatastoreDeleteOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(DatastoreToDatastoreDeleteOptions.class);

    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(
            ReadJsonEntities.newBuilder()
                .setGqlQuery(
                    selectProvidedInput(
                        options.getDatastoreReadGqlQuery(), options.getFirestoreReadGqlQuery()))
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreReadProjectId(), options.getFirestoreReadProjectId()))
                .setNamespace(
                    selectProvidedInput(
                        options.getDatastoreReadNamespace(), options.getFirestoreReadNamespace()))
                .build())
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(
            DatastoreDeleteEntityJson.newBuilder()
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreDeleteProjectId(),
                        options.getFirestoreDeleteProjectId()))
                .setHintNumWorkers(
                    selectProvidedInput(
                        options.getDatastoreHintNumWorkers(), options.getFirestoreHintNumWorkers()))
                .build());

    pipeline.run();
  }
}

Firestore Bulk Delete

Firestore Bulk Delete 模板是一种流水线，它使用给定的 GQL 查询从 Firestore 中读取实体，然后删除所选目标项目中的所有匹配实体。此流水线可选择性地将 JSON 编码的 Firestore 实体传递给您的 JavaScript UDF，使您可以通过返回 null 值来过滤掉实体。

对此流水线的要求：

必须先在项目中设置 Firestore，然后才能运行模板。
如果从单独的 Firestore 实例中读取和删除，则 Dataflow 工作器服务帐号必须具有从一个实例读取并从另一个实例中删除的权限。

模板参数

参数	说明
`firestoreReadGqlQuery`	GQL 查询，指定要删除的匹配实体。使用仅限于键的查询可以提高性能。例如：“SELECT __key__ FROM MyKind”。
`firestoreReadProjectId`	您要从中读取用于匹配的实体（使用 GQL 查询）的 Firestore 实例的项目 ID。
`firestoreDeleteProjectId`	要从中删除匹配的实体的 Firestore 实例的项目 ID。如果您希望在同一个 Firestore 实例中执行读取和删除操作，则此 ID 可以与 `firestoreReadProjectId` 相同。
`firestoreReadNamespace`	（可选）所请求实体的命名空间。默认名称空间设置为“”。
`firestoreHintNumWorkers`	（可选）Firestore 逐步增加限制步骤中的预期工作器数量的提示。默认值为 `500`。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。如果此函数为给定 Firestore 实体返回未定义的值或 null 值，则该实体不会被删除。

运行 Firestore Bulk Delete 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Bulk Delete Entities in Firestore template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Firestore_to_Firestore_Delete \
    --region REGION_NAME \
    --parameters \
firestoreReadGqlQuery="GQL_QUERY",\
firestoreReadProjectId=FIRESTORE_READ_AND_DELETE_PROJECT_ID,\
firestoreDeleteProjectId=FIRESTORE_READ_AND_DELETE_PROJECT_ID

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
GQL_QUERY：用于匹配要删除的实体的查询
FIRESTORE_READ_AND_DELETE_PROJECT_ID：您的 Firestore 实例项目 ID。此示例从同一个 Firestore 实例中读取和删除。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Firestore_to_Firestore_Delete
{
   "jobName": "JOB_NAME",
   "parameters": {
       "firestoreReadGqlQuery": "GQL_QUERY",
       "firestoreReadProjectId": "FIRESTORE_READ_AND_DELETE_PROJECT_ID",
       "firestoreDeleteProjectId": "FIRESTORE_READ_AND_DELETE_PROJECT_ID"
   },
   "environment": { "zone": "us-central1-f" }
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
GQL_QUERY：用于匹配要删除的实体的查询
FIRESTORE_READ_AND_DELETE_PROJECT_ID：您的 Firestore 实例项目 ID。此示例从同一个 Firestore 实例中读取和删除。

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.DatastoreToDatastoreDelete.DatastoreToDatastoreDeleteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreDeleteEntityJson;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreDeleteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreReadOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.ReadJsonEntities;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;

/** Dataflow template which deletes pulled Datastore Entities. */
@Template(
    name = "Datastore_to_Datastore_Delete",
    category = TemplateCategory.UTILITIES,
    displayName = "Bulk Delete Entities in Datastore [Deprecated]",
    description =
        "A pipeline which reads in Entities (via a GQL query) from Datastore, optionally passes in the JSON encoded Entities to a JavaScript UDF, and then deletes all matching Entities in the selected target project.",
    optionsClass = DatastoreToDatastoreDeleteOptions.class,
    skipOptions = {
      "firestoreReadGqlQuery",
      "firestoreReadProjectId",
      "firestoreReadNamespace",
      "firestoreDeleteProjectId",
      "firestoreHintNumWorkers"
    },
    contactInformation = "https://cloud.google.com/support")
@Template(
    name = "Firestore_to_Firestore_Delete",
    category = TemplateCategory.UTILITIES,
    displayName = "Bulk Delete Entities in Firestore (Datastore mode)",
    description =
        "A pipeline which reads in Entities (via a GQL query) from Firestore, optionally passes in the JSON encoded Entities to a JavaScript UDF, and then deletes all matching Entities in the selected target project.",
    optionsClass = DatastoreToDatastoreDeleteOptions.class,
    skipOptions = {
      "datastoreReadGqlQuery",
      "datastoreReadProjectId",
      "datastoreReadNamespace",
      "datastoreDeleteProjectId",
      "datastoreHintNumWorkers"
    },
    contactInformation = "https://cloud.google.com/support")
public class DatastoreToDatastoreDelete {

  public static <T> ValueProvider<T> selectProvidedInput(
      ValueProvider<T> datastoreInput, ValueProvider<T> firestoreInput) {
    return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
  }

  /** Custom PipelineOptions. */
  public interface DatastoreToDatastoreDeleteOptions
      extends PipelineOptions,
          DatastoreReadOptions,
          JavascriptTextTransformerOptions,
          DatastoreDeleteOptions {}

  /**
   * Runs a pipeline which reads in Entities from datastore, passes in the JSON encoded Entities to
   * a Javascript UDF, and deletes all the Entities.
   *
   * <p>If the UDF returns value of undefined or null for a given Entity, then that Entity will not
   * be deleted.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    DatastoreToDatastoreDeleteOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(DatastoreToDatastoreDeleteOptions.class);

    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(
            ReadJsonEntities.newBuilder()
                .setGqlQuery(
                    selectProvidedInput(
                        options.getDatastoreReadGqlQuery(), options.getFirestoreReadGqlQuery()))
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreReadProjectId(), options.getFirestoreReadProjectId()))
                .setNamespace(
                    selectProvidedInput(
                        options.getDatastoreReadNamespace(), options.getFirestoreReadNamespace()))
                .build())
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(
            DatastoreDeleteEntityJson.newBuilder()
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreDeleteProjectId(),
                        options.getFirestoreDeleteProjectId()))
                .setHintNumWorkers(
                    selectProvidedInput(
                        options.getDatastoreHintNumWorkers(), options.getFirestoreHintNumWorkers()))
                .build());

    pipeline.run();
  }
}

Streaming Data Generator 到 Pub/Sub、BigQuery 和 Cloud Storage

Streaming Data Generator 模板用于根据用户提供架构以指定的速率生成不限数量或固定数量的综合记录或消息。兼容的目标包括 Pub/Sub 主题、BigQuery 表和 Cloud Storage 存储桶。

以下是一些可能的使用场景：

模拟针对 Pub/Sub 主题的大规模实时事件发布，以衡量并确定处理发布事件所需的消费者数量和规模。
生成发送到 BigQuery 表或 Cloud Storage 存储桶的综合数据，以评估性能基准或用作概念验证。

支持的接收器和编码格式

下表说明了此模板支持的接收器和编码格式：

	JSON	Avro	Parquet
Pub/Sub	是	是	否
BigQuery	是	否	否
Cloud Storage	是	是	是

对此流水线的要求：

创建一个架构文件，其中包含所生成数据的 JSON 模板。此模板使用 JSON 数据生成器库，因此您可以为架构中的每个字段提供各种 faker 函数。如需了解详情，请参阅 json-data-generator 文档。

例如：
```
{
  "id": {{integer(0,1000)}},
  "name": "{{uuid()}}",
  "isInStock": {{bool()}}
}
```
将架构文件上传到 Cloud Storage 存储桶。
输出目标必须已存在才能执行此流水线。目标必须是 Pub/Sub 主题、BigQuery 表或 Cloud Storage 存储桶，具体取决于接收器类型。
如果输出编码是 Avro 或 Parquet，请创建一个 Avro 架构文件并将其存储在 Cloud Storage 位置。

模板参数

参数	说明
`schemaLocation`	架构文件的位置。例如：`gs://mybucket/filename.json`。
`qps`	每秒要发布的消息数。例如：`100`。
`sinkType`	（可选）输出接收器类型。可能的值有 `PUBSUB`、`BIGQUERY`、`GCS`。默认值为 PUBSUB。
`outputType`	（可选）输出编码类型。可能的值有 `JSON`、`AVRO`、`PARQUET`。默认值为 JSON。
`avroSchemaLocation`	（可选）AVRO 架构文件的位置。`outputType` 为 AVRO 或 PARQUET 时必须提供此参数。例如：`gs://mybucket/filename.avsc`。
`topic`	（可选）流水线应向其发布数据的 Pub/Sub 主题的名称。`sinkType` 为 Pub/Sub 时必须提供此参数。例如：`projects/my-project-ID/topics/my-topic-ID`。
`outputTableSpec`	（可选）输出 BigQuery 表的名称。`sinkType` 为 BigQuery 时必须提供此参数。例如：`my-project-ID:my_dataset_name.my-table-name`。
`writeDisposition`	（可选）BigQuery 写入处置方式。可能的值有 `WRITE_APPEND`、`WRITE_EMPTY` 或 `WRITE_TRUNCATE`。默认值为 WRITE_APPEND。
`outputDeadletterTable`	（可选）保存失败记录的输出 BigQuery 表的名称。如果未提供，流水线在执行期间会创建名为 {output_table_name}_error_records 的表。例如：`my-project-ID:my_dataset_name.my-table-name`。
`outputDirectory`	（可选）输出 Cloud Storage 位置的路径。`sinkType` 为 Cloud Storage 时必须提供此参数。例如：`gs://mybucket/pathprefix/`。
`outputFilenamePrefix`	（可选）写入 Cloud Storage 的输出文件的文件名前缀。默认值为 output-。
`windowDuration`	（可选）输出写入 Cloud Storage 的时段间隔。默认值为 1m（即 1 分钟）。
`numShards`	[可选] 输出分片的数量上限。`sinkType` 为 Cloud Storage 时必须提供此参数，并且此参数应设置为 1 或更大的数。
`messagesLimit`	（可选）输出消息的数量上限。默认值为 0，表示无限制。
`autoscalingAlgorithm`	（可选）用于自动扩缩工作器的算法。可能的值为 `THROUGHPUT_BASED`（启用自动扩缩）或 `NONE`（停用）。
`maxNumWorkers`	（可选）工作器机器数上限。例如：`10`。

运行 Streaming Data Generator 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Streaming Data Generator template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/Streaming_Data_Generator \
    --parameters \
schemaLocation=SCHEMA_LOCATION,\
qps=QPS,\
topic=PUBSUB_TOPIC

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
SCHEMA_LOCATION：Cloud Storage 中架构文件的路径。例如：gs://mybucket/filename.json。
QPS：每秒要发布的消息数
PUBSUB_TOPIC：输出 Pub/Sub 主题。例如：projects/my-project-ID/topics/my-topic-ID。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "schemaLocation": "SCHEMA_LOCATION",
          "qps": "QPS",
          "topic": "PUBSUB_TOPIC"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/Streaming_Data_Generator",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
SCHEMA_LOCATION：Cloud Storage 中架构文件的路径。例如：gs://mybucket/filename.json。
QPS：每秒要发布的消息数
PUBSUB_TOPIC：输出 Pub/Sub 主题。例如：projects/my-project-ID/topics/my-topic-ID。

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2020 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument;
import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkNotNull;

import com.github.vincentrussell.json.datagenerator.JsonDataGenerator;
import com.github.vincentrussell.json.datagenerator.JsonDataGeneratorException;
import com.github.vincentrussell.json.datagenerator.impl.JsonDataGeneratorImpl;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.templates.StreamingDataGenerator.StreamingDataGeneratorOptions;
import com.google.cloud.teleport.v2.transforms.StreamingDataGeneratorWriteToBigQuery;
import com.google.cloud.teleport.v2.transforms.StreamingDataGeneratorWriteToGcs;
import com.google.cloud.teleport.v2.transforms.StreamingDataGeneratorWriteToJdbc;
import com.google.cloud.teleport.v2.transforms.StreamingDataGeneratorWriteToPubSub;
import com.google.cloud.teleport.v2.transforms.StreamingDataGeneratorWriteToSpanner;
import com.google.cloud.teleport.v2.utils.DurationUtils;
import com.google.cloud.teleport.v2.utils.GCSUtils;
import com.google.cloud.teleport.v2.utils.MetadataValidator;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import javax.annotation.Nonnull;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.FileSystems;
import org.apache.beam.sdk.io.GenerateSequence;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.windowing.FixedWindows;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PDone;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.annotations.VisibleForTesting;
import org.joda.time.Duration;
import org.joda.time.Instant;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link StreamingDataGenerator} is a streaming pipeline which generates messages at a
 * specified rate to either Pub/Sub topic or BigQuery/GCS. The messages are generated according to a
 * schema template which instructs the pipeline how to populate the messages with fake data
 * compliant to constraints.
 *
 * <p>The number of workers executing the pipeline must be large enough to support the supplied QPS.
 * Use a general rule of 2,500 QPS per core in the worker pool.
 *
 * <p>See <a href="https://github.com/vincentrussell/json-data-generator">json-data-generator</a>
 * for instructions on how to construct the schema file.
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT=my-project
 * BUCKET_NAME=my-bucket
 * SCHEMA_LOCATION=gs://{bucket}/{path}/{to}/game-event-schema.json
 * PUBSUB_TOPIC=projects/{project-id}/topics/{topic-id}
 * QPS=2500
 *
 * # Set containerization vars
 * IMAGE_NAME=my-image-name
 * TARGET_GCR_IMAGE=gcr.io/${PROJECT}/${IMAGE_NAME}
 * BASE_CONTAINER_IMAGE=my-base-container-image
 * BASE_CONTAINER_IMAGE_VERSION=my-base-container-image-version
 * APP_ROOT=/path/to/app-root
 * COMMAND_SPEC=/path/to/command-spec
 *
 * # Build and upload image
 * mvn clean package \
 * -Dimage=${TARGET_GCR_IMAGE} \
 * -Dbase-container-image=${BASE_CONTAINER_IMAGE} \
 * -Dbase-container-image.version=${BASE_CONTAINER_IMAGE_VERSION} \
 * -Dapp-root=${APP_ROOT} \
 * -Dcommand-spec=${COMMAND_SPEC}
 *
 * # Create a template spec containing the details of image location and metadata in GCS
 *   as specified in README.md file
 *
 * # Execute template:
 * JOB_NAME={job-name}
 * PROJECT={project-id}
 * TEMPLATE_SPEC_GCSPATH=gs://path/to/template-spec
 * SCHEMA_LOCATION=gs://path/to/schema.json
 * PUBSUB_TOPIC=projects/$PROJECT/topics/{topic-name}
 * QPS=1
 *
 * gcloud beta dataflow flex-template run $JOB_NAME \
 *         --project=$PROJECT --region=us-central1 --flex-template  \
 *         --template-file-gcs-location=$TEMPLATE_SPEC_GCSPATH \
 *         --parameters autoscalingAlgorithm="THROUGHPUT_BASED",schemaLocation=$SCHEMA_LOCATION,topic=$PUBSUB_TOPIC,qps=$QPS,maxNumWorkers=3
 *
 * </pre>
 */
@Template(
    name = "Streaming_Data_Generator",
    category = TemplateCategory.UTILITIES,
    displayName = "Streaming Data Generator",
    description =
        "A pipeline to publish messages at specified QPS.This template can be used to benchmark"
            + " performance of streaming pipelines.",
    optionsClass = StreamingDataGeneratorOptions.class,
    flexContainerName = "streaming-data-generator",
    contactInformation = "https://cloud.google.com/support")
public class StreamingDataGenerator {

  private static final Logger logger = LoggerFactory.getLogger(StreamingDataGenerator.class);

  /**
   * The {@link StreamingDataGeneratorOptions} class provides the custom execution options passed by
   * the executor at the command-line.
   */
  public interface StreamingDataGeneratorOptions extends PipelineOptions {
    @TemplateParameter.Text(
        order = 1,
        regexes = {"^[1-9][0-9]*$"},
        description = "Required output rate",
        helpText = "Indicates rate of messages per second to be published to Pub/Sub")
    @Required
    Long getQps();

    void setQps(Long value);

    @TemplateParameter.Enum(
        order = 2,
        enumOptions = {"GAME_EVENT"},
        optional = true,
        description = "Schema template to generate fake data",
        helpText = "Pre-existing schema template to use. The value must be one of: [GAME_EVENT]")
    SchemaTemplate getSchemaTemplate();

    void setSchemaTemplate(SchemaTemplate value);

    @TemplateParameter.GcsReadFile(
        order = 3,
        optional = true,
        description = "Location of Schema file to generate fake data",
        helpText = "Cloud Storage path of schema location.",
        example = "gs://<bucket-name>/prefix")
    String getSchemaLocation();

    void setSchemaLocation(String value);

    @TemplateParameter.PubsubTopic(
        order = 4,
        optional = true,
        description = "Output Pub/Sub topic",
        helpText = "The name of the topic to which the pipeline should publish data.",
        example = "projects/<project-id>/topics/<topic-name>")
    String getTopic();

    void setTopic(String value);

    @TemplateParameter.Long(
        order = 5,
        optional = true,
        description = "Maximum number of output Messages",
        helpText =
            "Indicates maximum number of output messages to be generated. 0 means unlimited.")
    @Default.Long(0L)
    Long getMessagesLimit();

    void setMessagesLimit(Long value);

    @TemplateParameter.Enum(
        order = 6,
        enumOptions = {"AVRO", "JSON", "PARQUET"},
        optional = true,
        description = "Output Encoding Type",
        helpText = "The message Output type. Default is JSON.")
    @Default.Enum("JSON")
    OutputType getOutputType();

    void setOutputType(OutputType value);

    @TemplateParameter.GcsReadFile(
        order = 7,
        optional = true,
        description = "Location of Avro Schema file",
        helpText =
            "Cloud Storage path of Avro schema location. Mandatory when output type is AVRO or"
                + " PARQUET.",
        example = "gs://your-bucket/your-path/schema.avsc")
    String getAvroSchemaLocation();

    void setAvroSchemaLocation(String value);

    @TemplateParameter.Enum(
        order = 8,
        enumOptions = {"BIGQUERY", "GCS", "PUBSUB", "JDBC", "SPANNER"},
        optional = true,
        description = "Output Sink Type",
        helpText = "The message Sink type. Default is PUBSUB")
    @Default.Enum("PUBSUB")
    SinkType getSinkType();

    void setSinkType(SinkType value);

    @TemplateParameter.BigQueryTable(
        order = 9,
        optional = true,
        description = "Output BigQuery table",
        helpText = "Output BigQuery table. Mandatory when sinkType is BIGQUERY",
        example = "<project>:<dataset>.<table_name>")
    String getOutputTableSpec();

    void setOutputTableSpec(String value);

    @TemplateParameter.Enum(
        order = 10,
        enumOptions = {"WRITE_APPEND", "WRITE_EMPTY", "WRITE_TRUNCATE"},
        optional = true,
        description = "Write Disposition to use for BigQuery",
        helpText =
            "BigQuery WriteDisposition. For example, WRITE_APPEND, WRITE_EMPTY or WRITE_TRUNCATE.")
    @Default.String("WRITE_APPEND")
    String getWriteDisposition();

    void setWriteDisposition(String writeDisposition);

    @TemplateParameter.BigQueryTable(
        order = 11,
        optional = true,
        description = "The dead-letter table name to output failed messages to BigQuery",
        helpText =
            "Messages failed to reach the output table for all kind of reasons (e.g., mismatched"
                + " schema, malformed json) are written to this table. If it doesn't exist, it will"
                + " be created during pipeline execution.",
        example = "your-project-id:your-dataset.your-table-name")
    String getOutputDeadletterTable();

    void setOutputDeadletterTable(String outputDeadletterTable);

    @TemplateParameter.Duration(
        order = 12,
        optional = true,
        description = "Window duration",
        helpText =
            "The window duration/size in which data will be written to Cloud Storage. Allowed"
                + " formats are: Ns (for seconds, example: 5s), Nm (for minutes, example: 12m), Nh"
                + " (for hours, example: 2h).",
        example = "1m")
    @Default.String("1m")
    String getWindowDuration();

    void setWindowDuration(String windowDuration);

    @TemplateParameter.GcsWriteFolder(
        order = 13,
        optional = true,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime"
                + " formatting is used to parse directory path for date & time formatters.",
        example = "gs://your-bucket/your-path/")
    String getOutputDirectory();

    void setOutputDirectory(String outputDirectory);

    @TemplateParameter.Text(
        order = 14,
        optional = true,
        description = "Output filename prefix of the files to write",
        helpText = "The prefix to place on each windowed file.",
        example = "output-")
    @Default.String("output-")
    String getOutputFilenamePrefix();

    void setOutputFilenamePrefix(String outputFilenamePrefix);

    @TemplateParameter.Integer(
        order = 15,
        optional = true,
        description = "Maximum output shards",
        helpText =
            "The maximum number of output shards produced when writing. A higher number of shards"
                + " means higher throughput for writing to Cloud Storage, but potentially higher"
                + " data aggregation cost across shards when processing output Cloud Storage files."
                + " Default value is decided by the runner.")
    @Default.Integer(0)
    Integer getNumShards();

    void setNumShards(Integer numShards);

    @TemplateParameter.Text(
        order = 16,
        optional = true,
        regexes = {"^.+$"},
        description = "JDBC driver class name.",
        helpText = "JDBC driver class name to use.",
        example = "com.mysql.jdbc.Driver")
    String getDriverClassName();

    void setDriverClassName(String driverClassName);

    @TemplateParameter.Text(
        order = 17,
        optional = true,
        regexes = {
          "(^jdbc:[a-zA-Z0-9/:@.?_+!*=&-;]+$)|(^([A-Za-z0-9+/]{4}){1,}([A-Za-z0-9+/]{0,3})={0,3})"
        },
        description = "JDBC connection URL string.",
        helpText = "Url connection string to connect to the JDBC source.",
        example = "jdbc:mysql://some-host:3306/sampledb")
    String getConnectionUrl();

    void setConnectionUrl(String connectionUrl);

    @TemplateParameter.Text(
        order = 18,
        optional = true,
        regexes = {"^.+$"},
        description = "JDBC connection username.",
        helpText = "User name to be used for the JDBC connection.")
    String getUsername();

    void setUsername(String username);

    @TemplateParameter.Password(
        order = 19,
        optional = true,
        description = "JDBC connection password.",
        helpText = "Password to be used for the JDBC connection.")
    String getPassword();

    void setPassword(String password);

    @TemplateParameter.Text(
        order = 20,
        optional = true,
        regexes = {"^[a-zA-Z0-9_;!*&=@#-:\\/]+$"},
        description = "JDBC connection property string.",
        helpText =
            "Properties string to use for the JDBC connection. Format of the string must be"
                + " [propertyName=property;]*.",
        example = "unicode=true;characterEncoding=UTF-8")
    String getConnectionProperties();

    void setConnectionProperties(String connectionProperties);

    @TemplateParameter.Text(
        order = 21,
        optional = true,
        regexes = {"^.+$"},
        description = "Statement which will be executed against the database.",
        helpText =
            "SQL statement which will be executed to write to the database. The statement must"
                + " specify the column names of the table in any order. Only the values of the"
                + " specified column names will be read from the json and added to the statement.",
        example = "INSERT INTO tableName (column1, column2) VALUES (?,?)")
    String getStatement();

    void setStatement(String statement);

    @TemplateParameter.Text(
        order = 22,
        optional = true,
        regexes = {"^.+$"},
        description = "GCP Project Id of where the Spanner table lives.",
        helpText = "GCP Project Id of where the Spanner table lives.")
    String getProjectId();

    void setProjectId(String projectId);

    @TemplateParameter.Text(
        order = 23,
        optional = true,
        regexes = {"^.+$"},
        description = "Cloud Spanner instance name.",
        helpText = "Cloud Spanner instance name.")
    String getSpannerInstanceName();

    void setSpannerInstanceName(String spannerInstanceName);

    @TemplateParameter.Text(
        order = 24,
        optional = true,
        regexes = {"^.+$"},
        description = "Cloud Spanner database name.",
        helpText = "Cloud Spanner database name.")
    String getSpannerDatabaseName();

    void setSpannerDatabaseName(String spannerDBName);

    @TemplateParameter.Text(
        order = 25,
        optional = true,
        regexes = {"^.+$"},
        description = "Cloud Spanner table name.",
        helpText = "Cloud Spanner table name.")
    String getSpannerTableName();

    void setSpannerTableName(String spannerTableName);
  }

  /** Allowed list of existing schema templates. */
  public enum SchemaTemplate {
    GAME_EVENT(
        "{\n"
            + "  \"eventId\": \"{{uuid()}}\",\n"
            + "  \"eventTimestamp\": {{timestamp()}},\n"
            + "  \"ipv4\": \"{{ipv4()}}\",\n"
            + "  \"ipv6\": \"{{ipv6()}}\",\n"
            + "  \"country\": \"{{country()}}\",\n"
            + "  \"username\": \"{{username()}}\",\n"
            + "  \"quest\": \"{{random(\"A Break In the Ice\", \"Ghosts of Perdition\", \"Survive"
            + " the Low Road\")}}\",\n"
            + "  \"score\": {{integer(100, 10000)}},\n"
            + "  \"completed\": {{bool()}}\n"
            + "}"),
    LOG_ENTRY(
        "{\n"
            + "  \"logName\": \"{{alpha(10,20)}}\",\n"
            + "  \"resource\": {\n"
            + "    \"type\": \"{{alpha(5,10)}}\"\n"
            + "  },\n"
            + "  \"timestamp\": {{timestamp()}},\n"
            + "  \"receiveTimestamp\": {{timestamp()}},\n"
            + "  \"severity\": \"{{random(\"DEFAULT\", \"DEBUG\", \"INFO\", \"NOTICE\","
            + " \"WARNING\", \"ERROR\", \"CRITICAL\", \"ERROR\")}}\",\n"
            + "  \"insertId\": \"{{uuid()}}\",\n"
            + "  \"trace\": \"{{uuid()}}\",\n"
            + "  \"spanId\": \"{{uuid()}}\",\n"
            + "  \"jsonPayload\": {\n"
            + "    \"bytes_sent\": {{integer(1000,20000)}},\n"
            + "    \"connection\": {\n"
            + "      \"dest_ip\": \"{{ipv4()}}\",\n"
            + "      \"dest_port\": {{integer(0,65000)}},\n"
            + "      \"protocol\": {{integer(0,6)}},\n"
            + "      \"src_ip\": \"{{ipv4()}}\",\n"
            + "      \"src_port\": {{integer(0,65000)}}\n"
            + "    },\n"
            + "    \"dest_instance\": {\n"
            + "      \"project_id\": \"{{concat(\"PROJECT\", integer(0,3))}}\",\n"
            + "      \"region\": \"{{country()}}\",\n"
            + "      \"vm_name\": \"{{username()}}\",\n"
            + "      \"zone\": \"{{state()}}\"\n"
            + "    },\n"
            + "    \"end_time\": {{timestamp()}},\n"
            + "    \"packets_sent\": {{integer(100,400)}},\n"
            + "    \"reporter\": \"{{random(\"SRC\", \"DEST\")}}\",\n"
            + "    \"rtt_msec\": {{integer(0,20)}},\n"
            + "    \"start_time\": {{timestamp()}}\n"
            + "  }\n"
            + "}");

    private final String schema;

    SchemaTemplate(String schema) {
      this.schema = schema;
    }

    public String getSchema() {
      return schema;
    }
  }

  /** Allowed list of message encoding types. */
  public enum OutputType {
    JSON(".json"),
    AVRO(".avro"),
    PARQUET(".parquet");

    private final String fileExtension;

    /** Sets file extension associated with output type. */
    OutputType(String fileExtension) {
      this.fileExtension = fileExtension;
    }

    /** Returns file extension associated with output type. */
    public String getFileExtension() {
      return fileExtension;
    }
  }

  /** Allowed list of sink types. */
  public enum SinkType {
    PUBSUB,
    BIGQUERY,
    GCS,
    JDBC,
    SPANNER
  }

  /**
   * The main entry-point for pipeline execution. This method will start the pipeline but will not
   * wait for it's execution to finish. If blocking execution is required, use the {@link
   * StreamingDataGenerator#run(StreamingDataGeneratorOptions)} method to start the pipeline and
   * invoke {@code result.waitUntilFinish()} on the {@link PipelineResult}.
   *
   * @param args command-line args passed by the executor.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    StreamingDataGeneratorOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(StreamingDataGeneratorOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options the execution options.
   * @return the pipeline result.
   */
  public static PipelineResult run(@Nonnull StreamingDataGeneratorOptions options) {
    checkNotNull(options, "options argument to run method cannot be null.");
    MetadataValidator.validate(options);

    // FileSystems does not set the default configuration in workers till Pipeline.run
    // Explicitly registering standard file systems.
    FileSystems.setDefaultPipelineOptions(options);
    String schema = getSchema(options.getSchemaTemplate(), options.getSchemaLocation());

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    /*
     * Steps:
     *  1) Trigger at the supplied QPS
     *  2) Generate messages containing fake data
     *  3) Write messages to appropriate Sink
     */
    PCollection<byte[]> generatedMessages =
        pipeline
            .apply("Trigger", createTrigger(options))
            .apply("Generate Fake Messages", ParDo.of(new MessageGeneratorFn(schema)));

    if (options.getSinkType().equals(SinkType.GCS)) {
      generatedMessages =
          generatedMessages.apply(
              options.getWindowDuration() + " Window",
              Window.into(
                  FixedWindows.of(DurationUtils.parseDuration(options.getWindowDuration()))));
    }

    generatedMessages.apply(
        "Write To " + options.getSinkType().name(), createSink(options, schema));

    return pipeline.run();
  }

  /**
   * Creates either Bounded or UnBounded Source based on messageLimit pipeline option.
   *
   * @param options the pipeline options.
   */
  private static GenerateSequence createTrigger(@Nonnull StreamingDataGeneratorOptions options) {
    checkNotNull(options, "options argument to createTrigger method cannot be null.");
    GenerateSequence generateSequence =
        GenerateSequence.from(0L)
            .withRate(options.getQps(), /* periodLength = */ Duration.standardSeconds(1L));

    return options.getMessagesLimit() > 0
        ? generateSequence.to(options.getMessagesLimit())
        : generateSequence;
  }

  /**
   * The {@link MessageGeneratorFn} class generates fake messages based on supplied schema
   *
   * <p>See <a href="https://github.com/vincentrussell/json-data-generator">json-data-generator</a>
   * for instructions on how to construct the schema file.
   */
  @VisibleForTesting
  static class MessageGeneratorFn extends DoFn<Long, byte[]> {

    // Not initialized inline or constructor because {@link JsonDataGenerator} is not serializable.
    private transient JsonDataGenerator dataGenerator;
    private final String schema;

    MessageGeneratorFn(String schema) {
      this.schema = schema;
    }

    @Setup
    public void setup() {
      dataGenerator = new JsonDataGeneratorImpl();
    }

    @ProcessElement
    public void processElement(
        @Element Long element,
        @Timestamp Instant timestamp,
        OutputReceiver<byte[]> receiver,
        ProcessContext context)
        throws IOException, JsonDataGeneratorException {

      byte[] payload;

      // Generate the fake JSON according to the schema.
      try (ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream()) {
        dataGenerator.generateTestDataJson(schema, byteArrayOutputStream);
        payload = byteArrayOutputStream.toByteArray();
      }

      receiver.output(payload);
    }
  }

  /**
   * Creates appropriate sink based on sinkType pipeline option.
   *
   * @param options the pipeline options.
   */
  @VisibleForTesting
  static PTransform<PCollection<byte[]>, PDone> createSink(
      @Nonnull StreamingDataGeneratorOptions options, @Nonnull String schema) {
    checkNotNull(options, "options argument to createSink method cannot be null.");
    checkNotNull(schema, "schema argument to createSink method cannot be null.");

    switch (options.getSinkType()) {
      case PUBSUB:
        checkArgument(
            options.getTopic() != null,
            String.format(
                "Missing required value --topic for %s sink type", options.getSinkType().name()));
        return StreamingDataGeneratorWriteToPubSub.Writer.builder(options, schema).build();
      case BIGQUERY:
        checkArgument(
            options.getOutputTableSpec() != null,
            String.format(
                "Missing required value --outputTableSpec in format"
                    + " <project>:<dataset>.<table_name> for %s sink type",
                options.getSinkType().name()));
        return StreamingDataGeneratorWriteToBigQuery.builder(options).build();
      case GCS:
        checkArgument(
            options.getOutputDirectory() != null,
            String.format(
                "Missing required value --outputDirectory in format gs:// for %s sink type",
                options.getSinkType().name()));
        return StreamingDataGeneratorWriteToGcs.builder(options).build();
      case JDBC:
        checkArgument(
            options.getDriverClassName() != null,
            String.format(
                "Missing required value --driverClassName for %s sink type",
                options.getSinkType().name()));
        checkArgument(
            options.getConnectionUrl() != null,
            String.format(
                "Missing required value --connectionUrl for %s sink type",
                options.getSinkType().name()));
        checkArgument(
            options.getStatement() != null,
            String.format(
                "Missing required value --statement for %s sink type",
                options.getSinkType().name()));
        return StreamingDataGeneratorWriteToJdbc.builder(options).build();
      case SPANNER:
        checkArgument(
            options.getProjectId() != null,
            String.format(
                "Missing required value --projectId for %s sink type",
                options.getSinkType().name()));
        checkArgument(
            options.getSpannerInstanceName() != null,
            String.format(
                "Missing required value --spannerInstanceName for %s sink type",
                options.getSinkType().name()));
        checkArgument(
            options.getSpannerDatabaseName() != null,
            String.format(
                "Missing required value --spannerDatabaseName for %s sink type",
                options.getSinkType().name()));
        checkArgument(
            options.getSpannerTableName() != null,
            String.format(
                "Missing required value --spannerTableName for %s sink type",
                options.getSinkType().name()));
        return StreamingDataGeneratorWriteToSpanner.builder(options).build();
      default:
        throw new IllegalArgumentException("Unsupported Sink.");
    }
  }

  private static String getSchema(SchemaTemplate schemaTemplate, String schemaLocation) {
    checkArgument(
        schemaTemplate != null || schemaLocation != null,
        "Either schemaTemplate or schemaLocation argument of MessageGeneratorFn class must be"
            + " provided.");
    if (schemaLocation != null) {
      return GCSUtils.getGcsFileAsString(schemaLocation);
    } else {
      return schemaTemplate.getSchema();
    }
  }
}