Google 提供的 Dataflow 流处理模板

Google 提供了一组开源 Dataflow 模板。

这些 Dataflow 模板可帮助您解决大型数据任务，包括数据导入、数据导出、数据备份、数据恢复和批量 API 操作，所有这些均无需使用专用开发环境。这些模板基于 Apache Beam 构建，并使用 Dataflow 转换数据。

如需了解有关模板的一般信息，请参阅 Dataflow 模板。如需查看 Google 提供的所有模板的列表，请参阅开始使用 Google 提供的模板。

本指南介绍了流式模板。

Pub/Sub Subscription to BigQuery

Pub/Sub Subscription to BigQuery 模板是一种流处理流水线，可从 Pub/Sub 订阅读取 JSON 格式的消息并将其写入 BigQuery 表格中。您可以使用该模板作为将 Pub/Sub 数据移动到 BigQuery 的快速解决方案。此模板可从 Pub/Sub 中读取 JSON 格式的消息并将其转换为 BigQuery 元素。

对此流水线的要求：

Pub/Sub 消息的 data 字段必须使用 JSON 格式（如此 JSON 指南中所述）。例如，您可以将 data 字段中值为 {"k1":"v1", "k2":"v2"} 格式的消息插入具有名为 k1 和 k2 的列且数据类型为字符串的 BigQuery 表中。
在运行该流水线之前，输出表必须已存在。表架构必须与输入 JSON 对象相匹配。

模板参数

参数	说明
`inputSubscription`	要读取的 Pub/Sub 输入订阅，格式为 `projects/<project>/subscriptions/<subscription>`。
`outputTableSpec`	BigQuery 输出表位置，格式为 `<my-project>:<my-dataset>.<my-table>`
`outputDeadletterTable`	未能到达输出表的消息的 BigQuery 表，格式为 `<my-project>:<my-dataset>.<my-table>`。如果该表不存在，则系统会在流水线执行期间创建它。如果未指定此参数，则系统会改用 `OUTPUT_TABLE_SPEC_error_records`。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。

运行 Pub/Sub Subscription to BigQuery 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub Subscription to BigQuery template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/PubSub_Subscription_to_BigQuery \
    --region REGION_NAME \
    --staging-location STAGING_LOCATION \
    --parameters \
inputSubscription=projects/PROJECT_ID/subscriptions/SUBSCRIPTION_NAME,\
outputTableSpec=PROJECT_ID:DATASET.TABLE_NAME,\
outputDeadletterTable=PROJECT_ID:DATASET.TABLE_NAME

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
SUBSCRIPTION_NAME：您的 Pub/Sub 订阅名称
DATASET：您的 BigQuery 数据集
TABLE_NAME：您的 BigQuery 表名称

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/PubSub_Subscription_to_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputSubscription": "projects/PROJECT_ID/subscriptions/SUBSCRIPTION_NAME",
       "outputTableSpec": "PROJECT_ID:DATASET.TABLE_NAME"
   },
   "environment": {
       "bypassTempDirValidation": false,
       "tempLocation": "TEMP_LOCATION",
       "ipConfiguration": "WORKER_IP_UNSPECIFIED",
       "additionalExperiments": []
   },
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
SUBSCRIPTION_NAME：您的 Pub/Sub 订阅名称
DATASET：您的 BigQuery 数据集
TABLE_NAME：您的 BigQuery 表名称

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import static com.google.cloud.teleport.templates.TextToBigQueryStreaming.wrapBigQueryInsertError;

import com.google.api.services.bigquery.model.TableRow;
import com.google.cloud.teleport.coders.FailsafeElementCoder;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateCreationParameter;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.PubSubToBigQuery.Options;
import com.google.cloud.teleport.templates.common.BigQueryConverters.FailsafeJsonToTableRow;
import com.google.cloud.teleport.templates.common.ErrorConverters;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.FailsafeJavascriptUdf;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.util.DualInputNestedValueProvider;
import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput;
import com.google.cloud.teleport.util.ResourceUtils;
import com.google.cloud.teleport.util.ValueProviderUtils;
import com.google.cloud.teleport.values.FailsafeElement;
import com.google.common.collect.ImmutableList;
import java.nio.charset.StandardCharsets;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError;
import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy;
import org.apache.beam.sdk.io.gcp.bigquery.WriteResult;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.Flatten;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionList;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link PubSubToBigQuery} pipeline is a streaming pipeline which ingests data in JSON format
 * from Cloud Pub/Sub, executes a UDF, and outputs the resulting records to BigQuery. Any errors
 * which occur in the transformation of the data or execution of the UDF will be output to a
 * separate errors table in BigQuery. The errors table will be created if it does not exist prior to
 * execution. Both output and error tables are specified by the user as template parameters.
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>The Pub/Sub topic exists.
 *   <li>The BigQuery output table exists.
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT_ID=PROJECT ID HERE
 * BUCKET_NAME=BUCKET NAME HERE
 * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery
 * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read
 *                  from a Pub/Sub Subscription or a Pub/Sub Topic.
 *
 * # Set the runner
 * RUNNER=DataflowRunner
 *
 * # Build the template
 * mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToBigQuery \
 * -Dexec.cleanupDaemonThreads=false \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=${PIPELINE_FOLDER}/staging \
 * --tempLocation=${PIPELINE_FOLDER}/temp \
 * --templateLocation=${PIPELINE_FOLDER}/template \
 * --runner=${RUNNER}
 * --useSubscription=${USE_SUBSCRIPTION}
 * "
 *
 * # Execute the template
 * JOB_NAME=pubsub-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"`
 *
 * # Execute a pipeline to read from a Topic.
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "inputTopic=projects/${PROJECT_ID}/topics/input-topic-name,\
 * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\
 * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table"
 *
 * # Execute a pipeline to read from a Subscription.
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\
 * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\
 * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table"
 * </pre>
 */
@Template(
    name = "PubSub_Subscription_to_BigQuery",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub Subscription to BigQuery",
    description =
        "Streaming pipeline. Ingests JSON-encoded messages from a Pub/Sub subscription, transforms"
            + " them using a JavaScript user-defined function (UDF), and writes them to a"
            + " pre-existing BigQuery table as BigQuery elements.",
    optionsClass = Options.class,
    skipOptions = "inputTopic",
    contactInformation = "https://cloud.google.com/support")
@Template(
    name = "PubSub_to_BigQuery",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub Topic to BigQuery",
    description =
        "Streaming pipeline. Ingests JSON-encoded messages from a Pub/Sub topic, transforms them"
            + " using a JavaScript user-defined function (UDF), and writes them to a pre-existing"
            + " BigQuery table as BigQuery elements.",
    optionsClass = Options.class,
    skipOptions = "inputSubscription",
    contactInformation = "https://cloud.google.com/support")
public class PubSubToBigQuery {

  /** The log to output status messages to. */
  private static final Logger LOG = LoggerFactory.getLogger(PubSubToBigQuery.class);

  /** The tag for the main output for the UDF. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> UDF_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The tag for the main output of the json transformation. */
  public static final TupleTag<TableRow> TRANSFORM_OUT = new TupleTag<TableRow>() {};

  /** The tag for the dead-letter output of the udf. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> UDF_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The tag for the dead-letter output of the json to table row transform. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> TRANSFORM_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The default suffix for error tables if dead letter table is not specified. */
  public static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";

  /** Pubsub message/string coder for pipeline. */
  public static final FailsafeElementCoder<PubsubMessage, String> CODER =
      FailsafeElementCoder.of(PubsubMessageWithAttributesCoder.of(), StringUtf8Coder.of());

  /** String/String Coder for FailsafeElement. */
  public static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  /**
   * The {@link Options} class provides the custom execution options passed by the executor at the
   * command-line.
   */
  public interface Options extends PipelineOptions, JavascriptTextTransformerOptions {
    @TemplateParameter.BigQueryTable(
        order = 1,
        description = "BigQuery output table",
        helpText =
            "BigQuery table location to write the output to. The table’s schema must match the "
                + "input JSON objects.")
    ValueProvider<String> getOutputTableSpec();

    void setOutputTableSpec(ValueProvider<String> value);

    @TemplateParameter.PubsubTopic(
        order = 2,
        description = "Input Pub/Sub topic",
        helpText = "The Pub/Sub topic to read the input from.")
    ValueProvider<String> getInputTopic();

    void setInputTopic(ValueProvider<String> value);

    @TemplateParameter.PubsubSubscription(
        order = 3,
        description = "Pub/Sub input subscription",
        helpText =
            "Pub/Sub subscription to read the input from, in the format of"
                + " 'projects/your-project-id/subscriptions/your-subscription-name'")
    ValueProvider<String> getInputSubscription();

    void setInputSubscription(ValueProvider<String> value);

    @TemplateCreationParameter(template = "PubSub_to_BigQuery", value = "false")
    @TemplateCreationParameter(template = "PubSub_Subscription_to_BigQuery", value = "true")
    @Description(
        "This determines whether the template reads from a Pub/sub subscription or a topic")
    @Default.Boolean(false)
    Boolean getUseSubscription();

    void setUseSubscription(Boolean value);

    @TemplateParameter.BigQueryTable(
        order = 5,
        optional = true,
        description =
            "Table for messages failed to reach the output table (i.e., Deadletter table)",
        helpText =
            "Messages failed to reach the output table for all kind of reasons (e.g., mismatched"
                + " schema, malformed json) are written to this table. It should be in the format"
                + " of \"your-project-id:your-dataset.your-table-name\". If it doesn't exist, it"
                + " will be created during pipeline execution. If not specified,"
                + " \"{outputTableSpec}_error_records\" is used instead.")
    ValueProvider<String> getOutputDeadletterTable();

    void setOutputDeadletterTable(ValueProvider<String> value);
  }

  /**
   * The main entry-point for pipeline execution. This method will start the pipeline but will not
   * wait for it's execution to finish. If blocking execution is required, use the {@link
   * PubSubToBigQuery#run(Options)} method to start the pipeline and invoke {@code
   * result.waitUntilFinish()} on the {@link PipelineResult}.
   *
   * @param args The command-line args passed by the executor.
   */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(Options options) {

    Pipeline pipeline = Pipeline.create(options);

    CoderRegistry coderRegistry = pipeline.getCoderRegistry();
    coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER);

    /*
     * Steps:
     *  1) Read messages in from Pub/Sub
     *  2) Transform the PubsubMessages into TableRows
     *     - Transform message payload via UDF
     *     - Convert UDF result to TableRow objects
     *  3) Write successful records out to BigQuery
     *  4) Write failed records out to BigQuery
     */

    /*
     * Step #1: Read messages in from Pub/Sub
     * Either from a Subscription or Topic
     */

    PCollection<PubsubMessage> messages = null;
    if (options.getUseSubscription()) {
      messages =
          pipeline.apply(
              "ReadPubSubSubscription",
              PubsubIO.readMessagesWithAttributes()
                  .fromSubscription(options.getInputSubscription()));
    } else {
      messages =
          pipeline.apply(
              "ReadPubSubTopic",
              PubsubIO.readMessagesWithAttributes().fromTopic(options.getInputTopic()));
    }

    PCollectionTuple convertedTableRows =
        messages
            /*
             * Step #2: Transform the PubsubMessages into TableRows
             */
            .apply("ConvertMessageToTableRow", new PubsubMessageToTableRow(options));

    /*
     * Step #3: Write the successful records out to BigQuery
     */
    WriteResult writeResult =
        convertedTableRows
            .get(TRANSFORM_OUT)
            .apply(
                "WriteSuccessfulRecords",
                BigQueryIO.writeTableRows()
                    .withoutValidation()
                    .withCreateDisposition(CreateDisposition.CREATE_NEVER)
                    .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                    .withExtendedErrorInfo()
                    .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
                    .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
                    .to(options.getOutputTableSpec()));

    /*
     * Step 3 Contd.
     * Elements that failed inserts into BigQuery are extracted and converted to FailsafeElement
     */
    PCollection<FailsafeElement<String, String>> failedInserts =
        writeResult
            .getFailedInsertsWithErr()
            .apply(
                "WrapInsertionErrors",
                MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor())
                    .via((BigQueryInsertError e) -> wrapBigQueryInsertError(e)))
            .setCoder(FAILSAFE_ELEMENT_CODER);

    /*
     * Step #4: Write records that failed table row transformation
     * or conversion out to BigQuery deadletter table.
     */
    PCollectionList.of(
            ImmutableList.of(
                convertedTableRows.get(UDF_DEADLETTER_OUT),
                convertedTableRows.get(TRANSFORM_DEADLETTER_OUT)))
        .apply("Flatten", Flatten.pCollections())
        .apply(
            "WriteFailedRecords",
            ErrorConverters.WritePubsubMessageErrors.newBuilder()
                .setErrorRecordsTable(
                    ValueProviderUtils.maybeUseDefaultDeadletterTable(
                        options.getOutputDeadletterTable(),
                        options.getOutputTableSpec(),
                        DEFAULT_DEADLETTER_TABLE_SUFFIX))
                .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
                .build());

    // 5) Insert records that failed insert into deadletter table
    failedInserts.apply(
        "WriteFailedRecords",
        ErrorConverters.WriteStringMessageErrors.newBuilder()
            .setErrorRecordsTable(
                ValueProviderUtils.maybeUseDefaultDeadletterTable(
                    options.getOutputDeadletterTable(),
                    options.getOutputTableSpec(),
                    DEFAULT_DEADLETTER_TABLE_SUFFIX))
            .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
            .build());

    return pipeline.run();
  }

  /**
   * If deadletterTable is available, it is returned as is, otherwise outputTableSpec +
   * defaultDeadLetterTableSuffix is returned instead.
   */
  private static ValueProvider<String> maybeUseDefaultDeadletterTable(
      ValueProvider<String> deadletterTable,
      ValueProvider<String> outputTableSpec,
      String defaultDeadLetterTableSuffix) {
    return DualInputNestedValueProvider.of(
        deadletterTable,
        outputTableSpec,
        new SerializableFunction<TranslatorInput<String, String>, String>() {
          @Override
          public String apply(TranslatorInput<String, String> input) {
            String userProvidedTable = input.getX();
            String outputTableSpec = input.getY();
            if (userProvidedTable == null) {
              return outputTableSpec + defaultDeadLetterTableSuffix;
            }
            return userProvidedTable;
          }
        });
  }

  /**
   * The {@link PubsubMessageToTableRow} class is a {@link PTransform} which transforms incoming
   * {@link PubsubMessage} objects into {@link TableRow} objects for insertion into BigQuery while
   * applying an optional UDF to the input. The executions of the UDF and transformation to {@link
   * TableRow} objects is done in a fail-safe way by wrapping the element with it's original payload
   * inside the {@link FailsafeElement} class. The {@link PubsubMessageToTableRow} transform will
   * output a {@link PCollectionTuple} which contains all output and dead-letter {@link
   * PCollection}.
   *
   * <p>The {@link PCollectionTuple} output will contain the following {@link PCollection}:
   *
   * <ul>
   *   <li>{@link PubSubToBigQuery#UDF_OUT} - Contains all {@link FailsafeElement} records
   *       successfully processed by the optional UDF.
   *   <li>{@link PubSubToBigQuery#UDF_DEADLETTER_OUT} - Contains all {@link FailsafeElement}
   *       records which failed processing during the UDF execution.
   *   <li>{@link PubSubToBigQuery#TRANSFORM_OUT} - Contains all records successfully converted from
   *       JSON to {@link TableRow} objects.
   *   <li>{@link PubSubToBigQuery#TRANSFORM_DEADLETTER_OUT} - Contains all {@link FailsafeElement}
   *       records which couldn't be converted to table rows.
   * </ul>
   */
  static class PubsubMessageToTableRow
      extends PTransform<PCollection<PubsubMessage>, PCollectionTuple> {

    private final Options options;

    PubsubMessageToTableRow(Options options) {
      this.options = options;
    }

    @Override
    public PCollectionTuple expand(PCollection<PubsubMessage> input) {

      PCollectionTuple udfOut =
          input
              // Map the incoming messages into FailsafeElements so we can recover from failures
              // across multiple transforms.
              .apply("MapToRecord", ParDo.of(new PubsubMessageToFailsafeElementFn()))
              .apply(
                  "InvokeUDF",
                  FailsafeJavascriptUdf.<PubsubMessage>newBuilder()
                      .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                      .setFunctionName(options.getJavascriptTextTransformFunctionName())
                      .setSuccessTag(UDF_OUT)
                      .setFailureTag(UDF_DEADLETTER_OUT)
                      .build());

      // Convert the records which were successfully processed by the UDF into TableRow objects.
      PCollectionTuple jsonToTableRowOut =
          udfOut
              .get(UDF_OUT)
              .apply(
                  "JsonToTableRow",
                  FailsafeJsonToTableRow.<PubsubMessage>newBuilder()
                      .setSuccessTag(TRANSFORM_OUT)
                      .setFailureTag(TRANSFORM_DEADLETTER_OUT)
                      .build());

      // Re-wrap the PCollections so we can return a single PCollectionTuple
      return PCollectionTuple.of(UDF_OUT, udfOut.get(UDF_OUT))
          .and(UDF_DEADLETTER_OUT, udfOut.get(UDF_DEADLETTER_OUT))
          .and(TRANSFORM_OUT, jsonToTableRowOut.get(TRANSFORM_OUT))
          .and(TRANSFORM_DEADLETTER_OUT, jsonToTableRowOut.get(TRANSFORM_DEADLETTER_OUT));
    }
  }

  /**
   * The {@link PubsubMessageToFailsafeElementFn} wraps an incoming {@link PubsubMessage} with the
   * {@link FailsafeElement} class so errors can be recovered from and the original message can be
   * output to a error records table.
   */
  static class PubsubMessageToFailsafeElementFn
      extends DoFn<PubsubMessage, FailsafeElement<PubsubMessage, String>> {
    @ProcessElement
    public void processElement(ProcessContext context) {
      PubsubMessage message = context.element();
      context.output(
          FailsafeElement.of(message, new String(message.getPayload(), StandardCharsets.UTF_8)));
    }
  }
}

Pub/Sub Topic to BigQuery

Pub/Sub Topic to BigQuery 模板是一种流处理流水线，可从 Cloud Pub/Sub 主题读取 JSON 格式的消息并将其写入 BigQuery 表格中。您可以使用该模板作为将 Pub/Sub 数据移动到 BigQuery 的快速解决方案。此模板可从 Pub/Sub 中读取 JSON 格式的消息并将其转换为 BigQuery 元素。

对此流水线的要求：

Pub/Sub 消息的 data 字段必须使用 JSON 格式（如此 JSON 指南中所述）。例如，您可以将 data 字段中值为 {"k1":"v1", "k2":"v2"} 格式的消息插入具有名为 k1 和 k2 的列且数据类型为字符串的 BigQuery 表中。
在运行该流水线之前，输出表必须已存在。表架构必须与输入 JSON 对象相匹配。

模板参数

参数	说明
`inputTopic`	要读取的 Cloud Pub/Sub 输入主题，格式为 `projects/<project>/topics/<topic>`。
`outputTableSpec`	BigQuery 输出表位置，格式为 `<my-project>:<my-dataset>.<my-table>`
`outputDeadletterTable`	无法到达输出表的消息的 BigQuery 表。格式应为 `<my-project>:<my-dataset>.<my-table>`。如果该表不存在，则系统会在流水线执行期间创建它。如果未指定此参数，则系统会改用 `<outputTableSpec>_error_records`。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。

运行 Cloud Pub/Sub Topic to BigQuery 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub Topic to BigQuery template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/PubSub_to_BigQuery \
    --region REGION_NAME \
    --staging-location STAGING_LOCATION \
    --parameters \
inputTopic=projects/PROJECT_ID/topics/TOPIC_NAME,\
outputTableSpec=PROJECT_ID:DATASET.TABLE_NAME,\
outputDeadletterTable=PROJECT_ID:DATASET.TABLE_NAME

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
TOPIC_NAME：您的 Pub/Sub 主题名称
DATASET：您的 BigQuery 数据集
TABLE_NAME：您的 BigQuery 表名称

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/PubSub_to_BigQuery
{
   "jobName": "JOB_NAME",
   "environment": {
       "bypassTempDirValidation": false,
       "tempLocation": TEMP_LOCATION,
       "ipConfiguration": "WORKER_IP_UNSPECIFIED",
       "additionalExperiments": []
    },
   "parameters": {
       "inputTopic": "projects/PROJECT_ID/topics/TOPIC_NAME",
       "outputTableSpec": "PROJECT_ID:DATASET.TABLE_NAME"
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
TOPIC_NAME：您的 Pub/Sub 主题名称
DATASET：您的 BigQuery 数据集
TABLE_NAME：您的 BigQuery 表名称

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import static com.google.cloud.teleport.templates.TextToBigQueryStreaming.wrapBigQueryInsertError;

import com.google.api.services.bigquery.model.TableRow;
import com.google.cloud.teleport.coders.FailsafeElementCoder;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateCreationParameter;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.PubSubToBigQuery.Options;
import com.google.cloud.teleport.templates.common.BigQueryConverters.FailsafeJsonToTableRow;
import com.google.cloud.teleport.templates.common.ErrorConverters;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.FailsafeJavascriptUdf;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.util.DualInputNestedValueProvider;
import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput;
import com.google.cloud.teleport.util.ResourceUtils;
import com.google.cloud.teleport.util.ValueProviderUtils;
import com.google.cloud.teleport.values.FailsafeElement;
import com.google.common.collect.ImmutableList;
import java.nio.charset.StandardCharsets;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError;
import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy;
import org.apache.beam.sdk.io.gcp.bigquery.WriteResult;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.Flatten;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionList;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link PubSubToBigQuery} pipeline is a streaming pipeline which ingests data in JSON format
 * from Cloud Pub/Sub, executes a UDF, and outputs the resulting records to BigQuery. Any errors
 * which occur in the transformation of the data or execution of the UDF will be output to a
 * separate errors table in BigQuery. The errors table will be created if it does not exist prior to
 * execution. Both output and error tables are specified by the user as template parameters.
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>The Pub/Sub topic exists.
 *   <li>The BigQuery output table exists.
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT_ID=PROJECT ID HERE
 * BUCKET_NAME=BUCKET NAME HERE
 * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery
 * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read
 *                  from a Pub/Sub Subscription or a Pub/Sub Topic.
 *
 * # Set the runner
 * RUNNER=DataflowRunner
 *
 * # Build the template
 * mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToBigQuery \
 * -Dexec.cleanupDaemonThreads=false \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=${PIPELINE_FOLDER}/staging \
 * --tempLocation=${PIPELINE_FOLDER}/temp \
 * --templateLocation=${PIPELINE_FOLDER}/template \
 * --runner=${RUNNER}
 * --useSubscription=${USE_SUBSCRIPTION}
 * "
 *
 * # Execute the template
 * JOB_NAME=pubsub-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"`
 *
 * # Execute a pipeline to read from a Topic.
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "inputTopic=projects/${PROJECT_ID}/topics/input-topic-name,\
 * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\
 * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table"
 *
 * # Execute a pipeline to read from a Subscription.
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\
 * outputTableSpec=${PROJECT_ID}:dataset-id.output-table,\
 * outputDeadletterTable=${PROJECT_ID}:dataset-id.deadletter-table"
 * </pre>
 */
@Template(
    name = "PubSub_Subscription_to_BigQuery",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub Subscription to BigQuery",
    description =
        "Streaming pipeline. Ingests JSON-encoded messages from a Pub/Sub subscription, transforms"
            + " them using a JavaScript user-defined function (UDF), and writes them to a"
            + " pre-existing BigQuery table as BigQuery elements.",
    optionsClass = Options.class,
    skipOptions = "inputTopic",
    contactInformation = "https://cloud.google.com/support")
@Template(
    name = "PubSub_to_BigQuery",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub Topic to BigQuery",
    description =
        "Streaming pipeline. Ingests JSON-encoded messages from a Pub/Sub topic, transforms them"
            + " using a JavaScript user-defined function (UDF), and writes them to a pre-existing"
            + " BigQuery table as BigQuery elements.",
    optionsClass = Options.class,
    skipOptions = "inputSubscription",
    contactInformation = "https://cloud.google.com/support")
public class PubSubToBigQuery {

  /** The log to output status messages to. */
  private static final Logger LOG = LoggerFactory.getLogger(PubSubToBigQuery.class);

  /** The tag for the main output for the UDF. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> UDF_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The tag for the main output of the json transformation. */
  public static final TupleTag<TableRow> TRANSFORM_OUT = new TupleTag<TableRow>() {};

  /** The tag for the dead-letter output of the udf. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> UDF_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The tag for the dead-letter output of the json to table row transform. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> TRANSFORM_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The default suffix for error tables if dead letter table is not specified. */
  public static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";

  /** Pubsub message/string coder for pipeline. */
  public static final FailsafeElementCoder<PubsubMessage, String> CODER =
      FailsafeElementCoder.of(PubsubMessageWithAttributesCoder.of(), StringUtf8Coder.of());

  /** String/String Coder for FailsafeElement. */
  public static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  /**
   * The {@link Options} class provides the custom execution options passed by the executor at the
   * command-line.
   */
  public interface Options extends PipelineOptions, JavascriptTextTransformerOptions {
    @TemplateParameter.BigQueryTable(
        order = 1,
        description = "BigQuery output table",
        helpText =
            "BigQuery table location to write the output to. The table’s schema must match the "
                + "input JSON objects.")
    ValueProvider<String> getOutputTableSpec();

    void setOutputTableSpec(ValueProvider<String> value);

    @TemplateParameter.PubsubTopic(
        order = 2,
        description = "Input Pub/Sub topic",
        helpText = "The Pub/Sub topic to read the input from.")
    ValueProvider<String> getInputTopic();

    void setInputTopic(ValueProvider<String> value);

    @TemplateParameter.PubsubSubscription(
        order = 3,
        description = "Pub/Sub input subscription",
        helpText =
            "Pub/Sub subscription to read the input from, in the format of"
                + " 'projects/your-project-id/subscriptions/your-subscription-name'")
    ValueProvider<String> getInputSubscription();

    void setInputSubscription(ValueProvider<String> value);

    @TemplateCreationParameter(template = "PubSub_to_BigQuery", value = "false")
    @TemplateCreationParameter(template = "PubSub_Subscription_to_BigQuery", value = "true")
    @Description(
        "This determines whether the template reads from a Pub/sub subscription or a topic")
    @Default.Boolean(false)
    Boolean getUseSubscription();

    void setUseSubscription(Boolean value);

    @TemplateParameter.BigQueryTable(
        order = 5,
        optional = true,
        description =
            "Table for messages failed to reach the output table (i.e., Deadletter table)",
        helpText =
            "Messages failed to reach the output table for all kind of reasons (e.g., mismatched"
                + " schema, malformed json) are written to this table. It should be in the format"
                + " of \"your-project-id:your-dataset.your-table-name\". If it doesn't exist, it"
                + " will be created during pipeline execution. If not specified,"
                + " \"{outputTableSpec}_error_records\" is used instead.")
    ValueProvider<String> getOutputDeadletterTable();

    void setOutputDeadletterTable(ValueProvider<String> value);
  }

  /**
   * The main entry-point for pipeline execution. This method will start the pipeline but will not
   * wait for it's execution to finish. If blocking execution is required, use the {@link
   * PubSubToBigQuery#run(Options)} method to start the pipeline and invoke {@code
   * result.waitUntilFinish()} on the {@link PipelineResult}.
   *
   * @param args The command-line args passed by the executor.
   */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(Options options) {

    Pipeline pipeline = Pipeline.create(options);

    CoderRegistry coderRegistry = pipeline.getCoderRegistry();
    coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER);

    /*
     * Steps:
     *  1) Read messages in from Pub/Sub
     *  2) Transform the PubsubMessages into TableRows
     *     - Transform message payload via UDF
     *     - Convert UDF result to TableRow objects
     *  3) Write successful records out to BigQuery
     *  4) Write failed records out to BigQuery
     */

    /*
     * Step #1: Read messages in from Pub/Sub
     * Either from a Subscription or Topic
     */

    PCollection<PubsubMessage> messages = null;
    if (options.getUseSubscription()) {
      messages =
          pipeline.apply(
              "ReadPubSubSubscription",
              PubsubIO.readMessagesWithAttributes()
                  .fromSubscription(options.getInputSubscription()));
    } else {
      messages =
          pipeline.apply(
              "ReadPubSubTopic",
              PubsubIO.readMessagesWithAttributes().fromTopic(options.getInputTopic()));
    }

    PCollectionTuple convertedTableRows =
        messages
            /*
             * Step #2: Transform the PubsubMessages into TableRows
             */
            .apply("ConvertMessageToTableRow", new PubsubMessageToTableRow(options));

    /*
     * Step #3: Write the successful records out to BigQuery
     */
    WriteResult writeResult =
        convertedTableRows
            .get(TRANSFORM_OUT)
            .apply(
                "WriteSuccessfulRecords",
                BigQueryIO.writeTableRows()
                    .withoutValidation()
                    .withCreateDisposition(CreateDisposition.CREATE_NEVER)
                    .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                    .withExtendedErrorInfo()
                    .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
                    .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
                    .to(options.getOutputTableSpec()));

    /*
     * Step 3 Contd.
     * Elements that failed inserts into BigQuery are extracted and converted to FailsafeElement
     */
    PCollection<FailsafeElement<String, String>> failedInserts =
        writeResult
            .getFailedInsertsWithErr()
            .apply(
                "WrapInsertionErrors",
                MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor())
                    .via((BigQueryInsertError e) -> wrapBigQueryInsertError(e)))
            .setCoder(FAILSAFE_ELEMENT_CODER);

    /*
     * Step #4: Write records that failed table row transformation
     * or conversion out to BigQuery deadletter table.
     */
    PCollectionList.of(
            ImmutableList.of(
                convertedTableRows.get(UDF_DEADLETTER_OUT),
                convertedTableRows.get(TRANSFORM_DEADLETTER_OUT)))
        .apply("Flatten", Flatten.pCollections())
        .apply(
            "WriteFailedRecords",
            ErrorConverters.WritePubsubMessageErrors.newBuilder()
                .setErrorRecordsTable(
                    ValueProviderUtils.maybeUseDefaultDeadletterTable(
                        options.getOutputDeadletterTable(),
                        options.getOutputTableSpec(),
                        DEFAULT_DEADLETTER_TABLE_SUFFIX))
                .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
                .build());

    // 5) Insert records that failed insert into deadletter table
    failedInserts.apply(
        "WriteFailedRecords",
        ErrorConverters.WriteStringMessageErrors.newBuilder()
            .setErrorRecordsTable(
                ValueProviderUtils.maybeUseDefaultDeadletterTable(
                    options.getOutputDeadletterTable(),
                    options.getOutputTableSpec(),
                    DEFAULT_DEADLETTER_TABLE_SUFFIX))
            .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
            .build());

    return pipeline.run();
  }

  /**
   * If deadletterTable is available, it is returned as is, otherwise outputTableSpec +
   * defaultDeadLetterTableSuffix is returned instead.
   */
  private static ValueProvider<String> maybeUseDefaultDeadletterTable(
      ValueProvider<String> deadletterTable,
      ValueProvider<String> outputTableSpec,
      String defaultDeadLetterTableSuffix) {
    return DualInputNestedValueProvider.of(
        deadletterTable,
        outputTableSpec,
        new SerializableFunction<TranslatorInput<String, String>, String>() {
          @Override
          public String apply(TranslatorInput<String, String> input) {
            String userProvidedTable = input.getX();
            String outputTableSpec = input.getY();
            if (userProvidedTable == null) {
              return outputTableSpec + defaultDeadLetterTableSuffix;
            }
            return userProvidedTable;
          }
        });
  }

  /**
   * The {@link PubsubMessageToTableRow} class is a {@link PTransform} which transforms incoming
   * {@link PubsubMessage} objects into {@link TableRow} objects for insertion into BigQuery while
   * applying an optional UDF to the input. The executions of the UDF and transformation to {@link
   * TableRow} objects is done in a fail-safe way by wrapping the element with it's original payload
   * inside the {@link FailsafeElement} class. The {@link PubsubMessageToTableRow} transform will
   * output a {@link PCollectionTuple} which contains all output and dead-letter {@link
   * PCollection}.
   *
   * <p>The {@link PCollectionTuple} output will contain the following {@link PCollection}:
   *
   * <ul>
   *   <li>{@link PubSubToBigQuery#UDF_OUT} - Contains all {@link FailsafeElement} records
   *       successfully processed by the optional UDF.
   *   <li>{@link PubSubToBigQuery#UDF_DEADLETTER_OUT} - Contains all {@link FailsafeElement}
   *       records which failed processing during the UDF execution.
   *   <li>{@link PubSubToBigQuery#TRANSFORM_OUT} - Contains all records successfully converted from
   *       JSON to {@link TableRow} objects.
   *   <li>{@link PubSubToBigQuery#TRANSFORM_DEADLETTER_OUT} - Contains all {@link FailsafeElement}
   *       records which couldn't be converted to table rows.
   * </ul>
   */
  static class PubsubMessageToTableRow
      extends PTransform<PCollection<PubsubMessage>, PCollectionTuple> {

    private final Options options;

    PubsubMessageToTableRow(Options options) {
      this.options = options;
    }

    @Override
    public PCollectionTuple expand(PCollection<PubsubMessage> input) {

      PCollectionTuple udfOut =
          input
              // Map the incoming messages into FailsafeElements so we can recover from failures
              // across multiple transforms.
              .apply("MapToRecord", ParDo.of(new PubsubMessageToFailsafeElementFn()))
              .apply(
                  "InvokeUDF",
                  FailsafeJavascriptUdf.<PubsubMessage>newBuilder()
                      .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                      .setFunctionName(options.getJavascriptTextTransformFunctionName())
                      .setSuccessTag(UDF_OUT)
                      .setFailureTag(UDF_DEADLETTER_OUT)
                      .build());

      // Convert the records which were successfully processed by the UDF into TableRow objects.
      PCollectionTuple jsonToTableRowOut =
          udfOut
              .get(UDF_OUT)
              .apply(
                  "JsonToTableRow",
                  FailsafeJsonToTableRow.<PubsubMessage>newBuilder()
                      .setSuccessTag(TRANSFORM_OUT)
                      .setFailureTag(TRANSFORM_DEADLETTER_OUT)
                      .build());

      // Re-wrap the PCollections so we can return a single PCollectionTuple
      return PCollectionTuple.of(UDF_OUT, udfOut.get(UDF_OUT))
          .and(UDF_DEADLETTER_OUT, udfOut.get(UDF_DEADLETTER_OUT))
          .and(TRANSFORM_OUT, jsonToTableRowOut.get(TRANSFORM_OUT))
          .and(TRANSFORM_DEADLETTER_OUT, jsonToTableRowOut.get(TRANSFORM_DEADLETTER_OUT));
    }
  }

  /**
   * The {@link PubsubMessageToFailsafeElementFn} wraps an incoming {@link PubsubMessage} with the
   * {@link FailsafeElement} class so errors can be recovered from and the original message can be
   * output to a error records table.
   */
  static class PubsubMessageToFailsafeElementFn
      extends DoFn<PubsubMessage, FailsafeElement<PubsubMessage, String>> {
    @ProcessElement
    public void processElement(ProcessContext context) {
      PubsubMessage message = context.element();
      context.output(
          FailsafeElement.of(message, new String(message.getPayload(), StandardCharsets.UTF_8)));
    }
  }
}

Pub/Sub Avro to BigQuery

Pub/Sub Avro to BigQuery 模板是一种流处理流水线，用于将 Pub/Sub 订阅中的 Avro 数据提取到 BigQuery 表中。向 BigQuery 表写入数据时发生的任何错误都会流式传输到 Pub/Sub 未处理的主题。

对此流水线的要求

用作输入来源的 Pub/Sub 订阅必须存在。
Avro 记录的架构文件必须存在于 Cloud Storage 存储空间中。
未处理的 Pub/Sub 主题必须存在。
用作输出目标的 BigQuery 数据集必须已存在。

模板参数

参数	说明
`schemaPath`	Avro 架构文件的 Cloud Storage 位置。例如 `gs://path/to/my/schema.avsc`。
`inputSubscription`	要读取的 Pub/Sub 输入订阅。例如 `projects/<project>/subscriptions/<subscription>`。
`outputTopic`	要用于未处理的记录的 Pub/Sub 主题。例如 `projects/<project-id>/topics/<topic-name>`。
`outputTableSpec`	BigQuery 输出表位置。例如 `<my-project>:<my-dataset>.<my-table>`。根据指定的 createDisposition，系统可能会使用用户提供的 Avro 架构自动创建输出表。
`writeDisposition`	（可选）BigQuery WriteDisposition。例如 `WRITE_APPEND`、`WRITE_EMPTY` 或 `WRITE_TRUNCATE`。默认：`WRITE_APPEND`
`createDisposition`	（可选）BigQuery CreateDisposition。例如 `CREATE_IF_NEEDED`、`CREATE_NEVER`。默认：`CREATE_IF_NEEDED`

运行 Pub/Sub Avro to BigQuery 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub Avro to BigQuery template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/PubSub_Avro_to_BigQuery \
    --parameters \
schemaPath=SCHEMA_PATH,\
inputSubscription=SUBSCRIPTION_NAME,\
outputTableSpec=BIGQUERY_TABLE,\
outputTopic=DEADLETTER_TOPIC

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
SCHEMA_PATH：Avro 架构文件的 Cloud Storage 路径（例如 gs://MyBucket/file.avsc）
SUBSCRIPTION_NAME：Pub/Sub 输入订阅名称
BIGQUERY_TABLE：BigQuery 输出表名称
DEADLETTER_TOPIC：要用于未处理的队列的 Pub/Sub 主题

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/PubSub_Avro_to_BigQuery",
      "parameters": {
          "schemaPath": "SCHEMA_PATH",
          "inputSubscription": "SUBSCRIPTION_NAME",
          "outputTableSpec": "BIGQUERY_TABLE",
          "outputTopic": "DEADLETTER_TOPIC"
      }
   }
}

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
SCHEMA_PATH：Avro 架构文件的 Cloud Storage 路径（例如 gs://MyBucket/file.avsc）
SUBSCRIPTION_NAME：Pub/Sub 输入订阅名称
BIGQUERY_TABLE：BigQuery 输出表名称
DEADLETTER_TOPIC：要用于未处理的队列的 Pub/Sub 主题

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2020 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.options.BigQueryCommonOptions.WriteOptions;
import com.google.cloud.teleport.v2.options.BigQueryStorageApiStreamingOptions;
import com.google.cloud.teleport.v2.options.PubsubCommonOptions.ReadSubscriptionOptions;
import com.google.cloud.teleport.v2.options.PubsubCommonOptions.WriteTopicOptions;
import com.google.cloud.teleport.v2.templates.PubsubAvroToBigQuery.PubsubAvroToBigQueryOptions;
import com.google.cloud.teleport.v2.transforms.BigQueryConverters;
import com.google.cloud.teleport.v2.transforms.ErrorConverters;
import com.google.cloud.teleport.v2.utils.BigQueryIOUtils;
import com.google.cloud.teleport.v2.utils.SchemaUtils;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.io.gcp.bigquery.WriteResult;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.schemas.transforms.Convert;
import org.apache.beam.sdk.values.Row;

/**
 * A Dataflow pipeline to stream <a href="https://avro.apache.org/">Apache Avro</a> records from
 * Pub/Sub into a BigQuery table.
 *
 * <p>Any persistent failures while writing to BigQuery will be written to a Pub/Sub dead-letter
 * topic.
 */
@Template(
    name = "PubSub_Avro_to_BigQuery",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub Avro to BigQuery",
    description =
        "A streaming pipeline which inserts Avro records from a Pub/Sub subscription into a"
            + " BigQuery table.",
    optionsClass = PubsubAvroToBigQueryOptions.class,
    flexContainerName = "pubsub-avro-to-bigquery",
    contactInformation = "https://cloud.google.com/support")
public final class PubsubAvroToBigQuery {
  /**
   * Validates input flags and executes the Dataflow pipeline.
   *
   * @param args command line arguments to the pipeline
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    PubsubAvroToBigQueryOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(PubsubAvroToBigQueryOptions.class);

    run(options);
  }

  /**
   * Provides custom {@link org.apache.beam.sdk.options.PipelineOptions} required to execute the
   * {@link PubsubAvroToBigQuery} pipeline.
   */
  public interface PubsubAvroToBigQueryOptions
      extends ReadSubscriptionOptions,
          WriteOptions,
          WriteTopicOptions,
          BigQueryStorageApiStreamingOptions {

    @TemplateParameter.GcsReadFile(
        order = 1,
        description = "Cloud Storage path to the Avro schema file",
        helpText = "Cloud Storage path to Avro schema file. For example, gs://MyBucket/file.avsc.")
    @Required
    String getSchemaPath();

    void setSchemaPath(String schemaPath);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options execution parameters to the pipeline
   * @return result of the pipeline execution as a {@link PipelineResult}
   */
  private static PipelineResult run(PubsubAvroToBigQueryOptions options) {
    BigQueryIOUtils.validateBQStorageApiOptionsStreaming(options);

    // Create the pipeline.
    Pipeline pipeline = Pipeline.create(options);

    Schema schema = SchemaUtils.getAvroSchema(options.getSchemaPath());

    WriteResult writeResults =
        pipeline
            .apply(
                "Read Avro records",
                PubsubIO.readAvroGenericRecords(schema)
                    .fromSubscription(options.getInputSubscription())
                    .withDeadLetterTopic(options.getOutputTopic()))
            // Workaround for BEAM-12256. Eagerly convert to rows to avoid
            // the RowToGenericRecord function that doesn't handle all data
            // types.
            // TODO: Remove this workaround when a fix for BEAM-12256 is
            // released.
            .apply(Convert.toRows())
            .apply(
                "Write to BigQuery",
                BigQueryConverters.<Row>createWriteTransform(options).useBeamSchema());

    BigQueryIOUtils.writeResultToBigQueryInsertErrors(writeResults, options)
        .apply(
            "Create error payload",
            ErrorConverters.BigQueryInsertErrorToPubsubMessage.<GenericRecord>newBuilder()
                .setPayloadCoder(AvroCoder.of(schema))
                .setTranslateFunction(BigQueryConverters.TableRowToGenericRecordFn.of(schema))
                .build())
        .apply("Write failed records", PubsubIO.writeMessages().to(options.getOutputTopic()));

    // Execute the pipeline and return the result.
    return pipeline.run();
  }
}

Pub/Sub Proto to BigQuery

Pub/Sub proto to BigQuery 模板是一种流处理流水线，用于将 Pub/Sub 订阅中的 Avro 数据提取到 BigQuery 表中。向 BigQuery 表写入数据时发生的任何错误都会流式传输到 Pub/Sub 未处理的主题。

可以提供 JavaScript 用户定义函数 (UDF) 来转换数据。可以将在执行 UDF 期间发生的错误发送到单独的 Pub/Sub 主题或与 BigQuery 错误相同的未处理主题。

对此流水线的要求：

用作输入来源的 Pub/Sub 订阅必须存在。
Proto 记录的架构文件必须存在于 Cloud Storage 中。
输出 Pub/Sub 主题必须存在。
用作输出目标的 BigQuery 数据集必须已存在。
如果 BigQuery 表存在，则无论 createDisposition 值如何，该表都必须具有与 proto 数据匹配的架构。

模板参数

参数	说明
`protoSchemaPath`	独立的 proto 架构文件的 Cloud Storage 位置。例如 `gs://path/to/my/file.pb`。您可以使用 `protoc` 命令的 `--descriptor_set_out` 标志生成此文件。`--include_imports` 标志可确保文件是独立的。
`fullMessageName`	完整的 proto 消息名称。例如 `package.name.MessageName`，其中 `package.name` 是为 `package` 语句（而不是 `java_package` 语句）提供的值。
`inputSubscription`	要读取的 Pub/Sub 输入订阅。例如 `projects/<project>/subscriptions/<subscription>`。
`outputTopic`	要用于未处理的记录的 Pub/Sub 主题。例如 `projects/<project-id>/topics/<topic-name>`。
`outputTableSpec`	BigQuery 输出表位置。例如 `my-project:my_dataset.my_table`。根据指定的 createDisposition，系统可能会使用输入架构文件自动创建输出表。
`preserveProtoFieldNames`	（可选）`true` 用于保留 JSON 中的原始 Proto 字段名称。`false` 用于使用更多标准 JSON 名称。例如，`false` 会将 `field_name` 更改为 `fieldName`。（默认：`false`）
`bigQueryTableSchemaPath`	（可选）BigQuery 架构路径到 Cloud Storage 路径。例如 `gs://path/to/my/schema.json`。如果未提供，则根据 Proto 架构推断架构。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
`udfOutputTopic`	（可选）存储 UDF 错误的 Pub/Sub 主题。例如 `projects/<project-id>/topics/<topic-name>`。如果未提供，则会将 UDF 错误发送到 `outputTopic` 所在的主题。
`writeDisposition`	（可选）BigQuery `WriteDisposition`。例如 `WRITE_APPEND`、`WRITE_EMPTY` 或 `WRITE_TRUNCATE`。默认值：`WRITE_APPEND`。
`createDisposition`	（可选）BigQuery `CreateDisposition`。例如 `CREATE_IF_NEEDED`、`CREATE_NEVER`。默认值为 `CREATE_IF_NEEDED`。

运行 Pub/Sub Proto to BigQuery 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub Proto to BigQuery template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/PubSub_Proto_to_BigQuery \
    --parameters \
schemaPath=SCHEMA_PATH,\
fullMessageName=PROTO_MESSAGE_NAME,\
inputSubscription=SUBSCRIPTION_NAME,\
outputTableSpec=BIGQUERY_TABLE,\
outputTopic=UNPROCESSED_TOPIC

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
SCHEMA_PATH：Proto 架构文件的 Cloud Storage 路径（例如 gs://MyBucket/file.pb）
PROTO_MESSAGE_NAME：Proto 消息名称（例如 package.name.MessageName）
SUBSCRIPTION_NAME：Pub/Sub 输入订阅名称
BIGQUERY_TABLE：BigQuery 输出表名称
UNPROCESSED_TOPIC：要用于未处理的队列的 Pub/Sub 主题

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/PubSub_Proto_to_BigQuery",
      "parameters": {
          "schemaPath": "SCHEMA_PATH",
          "fullMessageName": "PROTO_MESSAGE_NAME",
          "inputSubscription": "SUBSCRIPTION_NAME",
          "outputTableSpec": "BIGQUERY_TABLE",
          "outputTopic": "UNPROCESSED_TOPIC"
      }
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
SCHEMA_PATH：Proto 架构文件的 Cloud Storage 路径（例如 gs://MyBucket/file.pb）
PROTO_MESSAGE_NAME：Proto 消息名称（例如 package.name.MessageName）
SUBSCRIPTION_NAME：Pub/Sub 输入订阅名称
BIGQUERY_TABLE：BigQuery 输出表名称
UNPROCESSED_TOPIC：要用于未处理的队列的 Pub/Sub 主题

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import static java.nio.charset.StandardCharsets.UTF_8;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.options.BigQueryCommonOptions.WriteOptions;
import com.google.cloud.teleport.v2.options.BigQueryStorageApiStreamingOptions;
import com.google.cloud.teleport.v2.options.PubsubCommonOptions.ReadSubscriptionOptions;
import com.google.cloud.teleport.v2.options.PubsubCommonOptions.WriteTopicOptions;
import com.google.cloud.teleport.v2.templates.PubsubProtoToBigQuery.PubSubProtoToBigQueryOptions;
import com.google.cloud.teleport.v2.transforms.BigQueryConverters;
import com.google.cloud.teleport.v2.transforms.ErrorConverters;
import com.google.cloud.teleport.v2.transforms.FailsafeElementTransforms.ConvertFailsafeElementToPubsubMessage;
import com.google.cloud.teleport.v2.transforms.JavascriptTextTransformer.FailsafeJavascriptUdf;
import com.google.cloud.teleport.v2.transforms.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.v2.utils.BigQueryIOUtils;
import com.google.cloud.teleport.v2.utils.GCSUtils;
import com.google.cloud.teleport.v2.utils.SchemaUtils;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import com.google.common.annotations.VisibleForTesting;
import com.google.common.base.Strings;
import com.google.protobuf.Descriptors.Descriptor;
import com.google.protobuf.DynamicMessage;
import com.google.protobuf.InvalidProtocolBufferException;
import com.google.protobuf.util.JsonFormat;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.NullableCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write;
import org.apache.beam.sdk.io.gcp.bigquery.WriteResult;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO.Read;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TypeDescriptor;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.apache.commons.lang3.ArrayUtils;

/**
 * A template for writing <a href="https://developers.google.com/protocol-buffers">Protobuf</a>
 * records from Pub/Sub to BigQuery.
 *
 * <p>Persistent failures are written to a Pub/Sub unprocessed topic.
 */
@Template(
    name = "PubSub_Proto_to_BigQuery",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub Proto to BigQuery",
    description =
        "A streaming pipeline that reads Protobuf messages from a Pub/Sub subscription and writes"
            + " them to a BigQuery table.",
    optionsClass = PubSubProtoToBigQueryOptions.class,
    flexContainerName = "pubsub-proto-to-bigquery",
    contactInformation = "https://cloud.google.com/support")
public final class PubsubProtoToBigQuery {
  private static final TupleTag<FailsafeElement<String, String>> UDF_SUCCESS_TAG = new TupleTag<>();
  private static final TupleTag<FailsafeElement<String, String>> UDF_FAILURE_TAG = new TupleTag<>();

  private static final FailsafeElementCoder<String, String> FAILSAFE_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  public static void main(String[] args) {
    UncaughtExceptionLogger.register();
    run(PipelineOptionsFactory.fromArgs(args).as(PubSubProtoToBigQueryOptions.class));
  }

  /** {@link org.apache.beam.sdk.options.PipelineOptions} for {@link PubsubProtoToBigQuery}. */
  public interface PubSubProtoToBigQueryOptions
      extends ReadSubscriptionOptions,
          WriteOptions,
          WriteTopicOptions,
          JavascriptTextTransformerOptions,
          BigQueryStorageApiStreamingOptions {

    @TemplateParameter.GcsReadFile(
        order = 1,
        description = "Cloud Storage Path to the Proto Schema File",
        helpText =
            "Cloud Storage path to a self-contained descriptor set file. Example:"
                + " gs://MyBucket/schema.pb. `schema.pb` can be generated by adding"
                + " `--descriptor_set_out=schema.pb` to the `protoc` command that compiles the"
                + " protos. The `--include_imports` flag can be used to guarantee that the file is"
                + " self-contained.")
    @Required
    String getProtoSchemaPath();

    void setProtoSchemaPath(String value);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"^.+([a-zA-Z][a-zA-Z0-9_]+\\.?)+[a-zA-Z0-9_]$"},
        description = "Full Proto Message Name",
        helpText =
            "The full message name (example: package.name.MessageName). If the message is nested"
                + " inside of another message, then include all messages with the '.' delimiter"
                + " (example: package.name.OuterMessage.InnerMessage). 'package.name' should be"
                + " from the `package` statement, not the `java_package` statement.")
    @Required
    String getFullMessageName();

    void setFullMessageName(String value);

    @TemplateParameter.Text(
        order = 3,
        optional = true,
        description = "Preserve Proto Field Names",
        helpText =
            "Flag to control whether proto field names should be kept or converted to"
                + " lowerCamelCase. If the table already exists, this should be based on what"
                + " matches the table's schema. Otherwise, it will determine the column names of"
                + " the created table. True to preserve proto snake_case. False will convert fields"
                + " to lowerCamelCase. (Default: false)")
    @Default.Boolean(false)
    Boolean getPreserveProtoFieldNames();

    void setPreserveProtoFieldNames(Boolean value);

    @TemplateParameter.GcsReadFile(
        order = 4,
        optional = true,
        description = "BigQuery Table Schema Path",
        helpText =
            "Cloud Storage path to the BigQuery schema JSON file. "
                + "If this is not set, then the schema is inferred "
                + "from the Proto schema.",
        example = "gs://MyBucket/bq_schema.json")
    String getBigQueryTableSchemaPath();

    void setBigQueryTableSchemaPath(String value);

    @TemplateParameter.PubsubTopic(
        order = 5,
        optional = true,
        description = "Pub/Sub output topic for UDF failures",
        helpText =
            "An optional output topic to send UDF failures to. If this option is not set, then"
                + " failures will be written to the same topic as the BigQuery failures.",
        example = "projects/your-project-id/topics/your-topic-name")
    String getUdfOutputTopic();

    void setUdfOutputTopic(String udfOutputTopic);
  }

  /** Runs the pipeline and returns the results. */
  private static PipelineResult run(PubSubProtoToBigQueryOptions options) {
    BigQueryIOUtils.validateBQStorageApiOptionsStreaming(options);

    Pipeline pipeline = Pipeline.create(options);

    Descriptor descriptor = getDescriptor(options);
    PCollection<String> maybeForUdf =
        pipeline
            .apply("Read From Pubsub", readPubsubMessages(options, descriptor))
            .apply("Dynamic Message to TableRow", new ConvertDynamicProtoMessageToJson(options));

    WriteResult writeResult =
        runUdf(maybeForUdf, options)
            .apply("Write to BigQuery", writeToBigQuery(options, descriptor));
    BigQueryIOUtils.writeResultToBigQueryInsertErrors(writeResult, options)
        .apply(
            "Create Error Payload",
            ErrorConverters.BigQueryInsertErrorToPubsubMessage.<String>newBuilder()
                .setPayloadCoder(StringUtf8Coder.of())
                .setTranslateFunction(BigQueryConverters::tableRowToJson)
                .build())
        .apply("Write Failed BQ Records", PubsubIO.writeMessages().to(options.getOutputTopic()));

    return pipeline.run();
  }

  /** Gets the {@link Descriptor} for the message type in the Pub/Sub topic. */
  @VisibleForTesting
  static Descriptor getDescriptor(PubSubProtoToBigQueryOptions options) {
    String schemaPath = options.getProtoSchemaPath();
    String messageName = options.getFullMessageName();
    Descriptor descriptor = SchemaUtils.getProtoDomain(schemaPath).getDescriptor(messageName);

    if (descriptor == null) {
      throw new IllegalArgumentException(
          messageName + " is not a recognized message in " + schemaPath);
    }

    return descriptor;
  }

  /** Returns the {@link PTransform} for reading Pub/Sub messages. */
  private static Read<DynamicMessage> readPubsubMessages(
      PubSubProtoToBigQueryOptions options, Descriptor descriptor) {
    return PubsubIO.readProtoDynamicMessages(descriptor)
        .fromSubscription(options.getInputSubscription())
        .withDeadLetterTopic(options.getOutputTopic());
  }

  /**
   * Writes messages to BigQuery, creating the table if necessary and allowed in {@code options}.
   *
   * <p>The BigQuery schema will be inferred from {@code descriptor} unless a JSON schema path is
   * specified in {@code options}.
   */
  @VisibleForTesting
  static Write<String> writeToBigQuery(
      PubSubProtoToBigQueryOptions options, Descriptor descriptor) {
    Write<String> write =
        BigQueryConverters.<String>createWriteTransform(options)
            .withFormatFunction(BigQueryConverters::convertJsonToTableRow);

    String schemaPath = options.getBigQueryTableSchemaPath();
    if (Strings.isNullOrEmpty(schemaPath)) {
      return write.withSchema(
          SchemaUtils.createBigQuerySchema(descriptor, options.getPreserveProtoFieldNames()));
    } else {
      return write.withJsonSchema(GCSUtils.getGcsFileAsString(schemaPath));
    }
  }

  /** {@link PTransform} that handles converting {@link PubsubMessage} values to JSON. */
  private static class ConvertDynamicProtoMessageToJson
      extends PTransform<PCollection<DynamicMessage>, PCollection<String>> {
    private final boolean preserveProtoName;

    private ConvertDynamicProtoMessageToJson(PubSubProtoToBigQueryOptions options) {
      this.preserveProtoName = options.getPreserveProtoFieldNames();
    }

    @Override
    public PCollection<String> expand(PCollection<DynamicMessage> input) {
      return input.apply(
          "Map to JSON",
          MapElements.into(TypeDescriptors.strings())
              .via(
                  message -> {
                    try {
                      JsonFormat.Printer printer = JsonFormat.printer();
                      return preserveProtoName
                          ? printer.preservingProtoFieldNames().print(message)
                          : printer.print(message);
                    } catch (InvalidProtocolBufferException e) {
                      throw new RuntimeException(e);
                    }
                  }));
    }
  }

  /**
   * Handles running the UDF.
   *
   * <p>If {@code options} is configured so as not to run the UDF, then the UDF will not be called.
   *
   * <p>This may add a branch to the pipeline for outputting failed UDF records to an unprocessed
   * topic.
   *
   * @param jsonCollection {@link PCollection} of JSON strings for use as input to the UDF
   * @param options the options containing info on running the UDF
   * @return the {@link PCollection} of UDF output as JSON or {@code jsonCollection} if UDF not
   *     called
   */
  @VisibleForTesting
  static PCollection<String> runUdf(
      PCollection<String> jsonCollection, PubSubProtoToBigQueryOptions options) {
    // In order to avoid generating a graph that makes it look like a UDF was called when none was
    // intended, simply return the input as "success" output.
    if (Strings.isNullOrEmpty(options.getJavascriptTextTransformGcsPath())) {
      return jsonCollection;
    }

    // For testing purposes, we need to do this check before creating the PTransform rather than
    // in `expand`. Otherwise, we get a NullPointerException due to the PTransform not returning
    // a value.
    if (Strings.isNullOrEmpty(options.getJavascriptTextTransformFunctionName())) {
      throw new IllegalArgumentException(
          "JavaScript function name cannot be null or empty if file is set");
    }

    PCollectionTuple maybeSuccess = jsonCollection.apply("Run UDF", new RunUdf(options));

    maybeSuccess
        .get(UDF_FAILURE_TAG)
        .setCoder(FAILSAFE_CODER)
        .apply(
            "Get UDF Failures",
            ConvertFailsafeElementToPubsubMessage.<String, String>builder()
                .setOriginalPayloadSerializeFn(s -> ArrayUtils.toObject(s.getBytes(UTF_8)))
                .setErrorMessageAttributeKey("udfErrorMessage")
                .build())
        .apply("Write Failed UDF", writeUdfFailures(options));

    return maybeSuccess
        .get(UDF_SUCCESS_TAG)
        .setCoder(FAILSAFE_CODER)
        .apply(
            "Get UDF Output",
            MapElements.into(TypeDescriptors.strings()).via(FailsafeElement::getPayload))
        .setCoder(NullableCoder.of(StringUtf8Coder.of()));
  }

  /** {@link PTransform} that calls a UDF and returns both success and failure output. */
  private static class RunUdf extends PTransform<PCollection<String>, PCollectionTuple> {
    private final PubSubProtoToBigQueryOptions options;

    RunUdf(PubSubProtoToBigQueryOptions options) {
      this.options = options;
    }

    @Override
    public PCollectionTuple expand(PCollection<String> input) {
      return input
          .apply("Prepare Failsafe UDF", makeFailsafe())
          .setCoder(FAILSAFE_CODER)
          .apply(
              "Call UDF",
              FailsafeJavascriptUdf.<String>newBuilder()
                  .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                  .setFunctionName(options.getJavascriptTextTransformFunctionName())
                  .setSuccessTag(UDF_SUCCESS_TAG)
                  .setFailureTag(UDF_FAILURE_TAG)
                  .build());
    }

    private static MapElements<String, FailsafeElement<String, String>> makeFailsafe() {
      return MapElements.into(new TypeDescriptor<FailsafeElement<String, String>>() {})
          .via((String json) -> FailsafeElement.of(json, json));
    }
  }

  /**
   * Returns a {@link PubsubIO.Write} configured to write UDF failures to the appropriate output
   * topic.
   */
  private static PubsubIO.Write<PubsubMessage> writeUdfFailures(
      PubSubProtoToBigQueryOptions options) {
    PubsubIO.Write<PubsubMessage> write = PubsubIO.writeMessages();
    return Strings.isNullOrEmpty(options.getUdfOutputTopic())
        ? write.to(options.getOutputTopic())
        : write.to(options.getUdfOutputTopic());
  }
}

Pub/Sub to Pub/Sub

Pub/Sub to Pub/Sub 模板是一种流处理流水线，可从 Pub/Sub 订阅中读取消息，并将这些消息写入其他 Pub/Sub 主题。该流水线还接受一个可选的消息属性键以及值，该值可用于过滤应写入 Pub/Sub 主题的消息。您可以使用此模板将消息从 Pub/Sub 订阅复制到带有可选消息过滤器的其他 Pub/Sub 主题。

对此流水线的要求：

源 Pub/Sub 订阅必须已存在才能执行流水线。
源 Pub/Sub 订阅必须是拉取订阅。
目的地 Pub/Sub 主题必须已存在才能执行此流水线。

模板参数

参数	说明
`inputSubscription`	要从中读取输入的 Pub/Sub 订阅，例如 `projects/<project-id>/subscriptions/<subscription-name>`。
`outputTopic`	要将输出写入其中的 Cloud Pub/Sub 主题，例如 `projects/<project-id>/topics/<topic-name>`。
`filterKey`	（可选）根据特性键过滤事件。如果未指定 `filterKey`，则不会应用过滤器。
`filterValue`	（可选）提供了 filterKey 时要使用的过滤器特性值。默认情况下，`filterValue` 为 null。

运行 Pub/Sub to Pub/Sub 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub to Pub/Sub template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_PubSub_to_Cloud_PubSub \
    --region REGION_NAME \
    --staging-location STAGING_LOCATION \
    --parameters \
inputSubscription=projects/PROJECT_ID/subscriptions/SUBSCRIPTION_NAME,\
outputTopic=projects/PROJECT_ID/topics/TOPIC_NAME,\
filterKey=FILTER_KEY,\
filterValue=FILTER_VALUE

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
SUBSCRIPTION_NAME：Pub/Sub 订阅名称
TOPIC_NAME：Pub/Sub 主题名称
FILTER_KEY：用于过滤事件的属性键。如果未指定键，则不会应用过滤器。
FILTER_VALUE：提供事件过滤键时要使用的过滤器属性值。接受有效的 Java 正则表达式字符串作为事件过滤值。如果提供了正则表达式，则应匹配整个表达式以过滤消息。不过滤部分匹配（如子字符串）。默认使用 null 事件过滤值。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_PubSub_to_Cloud_PubSub
{
   "jobName": "JOB_NAME",
   "environment": {
       "bypassTempDirValidation": false,
       "tempLocation": TEMP_LOCATION,
       "ipConfiguration": "WORKER_IP_UNSPECIFIED",
       "additionalExperiments": []
    },
   "parameters": {
       "inputSubscription": "projects/PROJECT_ID/subscriptions/SUBSCRIPTION_NAME",
       "outputTopic": "projects/PROJECT_ID/topics/TOPIC_NAME",
       "filterKey": "FILTER_KEY",
       "filterValue": "FILTER_VALUE"
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
SUBSCRIPTION_NAME：Pub/Sub 订阅名称
TOPIC_NAME：Pub/Sub 主题名称
FILTER_KEY：用于过滤事件的属性键。如果未指定键，则不会应用过滤器。
FILTER_VALUE：提供事件过滤键时要使用的过滤器属性值。接受有效的 Java 正则表达式字符串作为事件过滤值。如果提供了正则表达式，则应匹配整个表达式以过滤消息。不过滤部分匹配（如子字符串）。默认使用 null 事件过滤值。

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import static com.google.common.base.Preconditions.checkNotNull;
import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument;

import com.google.auto.value.AutoValue;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.PubsubToPubsub.Options;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
import javax.annotation.Nullable;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.metrics.Counter;
import org.apache.beam.sdk.metrics.Metrics;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.StreamingOptions;
import org.apache.beam.sdk.options.Validation;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/** A template that copies messages from one Pubsub subscription to another Pubsub topic. */
@Template(
    name = "Cloud_PubSub_to_Cloud_PubSub",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub to Pub/Sub",
    description =
        "Streaming pipeline. Reads from a Pub/Sub subscription and writes to a Pub/Sub topic. ",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class PubsubToPubsub {

  /**
   * Main entry point for executing the pipeline.
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {

    // Parse the user options passed from the command-line
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    options.setStreaming(true);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  public static PipelineResult run(Options options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    /**
     * Steps: 1) Read PubSubMessage with attributes from input PubSub subscription. 2) Apply any
     * filters if an attribute=value pair is provided. 3) Write each PubSubMessage to output PubSub
     * topic.
     */
    pipeline
        .apply(
            "Read PubSub Events",
            PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
        .apply(
            "Filter Events If Enabled",
            ParDo.of(
                ExtractAndFilterEventsFn.newBuilder()
                    .withFilterKey(options.getFilterKey())
                    .withFilterValue(options.getFilterValue())
                    .build()))
        .apply("Write PubSub Events", PubsubIO.writeMessages().to(options.getOutputTopic()));

    // Execute the pipeline and return the result.
    return pipeline.run();
  }

  /**
   * Options supported by {@link PubsubToPubsub}.
   *
   * <p>Inherits standard configuration options.
   */
  public interface Options extends PipelineOptions, StreamingOptions {
    @TemplateParameter.PubsubSubscription(
        order = 1,
        description = "Pub/Sub input subscription",
        helpText =
            "Pub/Sub subscription to read the input from, in the format of 'projects/your-project-id/subscriptions/your-subscription-name'",
        example = "projects/your-project-id/subscriptions/your-subscription-name")
    @Validation.Required
    ValueProvider<String> getInputSubscription();

    void setInputSubscription(ValueProvider<String> inputSubscription);

    @TemplateParameter.PubsubTopic(
        order = 2,
        description = "Output Pub/Sub topic",
        helpText =
            "The name of the topic to which data should published, in the format of 'projects/your-project-id/topics/your-topic-name'",
        example = "projects/your-project-id/topics/your-topic-name")
    @Validation.Required
    ValueProvider<String> getOutputTopic();

    void setOutputTopic(ValueProvider<String> outputTopic);

    @TemplateParameter.Text(
        order = 3,
        optional = true,
        description = "Event filter key",
        helpText =
            "Attribute key by which events are filtered. No filters are applied if no key is specified.")
    ValueProvider<String> getFilterKey();

    void setFilterKey(ValueProvider<String> filterKey);

    @TemplateParameter.Text(
        order = 4,
        optional = true,
        description = "Event filter value",
        helpText =
            "Filter attribute value to use if an event filter key is provided. Accepts a valid "
                + "Java Regex string as an event filter value. In case a regex is provided, the complete "
                + "expression should match in order for the message to be filtered. Partial matches (e.g. "
                + "substring) will not be filtered. A null event filter value is used by default.")
    ValueProvider<String> getFilterValue();

    void setFilterValue(ValueProvider<String> filterValue);
  }

  /**
   * DoFn that will determine if events are to be filtered. If filtering is enabled, it will only
   * publish events that pass the filter else, it will publish all input events.
   */
  @AutoValue
  public abstract static class ExtractAndFilterEventsFn extends DoFn<PubsubMessage, PubsubMessage> {

    private static final Logger LOG = LoggerFactory.getLogger(ExtractAndFilterEventsFn.class);

    // Counter tracking the number of incoming Pub/Sub messages.
    private static final Counter INPUT_COUNTER =
        Metrics.counter(ExtractAndFilterEventsFn.class, "inbound-messages");

    // Counter tracking the number of output Pub/Sub messages after the user provided filter
    // is applied.
    private static final Counter OUTPUT_COUNTER =
        Metrics.counter(ExtractAndFilterEventsFn.class, "filtered-outbound-messages");

    private Boolean doFilter;
    private String inputFilterKey;
    private Pattern inputFilterValueRegex;
    private Boolean isNullFilterValue;

    public static Builder newBuilder() {
      return new AutoValue_PubsubToPubsub_ExtractAndFilterEventsFn.Builder();
    }

    @Nullable
    abstract ValueProvider<String> filterKey();

    @Nullable
    abstract ValueProvider<String> filterValue();

    @Setup
    public void setup() {

      if (this.doFilter != null) {
        return; // Filter has been evaluated already
      }

      inputFilterKey = (filterKey() == null ? null : filterKey().get());

      if (inputFilterKey == null) {

        // Disable input message filtering.
        this.doFilter = false;

      } else {

        this.doFilter = true; // Enable filtering.

        String inputFilterValue = (filterValue() == null ? null : filterValue().get());

        if (inputFilterValue == null) {

          LOG.warn(
              "User provided a NULL for filterValue. Only messages with a value of NULL for the"
                  + " filterKey: {} will be filtered forward",
              inputFilterKey);

          // For backward compatibility, we are allowing filtering by null filterValue.
          this.isNullFilterValue = true;
          this.inputFilterValueRegex = null;
        } else {

          this.isNullFilterValue = false;
          try {
            inputFilterValueRegex = getFilterPattern(inputFilterValue);
          } catch (PatternSyntaxException e) {
            LOG.error("Invalid regex pattern for supplied filterValue: {}", inputFilterValue);
            throw new RuntimeException(e);
          }
        }

        LOG.info(
            "Enabling event filter [key: " + inputFilterKey + "][value: " + inputFilterValue + "]");
      }
    }

    @ProcessElement
    public void processElement(ProcessContext context) {

      INPUT_COUNTER.inc();
      if (!this.doFilter) {

        // Filter is not enabled
        writeOutput(context, context.element());
      } else {

        PubsubMessage message = context.element();
        String extractedValue = message.getAttribute(this.inputFilterKey);

        if (this.isNullFilterValue) {

          if (extractedValue == null) {
            // If we are filtering for null and the extracted value is null, we forward
            // the message.
            writeOutput(context, message);
          }

        } else {

          if (extractedValue != null
              && this.inputFilterValueRegex.matcher(extractedValue).matches()) {
            // If the extracted value is not null and it matches the filter,
            // we forward the message.
            writeOutput(context, message);
          }
        }
      }
    }

    /**
     * Write a {@link PubsubMessage} and increment the output counter.
     *
     * @param context {@link ProcessContext} to write {@link PubsubMessage} to.
     * @param message {@link PubsubMessage} output.
     */
    private void writeOutput(ProcessContext context, PubsubMessage message) {
      OUTPUT_COUNTER.inc();
      context.output(message);
    }

    /**
     * Return a {@link Pattern} based on a user provided regex string.
     *
     * @param regex Regex string to compile.
     * @return {@link Pattern}
     * @throws PatternSyntaxException If the string is an invalid regex.
     */
    private Pattern getFilterPattern(String regex) throws PatternSyntaxException {
      checkNotNull(regex, "Filter regex cannot be null.");
      return Pattern.compile(regex);
    }

    /** Builder class for {@link ExtractAndFilterEventsFn}. */
    @AutoValue.Builder
    abstract static class Builder {

      abstract Builder setFilterKey(ValueProvider<String> filterKey);

      abstract Builder setFilterValue(ValueProvider<String> filterValue);

      abstract ExtractAndFilterEventsFn build();

      /**
       * Method to set the filterKey used for filtering messages.
       *
       * @param filterKey Lookup key for the {@link PubsubMessage} attribute map.
       * @return {@link Builder}
       */
      public Builder withFilterKey(ValueProvider<String> filterKey) {
        checkArgument(filterKey != null, "withFilterKey(filterKey) called with null input.");
        return setFilterKey(filterKey);
      }

      /**
       * Method to set the filterValue used for filtering messages.
       *
       * @param filterValue Lookup value for the {@link PubsubMessage} attribute map.
       * @return {@link Builder}
       */
      public Builder withFilterValue(ValueProvider<String> filterValue) {
        checkArgument(filterValue != null, "withFilterValue(filterValue) called with null input.");
        return setFilterValue(filterValue);
      }
    }
  }
}

Pub/Sub to Splunk

Pub/Sub to Splunk 模板是一种流处理流水线，可从 Pub/Sub 订阅中读取消息，并通过 Splunk 的 HTTP Event Collector (HEC) 将消息载荷写入 Splunk。此模板的最常见使用场景是将日志导出到 Splunk。如需查看底层工作流的示例，请参阅使用 Dataflow 将支持生产环境的日志导出部署到 Splunk。

在写入 Splunk 之前，您还可以将 JavaScript 用户定义函数应用于消息载荷。任何未能成功处理的消息都会被转发到 Pub/Sub 未处理主题，以便进一步进行问题排查并重新处理。

要为 HEC 令牌提供额外保护，您还可以传入 Cloud KMS 密钥和使用 Cloud KMS 密钥加密的 base64 编码 HEC 令牌参数。如需详细了解如何对 HEC 令牌参数进行加密，请参阅 Cloud KMS API 加密端点。

对此流水线的要求：

源 Pub/Sub 订阅必须已存在才能运行此流水线。
在运行此流水线之前，Pub/Sub 未处理的主题必须已存在。
Splunk HEC 端点必须可从 Dataflow 工作器的网络访问。
Splunk HEC 令牌必须已生成并且可用。

模板参数

参数	说明
`inputSubscription`	要从中读取输入的 Pub/Sub 订阅，例如 `projects/<project-id>/subscriptions/<subscription-name>`。
`token`	（可选）Splunk HEC 身份验证令牌。如果 `tokenSource` 设置为 PlaINTEXT 或 KMS，则必须提供。
`url`	Splunk HEC 网址，必须可从运行流水线的 VPC 路由。例如，https://splunk-hec-host:8088。
`outputDeadletterTopic`	用于转发无法递送的消息的 Pub/Sub 主题，例如 `projects/<project-id>/topics/<topic-name>`。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
`batchCount`	（可选）向 Splunk 发送多个事件的批次大小。默认值为 1（无批处理）。
`parallelism`	（可选）最大并行请求数。默认值为 1（无并行）。
`disableCertificateValidation`	（可选）停用 SSL 证书验证。默认为 false（已启用验证）。如果为 true，则不验证证书（所有证书均受信任），并且忽略“rootCaCertificatePath”参数。
`includePubsubMessage`	（可选）在载荷中包含完整的 Pub/Sub 消息。默认值为 false（只有数据元素包含在载荷中）。
`tokenSource`	令牌的来源。PLAINTEXT、KMS 或 SECRET_MANAGER 之一。如果使用了 Secret Manager，则必须提供此参数。如果 `tokenSource` 设置为 KMS，则必须提供 `tokenKMSEncryptionKey` 和加密的 `token`。如果 `tokenSource` 设置为 SECRET_MANAGER，则必须提供 `tokenSecretId`。如果 `tokenSource` 设置为 DLPINTEXT，则必须提供 `token`。
`tokenKMSEncryptionKey`	（可选）用于解密 HEC 令牌字符串的 Cloud KMS 密钥。如果 `tokenSource` 设置为 KMS，则必须提供此参数。如果提供了 Cloud KMS 密钥，则必须以加密方式传递 HEC 令牌字符串。
`tokenSecretId`	（可选）令牌的 Secret Manager Secret ID。如果 `tokenSource` 设置为 SECRET_MANAGER，则必须提供此参数。应采用以下格式：`projects/<project-id>/secrets/<secret-name>/versions/<secret-version>`。
`rootCaCertificatePath`	（可选）Cloud Storage 中根 CA 证书的完整网址。例如 `gs://mybucket/mycerts/privateCA.crt`。Cloud Storage 中提供的证书必须采用 DER 编码，并且可能以二进制或可打印 (Base64) 编码提供。如果证书是使用 Base64 编码提供的，则它必须以 -----BEGIN CERTIFICATE----- 开头为界，并且必须以 -----END CERTIFICATE----- 结尾为界。如果提供此参数，系统会提取此私有 CA 证书文件并将其添加到 Dataflow 工作器的信任库，以便验证 Splunk HEC 端点的 SSL 证书。如果未提供此参数，则使用默认信任库。
`enableBatchLogs`	（可选）指定是否应为写入 Splunk 的批次启用日志。默认值：`true`。
`enableGzipHttpCompression`	（可选）指定是否应压缩发送到 Splunk HEC 的 HTTP 请求（gzip 内容编码）。默认值：`true`。

运行 Pub/Sub to Splunk 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub to Splunk template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_PubSub_to_Splunk \
    --region REGION_NAME \
    --staging-location STAGING_LOCATION \
    --parameters \
inputSubscription=projects/PROJECT_ID/subscriptions/INPUT_SUBSCRIPTION_NAME,\
token=TOKEN,\
url=URL,\
outputDeadletterTopic=projects/PROJECT_ID/topics/DEADLETTER_TOPIC_NAME,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
batchCount=BATCH_COUNT,\
parallelism=PARALLELISM,\
disableCertificateValidation=DISABLE_VALIDATION,\
rootCaCertificatePath=ROOT_CA_CERTIFICATE_PATH

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
INPUT_SUBSCRIPTION_NAME：Pub/Sub 订阅名称
TOKEN：Splunk 的 HTTP Event Collector 令牌
URL：Splunk 的 HTTP Event Collector 的网址路径（例如 https://splunk-hec-host:8088）
DEADLETTER_TOPIC_NAME：Pub/Sub 主题名称
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js
BATCH_COUNT：用于向 Splunk 发送多个事件的批次大小
PARALLELISM：用于向 Splunk 发送事件的并行请求数
DISABLE_VALIDATION：如果要停用 SSL 证书验证则为 true
ROOT_CA_CERTIFICATE_PATH：Cloud Storage 中根 CA 证书的路径（例如 gs://your-bucket/privateCA.crt）

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_PubSub_to_Splunk
{
   "jobName": "JOB_NAME",
   "environment": {
       "bypassTempDirValidation": false,
       "tempLocation": "gs://your-bucket/temp",
       "ipConfiguration": "WORKER_IP_UNSPECIFIED",
       "additionalExperiments": []
   },
   "parameters": {
       "inputSubscription": "projects/PROJECT_ID/subscriptions/INPUT_SUBSCRIPTION_NAME",
       "token": "TOKEN",
       "url": "URL",
       "outputDeadletterTopic": "projects/PROJECT_ID/topics/DEADLETTER_TOPIC_NAME",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "batchCount": "BATCH_COUNT",
       "parallelism": "PARALLELISM",
       "disableCertificateValidation": "DISABLE_VALIDATION",
       "rootCaCertificatePath": "ROOT_CA_CERTIFICATE_PATH"
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
INPUT_SUBSCRIPTION_NAME：Pub/Sub 订阅名称
TOKEN：Splunk 的 HTTP Event Collector 令牌
URL：Splunk 的 HTTP Event Collector 的网址路径（例如 https://splunk-hec-host:8088）
DEADLETTER_TOPIC_NAME：Pub/Sub 主题名称
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js
BATCH_COUNT：用于向 Splunk 发送多个事件的批次大小
PARALLELISM：用于向 Splunk 发送事件的并行请求数
DISABLE_VALIDATION：如果要停用 SSL 证书验证则为 true
ROOT_CA_CERTIFICATE_PATH：Cloud Storage 中根 CA 证书的路径（例如 gs://your-bucket/privateCA.crt）

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.coders.FailsafeElementCoder;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.splunk.SplunkEvent;
import com.google.cloud.teleport.splunk.SplunkEventCoder;
import com.google.cloud.teleport.splunk.SplunkIO;
import com.google.cloud.teleport.splunk.SplunkWriteError;
import com.google.cloud.teleport.templates.PubSubToSplunk.PubSubToSplunkOptions;
import com.google.cloud.teleport.templates.common.ErrorConverters;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.FailsafeJavascriptUdf;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.PubsubConverters.PubsubReadSubscriptionOptions;
import com.google.cloud.teleport.templates.common.PubsubConverters.PubsubWriteDeadletterTopicOptions;
import com.google.cloud.teleport.templates.common.SplunkConverters;
import com.google.cloud.teleport.templates.common.SplunkConverters.SplunkOptions;
import com.google.cloud.teleport.util.TokenNestedValueProvider;
import com.google.cloud.teleport.values.FailsafeElement;
import com.google.common.annotations.VisibleForTesting;
import com.google.gson.Gson;
import com.google.gson.JsonObject;
import com.google.gson.JsonSyntaxException;
import java.nio.charset.StandardCharsets;
import java.util.Map;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.metrics.Counter;
import org.apache.beam.sdk.metrics.Metrics;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.Flatten;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PBegin;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionList;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.MoreObjects;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableList;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link PubSubToSplunk} pipeline is a streaming pipeline which ingests data from Cloud
 * Pub/Sub, executes a UDF, converts the output to {@link SplunkEvent}s and writes those records
 * into Splunk's HEC endpoint. Any errors which occur in the execution of the UDF, conversion to
 * {@link SplunkEvent} or writing to HEC will be streamed into a Pub/Sub topic.
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>The source Pub/Sub subscription exists.
 *   <li>HEC end-point is routable from the VPC where the Dataflow job executes.
 *   <li>Deadletter topic exists.
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT_ID=PROJECT ID HERE
 * BUCKET_NAME=BUCKET NAME HERE
 * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery
 * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read
 *                  from a Pub/Sub Subscription or a Pub/Sub Topic.
 *
 * # Set the runner
 * RUNNER=DataflowRunner
 *
 * # Build the template
 * mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToSplunk \
 * -Dexec.cleanupDaemonThreads=false \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=${PIPELINE_FOLDER}/staging \
 * --tempLocation=${PIPELINE_FOLDER}/temp \
 * --templateLocation=${PIPELINE_FOLDER}/template/PubSubToSplunk \
 * --runner=${RUNNER}
 * "
 *
 * # Execute the template
 * JOB_NAME=pubsub-to-splunk-$USER-`date +"%Y%m%d-%H%M%S%z"`
 * BATCH_COUNT=1
 * PARALLELISM=5
 *
 * # Execute the templated pipeline:
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template/PubSubToSplunk \
 * --zone=us-east1-d \
 * --parameters \
 * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\
 * token=my-splunk-hec-token,\
 * url=http://splunk-hec-server-address:8088,\
 * batchCount=${BATCH_COUNT},\
 * parallelism=${PARALLELISM},\
 * disableCertificateValidation=false,\
 * outputDeadletterTopic=projects/${PROJECT_ID}/topics/deadletter-topic-name,\
 * javascriptTextTransformGcsPath=gs://${BUCKET_NAME}/splunk/js/my-js-udf.js,\
 * javascriptTextTransformFunctionName=myUdf"
 * </pre>
 */
@Template(
    name = "Cloud_PubSub_to_Splunk",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub to Splunk",
    description =
        "A pipeline that reads from a Pub/Sub subscription and writes to Splunk's HTTP Event Collector (HEC).",
    optionsClass = PubSubToSplunkOptions.class,
    optionsOrder = {
      PubsubReadSubscriptionOptions.class,
      SplunkOptions.class,
      JavascriptTextTransformerOptions.class,
      PubsubWriteDeadletterTopicOptions.class
    },
    contactInformation = "https://cloud.google.com/support")
public class PubSubToSplunk {

  /** String/String Coder for FailsafeElement. */
  public static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  /** Counter to track inbound messages from source. */
  private static final Counter INPUT_MESSAGES_COUNTER =
      Metrics.counter(PubSubToSplunk.class, "inbound-pubsub-messages");

  /** The tag for successful {@link SplunkEvent} conversion. */
  private static final TupleTag<SplunkEvent> SPLUNK_EVENT_OUT = new TupleTag<SplunkEvent>() {};

  /** The tag for failed {@link SplunkEvent} conversion. */
  private static final TupleTag<FailsafeElement<String, String>> SPLUNK_EVENT_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /** The tag for the main output for the UDF. */
  private static final TupleTag<FailsafeElement<String, String>> UDF_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /** The tag for the dead-letter output of the udf. */
  private static final TupleTag<FailsafeElement<String, String>> UDF_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /** GSON to process a {@link PubsubMessage}. */
  private static final Gson GSON = new Gson();

  /** Logger for class. */
  private static final Logger LOG = LoggerFactory.getLogger(PubSubToSplunk.class);

  private static final Boolean DEFAULT_INCLUDE_PUBSUB_MESSAGE = false;

  @VisibleForTesting protected static final String PUBSUB_MESSAGE_ATTRIBUTE_FIELD = "attributes";
  @VisibleForTesting protected static final String PUBSUB_MESSAGE_DATA_FIELD = "data";
  private static final String PUBSUB_MESSAGE_ID_FIELD = "messageId";

  /**
   * The main entry-point for pipeline execution. This method will start the pipeline but will not
   * wait for it's execution to finish. If blocking execution is required, use the {@link
   * PubSubToSplunk#run(PubSubToSplunkOptions)} method to start the pipeline and invoke {@code
   * result.waitUntilFinish()} on the {@link PipelineResult}.
   *
   * @param args The command-line args passed by the executor.
   */
  public static void main(String[] args) {

    PubSubToSplunkOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(PubSubToSplunkOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(PubSubToSplunkOptions options) {

    Pipeline pipeline = Pipeline.create(options);

    // Register coders.
    CoderRegistry registry = pipeline.getCoderRegistry();
    registry.registerCoderForClass(SplunkEvent.class, SplunkEventCoder.of());
    registry.registerCoderForType(
        FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER);

    /*
     * Steps:
     *  1) Read messages in from Pub/Sub
     *  2) Convert message to FailsafeElement for processing.
     *  3) Apply user provided UDF (if any) on the input strings.
     *  4) Convert successfully transformed messages into SplunkEvent objects
     *  5) Write SplunkEvents to Splunk's HEC end point.
     *  5a) Wrap write failures into a FailsafeElement.
     *  6) Collect errors from UDF transform (#3), SplunkEvent transform (#4)
     *     and writing to Splunk HEC (#5) and stream into a Pub/Sub deadletter topic.
     */

    // 1) Read messages in from Pub/Sub
    PCollection<String> stringMessages =
        pipeline.apply(
            "ReadMessages",
            new ReadMessages(options.getInputSubscription(), options.getIncludePubsubMessage()));

    // 2) Convert message to FailsafeElement for processing.
    PCollectionTuple transformedOutput =
        stringMessages
            .apply(
                "ConvertToFailsafeElement",
                MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor())
                    .via(input -> FailsafeElement.of(input, input)))

            // 3) Apply user provided UDF (if any) on the input strings.
            .apply(
                "ApplyUDFTransformation",
                FailsafeJavascriptUdf.<String>newBuilder()
                    .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                    .setFunctionName(options.getJavascriptTextTransformFunctionName())
                    .setLoggingEnabled(ValueProvider.StaticValueProvider.of(true))
                    .setSuccessTag(UDF_OUT)
                    .setFailureTag(UDF_DEADLETTER_OUT)
                    .build());

    // 4) Convert successfully transformed messages into SplunkEvent objects
    PCollectionTuple convertToEventTuple =
        transformedOutput
            .get(UDF_OUT)
            .apply(
                "ConvertToSplunkEvent",
                SplunkConverters.failsafeStringToSplunkEvent(
                    SPLUNK_EVENT_OUT, SPLUNK_EVENT_DEADLETTER_OUT));

    // 5) Write SplunkEvents to Splunk's HEC end point.
    PCollection<SplunkWriteError> writeErrors =
        convertToEventTuple
            .get(SPLUNK_EVENT_OUT)
            .apply(
                "WriteToSplunk",
                SplunkIO.writeBuilder()
                    .withToken(
                        new TokenNestedValueProvider(
                            options.getTokenSecretId(),
                            options.getTokenKMSEncryptionKey(),
                            options.getToken(),
                            options.getTokenSource()))
                    .withUrl(options.getUrl())
                    .withBatchCount(options.getBatchCount())
                    .withParallelism(options.getParallelism())
                    .withDisableCertificateValidation(options.getDisableCertificateValidation())
                    .withRootCaCertificatePath(options.getRootCaCertificatePath())
                    .withEnableBatchLogs(options.getEnableBatchLogs())
                    .withEnableGzipHttpCompression(options.getEnableGzipHttpCompression())
                    .build());

    // 5a) Wrap write failures into a FailsafeElement.
    PCollection<FailsafeElement<String, String>> wrappedSplunkWriteErrors =
        writeErrors.apply(
            "WrapSplunkWriteErrors",
            ParDo.of(
                new DoFn<SplunkWriteError, FailsafeElement<String, String>>() {

                  @ProcessElement
                  public void processElement(ProcessContext context) {
                    SplunkWriteError error = context.element();
                    FailsafeElement<String, String> failsafeElement =
                        FailsafeElement.of(error.payload(), error.payload());

                    if (error.statusMessage() != null) {
                      failsafeElement.setErrorMessage(error.statusMessage());
                    }

                    if (error.statusCode() != null) {
                      failsafeElement.setErrorMessage(
                          String.format("Splunk write status code: %d", error.statusCode()));
                    }
                    context.output(failsafeElement);
                  }
                }));

    // 6) Collect errors from UDF transform (#4), SplunkEvent transform (#5)
    //     and writing to Splunk HEC (#6) and stream into a Pub/Sub deadletter topic.
    PCollectionList.of(
            ImmutableList.of(
                convertToEventTuple.get(SPLUNK_EVENT_DEADLETTER_OUT),
                wrappedSplunkWriteErrors,
                transformedOutput.get(UDF_DEADLETTER_OUT)))
        .apply("FlattenErrors", Flatten.pCollections())
        .apply(
            "WriteFailedRecords",
            ErrorConverters.WriteStringMessageErrorsToPubSub.newBuilder()
                .setErrorRecordsTopic(options.getOutputDeadletterTopic())
                .build());

    return pipeline.run();
  }

  /**
   * The {@link PubSubToSplunkOptions} class provides the custom options passed by the executor at
   * the command line.
   */
  public interface PubSubToSplunkOptions
      extends SplunkOptions,
          PubsubReadSubscriptionOptions,
          PubsubWriteDeadletterTopicOptions,
          JavascriptTextTransformerOptions {}

  /**
   * A {@link PTransform} that reads messages from a Pub/Sub subscription, increments a counter and
   * returns a {@link PCollection} of {@link String} messages.
   */
  private static class ReadMessages extends PTransform<PBegin, PCollection<String>> {
    private final ValueProvider<String> subscriptionName;
    private final ValueProvider<Boolean> inputIncludePubsubMessageFlag;
    private Boolean includePubsubMessage;

    ReadMessages(
        ValueProvider<String> subscriptionName,
        ValueProvider<Boolean> inputIncludePubsubMessageFlag) {
      this.subscriptionName = subscriptionName;
      this.inputIncludePubsubMessageFlag = inputIncludePubsubMessageFlag;
    }

    @Override
    public PCollection<String> expand(PBegin input) {
      return input
          .apply(
              "ReadPubsubMessage",
              PubsubIO.readMessagesWithAttributes().fromSubscription(subscriptionName))
          .apply(
              "ExtractMessageIfRequired",
              ParDo.of(
                  new DoFn<PubsubMessage, String>() {

                    @Setup
                    public void setup() {
                      if (inputIncludePubsubMessageFlag != null) {
                        includePubsubMessage = inputIncludePubsubMessageFlag.get();
                      }
                      includePubsubMessage =
                          MoreObjects.firstNonNull(
                              includePubsubMessage, DEFAULT_INCLUDE_PUBSUB_MESSAGE);
                      LOG.info("includePubsubMessage set to: {}", includePubsubMessage);
                    }

                    @ProcessElement
                    public void processElement(ProcessContext context) {
                      if (includePubsubMessage) {
                        context.output(formatPubsubMessage(context.element()));
                      } else {
                        context.output(
                            new String(context.element().getPayload(), StandardCharsets.UTF_8));
                      }
                    }
                  }))
          .apply(
              "CountMessages",
              ParDo.of(
                  new DoFn<String, String>() {
                    @ProcessElement
                    public void processElement(ProcessContext context) {
                      INPUT_MESSAGES_COUNTER.inc();
                      context.output(context.element());
                    }
                  }));
    }
  }

  /**
   * Utility method that formats {@link org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage} according
   * to the model defined in {@link com.google.pubsub.v1.PubsubMessage}.
   *
   * @param pubsubMessage {@link org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage}
   * @return JSON String that adheres to the model defined in {@link
   *     com.google.pubsub.v1.PubsubMessage}
   */
  @VisibleForTesting
  protected static String formatPubsubMessage(PubsubMessage pubsubMessage) {
    JsonObject messageJson = new JsonObject();

    String payload = new String(pubsubMessage.getPayload(), StandardCharsets.UTF_8);
    try {
      JsonObject data = GSON.fromJson(payload, JsonObject.class);
      messageJson.add(PUBSUB_MESSAGE_DATA_FIELD, data);
    } catch (JsonSyntaxException e) {
      messageJson.addProperty(PUBSUB_MESSAGE_DATA_FIELD, payload);
    }

    JsonObject attributes = getAttributesJson(pubsubMessage.getAttributeMap());
    messageJson.add(PUBSUB_MESSAGE_ATTRIBUTE_FIELD, attributes);

    if (pubsubMessage.getMessageId() != null) {
      messageJson.addProperty(PUBSUB_MESSAGE_ID_FIELD, pubsubMessage.getMessageId());
    }

    return messageJson.toString();
  }

  /**
   * Constructs a {@link JsonObject} from a {@link Map} of Pub/Sub attributes.
   *
   * @param attributesMap {@link Map} of Pub/Sub attributes
   * @return {@link JsonObject} of Pub/Sub attributes
   */
  private static JsonObject getAttributesJson(Map<String, String> attributesMap) {
    JsonObject attributesJson = new JsonObject();
    for (String key : attributesMap.keySet()) {
      attributesJson.addProperty(key, attributesMap.get(key));
    }

    return attributesJson;
  }
}

Pub/Sub to Avro Files on Cloud Storage

Pub/Sub to Avro files on Cloud Storage 模板是一个流处理流水线，可从 Pub/Sub 主题中读取数据，并将 Avro 文件写入指定的 Cloud Storage 存储桶。

对此流水线的要求：

Pub/Sub 输入主题必须已存在才能执行此流水线。

模板参数

参数	说明
`inputTopic`	要订阅用来处理消息的 Pub/Sub 主题。主题名称必须采用 `projects/<project-id>/topics/<topic-name>` 格式。
`outputDirectory`	要用于归档输出 Avro 文件的输出目录。末尾必须包含 `/`。例如：`gs://example-bucket/example-directory/`。
`avroTempDirectory`	临时 Avro 文件的目录。末尾必须包含 `/`。例如：`gs://example-bucket/example-directory/`。
`outputFilenamePrefix`	（可选）Avro 文件的输出文件名前缀。
`outputFilenameSuffix`	（可选）Avro 文件的输出文件名后缀。
`outputShardTemplate`	（可选）输出文件的分片模板。它被指定为字母 `S` 或 `N` 的重复序列。例如 `SSS-NNN`。这些字母会分别替换成分片编号或分片总数。如果未指定此参数，则默认模板格式为 `W-P-SS-of-NN`。

运行 Pub/Sub to Cloud Storage Avro 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub to Avro Files on Cloud Storage template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_PubSub_to_Avro \
    --region REGION_NAME \
    --staging-location STAGING_LOCATION \
    --parameters \
inputTopic=projects/PROJECT_ID/topics/TOPIC_NAME,\
outputDirectory=gs://BUCKET_NAME/output/,\
outputFilenamePrefix=FILENAME_PREFIX,\
outputFilenameSuffix=FILENAME_SUFFIX,\
outputShardTemplate=SHARD_TEMPLATE,\
avroTempDirectory=gs://BUCKET_NAME/temp/

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置；例如 gs://your-bucket/temp
TOPIC_NAME：Pub/Sub 主题名称
BUCKET_NAME - Cloud Storage 存储桶的名称。
FILENAME_PREFIX：首选输出文件名前缀
FILENAME_SUFFIX：首选输出文件名后缀
SHARD_TEMPLATE：首选输出分片模板

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_PubSub_to_Avro
{
   "jobName": "JOB_NAME",
   "environment": {
       "bypassTempDirValidation": false,
       "tempLocation": TEMP_LOCATION,
       "ipConfiguration": "WORKER_IP_UNSPECIFIED",
       "additionalExperiments": []
    },
   "parameters": {
       "inputTopic": "projects/PROJECT_ID/topics/TOPIC_NAME",
       "outputDirectory": "gs://BUCKET_NAME/output/",
       "avroTempDirectory": "gs://BUCKET_NAME/temp/",
       "outputFilenamePrefix": "FILENAME_PREFIX",
       "outputFilenameSuffix": "FILENAME_SUFFIX",
       "outputShardTemplate": "SHARD_TEMPLATE"
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置；例如 gs://your-bucket/temp
TOPIC_NAME：Pub/Sub 主题名称
BUCKET_NAME - Cloud Storage 存储桶的名称。
FILENAME_PREFIX：首选输出文件名前缀
FILENAME_SUFFIX：首选输出文件名后缀
SHARD_TEMPLATE：首选输出分片模板

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.avro.AvroPubsubMessageRecord;
import com.google.cloud.teleport.io.WindowedFilenamePolicy;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateCreationParameter;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.options.WindowedFilenamePolicyOptions;
import com.google.cloud.teleport.templates.PubsubToAvro.Options;
import com.google.cloud.teleport.util.DurationUtils;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.AvroIO;
import org.apache.beam.sdk.io.FileBasedSink;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.StreamingOptions;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.transforms.windowing.FixedWindows;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.apache.beam.sdk.values.PCollection;

/**
 * This pipeline ingests incoming data from a Cloud Pub/Sub topic and outputs the raw data into
 * windowed Avro files at the specified output directory.
 *
 * <p>Files output will have the following schema:
 *
 * <pre>
 *   {
 *      "type": "record",
 *      "name": "AvroPubsubMessageRecord",
 *      "namespace": "com.google.cloud.teleport.avro",
 *      "fields": [
 *        {"name": "message", "type": {"type": "array", "items": "bytes"}},
 *        {"name": "attributes", "type": {"type": "map", "values": "string"}},
 *        {"name": "timestamp", "type": "long"}
 *      ]
 *   }
 * </pre>
 *
 * <p>Example Usage:
 *
 * <pre>
 * # Set the pipeline vars
 * PIPELINE_NAME=PubsubToAvro
 * PROJECT_ID=PROJECT ID HERE
 * PIPELINE_BUCKET=TEMPLATE STORAGE BUCKET NAME HERE
 * OUTPUT_BUCKET=JOB OUTPUT BUCKET NAME HERE
 * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read
 *                  from a Pub/Sub Subscription or a Pub/Sub Topic.
 * PIPELINE_FOLDER=gs://${PIPELINE_BUCKET}/dataflow/pipelines/pubsub-to-gcs-avro
 *
 * # Set the runner
 * RUNNER=DataflowRunner
 *
 * # Build the template
 * mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.${PIPELINE_NAME} \
 * -Dexec.cleanupDaemonThreads=false \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=${PIPELINE_FOLDER}/staging \
 * --tempLocation=${PIPELINE_FOLDER}/temp \
 * --templateLocation=${PIPELINE_FOLDER}/template \
 * --runner=${RUNNER} \
 * --useSubscription=${USE_SUBSCRIPTION}"
 *
 * # Execute the template
 * JOB_NAME=pubsub-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"`
 *
 * # Execute a pipeline to read from a Topic.
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "inputTopic=projects/${PROJECT_ID}/topics/input-topic-name,\
 * windowDuration=5m,\
 * numShards=1,\
 * userTempLocation=gs://${OUTPUT_BUCKET}/tmp/,\
 * outputDirectory=gs://${OUTPUT_BUCKET}/output/,\
 * outputFilenamePrefix=windowed-file,\
 * outputFilenameSuffix=.txt"
 *
 * # Execute a pipeline to read from a Subscription.
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\
 * windowDuration=5m,\
 * numShards=1,\
 * userTempLocation=gs://${OUTPUT_BUCKET}/tmp/,\
 * outputDirectory=gs://${OUTPUT_BUCKET}/output/,\
 * outputFilenamePrefix=windowed-file,\
 * outputFilenameSuffix=.avro"
 * </pre>
 */
@Template(
    name = "Cloud_PubSub_to_Avro",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub to Avro Files on Cloud Storage",
    description =
        "Streaming pipeline. Reads from a Pub/Sub subscription and outputs windowed Avro files to"
            + " the specified directory.",
    optionsClass = Options.class,
    skipOptions = "inputSubscription",
    contactInformation = "https://cloud.google.com/support")
public class PubsubToAvro {

  /**
   * Options supported by the pipeline.
   *
   * <p>Inherits standard configuration options.
   */
  public interface Options
      extends PipelineOptions, StreamingOptions, WindowedFilenamePolicyOptions {
    @TemplateParameter.PubsubSubscription(
        order = 1,
        description = "Pub/Sub input subscription",
        helpText =
            "Pub/Sub subscription to read the input from, in the format of"
                + " 'projects/your-project-id/subscriptions/your-subscription-name'",
        example = "projects/your-project-id/subscriptions/your-subscription-name")
    ValueProvider<String> getInputSubscription();

    void setInputSubscription(ValueProvider<String> value);

    @TemplateParameter.PubsubTopic(
        order = 2,
        description = "Pub/Sub input topic",
        helpText =
            "Pub/Sub topic to read the input from, in the format of "
                + "'projects/your-project-id/topics/your-topic-name'")
    ValueProvider<String> getInputTopic();

    void setInputTopic(ValueProvider<String> value);

    @TemplateCreationParameter(value = "false")
    @Description(
        "This determines whether the template reads from " + "a pub/sub subscription or a topic")
    @Default.Boolean(false)
    Boolean getUseSubscription();

    void setUseSubscription(Boolean value);

    @TemplateParameter.GcsWriteFolder(
        order = 4,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime"
                + " formatting is used to parse directory path for date & time formatters.")
    @Required
    ValueProvider<String> getOutputDirectory();

    void setOutputDirectory(ValueProvider<String> value);

    @TemplateParameter.Text(
        order = 5,
        description = "Output filename prefix of the files to write",
        helpText = "The prefix to place on each windowed file.")
    @Default.String("output")
    ValueProvider<String> getOutputFilenamePrefix();

    void setOutputFilenamePrefix(ValueProvider<String> value);

    @TemplateParameter.Text(
        order = 6,
        optional = true,
        description = "Output filename suffix of the files to write",
        helpText =
            "The suffix to place on each windowed file. Typically a file extension such "
                + "as .txt or .csv.")
    @Default.String("")
    ValueProvider<String> getOutputFilenameSuffix();

    void setOutputFilenameSuffix(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFolder(
        order = 7,
        description = "Temporary Avro write directory",
        helpText = "Directory for temporary Avro files.")
    @Required
    ValueProvider<String> getAvroTempDirectory();

    void setAvroTempDirectory(ValueProvider<String> value);
  }

  /**
   * Main entry point for executing the pipeline.
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    options.setStreaming(true);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  public static PipelineResult run(Options options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    PCollection<PubsubMessage> messages = null;

    /*
     * Steps:
     *   1) Read messages from PubSub
     *   2) Window the messages into minute intervals specified by the executor.
     *   3) Output the windowed data into Avro files, one per window by default.
     */

    if (options.getUseSubscription()) {
      messages =
          pipeline.apply(
              "Read PubSub Events",
              PubsubIO.readMessagesWithAttributes()
                  .fromSubscription(options.getInputSubscription()));
    } else {
      messages =
          pipeline.apply(
              "Read PubSub Events",
              PubsubIO.readMessagesWithAttributes().fromTopic(options.getInputTopic()));
    }
    messages
        .apply("Map to Archive", ParDo.of(new PubsubMessageToArchiveDoFn()))
        .apply(
            options.getWindowDuration() + " Window",
            Window.into(FixedWindows.of(DurationUtils.parseDuration(options.getWindowDuration()))))

        // Apply windowed file writes. Use a NestedValueProvider because the filename
        // policy requires a resourceId generated from the input value at runtime.
        .apply(
            "Write File(s)",
            AvroIO.write(AvroPubsubMessageRecord.class)
                .to(
                    WindowedFilenamePolicy.writeWindowedFiles()
                        .withOutputDirectory(options.getOutputDirectory())
                        .withOutputFilenamePrefix(options.getOutputFilenamePrefix())
                        .withShardTemplate(options.getOutputShardTemplate())
                        .withSuffix(options.getOutputFilenameSuffix())
                        .withYearPattern(options.getYearPattern())
                        .withMonthPattern(options.getMonthPattern())
                        .withDayPattern(options.getDayPattern())
                        .withHourPattern(options.getHourPattern())
                        .withMinutePattern(options.getMinutePattern()))
                .withTempDirectory(
                    NestedValueProvider.of(
                        options.getAvroTempDirectory(),
                        (SerializableFunction<String, ResourceId>)
                            input -> FileBasedSink.convertToFileResourceIfPossible(input)))
                /*.withTempDirectory(FileSystems.matchNewResource(
                options.getAvroTempDirectory(),
                Boolean.TRUE))
                */
                .withWindowedWrites()
                .withNumShards(options.getNumShards()));

    // Execute the pipeline and return the result.
    return pipeline.run();
  }

  /**
   * Converts an incoming {@link PubsubMessage} to the {@link AvroPubsubMessageRecord} class by
   * copying it's fields and the timestamp of the message.
   */
  static class PubsubMessageToArchiveDoFn extends DoFn<PubsubMessage, AvroPubsubMessageRecord> {
    @ProcessElement
    public void processElement(ProcessContext context) {
      PubsubMessage message = context.element();
      context.output(
          new AvroPubsubMessageRecord(
              message.getPayload(), message.getAttributeMap(), context.timestamp().getMillis()));
    }
  }
}

Pub/Sub Topic to Text Files on Cloud Storage

Pub/Sub to Cloud Storage Text 模板是一种流处理流水线，可从 Pub/Sub 主题读取记录并将其保存为一系列文本格式的 Cloud Storage 文件。使用此模板，您可以快速地保存 Pub/Sub 中的数据以留待将来使用。默认情况下，此模板每 5 分钟生成一个新文件。

对此流水线的要求：

Pub/Sub 主题必须已存在才能执行此流水线。
发布到主题的消息必须采用文本格式。
发布到主题的消息不得包含任何换行符。请注意，每条 Pub/Sub 消息在输出文件中均会保存为一行。

模板参数

参数	说明
`inputTopic`	要从中读取输入的 Pub/Sub 主题。主题名称应采用 `projects/<project-id>/topics/<topic-name>` 格式。
`outputDirectory`	用于写入输出文件的路径和文件名前缀，例如 `gs://bucket-name/path/`。该值必须以斜杠结尾。
`outputFilenamePrefix`	要在各窗口文件上放置的前缀。例如 `output-`。
`outputFilenameSuffix`	要放置于每个窗口化文件上的后缀，通常是文件扩展名，例如 `.txt` 或 `.csv`。
`outputShardTemplate`	分片式模板定义每个窗口文件的动态部分。默认情况下，该管道使用单一碎片输出到各窗口内的文件系统。这意味着每个窗口的所有数据都会输出到单个文件中。`outputShardTemplate` 默认为 `W-P-SS-of-NN`，其中 `W` 是窗口日期范围，`P` 是窗格信息，`S` 是分片编号，而 `N` 是分片数。对于单个文件，`outputShardTemplate` 的 `SS-of-NN` 部分为 `00-of-01`。

运行 Pub/Sub Topic to Text Files on Cloud Storage 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub to Text Files on Cloud Storage template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_PubSub_to_GCS_Text \
    --region REGION_NAME \
    --staging-location STAGING_LOCATION \
    --parameters \
inputTopic=projects/PROJECT_ID/topics/TOPIC_NAME,\
outputDirectory=gs://BUCKET_NAME/output/,\
outputFilenamePrefix=output-,\
outputFilenameSuffix=.txt

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
TOPIC_NAME：您的 Pub/Sub 主题名称
BUCKET_NAME - Cloud Storage 存储桶的名称。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_PubSub_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "environment": {
       "bypassTempDirValidation": false,
       "tempLocation": "TEMP_LOCATION",
       "ipConfiguration": "WORKER_IP_UNSPECIFIED",
       "additionalExperiments": []
    },
   "parameters": {
       "inputTopic": "projects/PROJECT_ID/topics/TOPIC_NAME"
       "outputDirectory": "gs://BUCKET_NAME/output/",
       "outputFilenamePrefix": "output-",
       "outputFilenameSuffix": ".txt",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
TOPIC_NAME：您的 Pub/Sub 主题名称
BUCKET_NAME - Cloud Storage 存储桶的名称。

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.io.WindowedFilenamePolicy;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateCreationParameter;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.options.WindowedFilenamePolicyOptions;
import com.google.cloud.teleport.templates.PubsubToText.Options;
import com.google.cloud.teleport.util.DualInputNestedValueProvider;
import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput;
import com.google.cloud.teleport.util.DurationUtils;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.FileBasedSink;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.StreamingOptions;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.transforms.windowing.FixedWindows;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.apache.beam.sdk.values.PCollection;

/**
 * This pipeline ingests incoming data from a Cloud Pub/Sub topic and outputs the raw data into
 * windowed files at the specified output directory.
 *
 * <p>Example Usage:
 *
 * <pre>
 * # Set the pipeline vars
 * PIPELINE_NAME=PubsubToText
 * PROJECT_ID=PROJECT ID HERE
 * PIPELINE_BUCKET=TEMPLATE STORAGE BUCKET NAME HERE
 * OUTPUT_BUCKET=JOB OUTPUT BUCKET NAME HERE
 * PIPELINE_FOLDER=gs://${PIPELINE_BUCKET}/dataflow/pipelines/pubsub-to-gcs-text
 * USE_SUBSCRIPTION=true or false depending on whether the pipeline should read
 *                  from a Pub/Sub Subscription or a Pub/Sub Topic.
 *
 * # Set the runner
 * RUNNER=DataflowRunner
 *
 * # Build the template
 * mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.${PIPELINE_NAME} \
 * -Dexec.cleanupDaemonThreads=false \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=${PIPELINE_FOLDER}/staging \
 * --tempLocation=${PIPELINE_FOLDER}/temp \
 * --templateLocation=${PIPELINE_FOLDER}/template \
 * --runner=${RUNNER} \
 * --useSubscription=${USE_SUBSCRIPTION}"
 *
 * # Execute the template
 * JOB_NAME=pubsub-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"`
 *
 * # Execute a pipeline to read from a Topic.
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "inputTopic=projects/${PROJECT_ID}/topics/input-topic-name,\
 * userTempLocation=gs://${OUTPUT_BUCKET}/tmp/,\
 * windowDuration=5m,\
 * numShards=1,\
 * outputDirectory=gs://${OUTPUT_BUCKET}/output/,\
 * outputFilenamePrefix=windowed-file,\
 * outputFilenameSuffix=.txt"
 *
 * # Execute a pipeline to read from a Subscription.
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "inputSubscription=projects/${PROJECT_ID}/subscriptions/input-subscription-name,\
 * windowDuration=5m,\
 * numShards=1,\
 * userTempLocation=gs://${OUTPUT_BUCKET}/tmp/,\
 * outputDirectory=gs://${OUTPUT_BUCKET}/output/,\
 * outputFilenamePrefix=windowed-file,\
 * outputFilenameSuffix=.txt"
 * </pre>
 */
@Template(
    name = "Cloud_PubSub_to_GCS_Text",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub to Text Files on Cloud Storage",
    description =
        "Streaming pipeline. Reads records from Pub/Sub and writes them to Cloud Storage, creating"
            + " a text file for each five minute window. Note that this pipeline assumes no"
            + " newlines in the body of the Pub/Sub message and thus each message becomes a single"
            + " line in the output file.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class PubsubToText {

  /**
   * Options supported by the pipeline.
   *
   * <p>Inherits standard configuration options.
   */
  public interface Options
      extends PipelineOptions, StreamingOptions, WindowedFilenamePolicyOptions {

    @TemplateParameter.PubsubSubscription(
        order = 1,
        optional = true,
        description = "Pub/Sub input subscription",
        helpText =
            "Pub/Sub subscription to read the input from, in the format of"
                + " 'projects/your-project-id/subscriptions/your-subscription-name'",
        example = "projects/your-project-id/subscriptions/your-subscription-name")
    ValueProvider<String> getInputSubscription();

    void setInputSubscription(ValueProvider<String> value);

    @TemplateParameter.PubsubTopic(
        order = 2,
        optional = true,
        description = "Pub/Sub input topic",
        helpText =
            "Pub/Sub topic to read the input from, in the format of "
                + "'projects/your-project-id/topics/your-topic-name'")
    ValueProvider<String> getInputTopic();

    void setInputTopic(ValueProvider<String> value);

    @TemplateCreationParameter(value = "false")
    @Description(
        "This determines whether the template reads from a Pub/Sub subscription or a topic")
    @Default.Boolean(false)
    Boolean getUseSubscription();

    void setUseSubscription(Boolean value);

    @TemplateParameter.GcsWriteFolder(
        order = 3,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime"
                + " formatting is used to parse directory path for date & time formatters.")
    @Required
    ValueProvider<String> getOutputDirectory();

    void setOutputDirectory(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFolder(
        order = 4,
        optional = true,
        description = "User provided temp location",
        helpText =
            "The user provided directory to output temporary files to. Must end with a slash.")
    ValueProvider<String> getUserTempLocation();

    void setUserTempLocation(ValueProvider<String> value);

    @TemplateParameter.Text(
        order = 5,
        description = "Output filename prefix of the files to write",
        helpText = "The prefix to place on each windowed file.")
    @Default.String("output")
    @Required
    ValueProvider<String> getOutputFilenamePrefix();

    void setOutputFilenamePrefix(ValueProvider<String> value);

    @TemplateParameter.Text(
        order = 6,
        optional = true,
        description = "Output filename suffix of the files to write",
        helpText =
            "The suffix to place on each windowed file. Typically a file extension such "
                + "as .txt or .csv.")
    @Default.String("")
    ValueProvider<String> getOutputFilenameSuffix();

    void setOutputFilenameSuffix(ValueProvider<String> value);
  }

  /**
   * Main entry point for executing the pipeline.
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    options.setStreaming(true);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  public static PipelineResult run(Options options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    PCollection<String> messages = null;

    /*
     * Steps:
     *   1) Read string messages from PubSub
     *   2) Window the messages into minute intervals specified by the executor.
     *   3) Output the windowed files to GCS
     */
    if (options.getUseSubscription()) {
      messages =
          pipeline.apply(
              "Read PubSub Events",
              PubsubIO.readStrings().fromSubscription(options.getInputSubscription()));
    } else {
      messages =
          pipeline.apply(
              "Read PubSub Events", PubsubIO.readStrings().fromTopic(options.getInputTopic()));
    }
    messages
        .apply(
            options.getWindowDuration() + " Window",
            Window.into(FixedWindows.of(DurationUtils.parseDuration(options.getWindowDuration()))))

        // Apply windowed file writes. Use a NestedValueProvider because the filename
        // policy requires a resourceId generated from the input value at runtime.
        .apply(
            "Write File(s)",
            TextIO.write()
                .withWindowedWrites()
                .withNumShards(options.getNumShards())
                .to(
                    WindowedFilenamePolicy.writeWindowedFiles()
                        .withOutputDirectory(options.getOutputDirectory())
                        .withOutputFilenamePrefix(options.getOutputFilenamePrefix())
                        .withShardTemplate(options.getOutputShardTemplate())
                        .withSuffix(options.getOutputFilenameSuffix())
                        .withYearPattern(options.getYearPattern())
                        .withMonthPattern(options.getMonthPattern())
                        .withDayPattern(options.getDayPattern())
                        .withHourPattern(options.getHourPattern())
                        .withMinutePattern(options.getMinutePattern()))
                .withTempDirectory(
                    NestedValueProvider.of(
                        maybeUseUserTempLocation(
                            options.getUserTempLocation(), options.getOutputDirectory()),
                        (SerializableFunction<String, ResourceId>)
                            input -> FileBasedSink.convertToFileResourceIfPossible(input))));

    // Execute the pipeline and return the result.
    return pipeline.run();
  }

  /**
   * Utility method for using optional parameter userTempLocation as TempDirectory. This is useful
   * when output bucket is locked and temporary data cannot be deleted.
   *
   * @param userTempLocation user provided temp location
   * @param outputLocation user provided outputDirectory to be used as the default temp location
   * @return userTempLocation if available, otherwise outputLocation is returned.
   */
  private static ValueProvider<String> maybeUseUserTempLocation(
      ValueProvider<String> userTempLocation, ValueProvider<String> outputLocation) {
    return DualInputNestedValueProvider.of(
        userTempLocation,
        outputLocation,
        new SerializableFunction<TranslatorInput<String, String>, String>() {
          @Override
          public String apply(TranslatorInput<String, String> input) {
            return (input.getX() != null) ? input.getX() : input.getY();
          }
        });
  }
}

Pub/Sub Topic or Subscription to Text Files on Cloud Storage

Pub/Sub Topic or Subscription to Cloud Storage Text 模板是一种流处理流水线，可从 Pub/Sub 读取记录并将其保存为一系列文本格式的 Cloud Storage 文件。使用此模板，您可以快速地保存 Pub/Sub 中的数据以留待将来使用。默认情况下，此模板每 5 分钟生成一个新文件。

对此流水线的要求：

Pub/Sub 主题或订阅必须已存在才能执行此流水线。
发布到主题的消息必须采用文本格式。
发布到主题的消息不得包含任何换行符。请注意，每条 Pub/Sub 消息在输出文件中均会保存为一行。

模板参数

参数	说明
`inputTopic`	要从中读取输入的 Pub/Sub 主题。主题名称应采用 `projects/<project-id>/topics/<topic-name>` 格式。如果提供此参数，则不应提供 `inputSubscription`。
`inputSubscription`	要从中读取输入的 Pub/Sub 订阅。订阅名称应采用 `projects/<project-id>/subscription/<subscription-name>` 格式。如果提供此参数，则不应提供 `inputTopic`。
`outputDirectory`	用于写入输出文件的路径和文件名前缀，例如 `gs://bucket-name/path/`。该值必须以斜杠结尾。
`outputFilenamePrefix`	要在各窗口文件上放置的前缀。例如 `output-`。
`outputFilenameSuffix`	要放置于每个窗口化文件上的后缀，通常是文件扩展名，例如 `.txt` 或 `.csv`。
`outputShardTemplate`	分片式模板定义每个窗口文件的动态部分。默认情况下，该管道使用单一碎片输出到各窗口内的文件系统。这意味着每个窗口的所有数据都会输出到单个文件中。`outputShardTemplate` 默认为 `W-P-SS-of-NN`，其中 `W` 是窗口日期范围，`P` 是窗格信息，`S` 是分片编号，而 `N` 是分片数。对于单个文件，`outputShardTemplate` 的 `SS-of-NN` 部分为 `00-of-01`。
`windowDuration`	（可选）窗口时长是将数据写入输出目录的时间间隔。请根据流水线的吞吐量配置时长。例如，较高的吞吐量可能需要较短的窗口时长，以便数据适应内存。默认为 5 分钟，最短可为 1 秒。允许的格式如下：[int]s（表示数秒，例如 5s）、[int]m（表示数分钟，例如 12m）、[int]h（表示数小时，例如 2h）。

运行 Pub/Sub Topic or Subscription to Text Files on Cloud Storage 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub Topic or Subscription to Text Files on Cloud Storage template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template jobs run JOB_NAME \
    --project=YOUR_PROJECT_ID \
    --region REGION_NAME \
    --template-file-gcs-location gs://dataflow-templates/VERSION/flex/Cloud_PubSub_to_GCS_Text_Flex \
    --parameters \
inputSubscription=projects/PROJECT_ID/subscriptions/SUBSCRIPTION_NAME,\
outputDirectory=gs://BUCKET_NAME/output/,\
outputFilenamePrefix=output-,\
outputFilenameSuffix=.txt

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
SUBSCRIPTION_NAME：您的 Pub/Sub 订阅名称
BUCKET_NAME：Cloud Storage 存储桶的名称

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
  "launch_parameter": {
    "jobName": "JOB_NAME",
    "parameters": {
       "inputSubscription": "projects/PROJECT_ID/subscriptions/SUBSCRIPTION_NAME"
       "outputDirectory": "gs://BUCKET_NAME/output/",
       "outputFilenamePrefix": "output-",
       "outputFilenameSuffix": ".txt",
    },
    "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/Cloud_PubSub_to_GCS_Text_Flex",
  }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
SUBSCRIPTION_NAME：您的 Pub/Sub 订阅名称
BUCKET_NAME：Cloud Storage 存储桶的名称

模板源代码

<------>

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2022 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates.pubsubtotext;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.io.WindowedFilenamePolicy;
import com.google.cloud.teleport.v2.options.WindowedFilenamePolicyOptions;
import com.google.cloud.teleport.v2.templates.pubsubtotext.PubsubToText.Options;
import com.google.cloud.teleport.v2.utils.DurationUtils;
import com.google.common.base.Strings;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.FileBasedSink;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.StreamingOptions;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.transforms.windowing.FixedWindows;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.apache.beam.sdk.values.PCollection;

/**
 * This pipeline ingests incoming data from a Cloud Pub/Sub topic and outputs the raw data into
 * windowed files at the specified output directory.
 *
 * <p>Example Usage:
 *
 * <pre>
 * # Set the pipeline vars
 * export PROJECT={project id}
 * export TEMPLATE_MODULE=googlecloud-to-googlecloud
 * export TEMPLATE_NAME=pubsub-to-text
 * export BUCKET_NAME=gs://{bucket name}
 * export TARGET_GCR_IMAGE=gcr.io/${PROJECT}/${TEMPLATE_NAME}-image
 * export BASE_CONTAINER_IMAGE=gcr.io/dataflow-templates-base/java11-template-launcher-base
 * export BASE_CONTAINER_IMAGE_VERSION=latest
 * export APP_ROOT=/template/${TEMPLATE_NAME}
 * export COMMAND_SPEC=${APP_ROOT}/resources/${TEMPLATE_NAME}-command-spec.json
 * export TEMPLATE_IMAGE_SPEC=${BUCKET_NAME}/images/${TEMPLATE_NAME}-image-spec.json
 *
 * gcloud config set project ${PROJECT}
 *
 * # Build and push image to Google Container Repository
 * mvn package \
 *   -Dimage=${TARGET_GCR_IMAGE} \
 *   -Dbase-container-image=${BASE_CONTAINER_IMAGE} \
 *   -Dbase-container-image.version=${BASE_CONTAINER_IMAGE_VERSION} \
 *   -Dapp-root=${APP_ROOT} \
 *   -Dcommand-spec=${COMMAND_SPEC} \
 *   -Djib.applicationCache=/tmp/jib-cache \
 *   -am -pl ${TEMPLATE_MODULE}
 *
 * # Create and upload image spec
 * echo '{
 *  "image":"'${TARGET_GCR_IMAGE}'",
 *  "metadata":{
 *    "name":"Pub/Sub to text",
 *    "description":"Write Pub/Sub messages to GCS text files.",
 *    "parameters":[
 *        {
 *            "name":"inputSubscription",
 *            "label":"Pub/Sub subscription to read from",
 *            "paramType":"TEXT",
 *            "isOptional":true
 *        },
 *        {
 *            "name":"inputTopic",
 *            "label":"Pub/Sub topic to read from",
 *            "paramType":"TEXT",
 *            "isOptional":true
 *        },
 *        {
 *            "name":"outputDirectory",
 *            "label":"Directory to output files to",
 *            "paramType":"TEXT",
 *            "isOptional":false
 *        },
 *        {
 *            "name":"outputFilenamePrefix",
 *            "label":"The filename prefix of the files to write to",
 *            "paramType":"TEXT",
 *            "isOptional":false
 *        },
 *        {
 *            "name":"outputFilenameSuffix",
 *            "label":"The suffix of the files to write to",
 *            "paramType":"TEXT",
 *            "isOptional":true
 *        },
 *        {
 *            "name":"userTempLocation",
 *            "label":"The directory to output temporary files to",
 *            "paramType":"TEXT",
 *            "isOptional":true
 *        }
 *    ]
 *  },
 *  "sdk_info":{"language":"JAVA"}
 * }' > image_spec.json
 * gsutil cp image_spec.json ${TEMPLATE_IMAGE_SPEC}
 * rm image_spec.json
 *
 * # Run template
 * export JOB_NAME="${TEMPLATE_MODULE}-`date +%Y%m%d-%H%M%S-%N`"
 * gcloud beta dataflow flex-template run ${JOB_NAME} \
 *       --project=${PROJECT} --region=us-central1 \
 *       --template-file-gcs-location=${TEMPLATE_IMAGE_SPEC} \
 *       --parameters inputTopic={topic},outputDirectory={directory},outputFilenamePrefix={prefix}
 * </pre>
 */
@Template(
    name = "Cloud_PubSub_to_GCS_Text_Flex",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub Subscription or Topic to Text Files on Cloud Storage",
    description =
        "Streaming pipeline. Reads records from Pub/Sub Subscription or Topic and writes them to"
            + " Cloud Storage, creating a text file for each five minute window. Note that this"
            + " pipeline assumes no newlines in the body of the Pub/Sub message and thus each"
            + " message becomes a single line in the output file.",
    optionsClass = Options.class,
    flexContainerName = "pubsub-to-text",
    contactInformation = "https://cloud.google.com/support")
public class PubsubToText {

  /**
   * Options supported by the pipeline.
   *
   * <p>Inherits standard configuration options.
   */
  public interface Options
      extends PipelineOptions, StreamingOptions, WindowedFilenamePolicyOptions {

    @TemplateParameter.PubsubTopic(
        order = 1,
        optional = true,
        description = "Pub/Sub input topic",
        helpText =
            "Pub/Sub topic to read the input from, in the format of "
                + "'projects/your-project-id/topics/your-topic-name'",
        example = "projects/your-project-id/topics/your-topic-name")
    String getInputTopic();

    void setInputTopic(String value);

    @TemplateParameter.PubsubSubscription(
        order = 2,
        optional = true,
        description = "Pub/Sub input subscription",
        helpText =
            "Pub/Sub subscription to read the input from, in the format of"
                + " 'projects/your-project-id/subscriptions/your-subscription-name'",
        example = "projects/your-project-id/subscriptions/your-subscription-name")
    String getInputSubscription();

    void setInputSubscription(String value);

    @TemplateParameter.GcsWriteFolder(
        order = 3,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime"
                + " formatting is used to parse directory path for date & time formatters.",
        example = "gs://your-bucket/your-path")
    @Required
    String getOutputDirectory();

    void setOutputDirectory(String value);

    @TemplateParameter.GcsWriteFolder(
        order = 4,
        optional = true,
        description = "User provided temp location",
        helpText =
            "The user provided directory to output temporary files to. Must end with a slash.")
    String getUserTempLocation();

    void setUserTempLocation(String value);

    @TemplateParameter.Text(
        order = 5,
        optional = true,
        description = "Output filename prefix of the files to write",
        helpText = "The prefix to place on each windowed file.",
        example = "output-")
    @Default.String("output")
    @Required
    String getOutputFilenamePrefix();

    void setOutputFilenamePrefix(String value);

    @TemplateParameter.Text(
        order = 6,
        optional = true,
        description = "Output filename suffix of the files to write",
        helpText =
            "The suffix to place on each windowed file. Typically a file extension such "
                + "as .txt or .csv.",
        example = ".txt")
    @Default.String("")
    String getOutputFilenameSuffix();

    void setOutputFilenameSuffix(String value);
  }

  /**
   * Main entry point for executing the pipeline.
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    options.setStreaming(true);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  public static PipelineResult run(Options options) {
    boolean useInputSubscription = !Strings.isNullOrEmpty(options.getInputSubscription());
    boolean useInputTopic = !Strings.isNullOrEmpty(options.getInputTopic());
    if (useInputSubscription == useInputTopic) {
      throw new IllegalArgumentException(
          "Either input topic or input subscription must be provided, but not both.");
    }

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    PCollection<String> messages = null;

    /*
     * Steps:
     *   1) Read string messages from PubSub
     *   2) Window the messages into minute intervals specified by the executor.
     *   3) Output the windowed files to GCS
     */
    if (useInputSubscription) {
      messages =
          pipeline.apply(
              "Read PubSub Events",
              PubsubIO.readStrings().fromSubscription(options.getInputSubscription()));
    } else {
      messages =
          pipeline.apply(
              "Read PubSub Events", PubsubIO.readStrings().fromTopic(options.getInputTopic()));
    }
    messages
        .apply(
            options.getWindowDuration() + " Window",
            Window.into(FixedWindows.of(DurationUtils.parseDuration(options.getWindowDuration()))))

        // Apply windowed file writes
        .apply(
            "Write File(s)",
            TextIO.write()
                .withWindowedWrites()
                .withNumShards(options.getNumShards())
                .to(
                    WindowedFilenamePolicy.writeWindowedFiles()
                        .withOutputDirectory(options.getOutputDirectory())
                        .withOutputFilenamePrefix(options.getOutputFilenamePrefix())
                        .withShardTemplate(options.getOutputShardTemplate())
                        .withSuffix(options.getOutputFilenameSuffix())
                        .withYearPattern(options.getYearPattern())
                        .withMonthPattern(options.getMonthPattern())
                        .withDayPattern(options.getDayPattern())
                        .withHourPattern(options.getHourPattern())
                        .withMinutePattern(options.getMinutePattern()))
                .withTempDirectory(
                    FileBasedSink.convertToFileResourceIfPossible(
                        maybeUseUserTempLocation(
                            options.getUserTempLocation(), options.getOutputDirectory()))));

    // Execute the pipeline and return the result.
    return pipeline.run();
  }

  /**
   * Utility method for using optional parameter userTempLocation as TempDirectory. This is useful
   * when output bucket is locked and temporary data cannot be deleted.
   *
   * @param userTempLocation user provided temp location
   * @param outputLocation user provided outputDirectory to be used as the default temp location
   * @return userTempLocation if available, otherwise outputLocation is returned.
   */
  private static String maybeUseUserTempLocation(String userTempLocation, String outputLocation) {
    return !Strings.isNullOrEmpty(userTempLocation) ? userTempLocation : outputLocation;
  }
}

Pub/Sub to MongoDB

Pub/Sub to MongoDB 模板是一种流处理流水线，可从 Pub/Sub 订阅读取 JSON 编码的消息并将其以文档的形式写入 MongoDB。如果需要，此流水线还能支持额外的转换，只需通过 JavaScript 用户定义的函数 (UDF) 包含这些转换即可。因架构不匹配、JOSN 格式不正确或执行转换时发生的所有错误都会记录在未处理消息的 BigQuery 表中，并随附输入消息。如果在执行之前不存在任何未处理记录的表，则流水线会自动创建该表。

对此流水线的要求：

Pub/Sub 订阅必须存在，并且消息必须采用有效的 JSON 格式进行编码。
MongoDB 集群必须存在，并且应该可通过 Dataflow 工作器机器访问。

模板参数

参数	说明
`inputSubscription`	Pub/Sub 订阅的名称。例如：`projects/my-project-id/subscriptions/my-subscription-id`
`mongoDBUri`	以英文逗号分隔的 MongoDB 服务器列表。例如：`192.285.234.12:27017,192.287.123.11:27017`
`database`	存储集合的 MongoDB 数据库。例如：`my-db`。
`collection`	MongoDB 数据库中集合的名称。例如：`my-collection`。
`deadletterTable`	存储失败消息（架构不匹配、JSON 格式错误）的 BigQuery 表。例如：`project-id:dataset-name.table-name`。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
`batchSize`	（可选）用于将文档批量插入 MongoDB 的批次大小。默认值：`1000`。
`batchSizeBytes`	（可选）批次大小（以字节为单位）。默认值：`5242880`。
`maxConnectionIdleTime`	（可选）在出现连接超时之前所允许的空闲时间上限（以秒为单位）。默认值：`60000`。
`sslEnabled`	（可选）用于指示与 MongoDB 的连接是否启用了 SSL 的布尔值。默认值：`true`。
`ignoreSSLCertificate`	（可选）用于指示是否应忽略 SSL 证书的布尔值。默认值：`true`。
`withOrdered`	（可选）允许依次批量插入 MongoDB 的布尔值。默认值：`true`。
`withSSLInvalidHostNameAllowed`	（可选）用于指示是否允许 SSL 连接使用无效主机名的布尔值。默认值：`true`。

运行 Pub/Sub to MongoDB 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub to MongoDB template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/Cloud_PubSub_to_MongoDB \
    --parameters \
inputSubscription=INPUT_SUBSCRIPTION,\
mongoDBUri=MONGODB_URI,\
database=DATABASE,
collection=COLLECTION,
deadletterTable=UNPROCESSED_TABLE

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
INPUT_SUBSCRIPTION：Pub/Sub 订阅（例如 projects/my-project-id/subscriptions/my-subscription-id）
MONGODB_URI：MongoDB 服务器地址（例如 192.285.234.12:27017,192.287.123.11:27017）
DATABASE：MongoDB 数据库名称（例如 users）
COLLECTION：MongoDB 集合名称（例如 profiles）
UNPROCESSED_TABLE：BigQuery 表名称（例如 your-project:your-dataset.your-table-name）

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputSubscription": "INPUT_SUBSCRIPTION",
          "mongoDBUri": "MONGODB_URI",
          "database": "DATABASE",
          "collection": "COLLECTION",
          "deadletterTable": "UNPROCESSED_TABLE"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/Cloud_PubSub_to_MongoDB",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
INPUT_SUBSCRIPTION：Pub/Sub 订阅（例如 projects/my-project-id/subscriptions/my-subscription-id）
MONGODB_URI：MongoDB 服务器地址（例如 192.285.234.12:27017,192.287.123.11:27017）
DATABASE：MongoDB 数据库名称（例如 users）
COLLECTION：MongoDB 集合名称（例如 profiles）
UNPROCESSED_TABLE：BigQuery 表名称（例如 your-project:your-dataset.your-table-name）

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import com.google.auto.value.AutoValue;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.templates.PubSubToMongoDB.Options;
import com.google.cloud.teleport.v2.transforms.ErrorConverters;
import com.google.cloud.teleport.v2.transforms.JavascriptTextTransformer;
import com.google.cloud.teleport.v2.utils.SchemaUtils;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import com.google.gson.Gson;
import com.google.gson.JsonObject;
import com.google.gson.JsonSyntaxException;
import java.nio.charset.StandardCharsets;
import javax.annotation.Nullable;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder;
import org.apache.beam.sdk.io.mongodb.MongoDbIO;
import org.apache.beam.sdk.metrics.Counter;
import org.apache.beam.sdk.metrics.Metrics;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TupleTagList;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Throwables;
import org.bson.Document;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link PubSubToMongoDB} pipeline is a streaming pipeline which ingests data in JSON format
 * from PubSub, applies a Javascript UDF if provided and inserts resulting records as Bson Document
 * in MongoDB. If the element fails to be processed then it is written to a deadletter table in
 * BigQuery.
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>The PubSub topic and subscriptions exist
 *   <li>The MongoDB is up and running
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT_NAME=my-project
 * BUCKET_NAME=my-bucket
 * INPUT_SUBSCRIPTION=my-subscription
 * MONGODB_DATABASE_NAME=testdb
 * MONGODB_HOSTNAME=my-host:port
 * MONGODB_COLLECTION_NAME=testCollection
 * DEADLETTERTABLE=project:dataset.deadletter_table_name
 *
 * mvn compile exec:java \
 *  -Dexec.mainClass=com.google.cloud.teleport.v2.templates.PubSubToMongoDB \
 *  -Dexec.cleanupDaemonThreads=false \
 *  -Dexec.args=" \
 *  --project=${PROJECT_NAME} \
 *  --stagingLocation=gs://${BUCKET_NAME}/staging \
 *  --tempLocation=gs://${BUCKET_NAME}/temp \
 *  --runner=DataflowRunner \
 *  --inputSubscription=${INPUT_SUBSCRIPTION} \
 *  --mongoDBUri=${MONGODB_HOSTNAME} \
 *  --database=${MONGODB_DATABASE_NAME} \
 *  --collection=${MONGODB_COLLECTION_NAME} \
 *  --deadletterTable=${DEADLETTERTABLE}"
 * </pre>
 */
@Template(
    name = "Cloud_PubSub_to_MongoDB",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub to MongoDB",
    description =
        "Streaming pipeline that reads JSON encoded messages from a Pub/Sub subscription,"
            + " transforms them using a JavaScript user-defined function (UDF), and writes them to"
            + " a MongoDB as documents.",
    optionsClass = Options.class,
    flexContainerName = "pubsub-to-mongodb",
    contactInformation = "https://cloud.google.com/support")
public class PubSubToMongoDB {
  /**
   * Options supported by {@link PubSubToMongoDB}
   *
   * <p>Inherits standard configuration options.
   */

  /** The tag for the main output of the json transformation. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> TRANSFORM_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The tag for the dead-letter output of the json to table row transform. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> TRANSFORM_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** Pubsub message/string coder for pipeline. */
  public static final FailsafeElementCoder<PubsubMessage, String> CODER =
      FailsafeElementCoder.of(PubsubMessageWithAttributesCoder.of(), StringUtf8Coder.of());

  /** String/String Coder for FailsafeElement. */
  public static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  /** The log to output status messages to. */
  private static final Logger LOG = LoggerFactory.getLogger(PubSubToMongoDB.class);

  /**
   * The {@link Options} class provides the custom execution options passed by the executor at the
   * command-line.
   *
   * <p>Inherits standard configuration options, options from {@link
   * JavascriptTextTransformer.JavascriptTextTransformerOptions}.
   */
  public interface Options
      extends JavascriptTextTransformer.JavascriptTextTransformerOptions, PipelineOptions {
    @TemplateParameter.PubsubSubscription(
        order = 1,
        description = "Pub/Sub input subscription",
        helpText =
            "Pub/Sub subscription to read the input from, in the format of"
                + " 'projects/your-project-id/subscriptions/your-subscription-name'",
        example = "projects/your-project-id/subscriptions/your-subscription-name")
    @Validation.Required
    String getInputSubscription();

    void setInputSubscription(String inputSubscription);

    @TemplateParameter.Text(
        order = 2,
        description = "MongoDB Connection URI",
        helpText = "List of Mongo DB nodes separated by comma.",
        example = "host1:port,host2:port,host3:port")
    @Validation.Required
    String getMongoDBUri();

    void setMongoDBUri(String mongoDBUri);

    @TemplateParameter.Text(
        order = 3,
        description = "MongoDB Database",
        helpText = "Database in MongoDB to store the collection.",
        example = "my-db")
    @Validation.Required
    String getDatabase();

    void setDatabase(String database);

    @TemplateParameter.Text(
        order = 4,
        description = "MongoDB collection",
        helpText = "Name of the collection inside MongoDB database to put the documents to.",
        example = "my-collection")
    @Validation.Required
    String getCollection();

    void setCollection(String collection);

    @TemplateParameter.BigQueryTable(
        order = 5,
        description = "The dead-letter table name to output failed messages to BigQuery",
        helpText =
            "Messages failed to reach the output table for all kind of reasons (e.g., mismatched"
                + " schema, malformed json) are written to this table. If it doesn't exist, it will"
                + " be created during pipeline execution. If not specified,"
                + " \"outputTableSpec_error_records\" is used instead.",
        example = "your-project-id:your-dataset.your-table-name")
    @Validation.Required
    String getDeadletterTable();

    void setDeadletterTable(String deadletterTable);

    @TemplateParameter.Long(
        order = 6,
        optional = true,
        description = "Batch Size",
        helpText = "Batch Size used for batch insertion of documents into MongoDB.")
    @Default.Long(1000)
    Long getBatchSize();

    void setBatchSize(Long batchSize);

    @TemplateParameter.Long(
        order = 7,
        optional = true,
        description = "Batch Size in Bytes",
        helpText =
            "Batch Size in bytes used for batch insertion of documents into MongoDB. Default:"
                + " 5242880 (5mb)")
    @Default.Long(5242880)
    Long getBatchSizeBytes();

    void setBatchSizeBytes(Long batchSizeBytes);

    @TemplateParameter.Integer(
        order = 8,
        optional = true,
        description = "Max Connection idle time",
        helpText = "Maximum idle time allowed in seconds before connection timeout occurs.")
    @Default.Integer(60000)
    int getMaxConnectionIdleTime();

    void setMaxConnectionIdleTime(int maxConnectionIdleTime);

    @TemplateParameter.Boolean(
        order = 9,
        optional = true,
        description = "SSL Enabled",
        helpText = "Indicates whether connection to MongoDB is ssl enabled.")
    @Default.Boolean(true)
    Boolean getSslEnabled();

    void setSslEnabled(Boolean sslEnabled);

    @TemplateParameter.Boolean(
        order = 10,
        optional = true,
        description = "Ignore SSL Certificate",
        helpText = "Indicates whether SSL certificate should be ignored.")
    @Default.Boolean(true)
    Boolean getIgnoreSSLCertificate();

    void setIgnoreSSLCertificate(Boolean ignoreSSLCertificate);

    @TemplateParameter.Boolean(
        order = 11,
        optional = true,
        description = "withOrdered",
        helpText = "Enables ordered bulk insertions into MongoDB.")
    @Default.Boolean(true)
    Boolean getWithOrdered();

    void setWithOrdered(Boolean withOrdered);

    @TemplateParameter.Boolean(
        order = 12,
        optional = true,
        description = "withSSLInvalidHostNameAllowed",
        helpText = "Indicates whether invalid host name is allowed for ssl connection.")
    @Default.Boolean(true)
    Boolean getWithSSLInvalidHostNameAllowed();

    void setWithSSLInvalidHostNameAllowed(Boolean withSSLInvalidHostNameAllowed);
  }

  /** DoFn that will parse the given string elements as Bson Documents. */
  private static class ParseAsDocumentsFn extends DoFn<String, Document> {

    @ProcessElement
    public void processElement(ProcessContext context) {
      context.output(Document.parse(context.element()));
    }
  }

  /**
   * Main entry point for executing the pipeline.
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    // Parse the user options passed from the command-line.
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  public static PipelineResult run(Options options) {

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Register the coders for pipeline
    CoderRegistry coderRegistry = pipeline.getCoderRegistry();

    coderRegistry.registerCoderForType(
        FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER);

    coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER);

    /*
     * Steps: 1) Read PubSubMessage with attributes from input PubSub subscription.
     *        2) Apply Javascript UDF if provided.
     *        3) Write to MongoDB
     *
     */

    LOG.info("Reading from subscription: " + options.getInputSubscription());

    PCollectionTuple convertedPubsubMessages =
        pipeline
            /*
             * Step #1: Read from a PubSub subscription.
             */
            .apply(
                "Read PubSub Subscription",
                PubsubIO.readMessagesWithAttributes()
                    .fromSubscription(options.getInputSubscription()))
            /*
             * Step #2: Apply Javascript Transform and transform, if provided and transform
             *          the PubsubMessages into Json documents.
             */
            .apply(
                "Apply Javascript UDF",
                PubSubMessageToJsonDocument.newBuilder()
                    .setJavascriptTextTransformFunctionName(
                        options.getJavascriptTextTransformFunctionName())
                    .setJavascriptTextTransformGcsPath(options.getJavascriptTextTransformGcsPath())
                    .build());

    /*
     * Step #3a: Write Json documents into MongoDB using {@link MongoDbIO.write}.
     */
    convertedPubsubMessages
        .get(TRANSFORM_OUT)
        .apply(
            "Get Json Documents",
            MapElements.into(TypeDescriptors.strings()).via(FailsafeElement::getPayload))
        .apply("Parse as BSON Document", ParDo.of(new ParseAsDocumentsFn()))
        .apply(
            "Put to MongoDB",
            MongoDbIO.write()
                .withBatchSize(options.getBatchSize())
                .withUri(String.format("mongodb://%s", options.getMongoDBUri()))
                .withDatabase(options.getDatabase())
                .withCollection(options.getCollection())
                .withIgnoreSSLCertificate(options.getIgnoreSSLCertificate())
                .withMaxConnectionIdleTime(options.getMaxConnectionIdleTime())
                .withOrdered(options.getWithOrdered())
                .withSSLEnabled(options.getSslEnabled())
                .withSSLInvalidHostNameAllowed(options.getWithSSLInvalidHostNameAllowed()));

    /*
     * Step 3b: Write elements that failed processing to deadletter table via {@link BigQueryIO}.
     */
    convertedPubsubMessages
        .get(TRANSFORM_DEADLETTER_OUT)
        .apply(
            "Write Transform Failures To BigQuery",
            ErrorConverters.WritePubsubMessageErrors.newBuilder()
                .setErrorRecordsTable(options.getDeadletterTable())
                .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA)
                .build());

    // Execute the pipeline and return the result.
    return pipeline.run();
  }

  /**
   * The {@link PubSubMessageToJsonDocument} class is a {@link PTransform} which transforms incoming
   * {@link PubsubMessage} objects into JSON objects for insertion into MongoDB while applying an
   * optional UDF to the input. The executions of the UDF and transformation to Json objects is done
   * in a fail-safe way by wrapping the element with it's original payload inside the {@link
   * FailsafeElement} class. The {@link PubSubMessageToJsonDocument} transform will output a {@link
   * PCollectionTuple} which contains all output and dead-letter {@link PCollection}.
   *
   * <p>The {@link PCollectionTuple} output will contain the following {@link PCollection}:
   *
   * <ul>
   *   <li>{@link PubSubToMongoDB#TRANSFORM_OUT} - Contains all records successfully converted to
   *       JSON objects.
   *   <li>{@link PubSubToMongoDB#TRANSFORM_DEADLETTER_OUT} - Contains all {@link FailsafeElement}
   *       records which couldn't be converted to table rows.
   * </ul>
   */
  @AutoValue
  public abstract static class PubSubMessageToJsonDocument
      extends PTransform<PCollection<PubsubMessage>, PCollectionTuple> {

    public static Builder newBuilder() {
      return new AutoValue_PubSubToMongoDB_PubSubMessageToJsonDocument.Builder();
    }

    @Nullable
    public abstract String javascriptTextTransformGcsPath();

    @Nullable
    public abstract String javascriptTextTransformFunctionName();

    @Override
    public PCollectionTuple expand(PCollection<PubsubMessage> input) {

      // Map the incoming messages into FailsafeElements so we can recover from failures
      // across multiple transforms.
      PCollection<FailsafeElement<PubsubMessage, String>> failsafeElements =
          input.apply("MapToRecord", ParDo.of(new PubsubMessageToFailsafeElementFn()));

      // If a Udf is supplied then use it to parse the PubSubMessages.
      if (javascriptTextTransformGcsPath() != null) {
        return failsafeElements.apply(
            "InvokeUDF",
            JavascriptTextTransformer.FailsafeJavascriptUdf.<PubsubMessage>newBuilder()
                .setFileSystemPath(javascriptTextTransformGcsPath())
                .setFunctionName(javascriptTextTransformFunctionName())
                .setSuccessTag(TRANSFORM_OUT)
                .setFailureTag(TRANSFORM_DEADLETTER_OUT)
                .build());
      } else {
        return failsafeElements.apply(
            "ProcessPubSubMessages",
            ParDo.of(new ProcessFailsafePubSubFn())
                .withOutputTags(TRANSFORM_OUT, TupleTagList.of(TRANSFORM_DEADLETTER_OUT)));
      }
    }

    /** Builder for {@link PubSubMessageToJsonDocument}. */
    @AutoValue.Builder
    public abstract static class Builder {
      public abstract Builder setJavascriptTextTransformGcsPath(
          String javascriptTextTransformGcsPath);

      public abstract Builder setJavascriptTextTransformFunctionName(
          String javascriptTextTransformFunctionName);

      public abstract PubSubMessageToJsonDocument build();
    }
  }

  /**
   * The {@link ProcessFailsafePubSubFn} class processes a {@link FailsafeElement} containing a
   * {@link PubsubMessage} and a String of the message's payload {@link PubsubMessage#getPayload()}
   * into a {@link FailsafeElement} of the original {@link PubsubMessage} and a JSON string that has
   * been processed with {@link Gson}.
   *
   * <p>If {@link PubsubMessage#getAttributeMap()} is not empty then the message attributes will be
   * serialized along with the message payload.
   */
  static class ProcessFailsafePubSubFn
      extends DoFn<FailsafeElement<PubsubMessage, String>, FailsafeElement<PubsubMessage, String>> {

    private static final Counter successCounter =
        Metrics.counter(PubSubMessageToJsonDocument.class, "successful-json-conversion");

    private static Gson gson = new Gson();

    private static final Counter failedCounter =
        Metrics.counter(PubSubMessageToJsonDocument.class, "failed-json-conversion");

    @ProcessElement
    public void processElement(ProcessContext context) {
      PubsubMessage pubsubMessage = context.element().getOriginalPayload();

      JsonObject messageObject = new JsonObject();

      try {
        if (pubsubMessage.getPayload().length > 0) {
          messageObject = gson.fromJson(new String(pubsubMessage.getPayload()), JsonObject.class);
        }

        // If message attributes are present they will be serialized along with the message payload
        if (pubsubMessage.getAttributeMap() != null) {
          pubsubMessage.getAttributeMap().forEach(messageObject::addProperty);
        }

        context.output(FailsafeElement.of(pubsubMessage, messageObject.toString()));
        successCounter.inc();

      } catch (JsonSyntaxException e) {
        context.output(
            TRANSFORM_DEADLETTER_OUT,
            FailsafeElement.of(context.element())
                .setErrorMessage(e.getMessage())
                .setStacktrace(Throwables.getStackTraceAsString(e)));
        failedCounter.inc();
      }
    }
  }

  /**
   * The {@link PubsubMessageToFailsafeElementFn} wraps an incoming {@link PubsubMessage} with the
   * {@link FailsafeElement} class so errors can be recovered from and the original message can be
   * output to a error records table.
   */
  static class PubsubMessageToFailsafeElementFn
      extends DoFn<PubsubMessage, FailsafeElement<PubsubMessage, String>> {
    @ProcessElement
    public void processElement(ProcessContext context) {
      PubsubMessage message = context.element();
      context.output(
          FailsafeElement.of(message, new String(message.getPayload(), StandardCharsets.UTF_8)));
    }
  }
}

Pub/Sub to Elasticsearch

Pub/Sub to Elasticsearch 模板是一种流处理流水线，可从 Pub/Sub 订阅读取消息、执行用户定义的函数 (UDF) 并将其作为文档写入 Elasticsearch。Dataflow 模板使用 Elasticsearch 的数据流功能跨多个索引存储时间序列数据，同时为请求提供单个命名资源。数据流非常适合存储在 Pub/Sub 中的日志、指标、跟踪记录和其他持续生成的数据。

对此流水线的要求

来源 Pub/Sub 订阅必须存在，并且消息必须采用有效的 JSON 格式进行编码。
GCP 实例上或 Elastic Cloud 上可公开访问的 Elasticsearch 主机，用于 Elasticsearch 7.0 或更高版本。如需了解详情，请参阅适用于 Elastic 的 Google Cloud 集成。
用于错误输出的 Pub/Sub 主题。

模板参数

参数	说明
`inputSubscription`	要使用的 Pub/Sub 订阅。该名称应采用 `projects/<project-id>/subscriptions/<subscription-name>` 格式。
`connectionUrl`	Elasticsearch 网址，格式为 `https://hostname:[port]` 或指定 CloudID（如果使用 Elastic Cloud）。
`apiKey`	用于身份验证的 Base64 编码 API 密钥。
`errorOutputTopic`	Pub/Sub 输出主题，用于发布失败的记录，格式为 `projects/<project-id>/topics/<topic-name>`
`dataset`	（可选）通过 Pub/Sub 发送的日志类型，我们为其提供了开箱即用的信息中心。已知日志类型值为 audit、vpcflow 和 firewall。默认值：`pubsub`。
`namespace`	（可选）任意分组，例如环境（dev、prod 或 qa）、团队或战略性业务部门。默认值：`default`。
`batchSize`	（可选）文档数量中的批次大小。默认值：`1000`。
`batchSizeBytes`	（可选）批次大小（以字节为单位）。默认值：`5242880` (5mb)。
`maxRetryAttempts`	（可选）尝试次数上限，必须大于 0。默认值：`no retries`。
`maxRetryDuration`	（可选）重试时长上限（以毫秒为单位），必须大于 0。默认值：`no retries`。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
`propertyAsIndex`	（可选）要编入索引的文档中的一个属性，其值将指定批量请求要包含在文档中的 `_index` 元数据（优先于 `_index` UDF）。默认值：none。
`propertyAsId`	（可选）要编入索引的文档中的一个属性，其值将指定批量请求要包含在文档中的 `_id` 元数据（优先于 `_id` UDF）。默认值：none。
`javaScriptIndexFnGcsPath`	（可选）函数的 JavaScript UDF 源的 Cloud Storage 路径，该函数将指定批量请求要包含在文档中的 `_index` 元数据。默认值：none。
`javaScriptIndexFnName`	（可选）函数的 UDF JavaScript 函数名称，该函数将指定批量请求要包含在文档中的 `_index` 元数据。默认值：none。
`javaScriptIdFnGcsPath`	（可选）函数的 JavaScript UDF 源的 Cloud Storage 路径，该函数将指定批量请求要包含在文档中的 `_id` 元数据。默认值：none。
`javaScriptIdFnName`	（可选）函数的 UDF JavaScript 函数名称，该函数将指定批量请求要包含在文档中的 `_id` 元数据。默认值：none。
`javaScriptTypeFnGcsPath`	（可选）函数的 JavaScript UDF 源的 Cloud Storage 路径，该函数将指定批量请求要包含在文档中的 `_type` 元数据。默认值：none。
`javaScriptTypeFnName`	（可选）函数的 UDF JavaScript 函数名称，该函数将指定批量请求要包含在文档中的 `_type` 元数据。默认值：none。
`javaScriptIsDeleteFnGcsPath`	（可选）函数的 JavaScript UDF 源的 Cloud Storage 路径，该函数将确定是否应删除文档，而不是插入或更新文档。该函数应返回字符串值 `"true"` 或 `"false"`。默认值：none。
`javaScriptIsDeleteFnName`	（可选）函数的 UDF JavaScript 函数名称，该函数将确定是否应删除文档，而不是插入或更新文档。该函数应返回字符串值 `"true"` 或 `"false"`。默认值：none。
`usePartialUpdate`	（可选）是否在 Elasticsearch 请求中使用部分更新（更新而不是创建或索引，允许部分文档）。默认值：`false`。
`bulkInsertMethod`	（可选）在 Elasticsearch 批量请求中使用 `INDEX`（索引，允许执行更新插入操作）还是 `CREATE`（创建，会对重复 _id 报错）。默认值：`CREATE`。

运行 Pub/Sub to Elasticsearch 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub to Elasticsearch template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/PubSub_to_Elasticsearch \
    --parameters \
inputSubscription=SUBSCRIPTION_NAME,\
connectionUrl=CONNECTION_URL,\
dataset=DATASET,\
namespace=NAMESPACE,\
apiKey=APIKEY,\
errorOutputTopic=ERROR_OUTPUT_TOPIC

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
ERROR_OUTPUT_TOPIC：用于错误输出的 Pub/Sub 主题
SUBSCRIPTION_NAME：您的 Pub/Sub 订阅名称
CONNECTION_URL：您的 Elasticsearch 网址
DATASET：您的日志类型
NAMESPACE：数据集的命名空间
APIKEY：用于身份验证的 base64 编码 API 密钥

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputSubscription": "SUBSCRIPTION_NAME",
          "connectionUrl": "CONNECTION_URL",
          "dataset": "DATASET",
          "namespace": "NAMESPACE",
          "apiKey": "APIKEY",
          "errorOutputTopic": "ERROR_OUTPUT_TOPIC"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/PubSub_to_Elasticsearch",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
ERROR_OUTPUT_TOPIC：用于错误输出的 Pub/Sub 主题
SUBSCRIPTION_NAME：您的 Pub/Sub 订阅名称
CONNECTION_URL：您的 Elasticsearch 网址
DATASET：您的日志类型
NAMESPACE：数据集的命名空间
APIKEY：用于身份验证的 base64 编码 API 密钥

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.elasticsearch.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.elasticsearch.options.PubSubToElasticsearchOptions;
import com.google.cloud.teleport.v2.elasticsearch.transforms.FailedPubsubMessageToPubsubTopicFn;
import com.google.cloud.teleport.v2.elasticsearch.transforms.ProcessEventMetadata;
import com.google.cloud.teleport.v2.elasticsearch.transforms.PubSubMessageToJsonDocument;
import com.google.cloud.teleport.v2.elasticsearch.transforms.WriteToElasticsearch;
import com.google.cloud.teleport.v2.elasticsearch.utils.ElasticsearchIndex;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link PubSubToElasticsearch} pipeline is a streaming pipeline which ingests data in JSON
 * format from PubSub, applies a Javascript UDF if provided and writes the resulting records to
 * Elasticsearch. If the element fails to be processed then it is written to an error output table
 * in BigQuery.
 *
 * <p>Please refer to <b><a href=
 * "https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/v2/googlecloud-to-elasticsearch/docs/PubSubToElasticsearch/README.md">
 * README.md</a></b> for further information.
 */
@Template(
    name = "PubSub_to_Elasticsearch",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub to Elasticsearch",
    description =
        "A pipeline to read messages from Pub/Sub and writes into an Elasticsearch instance as json"
            + " documents with optional intermediate transformations using Javascript Udf.",
    optionsClass = PubSubToElasticsearchOptions.class,
    skipOptions = "index", // Template just ignores what is sent as "index"
    flexContainerName = "pubsub-to-elasticsearch",
    contactInformation = "https://cloud.google.com/support")
public class PubSubToElasticsearch {

  /** The tag for the main output of the json transformation. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> TRANSFORM_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The tag for the error output table of the json to table row transform. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> TRANSFORM_ERROROUTPUT_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** Pubsub message/string coder for pipeline. */
  public static final FailsafeElementCoder<PubsubMessage, String> CODER =
      FailsafeElementCoder.of(PubsubMessageWithAttributesCoder.of(), StringUtf8Coder.of());

  /** String/String Coder for FailsafeElement. */
  public static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  /** The log to output status messages to. */
  private static final Logger LOG = LoggerFactory.getLogger(PubSubToElasticsearch.class);

  /**
   * Main entry point for executing the pipeline.
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    // Parse the user options passed from the command-line.
    PubSubToElasticsearchOptions pubSubToElasticsearchOptions =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(PubSubToElasticsearchOptions.class);

    pubSubToElasticsearchOptions.setIndex(
        new ElasticsearchIndex(
                pubSubToElasticsearchOptions.getDataset(),
                pubSubToElasticsearchOptions.getNamespace())
            .getIndex());

    run(pubSubToElasticsearchOptions);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  public static PipelineResult run(PubSubToElasticsearchOptions options) {

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Register the coders for pipeline
    CoderRegistry coderRegistry = pipeline.getCoderRegistry();

    coderRegistry.registerCoderForType(
        FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER);

    coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER);

    /*
     * Steps: 1) Read PubSubMessage with attributes from input PubSub subscription.
     *        2) Apply Javascript UDF if provided.
     *        3) Index Json string to output ES index.
     *
     */
    LOG.info("Reading from subscription: " + options.getInputSubscription());

    PCollectionTuple convertedPubsubMessages =
        pipeline
            /*
             * Step #1: Read from a PubSub subscription.
             */
            .apply(
                "ReadPubSubSubscription",
                PubsubIO.readMessagesWithAttributes()
                    .fromSubscription(options.getInputSubscription()))
            /*
             * Step #2: Transform the PubsubMessages into Json documents.
             */
            .apply(
                "ConvertMessageToJsonDocument",
                PubSubMessageToJsonDocument.newBuilder()
                    .setJavascriptTextTransformFunctionName(
                        options.getJavascriptTextTransformFunctionName())
                    .setJavascriptTextTransformGcsPath(options.getJavascriptTextTransformGcsPath())
                    .build());

    /*
     * Step #3a: Write Json documents into Elasticsearch using {@link ElasticsearchTransforms.WriteToElasticsearch}.
     */
    convertedPubsubMessages
        .get(TRANSFORM_OUT)
        .apply(
            "GetJsonDocuments",
            MapElements.into(TypeDescriptors.strings()).via(FailsafeElement::getPayload))
        .apply("Insert metadata", new ProcessEventMetadata())
        .apply(
            "WriteToElasticsearch",
            WriteToElasticsearch.newBuilder()
                .setOptions(options.as(PubSubToElasticsearchOptions.class))
                .build());

    /*
     * Step 3b: Write elements that failed processing to error output PubSub topic via {@link PubSubIO}.
     */
    convertedPubsubMessages
        .get(TRANSFORM_ERROROUTPUT_OUT)
        .apply(ParDo.of(new FailedPubsubMessageToPubsubTopicFn()))
        .apply("writeFailureMessages", PubsubIO.writeMessages().to(options.getErrorOutputTopic()));

    // Execute the pipeline and return the result.
    return pipeline.run();
  }
}

Datastream to Cloud Spanner

Datastream to Cloud Spanner 模板是一种流处理流水线，可从 Cloud Storage 存储桶中读取 Datastream 事件并将它们写入 Cloud Spanner 数据库。它预期用于将数据从 Datastream 迁移到 Cloud Spanner。

在执行模板之前，执行迁移所需的所有表必须存在于目标 Cloud Spanner 数据库中。因此，在数据迁移之前，必须完成从源数据库到目标 Cloud Spanner 的架构迁移。在迁移之前，数据可能存在表中。此模板不会将 DataStream 架构更改传播到 Cloud Spanner 数据库。

只有在所有数据都写入 Cloud Spanner 后，才能在迁移结束时保证数据一致性。为了存储对写入 Cloud Spanner 的每条记录的排序信息，此模板为 Cloud Spanner 数据库中的每个表创建了一个额外的表（称为影子表）。这用于确保迁移结束时的一致性。影子表在迁移后不会被删除，可在迁移结束时用于进行验证。

操作期间发生的任何错误（例如架构不匹配、JSON 文件格式错误或执行转换产生的错误）都会记录在错误队列中。错误队列是一个 Cloud Storage 文件夹，它以文本格式存储遇到错误的所有 Datastream 事件以及错误原因。这些错误可能是暂时性的，也可能是永久性的，它们存储在错误队列的相应 Cloud Storage 文件夹中。系统会自动重试暂时性错误，但不会自动重试永久性错误。如果发生永久性错误，您可以选择在模板运行期间更正更改事件，并将它们转移到可重试的存储桶。

对此流水线的要求：

处于正在运行或未启动状态的 Datastream 数据流。
要在其中复制 Datastream 事件的 Cloud Storage 存储桶。
具有现有表的 Cloud Spanner 数据库。这些表可以为空，也可以包含数据。

模板参数

参数	说明
`inputFilePattern`	Cloud Storage 中要复制的 Datastream 文件的位置。通常，这是数据流的根路径。
`streamName`	用于轮询架构信息和来源类型的数据流的名称或模板。
`instanceId`	在其中复制更改的 Cloud Spanner 实例。
`databaseId`	在其中复制更改的 Cloud Spanner 数据库。
`projectId`	Cloud Spanner 项目 ID。
`deadLetterQueueDirectory`	（可选）用于存储错误队列输出的文件路径。默认值为 Dataflow 作业的临时位置下的目录。
`inputFileFormat`	（可选）Datastream 生成的输出文件的格式。例如：`avro,json`。默认值：`avro`。
`shadowTablePrefix`	（可选）用于为影子表命名的前缀。默认值：`shadow_`。

运行 Datastream to Cloud Spanner 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cloud Datastream to Spanner template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/Cloud_Datastream_to_Spanner \
    --parameters \
inputFilePattern=GCS_FILE_PATH,\
streamName=STREAM_NAME,\
instanceId=CLOUDSPANNER_INSTANCE,\
databaseId=CLOUDSPANNER_DATABASE,\
deadLetterQueueDirectory=DLQ

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
GCS_FILE_PATH：用于存储 Datastream 事件的 Cloud Storage 路径。例如 gs://bucket/path/to/data/
CLOUDSPANNER_INSTANCE：您的 Cloud Spanner 实例。
CLOUDSPANNER_DATABASE：您的 Cloud Spanner 数据库。
DLQ：错误队列目录的 Cloud Storage 路径。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/Cloud_Datastream_to_Spanner",
      "parameters": {
          "inputFilePattern": "GCS_FILE_PATH",
          "streamName": "STREAM_NAME"
          "instanceId": "CLOUDSPANNER_INSTANCE"
          "databaseId": "CLOUDSPANNER_DATABASE"
          "deadLetterQueueDirectory": "DLQ"
      }
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
GCS_FILE_PATH：用于存储 Datastream 事件的 Cloud Storage 路径。例如 gs://bucket/path/to/data/
CLOUDSPANNER_INSTANCE：您的 Cloud Spanner 实例。
CLOUDSPANNER_DATABASE：您的 Cloud Spanner 数据库。
DLQ：错误队列目录的 Cloud Storage 路径。

Text Files on Cloud Storage to BigQuery (Stream)

Text Files on Cloud Storage to BigQuery 流水线是一种流处理流水线，可用于流式传输 Cloud Storage 中存储的文本文件，使用您提供的 JavaScript 用户定义函数 (UDF) 转换这些文件，然后将结果附加到 BigQuery。

流水线无限期运行，需要通过取消而非排空手动终止，原因是其使用 Watch 转换，该转换是不支持排空的可拆分 DoFn。

对此流水线的要求：

创建一个用于描述 BigQuery 中的输出表架构的 JSON 文件。

确保有一个名为 fields 的顶级 JSON 数组，且该数组的内容遵循 {"name": "COLUMN_NAME", "type": "DATA_TYPE"} 格式。例如：

{
  "fields": [
    {
      "name": "location",
      "type": "STRING"
    },
    {
      "name": "name",
      "type": "STRING"
    },
    {
      "name": "age",
      "type": "STRING"
    },
    {
      "name": "color",
      "type": "STRING",
      "mode": "REQUIRED"
    },
    {
      "name": "coffee",
      "type": "STRING",
      "mode": "REQUIRED"
    }
  ]
}

使用 UDF 函数（该函数提供转换文本行的逻辑）创建一个 JavaScript (.js) 文件。请注意，此函数必须返回 JSON 字符串。

例如，以下函数将拆分 CSV 文件的每行文本，并通过转换值返回 JSON 字符串。

function transform(line) {
var values = line.split(',');

var obj = new Object();
obj.location = values[0];
obj.name = values[1];
obj.age = values[2];
obj.color = values[3];
obj.coffee = values[4];
var jsonString = JSON.stringify(obj);

return jsonString;
}

模板参数

参数	说明
`javascriptTextTransformGcsPath`	`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`JSONPath`	您的 BigQuery 架构文件的 Cloud Storage 位置，以 JSON 格式描述。例如：`gs://path/to/my/schema.json`。
`javascriptTextTransformFunctionName`	您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
`outputTable`	完全限定的 BigQuery 表，例如 `my-project:dataset.table`
`inputFilePattern`	您要处理的文本的 Cloud Storage 位置，例如：`gs://my-bucket/my-files/text.txt`。
`bigQueryLoadingTemporaryDirectory`	BigQuery 加载进程的临时目录，例如：`gs://my-bucket/my-files/temp_dir`
`outputDeadletterTable`	无法到达输出表的消息表。例如 `my-project:dataset.my-unprocessed-table`。如果该表不存在，则系统会在流水线执行期间创建它。如果未指定此参数，则系统会改用 `<outputTableSpec>_error_records`。

运行 Cloud Storage Text to BigQuery (Stream) 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Text Files on Cloud Storage to BigQuery template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Stream_GCS_Text_to_BigQuery \
    --region REGION_NAME \
    --staging-location STAGING_LOCATION \
    --parameters \
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
inputFilePattern=PATH_TO_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
outputDeadletterTable=BIGQUERY_UNPROCESSED_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_BIGQUERY_SCHEMA_JSON：包含架构定义的 JSON 文件的 Cloud Storage 路径
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js
PATH_TO_TEXT_DATA：文本数据集的 Cloud Storage 路径
BIGQUERY_TABLE：您的 BigQuery 表名称
BIGQUERY_UNPROCESSED_TABLE：未处理消息的 BigQuery 表名称
PATH_TO_TEMP_DIR_ON_GCS：临时目录的 Cloud Storage 路径

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Stream_GCS_Text_to_BigQuery
{
   "jobName": "JOB_NAME",
   "environment": {
       "bypassTempDirValidation": false,
       "tempLocation": "TEMP_LOCATION",
       "ipConfiguration": "WORKER_IP_UNSPECIFIED",
       "additionalExperiments": []
    },
   "parameters": {
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "JSONPath": "PATH_TO_BIGQUERY_SCHEMA_JSON",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "inputFilePattern":"PATH_TO_TEXT_DATA",
       "outputTable":"BIGQUERY_TABLE",
       "outputDeadletterTable":"BIGQUERY_UNPROCESSED_TABLE",
       "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS"
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_BIGQUERY_SCHEMA_JSON：包含架构定义的 JSON 文件的 Cloud Storage 路径
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js
PATH_TO_TEXT_DATA：文本数据集的 Cloud Storage 路径
BIGQUERY_TABLE：您的 BigQuery 表名称
BIGQUERY_UNPROCESSED_TABLE：未处理消息的 BigQuery 表名称
PATH_TO_TEMP_DIR_ON_GCS：临时目录的 Cloud Storage 路径

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.api.client.json.JsonFactory;
import com.google.api.services.bigquery.model.TableRow;
import com.google.cloud.teleport.coders.FailsafeElementCoder;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.TextToBigQueryStreaming.TextToBigQueryStreamingOptions;
import com.google.cloud.teleport.templates.common.BigQueryConverters.FailsafeJsonToTableRow;
import com.google.cloud.teleport.templates.common.ErrorConverters.WriteStringMessageErrors;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.FailsafeJavascriptUdf;
import com.google.cloud.teleport.util.ResourceUtils;
import com.google.cloud.teleport.util.ValueProviderUtils;
import com.google.cloud.teleport.values.FailsafeElement;
import com.google.common.base.Charsets;
import com.google.common.collect.ImmutableList;
import com.google.common.io.ByteStreams;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.extensions.gcp.util.Transport;
import org.apache.beam.sdk.io.FileSystems;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.Method;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError;
import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy;
import org.apache.beam.sdk.io.gcp.bigquery.WriteResult;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider;
import org.apache.beam.sdk.transforms.Flatten;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.transforms.Watch.Growth;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionList;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.joda.time.Duration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link TextToBigQueryStreaming} is a streaming version of {@link TextIOToBigQuery} pipeline
 * that reads text files, applies a JavaScript UDF and writes the output to BigQuery. The pipeline
 * continuously polls for new files, reads them row-by-row and processes each record into BigQuery.
 * The polling interval is set at 10 seconds.
 *
 * <p>Example Usage:
 *
 * <pre>
 * {@code mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.TextToBigQueryStreaming \
 * -Dexec.args="\
 * --project=${PROJECT_ID} \
 * --stagingLocation=gs://${STAGING_BUCKET}/staging \
 * --tempLocation=gs://${STAGING_BUCKET}/tmp \
 * --runner=DataflowRunner \
 * --inputFilePattern=gs://path/to/input* \
 * --JSONPath=gs://path/to/json/schema.json \
 * --outputTable={$PROJECT_ID}:${OUTPUT_DATASET}.${OUTPUT_TABLE} \
 * --javascriptTextTransformGcsPath=gs://path/to/transform/udf.js \
 * --javascriptTextTransformFunctionName=${TRANSFORM_NAME} \
 * --bigQueryLoadingTemporaryDirectory=gs://${STAGING_BUCKET}/tmp \
 * --outputDeadletterTable=${PROJECT_ID}:${ERROR_DATASET}.${ERROR_TABLE}"
 * }
 * </pre>
 */
@Template(
    name = "Stream_GCS_Text_to_BigQuery",
    category = TemplateCategory.STREAMING,
    displayName = "Cloud Storage Text to BigQuery (Stream)",
    description =
        "A streaming pipeline that can read text files stored in Cloud Storage, perform a transform via a user defined JavaScript function, and stream the results into BigQuery. This pipeline requires a JavaScript function and a JSON representation of the BigQuery TableSchema.",
    optionsClass = TextToBigQueryStreamingOptions.class,
    contactInformation = "https://cloud.google.com/support")
public class TextToBigQueryStreaming {

  private static final Logger LOG = LoggerFactory.getLogger(TextToBigQueryStreaming.class);

  /** The tag for the main output for the UDF. */
  private static final TupleTag<FailsafeElement<String, String>> UDF_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /** The tag for the dead-letter output of the udf. */
  private static final TupleTag<FailsafeElement<String, String>> UDF_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /** The tag for the main output of the json transformation. */
  private static final TupleTag<TableRow> TRANSFORM_OUT = new TupleTag<TableRow>() {};

  /** The tag for the dead-letter output of the json to table row transform. */
  private static final TupleTag<FailsafeElement<String, String>> TRANSFORM_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /** The default suffix for error tables if dead letter table is not specified. */
  private static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";

  /** Default interval for polling files in GCS. */
  private static final Duration DEFAULT_POLL_INTERVAL = Duration.standardSeconds(10);

  /** Coder for FailsafeElement. */
  private static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  private static final JsonFactory JSON_FACTORY = Transport.getJsonFactory();

  /**
   * Main entry point for executing the pipeline. This will run the pipeline asynchronously. If
   * blocking execution is required, use the {@link
   * TextToBigQueryStreaming#run(TextToBigQueryStreamingOptions)} method to start the pipeline and
   * invoke {@code result.waitUntilFinish()} on the {@link PipelineResult}
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {

    // Parse the user options passed from the command-line
    TextToBigQueryStreamingOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(TextToBigQueryStreamingOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  public static PipelineResult run(TextToBigQueryStreamingOptions options) {

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Register the coder for pipeline
    FailsafeElementCoder<String, String> coder =
        FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

    CoderRegistry coderRegistry = pipeline.getCoderRegistry();
    coderRegistry.registerCoderForType(coder.getEncodedTypeDescriptor(), coder);

    /*
     * Steps:
     *  1) Read from the text source continuously.
     *  2) Convert to FailsafeElement.
     *  3) Apply Javascript udf transformation.
     *    - Tag records that were successfully transformed and those
     *      that failed transformation.
     *  4) Convert records to TableRow.
     *    - Tag records that were successfully converted and those
     *      that failed conversion.
     *  5) Insert successfully converted records into BigQuery.
     *    - Errors encountered while streaming will be sent to deadletter table.
     *  6) Insert records that failed into deadletter table.
     */

    PCollectionTuple transformedOutput =
        pipeline

            // 1) Read from the text source continuously.
            .apply(
                "ReadFromSource",
                TextIO.read()
                    .from(options.getInputFilePattern())
                    .watchForNewFiles(DEFAULT_POLL_INTERVAL, Growth.never()))

            // 2) Convert to FailsafeElement.
            .apply(
                "ConvertToFailsafeElement",
                MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor())
                    .via(input -> FailsafeElement.of(input, input)))

            // 3) Apply Javascript udf transformation.
            .apply(
                "ApplyUDFTransformation",
                FailsafeJavascriptUdf.<String>newBuilder()
                    .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                    .setFunctionName(options.getJavascriptTextTransformFunctionName())
                    .setSuccessTag(UDF_OUT)
                    .setFailureTag(UDF_DEADLETTER_OUT)
                    .build());

    PCollectionTuple convertedTableRows =
        transformedOutput

            // 4) Convert records to TableRow.
            .get(UDF_OUT)
            .apply(
                "ConvertJSONToTableRow",
                FailsafeJsonToTableRow.<String>newBuilder()
                    .setSuccessTag(TRANSFORM_OUT)
                    .setFailureTag(TRANSFORM_DEADLETTER_OUT)
                    .build());

    WriteResult writeResult =
        convertedTableRows

            // 5) Insert successfully converted records into BigQuery.
            .get(TRANSFORM_OUT)
            .apply(
                "InsertIntoBigQuery",
                BigQueryIO.writeTableRows()
                    .withJsonSchema(getSchemaFromGCS(options.getJSONPath()))
                    .to(options.getOutputTable())
                    .withExtendedErrorInfo()
                    .withoutValidation()
                    .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
                    .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                    .withMethod(Method.STREAMING_INSERTS)
                    .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
                    .withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory()));

    // Elements that failed inserts into BigQuery are extracted and converted to FailsafeElement
    PCollection<FailsafeElement<String, String>> failedInserts =
        writeResult
            .getFailedInsertsWithErr()
            .apply(
                "WrapInsertionErrors",
                MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor())
                    .via(TextToBigQueryStreaming::wrapBigQueryInsertError));

    // 6) Insert records that failed transformation or conversion into deadletter table
    PCollectionList.of(
            ImmutableList.of(
                transformedOutput.get(UDF_DEADLETTER_OUT),
                convertedTableRows.get(TRANSFORM_DEADLETTER_OUT),
                failedInserts))
        .apply("Flatten", Flatten.pCollections())
        .apply(
            "WriteFailedRecords",
            WriteStringMessageErrors.newBuilder()
                .setErrorRecordsTable(
                    ValueProviderUtils.maybeUseDefaultDeadletterTable(
                        options.getOutputDeadletterTable(),
                        options.getOutputTable(),
                        DEFAULT_DEADLETTER_TABLE_SUFFIX))
                .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
                .build());

    return pipeline.run();
  }

  /**
   * Method to wrap a {@link BigQueryInsertError} into a {@link FailsafeElement}.
   *
   * @param insertError BigQueryInsert error.
   * @return FailsafeElement object.
   * @throws IOException
   */
  static FailsafeElement<String, String> wrapBigQueryInsertError(BigQueryInsertError insertError) {

    FailsafeElement<String, String> failsafeElement;
    try {

      String rowPayload = JSON_FACTORY.toString(insertError.getRow());
      String errorMessage = JSON_FACTORY.toString(insertError.getError());

      failsafeElement = FailsafeElement.of(rowPayload, rowPayload);
      failsafeElement.setErrorMessage(errorMessage);

    } catch (IOException e) {
      throw new RuntimeException(e);
    }

    return failsafeElement;
  }

  /**
   * Method to read a BigQuery schema file from GCS and return the file contents as a string.
   *
   * @param gcsPath Path string for the schema file in GCS.
   * @return File contents as a string.
   */
  private static ValueProvider<String> getSchemaFromGCS(ValueProvider<String> gcsPath) {
    return NestedValueProvider.of(
        gcsPath,
        new SimpleFunction<String, String>() {
          @Override
          public String apply(String input) {
            ResourceId sourceResourceId = FileSystems.matchNewResource(input, false);

            String schema;
            try (ReadableByteChannel rbc = FileSystems.open(sourceResourceId)) {
              try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
                try (WritableByteChannel wbc = Channels.newChannel(baos)) {
                  ByteStreams.copy(rbc, wbc);
                  schema = baos.toString(Charsets.UTF_8.name());
                  LOG.info("Extracted schema: " + schema);
                }
              }
            } catch (IOException e) {
              LOG.error("Error extracting schema: " + e.getMessage());
              throw new RuntimeException(e);
            }
            return schema;
          }
        });
  }

  /**
   * The {@link TextToBigQueryStreamingOptions} class provides the custom execution options passed
   * by the executor at the command-line.
   */
  public interface TextToBigQueryStreamingOptions extends TextIOToBigQuery.Options {
    @TemplateParameter.BigQueryTable(
        order = 1,
        optional = true,
        description = "The dead-letter table name to output failed messages to BigQuery",
        helpText =
            "Messages failed to reach the output table for all kind of reasons (e.g., mismatched "
                + "schema, malformed json) are written to this table. If it doesn't exist, it will be "
                + "created during pipeline execution. If not specified, \"outputTableSpec_error_records\" "
                + "is used instead.",
        example = "your-project-id:your-dataset.your-table-name")
    ValueProvider<String> getOutputDeadletterTable();

    void setOutputDeadletterTable(ValueProvider<String> value);
  }
}

Text Files on Cloud Storage to Pub/Sub (Stream)

此模板可以创建一种流处理流水线，该流水线可持续轮询新上传到 Cloud Storage 的文本文件，并逐行读取每个文件，然后将字符串发布到 Pub/Sub 主题。此外，此模板还能以包含 JSON 记录的换行符分隔文件或 CSV 文件形式，将记录发布到 Pub/Sub 主题进行实时处理。您可以使用此模板将数据重放到 Pub/Sub。

流水线无限期运行，需要通过“取消”而非“排空”手动终止，原因是其使用“Watch”转换，该转换是一个不支持排空的“SplittableDoFn”。

目前，轮询间隔固定为 10 秒。此模板不会对个别记录设置任何时间戳，因此在执行期间，事件时间与发布时间相同。如果您的流水线依赖准确的事件时间来执行处理，建议不要使用此流水线。

对此流水线的要求：

输入文件必须采用换行符分隔的 JSON 或 CSV 格式。跨越源文件中多个行的记录会导致下游问题，因为文件中的每一行都以消息形式发布到 Pub/Sub。
Pub/Sub 主题必须已存在才能执行此流水线。
流水线无限期运行，需要手动终止。

模板参数

参数	说明
`inputFilePattern`	需要读取的输入文件格式。例如 `gs://bucket-name/files/.json` 或 `gs://bucket-name/path/.csv`。
`outputTopic`	要向其写入数据的 Pub/Sub 输入主题。该名称应采用 `projects/<project-id>/topics/<topic-name>` 格式。

运行 Text Files on Cloud Storage to Pub/Sub (Stream) 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Text Files on Cloud Storage to Pub/Sub (Stream) template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Stream_GCS_Text_to_Cloud_PubSub \
    --region REGION_NAME\
    --staging-location STAGING_LOCATION\
    --parameters \
inputFilePattern=gs://BUCKET_NAME/FILE_PATTERN,\
outputTopic=projects/PROJECT_ID/topics/TOPIC_NAME

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
TOPIC_NAME：您的 Pub/Sub 主题名称
BUCKET_NAME：Cloud Storage 存储桶的名称
FILE_PATTERN：要从 Cloud Storage 存储桶中读取的文件格式 glob（例如 path/*.csv）

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Stream_GCS_Text_to_Cloud_PubSub
{
   "jobName": "JOB_NAME",
   "environment": {
       "bypassTempDirValidation": false,
       "tempLocation": "gs://your-bucket/temp",
       "ipConfiguration": "WORKER_IP_UNSPECIFIED",
       "additionalExperiments": []
    },
   "parameters": {
       "inputFilePattern": "gs://BUCKET_NAME/FILE_PATTERN",
       "outputTopic": "projects/PROJECT_ID/topics/TOPIC_NAME"
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
TOPIC_NAME：您的 Pub/Sub 主题名称
BUCKET_NAME：Cloud Storage 存储桶的名称
FILE_PATTERN：要从 Cloud Storage 存储桶中读取的文件格式 glob（例如 path/*.csv）

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.TextToPubsub.Options;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Watch;
import org.joda.time.Duration;

/**
 * The {@code TextToPubsubStream} is a streaming version of {@code TextToPubsub} pipeline that
 * publishes records to Cloud Pub/Sub from a set of files. The pipeline continuously polls for new
 * files, reads them row-by-row and publishes each record as a string message. The polling interval
 * is fixed and equals to 10 seconds. At the moment, publishing messages with attributes is
 * unsupported.
 *
 * <p>Example Usage:
 *
 * <pre>
 * {@code mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.TextToPubsubStream \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=gs://${STAGING_BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
 * --tempLocation=gs://${STAGING_BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
 * --runner=DataflowRunner \
 * --inputFilePattern=gs://path/to/*.csv \
 * --outputTopic=projects/${PROJECT_ID}/topics/${TOPIC_NAME}"
 * }
 * </pre>
 */
@Template(
    name = "Stream_GCS_Text_to_Cloud_PubSub",
    category = TemplateCategory.STREAMING,
    displayName = "Text Files on Cloud Storage to Pub/Sub",
    description =
        "A pipeline that polls every 10 seconds for new text files stored in Cloud Storage and outputs each line to a Pub/Sub topic.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class TextToPubsubStream extends TextToPubsub {
  private static final Duration DEFAULT_POLL_INTERVAL = Duration.standardSeconds(10);

  /**
   * Main entry-point for the pipeline. Reads in the command-line arguments, parses them, and
   * executes the pipeline.
   *
   * @param args Arguments passed in from the command-line.
   */
  public static void main(String[] args) {

    // Parse the user options passed from the command-line
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    run(options);
  }

  /**
   * Executes the pipeline with the provided execution parameters.
   *
   * @param options The execution parameters.
   */
  public static PipelineResult run(Options options) {
    // Create the pipeline.
    Pipeline pipeline = Pipeline.create(options);

    /*
     * Steps:
     *  1) Read from the text source.
     *  2) Write each text record to Pub/Sub
     */
    pipeline
        .apply(
            "Read Text Data",
            TextIO.read()
                .from(options.getInputFilePattern())
                .watchForNewFiles(DEFAULT_POLL_INTERVAL, Watch.Growth.never()))
        .apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic()));

    return pipeline.run();
  }
}

Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP)

Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP) 模板是一种流处理流水线，会从 Cloud Storage 存储桶读取 csv 文件，调用 Cloud Data Loss Prevention (Cloud DLP) API 进行去标识化处理，并将经过去标识化处理的数据写入指定的 BigQuery 表中。此模板支持使用 Cloud DLP 检查模板和 Cloud DLP 去标识化模板。这样用户就能够检查潜在的敏感信息并进行去标识化处理，以及对结构化数据进行去标识化处理，其中列指定进行去标识化处理并且无需检查。值得注意的是，此模板不支持去标识化模板位置的区域路径。仅支持全局路径。

对此流水线的要求：

要令牌化的输入数据必须存在。
Cloud DLP 模板（例如，DeidentifyTemplate 和 InspectTemplate）必须存在。如需了解更多详细信息，请参阅 Cloud DLP 模板。
BigQuery 数据集必须存在。

模板参数

参数	说明
`inputFilePattern`	要从中读取输入数据记录的 csv 文件。也可以使用通配符，例如 `gs://mybucket/my_csv_filename.csv` 或 `gs://mybucket/file-*.csv`。
`dlpProjectId`	拥有 Cloud DLP API 资源的 Cloud DLP 项目 ID。此 Cloud DLP 项目可以是拥有 Cloud DLP 模板的项目，也可以是其他项目；例如 `my_dlp_api_project`。
`deidentifyTemplateName`	要用于 API 请求的 Cloud DLP 去标识化模板；以 `projects/{template_project_id}/deidentifyTemplates/{deIdTemplateId}` 模式指定，例如 `projects/my_project/deidentifyTemplates/100`。
`datasetName`	用于发送标记化结果的 BigQuery 数据集。
`batchSize`	用于发送数据以进行检查和/或去标记化处理的区块/批次大小。对于 csv 文件，`batchSize` 是一个批次包含的行数。用户必须根据记录大小和文件大小确定批次大小。请注意，Cloud DLP API 的载荷大小上限为每个 API 调用 524 KB。
`inspectTemplateName`	（可选）要用于 API 请求的 Cloud DLP 检查模板；以 `projects/{template_project_id}/identifyTemplates/{idTemplateId}` 模式指定，例如 `projects/my_project/identifyTemplates/100`。

运行 Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP) 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP) template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Stream_DLP_GCS_Text_to_BigQuery \
    --region REGION_NAME \
    --staging-location STAGING_LOCATION \
    --parameters \
inputFilePattern=INPUT_DATA,\
datasetName=DATASET_NAME,\
batchSize=BATCH_SIZE_VALUE,\
dlpProjectId=DLP_API_PROJECT_ID,\
deidentifyTemplateName=projects/TEMPLATE_PROJECT_ID/deidentifyTemplates/DEIDENTIFY_TEMPLATE,\
inspectTemplateName=projects/TEMPLATE_PROJECT_ID/identifyTemplates/INSPECT_TEMPLATE_NUMBER

替换以下内容：

DLP_API_PROJECT_ID：您的 Cloud DLP API 项目 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
INPUT_DATA：输入文件路径
DEIDENTIFY_TEMPLATE：Cloud DLPDeidentify 模板编号
DATASET_NAME：BigQuery 数据集名称
INSPECT_TEMPLATE_NUMBER：Cloud DLPInspect 模板编号
BATCH_SIZE_VALUE：批次大小（对于 csv 文件，批次大小是每个 API 的行数）

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Stream_DLP_GCS_Text_to_BigQuery
{
   "jobName": "JOB_NAME",
   "environment": {
       "bypassTempDirValidation": false,
       "tempLocation": "TEMP_LOCATION",
       "ipConfiguration": "WORKER_IP_UNSPECIFIED",
       "additionalExperiments": []
   },
   "parameters": {
      "inputFilePattern":INPUT_DATA,
      "datasetName": "DATASET_NAME",
      "batchSize": "BATCH_SIZE_VALUE",
      "dlpProjectId": "DLP_API_PROJECT_ID",
      "deidentifyTemplateName": "projects/TEMPLATE_PROJECT_ID/deidentifyTemplates/DEIDENTIFY_TEMPLATE",
      "inspectTemplateName": "projects/TEMPLATE_PROJECT_ID/identifyTemplates/INSPECT_TEMPLATE_NUMBER"
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
DLP_API_PROJECT_ID：您的 Cloud DLP API 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
STAGING_LOCATION：暂存本地文件的位置（例如 gs://your-bucket/staging）
TEMP_LOCATION：写入临时文件的位置（例如 gs://your-bucket/temp）
INPUT_DATA：输入文件路径
DEIDENTIFY_TEMPLATE：Cloud DLPDeidentify 模板编号
DATASET_NAME：BigQuery 数据集名称
INSPECT_TEMPLATE_NUMBER：Cloud DLPInspect 模板编号
BATCH_SIZE_VALUE：批次大小（对于 csv 文件，批次大小是每个 API 的行数）

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.api.services.bigquery.model.TableCell;
import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableSchema;
import com.google.cloud.dlp.v2.DlpServiceClient;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming.TokenizePipelineOptions;
import com.google.common.base.Charsets;
import com.google.privacy.dlp.v2.ContentItem;
import com.google.privacy.dlp.v2.DeidentifyContentRequest;
import com.google.privacy.dlp.v2.DeidentifyContentRequest.Builder;
import com.google.privacy.dlp.v2.DeidentifyContentResponse;
import com.google.privacy.dlp.v2.FieldId;
import com.google.privacy.dlp.v2.ProjectName;
import com.google.privacy.dlp.v2.Table;
import com.google.privacy.dlp.v2.Value;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.KvCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.Compression;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.FileIO.ReadableFile;
import org.apache.beam.sdk.io.ReadableFileCoder;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.DynamicDestinations;
import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy;
import org.apache.beam.sdk.io.gcp.bigquery.TableDestination;
import org.apache.beam.sdk.io.range.OffsetRange;
import org.apache.beam.sdk.metrics.Distribution;
import org.apache.beam.sdk.metrics.Metrics;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.GroupByKey;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.Watch;
import org.apache.beam.sdk.transforms.WithKeys;
import org.apache.beam.sdk.transforms.splittabledofn.OffsetRangeTracker;
import org.apache.beam.sdk.transforms.splittabledofn.RestrictionTracker;
import org.apache.beam.sdk.transforms.windowing.AfterProcessingTime;
import org.apache.beam.sdk.transforms.windowing.FixedWindows;
import org.apache.beam.sdk.transforms.windowing.Repeatedly;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionView;
import org.apache.beam.sdk.values.ValueInSingleWindow;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVRecord;
import org.joda.time.Duration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link DLPTextToBigQueryStreaming} is a streaming pipeline that reads CSV files from a
 * storage location (e.g. Google Cloud Storage), uses Cloud DLP API to inspect, classify, and mask
 * sensitive information (e.g. PII Data like passport or SIN number) and at the end stores
 * obfuscated data in BigQuery (Dynamic Table Creation) to be used for various purposes. e.g. data
 * analytics, ML model. Cloud DLP inspection and masking can be configured by the user and can make
 * use of over 90 built in detectors and masking techniques like tokenization, secure hashing, date
 * shifting, partial masking, and more.
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>DLP Templates exist (e.g. deidentifyTemplate, InspectTemplate)
 *   <li>The BigQuery Dataset exists
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT_ID=PROJECT ID HERE
 * BUCKET_NAME=BUCKET NAME HERE
 * PIPELINE_FOLDER=gs://${BUCKET_NAME}/dataflow/pipelines/dlp-text-to-bigquery
 *
 * # Set the runner
 * RUNNER=DataflowRunner
 *
 * # Build the template
 * mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.DLPTextToBigQueryStreaming \
 * -Dexec.cleanupDaemonThreads=false \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=${PIPELINE_FOLDER}/staging \
 * --tempLocation=${PIPELINE_FOLDER}/temp \
 * --templateLocation=${PIPELINE_FOLDER}/template \
 * --runner=${RUNNER}"
 *
 * # Execute the template
 * JOB_NAME=dlp-text-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"`
 *
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "inputFilePattern=gs://{bucketName}/{fileName}.csv, batchSize=15,datasetName={BQDatasetId},
 *  dlpProjectId={projectId},
 *  deidentifyTemplateName=projects/{projectId}/deidentifyTemplates/{deIdTemplateId}
 * </pre>
 */
@Template(
    name = "Stream_DLP_GCS_Text_to_BigQuery",
    category = TemplateCategory.STREAMING,
    displayName = "Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP)",
    description =
        "An example pipeline that reads CSV files from Cloud Storage, uses Cloud DLP API to mask and tokenize data based on the DLP templates provided and stores output in BigQuery. Note, not all configuration settings are available in this default template. You may need to deploy a custom template to accommodate your specific environment and data needs. More details here: https://cloud.google.com/solutions/de-identification-re-identification-pii-using-cloud-dlp",
    optionsClass = TokenizePipelineOptions.class,
    contactInformation = "https://cloud.google.com/support")
public class DLPTextToBigQueryStreaming {

  public static final Logger LOG = LoggerFactory.getLogger(DLPTextToBigQueryStreaming.class);
  /** Default interval for polling files in GCS. */
  private static final Duration DEFAULT_POLL_INTERVAL = Duration.standardSeconds(30);
  /** Expected only CSV file in GCS bucket. */
  private static final String ALLOWED_FILE_EXTENSION = String.valueOf("csv");
  /** Regular expression that matches valid BQ table IDs. */
  private static final Pattern TABLE_REGEXP = Pattern.compile("[-\\w$@]{1,1024}");
  /** Default batch size if value not provided in execution. */
  private static final Integer DEFAULT_BATCH_SIZE = 100;
  /** Regular expression that matches valid BQ column name . */
  private static final Pattern COLUMN_NAME_REGEXP = Pattern.compile("^[A-Za-z_]+[A-Za-z_0-9]*$");
  /** Default window interval to create side inputs for header records. */
  private static final Duration WINDOW_INTERVAL = Duration.standardSeconds(30);

  /**
   * Main entry point for executing the pipeline. This will run the pipeline asynchronously. If
   * blocking execution is required, use the {@link
   * DLPTextToBigQueryStreaming#run(TokenizePipelineOptions)} method to start the pipeline and
   * invoke {@code result.waitUntilFinish()} on the {@link PipelineResult}
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {

    TokenizePipelineOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(TokenizePipelineOptions.class);
    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  public static PipelineResult run(TokenizePipelineOptions options) {
    // Create the pipeline
    Pipeline p = Pipeline.create(options);
    /*
     * Steps:
     *   1) Read from the text source continuously based on default interval e.g. 30 seconds
     *       - Setup a window for 30 secs to capture the list of files emited.
     *       - Group by file name as key and ReadableFile as a value.
     *   2) Output each readable file for content processing.
     *   3) Split file contents based on batch size for parallel processing.
     *   4) Process each split as a DLP table content request to invoke API.
     *   5) Convert DLP Table Rows to BQ Table Row.
     *   6) Create dynamic table and insert successfully converted records into BQ.
     */

    PCollection<KV<String, Iterable<ReadableFile>>> csvFiles =
        p
            /*
             * 1) Read from the text source continuously based on default interval e.g. 300 seconds
             *     - Setup a window for 30 secs to capture the list of files emited.
             *     - Group by file name as key and ReadableFile as a value.
             */
            .apply(
                "Poll Input Files",
                FileIO.match()
                    .filepattern(options.getInputFilePattern())
                    .continuously(DEFAULT_POLL_INTERVAL, Watch.Growth.never()))
            .apply("Find Pattern Match", FileIO.readMatches().withCompression(Compression.AUTO))
            .apply("Add File Name as Key", WithKeys.of(file -> getFileName(file)))
            .setCoder(KvCoder.of(StringUtf8Coder.of(), ReadableFileCoder.of()))
            .apply(
                "Fixed Window(30 Sec)",
                Window.<KV<String, ReadableFile>>into(FixedWindows.of(WINDOW_INTERVAL))
                    .triggering(
                        Repeatedly.forever(
                            AfterProcessingTime.pastFirstElementInPane()
                                .plusDelayOf(Duration.ZERO)))
                    .discardingFiredPanes()
                    .withAllowedLateness(Duration.ZERO))
            .apply(GroupByKey.create());

    PCollection<KV<String, TableRow>> bqDataMap =
        csvFiles

            // 2) Output each readable file for content processing.
            .apply(
                "File Handler",
                ParDo.of(
                    new DoFn<KV<String, Iterable<ReadableFile>>, KV<String, ReadableFile>>() {
                      @ProcessElement
                      public void processElement(ProcessContext c) {
                        String fileKey = c.element().getKey();
                        c.element()
                            .getValue()
                            .forEach(
                                file -> {
                                  c.output(KV.of(fileKey, file));
                                });
                      }
                    }))

            // 3) Split file contents based on batch size for parallel processing.
            .apply(
                "Process File Contents",
                ParDo.of(
                    new CSVReader(
                        NestedValueProvider.of(
                            options.getBatchSize(),
                            batchSize -> {
                              if (batchSize != null) {
                                return batchSize;
                              } else {
                                return DEFAULT_BATCH_SIZE;
                              }
                            }))))

            // 4) Create a DLP Table content request and invoke DLP API for each processsing
            .apply(
                "DLP-Tokenization",
                ParDo.of(
                    new DLPTokenizationDoFn(
                        options.getDlpProjectId(),
                        options.getDeidentifyTemplateName(),
                        options.getInspectTemplateName())))

            // 5) Convert DLP Table Rows to BQ Table Row
            .apply("Process Tokenized Data", ParDo.of(new TableRowProcessorDoFn()));

    // 6) Create dynamic table and insert successfully converted records into BQ.
    bqDataMap.apply(
        "Write To BQ",
        BigQueryIO.<KV<String, TableRow>>write()
            .to(new BQDestination(options.getDatasetName(), options.getDlpProjectId()))
            .withFormatFunction(
                element -> {
                  return element.getValue();
                })
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
            .withoutValidation()
            .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors()));

    return p.run();
  }

  /**
   * The {@link TokenizePipelineOptions} interface provides the custom execution options passed by
   * the executor at the command-line.
   */
  public interface TokenizePipelineOptions extends DataflowPipelineOptions {

    @TemplateParameter.GcsReadFile(
        order = 1,
        description = "Input Cloud Storage File(s)",
        helpText = "The Cloud Storage location of the files you'd like to process.",
        example = "gs://your-bucket/your-files/*.csv")
    ValueProvider<String> getInputFilePattern();

    void setInputFilePattern(ValueProvider<String> value);

    @TemplateParameter.Text(
        order = 2,
        regexes = {
          "^projects\\/[^\\n\\r\\/]+(\\/locations\\/[^\\n\\r\\/]+)?\\/deidentifyTemplates\\/[^\\n\\r\\/]+$"
        },
        description = "Cloud DLP deidentify template name",
        helpText =
            "Cloud DLP template to deidentify contents. Must be created here: https://console.cloud.google.com/security/dlp/create/template.",
        example =
            "projects/your-project-id/locations/global/deidentifyTemplates/generated_template_id")
    @Required
    ValueProvider<String> getDeidentifyTemplateName();

    void setDeidentifyTemplateName(ValueProvider<String> value);

    @TemplateParameter.Text(
        order = 3,
        optional = true,
        regexes = {
          "^projects\\/[^\\n\\r\\/]+(\\/locations\\/[^\\n\\r\\/]+)?\\/inspectTemplates\\/[^\\n\\r\\/]+$"
        },
        description = "Cloud DLP inspect template name",
        helpText = "Cloud DLP template to inspect contents.",
        example =
            "projects/your-project-id/locations/global/inspectTemplates/generated_template_id")
    ValueProvider<String> getInspectTemplateName();

    void setInspectTemplateName(ValueProvider<String> value);

    @TemplateParameter.Integer(
        order = 4,
        optional = true,
        description = "Batch size",
        helpText =
            "Batch size contents (number of rows) to optimize DLP API call. Total size of the "
                + "rows must not exceed 512 KB and total cell count must not exceed 50,000. Default batch "
                + "size is set to 100. Ex. 1000")
    @Required
    ValueProvider<Integer> getBatchSize();

    void setBatchSize(ValueProvider<Integer> value);

    @TemplateParameter.Text(
        order = 5,
        regexes = {"^[^.]*$"},
        description = "BigQuery Dataset",
        helpText =
            "BigQuery Dataset to be used. Dataset must exist prior to execution. Ex. pii_dataset")
    ValueProvider<String> getDatasetName();

    void setDatasetName(ValueProvider<String> value);

    @TemplateParameter.ProjectId(
        order = 6,
        description = "Cloud DLP project ID",
        helpText =
            "Cloud DLP project ID to be used for data masking/tokenization. Ex. your-dlp-project")
    ValueProvider<String> getDlpProjectId();

    void setDlpProjectId(ValueProvider<String> value);
  }

  /**
   * The {@link CSVReader} class uses experimental Split DoFn to split each csv file contents in
   * chunks and process it in non-monolithic fashion. For example: if a CSV file has 100 rows and
   * batch size is set to 15, then initial restrictions for the SDF will be 1 to 7 and split
   * restriction will be {{1-2},{2-3}..{7-8}} for parallel executions.
   */
  static class CSVReader extends DoFn<KV<String, ReadableFile>, KV<String, Table>> {

    private ValueProvider<Integer> batchSize;
    private PCollectionView<List<KV<String, List<String>>>> headerMap;
    /** This counter is used to track number of lines processed against batch size. */
    private Integer lineCount;

    public CSVReader(ValueProvider<Integer> batchSize) {
      lineCount = 1;
      this.batchSize = batchSize;
    }

    @ProcessElement
    public void processElement(ProcessContext c, RestrictionTracker<OffsetRange, Long> tracker)
        throws IOException {
      for (long i = tracker.currentRestriction().getFrom(); tracker.tryClaim(i); ++i) {
        String fileKey = c.element().getKey();
        try (BufferedReader br = getReader(c.element().getValue())) {
          List<Table.Row> rows = new ArrayList<>();
          Table dlpTable = null;
          /** finding out EOL for this restriction so that we know the SOL */
          int endOfLine = (int) (i * batchSize.get().intValue());
          int startOfLine = (endOfLine - batchSize.get().intValue());

          // getting the DLP table headers
          Iterator<CSVRecord> csvRows = CSVFormat.DEFAULT.parse(br).iterator();
          if (!csvRows.hasNext()) {
            LOG.info("File `" + c.element().getKey() + "` is empty");
            continue;
          }
          List<FieldId> dlpTableHeaders = toDlpTableHeaders(csvRows.next());

          /** skipping all the rows that's not part of this restriction */
          for (int line = 0; line < startOfLine; line++) {
            if (csvRows.hasNext()) {
              csvRows.next();
            }
          }
          /** looping through buffered reader and creating DLP Table Rows equals to batch */
          while (csvRows.hasNext() && lineCount <= batchSize.get()) {

            CSVRecord csvRow = csvRows.next();
            rows.add(convertCsvRowToTableRow(csvRow));
            lineCount += 1;
          }
          /** creating DLP table and output for next transformation */
          dlpTable = Table.newBuilder().addAllHeaders(dlpTableHeaders).addAllRows(rows).build();
          c.output(KV.of(fileKey, dlpTable));

          LOG.debug(
              "Current Restriction From: {}, Current Restriction To: {},"
                  + " StartofLine: {}, End Of Line {}, BatchData {}",
              tracker.currentRestriction().getFrom(),
              tracker.currentRestriction().getTo(),
              startOfLine,
              endOfLine,
              dlpTable.getRowsCount());
        }
      }
    }

    private static List<FieldId> toDlpTableHeaders(CSVRecord headerRow) {
      List<FieldId> result = new ArrayList<>();
      for (String header : headerRow) {
        result.add(FieldId.newBuilder().setName(header).build());
      }
      return result;
    }

    /**
     * SDF needs to define a @GetInitialRestriction method that can create a restriction describing
     * the complete work for a given element. For our case this would be the total number of rows
     * for each CSV file. We will calculate the number of split required based on total number of
     * rows and batch size provided.
     *
     * @throws IOException
     */
    @GetInitialRestriction
    public OffsetRange getInitialRestriction(@Element KV<String, ReadableFile> csvFile)
        throws IOException {

      int rowCount = 0;
      int totalSplit = 0;
      try (BufferedReader br = getReader(csvFile.getValue())) {
        /** assume first row is header */
        int checkRowCount = (int) br.lines().count() - 1;
        rowCount = (checkRowCount < 1) ? 1 : checkRowCount;
        totalSplit = rowCount / batchSize.get().intValue();
        int remaining = rowCount % batchSize.get().intValue();
        /**
         * Adjusting the total number of split based on remaining rows. For example: batch size of
         * 15 for 100 rows will have total 7 splits. As it's a range last split will have offset
         * range {7,8}
         */
        if (remaining > 0) {
          totalSplit = totalSplit + 2;

        } else {
          totalSplit = totalSplit + 1;
        }
      }

      LOG.debug("Initial Restriction range from 1 to: {}", totalSplit);
      return new OffsetRange(1, totalSplit);
    }

    /**
     * SDF needs to define a @SplitRestriction method that can split the intital restricton to a
     * number of smaller restrictions. For example: a intital rewstriction of (x, N) as input and
     * produces pairs (x, 0), (x, 1), …, (x, N-1) as output.
     */
    @SplitRestriction
    public void splitRestriction(
        @Element KV<String, ReadableFile> csvFile,
        @Restriction OffsetRange range,
        OutputReceiver<OffsetRange> out) {
      /** split the initial restriction by 1 */
      for (final OffsetRange p : range.split(1, 1)) {
        out.output(p);
      }
    }

    @NewTracker
    public OffsetRangeTracker newTracker(@Restriction OffsetRange range) {
      return new OffsetRangeTracker(new OffsetRange(range.getFrom(), range.getTo()));
    }

    private Table.Row convertCsvRowToTableRow(CSVRecord csvRow) {
      /** convert from CSV row to DLP Table Row */
      Iterator<String> valueIterator = csvRow.iterator();
      Table.Row.Builder tableRowBuilder = Table.Row.newBuilder();
      while (valueIterator.hasNext()) {
        String value = valueIterator.next();
        if (value != null) {
          tableRowBuilder.addValues(Value.newBuilder().setStringValue(value.toString()).build());
        } else {
          tableRowBuilder.addValues(Value.newBuilder().setStringValue("").build());
        }
      }

      return tableRowBuilder.build();
    }

    private List<String> getHeaders(List<KV<String, List<String>>> headerMap, String fileKey) {
      return headerMap.stream()
          .filter(map -> map.getKey().equalsIgnoreCase(fileKey))
          .findFirst()
          .map(e -> e.getValue())
          .orElse(null);
    }
  }

  /**
   * The {@link DLPTokenizationDoFn} class executes tokenization request by calling DLP api. It uses
   * DLP table as a content item as CSV file contains fully structured data. DLP templates (e.g.
   * de-identify, inspect) need to exist before this pipeline runs. As response from the API is
   * received, this DoFn ouptputs KV of new table with table id as key.
   */
  static class DLPTokenizationDoFn extends DoFn<KV<String, Table>, KV<String, Table>> {
    private ValueProvider<String> dlpProjectId;
    private DlpServiceClient dlpServiceClient;
    private ValueProvider<String> deIdentifyTemplateName;
    private ValueProvider<String> inspectTemplateName;
    private boolean inspectTemplateExist;
    private Builder requestBuilder;
    private final Distribution numberOfRowsTokenized =
        Metrics.distribution(DLPTokenizationDoFn.class, "numberOfRowsTokenizedDistro");
    private final Distribution numberOfBytesTokenized =
        Metrics.distribution(DLPTokenizationDoFn.class, "numberOfBytesTokenizedDistro");

    public DLPTokenizationDoFn(
        ValueProvider<String> dlpProjectId,
        ValueProvider<String> deIdentifyTemplateName,
        ValueProvider<String> inspectTemplateName) {
      this.dlpProjectId = dlpProjectId;
      this.dlpServiceClient = null;
      this.deIdentifyTemplateName = deIdentifyTemplateName;
      this.inspectTemplateName = inspectTemplateName;
      this.inspectTemplateExist = false;
    }

    @Setup
    public void setup() {
      if (this.inspectTemplateName.isAccessible()) {
        if (this.inspectTemplateName.get() != null) {
          this.inspectTemplateExist = true;
        }
      }
      if (this.deIdentifyTemplateName.isAccessible()) {
        if (this.deIdentifyTemplateName.get() != null) {
          this.requestBuilder =
              DeidentifyContentRequest.newBuilder()
                  .setParent(ProjectName.of(this.dlpProjectId.get()).toString())
                  .setDeidentifyTemplateName(this.deIdentifyTemplateName.get());
          if (this.inspectTemplateExist) {
            this.requestBuilder.setInspectTemplateName(this.inspectTemplateName.get());
          }
        }
      }
    }

    @StartBundle
    public void startBundle() throws SQLException {

      try {
        this.dlpServiceClient = DlpServiceClient.create();

      } catch (IOException e) {
        LOG.error("Failed to create DLP Service Client", e.getMessage());
        throw new RuntimeException(e);
      }
    }

    @FinishBundle
    public void finishBundle() throws Exception {
      if (this.dlpServiceClient != null) {
        this.dlpServiceClient.close();
      }
    }

    @ProcessElement
    public void processElement(ProcessContext c) {
      String key = c.element().getKey();
      Table nonEncryptedData = c.element().getValue();
      ContentItem tableItem = ContentItem.newBuilder().setTable(nonEncryptedData).build();
      this.requestBuilder.setItem(tableItem);
      DeidentifyContentResponse response =
          dlpServiceClient.deidentifyContent(this.requestBuilder.build());
      Table tokenizedData = response.getItem().getTable();
      numberOfRowsTokenized.update(tokenizedData.getRowsList().size());
      numberOfBytesTokenized.update(tokenizedData.toByteArray().length);
      c.output(KV.of(key, tokenizedData));
    }
  }

  /**
   * The {@link TableRowProcessorDoFn} class process tokenized DLP tables and convert them to
   * BigQuery Table Row.
   */
  public static class TableRowProcessorDoFn extends DoFn<KV<String, Table>, KV<String, TableRow>> {

    @ProcessElement
    public void processElement(ProcessContext c) {

      Table tokenizedData = c.element().getValue();
      List<String> headers =
          tokenizedData.getHeadersList().stream()
              .map(fid -> fid.getName())
              .collect(Collectors.toList());
      List<Table.Row> outputRows = tokenizedData.getRowsList();
      if (outputRows.size() > 0) {
        for (Table.Row outputRow : outputRows) {
          if (outputRow.getValuesCount() != headers.size()) {
            throw new IllegalArgumentException(
                "CSV file's header count must exactly match with data element count");
          }
          c.output(
              KV.of(
                  c.element().getKey(),
                  createBqRow(outputRow, headers.toArray(new String[headers.size()]))));
        }
      }
    }

    private static TableRow createBqRow(Table.Row tokenizedValue, String[] headers) {
      TableRow bqRow = new TableRow();
      AtomicInteger headerIndex = new AtomicInteger(0);
      List<TableCell> cells = new ArrayList<>();
      tokenizedValue
          .getValuesList()
          .forEach(
              value -> {
                String checkedHeaderName =
                    checkHeaderName(headers[headerIndex.getAndIncrement()].toString());
                bqRow.set(checkedHeaderName, value.getStringValue());
                cells.add(new TableCell().set(checkedHeaderName, value.getStringValue()));
              });
      bqRow.setF(cells);
      return bqRow;
    }
  }

  /**
   * The {@link BQDestination} class creates BigQuery table destination and table schema based on
   * the CSV file processed in earlier transformations. Table id is same as filename Table schema is
   * same as file header columns.
   */
  public static class BQDestination
      extends DynamicDestinations<KV<String, TableRow>, KV<String, TableRow>> {

    private ValueProvider<String> datasetName;
    private ValueProvider<String> projectId;

    public BQDestination(ValueProvider<String> datasetName, ValueProvider<String> projectId) {
      this.datasetName = datasetName;
      this.projectId = projectId;
    }

    @Override
    public KV<String, TableRow> getDestination(ValueInSingleWindow<KV<String, TableRow>> element) {
      String key = element.getValue().getKey();
      String tableName = String.format("%s:%s.%s", projectId.get(), datasetName.get(), key);
      LOG.debug("Table Name {}", tableName);
      return KV.of(tableName, element.getValue().getValue());
    }

    @Override
    public TableDestination getTable(KV<String, TableRow> destination) {
      TableDestination dest =
          new TableDestination(destination.getKey(), "pii-tokenized output data from dataflow");
      LOG.debug("Table Destination {}", dest.getTableSpec());
      return dest;
    }

    @Override
    public TableSchema getSchema(KV<String, TableRow> destination) {

      TableRow bqRow = destination.getValue();
      TableSchema schema = new TableSchema();
      List<TableFieldSchema> fields = new ArrayList<TableFieldSchema>();
      List<TableCell> cells = bqRow.getF();
      for (int i = 0; i < cells.size(); i++) {
        Map<String, Object> object = cells.get(i);
        String header = object.keySet().iterator().next();
        /** currently all BQ data types are set to String */
        fields.add(new TableFieldSchema().setName(checkHeaderName(header)).setType("STRING"));
      }

      schema.setFields(fields);
      return schema;
    }
  }

  private static String getFileName(ReadableFile file) {
    String csvFileName = file.getMetadata().resourceId().getFilename().toString();
    /** taking out .csv extension from file name e.g fileName.csv->fileName */
    String[] fileKey = csvFileName.split("\\.", 2);

    if (!fileKey[1].equals(ALLOWED_FILE_EXTENSION) || !TABLE_REGEXP.matcher(fileKey[0]).matches()) {
      throw new RuntimeException(
          "[Filename must contain a CSV extension "
              + " BQ table name must contain only letters, numbers, or underscores ["
              + fileKey[1]
              + "], ["
              + fileKey[0]
              + "]");
    }
    /** returning file name without extension */
    return fileKey[0];
  }

  private static BufferedReader getReader(ReadableFile csvFile) {
    BufferedReader br = null;
    ReadableByteChannel channel = null;
    /** read the file and create buffered reader */
    try {
      channel = csvFile.openSeekable();

    } catch (IOException e) {
      LOG.error("Failed to Read File {}", e.getMessage());
      throw new RuntimeException(e);
    }

    if (channel != null) {

      br = new BufferedReader(Channels.newReader(channel, Charsets.UTF_8.name()));
    }

    return br;
  }

  private static String checkHeaderName(String name) {
    /** some checks to make sure BQ column names don't fail e.g. special characters */
    String checkedHeader = name.replaceAll("\\s", "_");
    checkedHeader = checkedHeader.replaceAll("'", "");
    checkedHeader = checkedHeader.replaceAll("/", "");
    if (!COLUMN_NAME_REGEXP.matcher(checkedHeader).matches()) {
      throw new IllegalArgumentException("Column name can't be matched to a valid format " + name);
    }
    return checkedHeader;
  }
}

Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub (Stream)

Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub 模版是一种流处理流水线，用于从 MySQL 数据库中读取包含更改数据的 Pub/Sub 消息并将记录写入 BigQuery。Debezium 连接器捕获对 MySQL 数据库的更改，并将更改后的数据发布到 Pub/Sub。然后，该模板读取 Pub/Sub 消息并将其写入 BigQuery。

您可以使用此模板同步 MySQL 数据库和 BigQuery 表。流水线将更改后的数据写入 BigQuery 暂存表，并间歇性地更新复制 MySQL 数据库的 BigQuery 表。

对此流水线的要求：

Debezium 连接器必须已部署。
Pub/Sub 消息必须在 Beam 行中序列化。

模板参数

参数	说明
`inputSubscriptions`	要读取的 Pub/Sub 输入订阅列表（以英文逗号分隔），格式为 `<subscription>,<subscription>, ...`
`changeLogDataset`	用于存储暂存表的 BigQuery 数据集，格式为 `<my-dataset>`
`replicaDataset`	用于存储副本表的 BigQuery 数据集的位置，格式为 `<my-dataset>`
`updateFrequencySecs`	（可选）流水线更新复制 MySQL 数据库的 BigQuery 表的时间间隔。

运行 Change Data Capture using Debezium and MySQL from Pub/Sub to BigQuery 模版

如需运行此模板，请执行以下步骤：

在本地机器上，克隆 DataflowTemplates 代码库。
切换到 v2/cdc-parent 目录：
确保已部署 Debezium 连接器。
使用 Maven，运行 Dataflow 模板。
```
mvn exec:java -pl cdc-change-applier -Dexec.args="--runner=DataflowRunner \
    --inputSubscriptions=SUBSCRIPTIONS \
    --updateFrequencySecs=300 \
    --changeLogDataset=CHANGELOG_DATASET \
    --replicaDataset=REPLICA_DATASET \
    --project=PROJECT_ID \
    --region=REGION_NAME"
  
```
替换以下内容：
- PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
- SUBSCRIPTIONS：以英文逗号分隔的 Pub/Sub 订阅名称列表
- CHANGELOG_DATASET：用于变更日志数据的 BigQuery 数据集
- REPLICA_DATASET：用于副本表的 BigQuery 数据集

Apache Kafka to BigQuery

Apache Kafka to BigQuery 模板是一种流处理流水线，可从 Apache Kafka 提取文本数据、执行用户定义的函数 (UDF) 并将生成的记录输出到 BigQuery。在转换数据、执行 UDF 或插入输出表时出现的任何错误都将被插入 BigQuery 中单独的错误表。如果在执行之前错误表不存在，则创建该表。

对此流水线的要求

输出 BigQuery 表必须已存在。
Apache Kafka 代理服务器必须正在运行并可从 Dataflow 工作器机器进行访问。
Apache Kafka 主题必须存在，并且消息必须采用有效的 JSON 格式编码。

模板参数

参数	说明
`outputTableSpec`	要写入 Apache Kafka 消息的 BigQuery 输出表位置，格式为 `my-project:dataset.table`
`inputTopics`	以英文逗号分隔的列表中可读取的 Apache Kafka 输入主题。例如：`messages`
`bootstrapServers`	以英文逗号分隔的列表中正在运行的 Apache Kafka Broker 服务器的主机地址，每个主机地址的格式为 `35.70.252.199:9092`
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
`outputDeadletterTable`	（可选）未能到达输出表的消息的 BigQuery 表，格式为 `my-project:dataset.my-deadletter-table`。如果不存在，则系统会在流水线执行期间创建该表。如果未指定此参数，则系统会改用 `<outputTableSpec>_error_records`。

运行 Apache Kafka to BigQuery 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Kafka to BigQuery template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/Kafka_to_BigQuery \
    --parameters \
outputTableSpec=BIGQUERY_TABLE,\
inputTopics=KAFKA_TOPICS,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
bootstrapServers=KAFKA_SERVER_ADDRESSES

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
BIGQUERY_TABLE：您的 BigQuery 表名称
KAFKA_TOPICS：Apache Kakfa 主题列表。如果提供了多个主题，请按照说明了解如何转义英文逗号。
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
KAFKA_SERVER_ADDRESSES：Apache Kafka broker 服务器 IP 地址列表。每个 IP 地址都应与服务器可访问的端口号相匹配。例如：35.70.252.199:9092。如果提供了多个地址，请按照说明了解如何转义英文逗号。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "outputTableSpec": "BIGQUERY_TABLE",
          "inputTopics": "KAFKA_TOPICS",
          "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
          "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
          "bootstrapServers": "KAFKA_SERVER_ADDRESSES"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/Kafka_to_BigQuery",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
BIGQUERY_TABLE：您的 BigQuery 表名称
KAFKA_TOPICS：Apache Kakfa 主题列表。如果提供了多个主题，请按照说明了解如何转义英文逗号。
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
KAFKA_SERVER_ADDRESSES：Apache Kafka broker 服务器 IP 地址列表。每个 IP 地址都应与服务器可访问的端口号相匹配。例如：35.70.252.199:9092。如果提供了多个地址，请按照说明了解如何转义英文逗号。

如需了解详情，请参阅使用 Dataflow 将数据从 Kafka 写入 BigQuery。

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import static com.google.cloud.teleport.v2.kafka.transforms.KafkaTransform.readFromKafka;

import com.google.api.services.bigquery.model.TableRow;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.kafka.options.KafkaReadOptions;
import com.google.cloud.teleport.v2.options.BigQueryStorageApiStreamingOptions;
import com.google.cloud.teleport.v2.templates.KafkaToBigQuery.KafkaToBQOptions;
import com.google.cloud.teleport.v2.transforms.BigQueryConverters.FailsafeJsonToTableRow;
import com.google.cloud.teleport.v2.transforms.ErrorConverters;
import com.google.cloud.teleport.v2.transforms.ErrorConverters.WriteKafkaMessageErrors;
import com.google.cloud.teleport.v2.transforms.JavascriptTextTransformer.FailsafeJavascriptUdf;
import com.google.cloud.teleport.v2.transforms.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.v2.utils.BigQueryIOUtils;
import com.google.cloud.teleport.v2.utils.MetadataValidator;
import com.google.cloud.teleport.v2.utils.SchemaUtils;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import com.google.common.collect.ImmutableMap;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.KvCoder;
import org.apache.beam.sdk.coders.NullableCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError;
import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy;
import org.apache.beam.sdk.io.gcp.bigquery.WriteResult;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.Flatten;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionList;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.commons.lang3.ObjectUtils;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link KafkaToBigQuery} pipeline is a streaming pipeline which ingests text data from Kafka,
 * executes a UDF, and outputs the resulting records to BigQuery. Any errors which occur in the
 * transformation of the data, execution of the UDF, or inserting into the output table will be
 * inserted into a separate errors table in BigQuery. The errors table will be created if it does
 * not exist prior to execution. Both output and error tables are specified by the user as
 * parameters.
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>The Kafka topic exists and the message is encoded in a valid JSON format.
 *   <li>The BigQuery output table exists.
 *   <li>The Kafka brokers are reachable from the Dataflow worker machines.
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 *
 * # Set some environment variables
 * PROJECT=my-project
 * TEMP_BUCKET=my-temp-bucket
 * OUTPUT_TABLE=${PROJECT}:my_dataset.my_table
 * TOPICS=my-topics
 * JS_PATH=my-js-path-on-gcs
 * JS_FUNC_NAME=my-js-func-name
 * BOOTSTRAP=my-comma-separated-bootstrap-servers
 *
 * # Set containerization vars
 * IMAGE_NAME=my-image-name
 * TARGET_GCR_IMAGE=gcr.io/${PROJECT}/${IMAGE_NAME}
 * BASE_CONTAINER_IMAGE=my-base-container-image
 * BASE_CONTAINER_IMAGE_VERSION=my-base-container-image-version
 * APP_ROOT=/path/to/app-root
 * COMMAND_SPEC=/path/to/command-spec
 *
 * # Build and upload image
 * mvn clean package \
 * -Dimage=${TARGET_GCR_IMAGE} \
 * -Dbase-container-image=${BASE_CONTAINER_IMAGE} \
 * -Dbase-container-image.version=${BASE_CONTAINER_IMAGE_VERSION} \
 * -Dapp-root=${APP_ROOT} \
 * -Dcommand-spec=${COMMAND_SPEC}
 *
 * # Create an image spec in GCS that contains the path to the image
 * {
 *    "docker_template_spec": {
 *       "docker_image": $TARGET_GCR_IMAGE
 *     }
 *  }
 *
 * # Execute template:
 * API_ROOT_URL="https://dataflow.googleapis.com"
 * TEMPLATES_LAUNCH_API="${API_ROOT_URL}/v1b3/projects/${PROJECT}/templates:launch"
 * JOB_NAME="kafka-to-bigquery`date +%Y%m%d-%H%M%S-%N`"
 *
 * time curl -X POST -H "Content-Type: application/json"     \
 *     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
 *     "${TEMPLATES_LAUNCH_API}"`
 *     `"?validateOnly=false"`
 *     `"&dynamicTemplate.gcsPath=${TEMP_BUCKET}/path/to/image-spec"`
 *     `"&dynamicTemplate.stagingLocation=${TEMP_BUCKET}/staging" \
 *     -d '
 *      {
 *       "jobName":"'$JOB_NAME'",
 *       "parameters": {
 *           "outputTableSpec":"'$OUTPUT_TABLE'",
 *           "inputTopics":"'$TOPICS'",
 *           "javascriptTextTransformGcsPath":"'$JS_PATH'",
 *           "javascriptTextTransformFunctionName":"'$JS_FUNC_NAME'",
 *           "bootstrapServers":"'$BOOTSTRAP'"
 *        }
 *       }
 *      '
 * </pre>
 */
@Template(
    name = "Kafka_to_BigQuery",
    category = TemplateCategory.STREAMING,
    displayName = "Kafka to BigQuery",
    description =
        "A streaming pipeline which ingests data in JSON format from Kafka, performs a transform"
            + " via a user defined JavaScript function, and writes to a pre-existing BigQuery"
            + " table.",
    optionsClass = KafkaToBQOptions.class,
    flexContainerName = "kafka-to-bigquery",
    contactInformation = "https://cloud.google.com/support")
public class KafkaToBigQuery {

  /* Logger for class. */
  private static final Logger LOG = LoggerFactory.getLogger(KafkaToBigQuery.class);

  /** The tag for the main output for the UDF. */
  private static final TupleTag<FailsafeElement<KV<String, String>, String>> UDF_OUT =
      new TupleTag<FailsafeElement<KV<String, String>, String>>() {};

  /** The tag for the main output of the json transformation. */
  static final TupleTag<TableRow> TRANSFORM_OUT = new TupleTag<TableRow>() {};

  /** The tag for the dead-letter output of the udf. */
  static final TupleTag<FailsafeElement<KV<String, String>, String>> UDF_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<KV<String, String>, String>>() {};

  /** The tag for the dead-letter output of the json to table row transform. */
  static final TupleTag<FailsafeElement<KV<String, String>, String>> TRANSFORM_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<KV<String, String>, String>>() {};

  /** The default suffix for error tables if dead letter table is not specified. */
  private static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";

  /** String/String Coder for FailsafeElement. */
  private static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(
          NullableCoder.of(StringUtf8Coder.of()), NullableCoder.of(StringUtf8Coder.of()));

  /**
   * The {@link KafkaToBQOptions} class provides the custom execution options passed by the executor
   * at the command-line.
   */
  public interface KafkaToBQOptions
      extends KafkaReadOptions,
          JavascriptTextTransformerOptions,
          BigQueryStorageApiStreamingOptions {

    @TemplateParameter.BigQueryTable(
        order = 1,
        description = "BigQuery output table",
        helpText =
            "BigQuery table location to write the output to. The name should be in the format "
                + "<project>:<dataset>.<table_name>. The table's schema must match input objects.")
    @Required
    String getOutputTableSpec();

    void setOutputTableSpec(String outputTableSpec);

    /**
     * Get bootstrap server across releases.
     *
     * @deprecated This method is no longer acceptable to get bootstrap servers.
     *     <p>Use {@link KafkaToBQOptions#getReadBootstrapServers()} instead.
     */
    @TemplateParameter.Text(
        order = 2,
        optional = true,
        regexes = {"[,:a-zA-Z0-9._-]+"},
        description = "Kafka Bootstrap Server list",
        helpText = "Kafka Bootstrap Server list, separated by commas.",
        example = "localhost:9092,127.0.0.1:9093")
    @Deprecated
    String getBootstrapServers();

    /**
     * Get bootstrap server across releases.
     *
     * @deprecated This method is no longer acceptable to set bootstrap servers.
     *     <p>Use {@link KafkaToBQOptions#setReadBootstrapServers()} instead.
     */
    @Deprecated
    void setBootstrapServers(String bootstrapServers);

    /**
     * Get bootstrap server across releases.
     *
     * @deprecated This method is no longer acceptable to get Input topics.
     *     <p>Use {@link KafkaToBQOptions#getKafkaReadTopics()} instead.
     */
    @Deprecated
    @TemplateParameter.Text(
        order = 3,
        regexes = {"[,a-zA-Z0-9._-]+"},
        description = "Kafka topic(s) to read the input from",
        helpText = "Kafka topic(s) to read the input from.",
        example = "topic1,topic2")
    String getInputTopics();

    /**
     * Get bootstrap server across releases.
     *
     * @deprecated This method is no longer acceptable to set Input topics.
     *     <p>Use {@link KafkaToBQOptions#getKafkaReadTopics()} instead.
     */
    @Deprecated
    void setInputTopics(String inputTopics);

    @TemplateParameter.BigQueryTable(
        order = 4,
        optional = true,
        description = "The dead-letter table name to output failed messages to BigQuery",
        helpText =
            "Messages failed to reach the output table for all kind of reasons (e.g., mismatched"
                + " schema, malformed json) are written to this table. If it doesn't exist, it will"
                + " be created during pipeline execution.",
        example = "your-project-id:your-dataset.your-table-name")
    String getOutputDeadletterTable();

    void setOutputDeadletterTable(String outputDeadletterTable);
  }

  /**
   * The main entry-point for pipeline execution. This method will start the pipeline but will not
   * wait for it's execution to finish. If blocking execution is required, use the {@link
   * KafkaToBigQuery#run(KafkaToBQOptions)} method to start the pipeline and invoke {@code
   * result.waitUntilFinish()} on the {@link PipelineResult}.
   *
   * @param args The command-line args passed by the executor.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    KafkaToBQOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(KafkaToBQOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(KafkaToBQOptions options) {

    // Validate BQ STORAGE_WRITE_API options
    BigQueryIOUtils.validateBQStorageApiOptionsStreaming(options);
    MetadataValidator.validate(options);

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Register the coder for pipeline
    FailsafeElementCoder<KV<String, String>, String> coder =
        FailsafeElementCoder.of(
            KvCoder.of(
                NullableCoder.of(StringUtf8Coder.of()), NullableCoder.of(StringUtf8Coder.of())),
            NullableCoder.of(StringUtf8Coder.of()));

    CoderRegistry coderRegistry = pipeline.getCoderRegistry();
    coderRegistry.registerCoderForType(coder.getEncodedTypeDescriptor(), coder);

    List<String> topicsList;
    if (options.getKafkaReadTopics() != null) {
      topicsList = new ArrayList<>(Arrays.asList(options.getKafkaReadTopics().split(",")));
    } else if (options.getInputTopics() != null) {
      topicsList = new ArrayList<>(Arrays.asList(options.getInputTopics().split(",")));
    } else {
      throw new IllegalArgumentException("Please Provide --kafkaReadTopic");
    }
    String bootstrapServers;
    if (options.getReadBootstrapServers() != null) {
      bootstrapServers = options.getReadBootstrapServers();
    } else if (options.getBootstrapServers() != null) {
      bootstrapServers = options.getBootstrapServers();
    } else {
      throw new IllegalArgumentException("Please Provide --bootstrapServers");
    }
    /*
     * Steps:
     *  1) Read messages in from Kafka
     *  2) Transform the messages into TableRows
     *     - Transform message payload via UDF
     *     - Convert UDF result to TableRow objects
     *  3) Write successful records out to BigQuery
     *  4) Write failed records out to BigQuery
     */

    PCollectionTuple convertedTableRows =
        pipeline
            /*
             * Step #1: Read messages in from Kafka
             */
            .apply(
                "ReadFromKafka",
                readFromKafka(
                    bootstrapServers,
                    topicsList,
                    ImmutableMap.of(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest"),
                    null))

            /*
             * Step #2: Transform the Kafka Messages into TableRows
             */
            .apply("ConvertMessageToTableRow", new MessageToTableRow(options));

    /*
     * Step #3: Write the successful records out to BigQuery
     */
    WriteResult writeResult =
        convertedTableRows
            .get(TRANSFORM_OUT)
            .apply(
                "WriteSuccessfulRecords",
                BigQueryIO.writeTableRows()
                    .withoutValidation()
                    .withCreateDisposition(CreateDisposition.CREATE_NEVER)
                    .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                    .withExtendedErrorInfo()
                    .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
                    .to(options.getOutputTableSpec()));

    /*
     * Step 3 Contd.
     * Elements that failed inserts into BigQuery are extracted and converted to FailsafeElement
     */
    PCollection<FailsafeElement<String, String>> failedInserts =
        BigQueryIOUtils.writeResultToBigQueryInsertErrors(writeResult, options)
            .apply(
                "WrapInsertionErrors",
                MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor())
                    .via(KafkaToBigQuery::wrapBigQueryInsertError))
            .setCoder(FAILSAFE_ELEMENT_CODER);

    /*
     * Step #4: Write failed records out to BigQuery
     */
    PCollectionList.of(convertedTableRows.get(UDF_DEADLETTER_OUT))
        .and(convertedTableRows.get(TRANSFORM_DEADLETTER_OUT))
        .apply("Flatten", Flatten.pCollections())
        .apply(
            "WriteTransformationFailedRecords",
            WriteKafkaMessageErrors.newBuilder()
                .setErrorRecordsTable(
                    ObjectUtils.firstNonNull(
                        options.getOutputDeadletterTable(),
                        options.getOutputTableSpec() + DEFAULT_DEADLETTER_TABLE_SUFFIX))
                .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA)
                .build());

    /*
     * Step #5: Insert records that failed BigQuery inserts into a deadletter table.
     */
    failedInserts.apply(
        "WriteInsertionFailedRecords",
        ErrorConverters.WriteStringMessageErrors.newBuilder()
            .setErrorRecordsTable(
                ObjectUtils.firstNonNull(
                    options.getOutputDeadletterTable(),
                    options.getOutputTableSpec() + DEFAULT_DEADLETTER_TABLE_SUFFIX))
            .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA)
            .build());

    return pipeline.run();
  }

  /**
   * The {@link MessageToTableRow} class is a {@link PTransform} which transforms incoming Kafka
   * Message objects into {@link TableRow} objects for insertion into BigQuery while applying a UDF
   * to the input. The executions of the UDF and transformation to {@link TableRow} objects is done
   * in a fail-safe way by wrapping the element with it's original payload inside the {@link
   * FailsafeElement} class. The {@link MessageToTableRow} transform will output a {@link
   * PCollectionTuple} which contains all output and dead-letter {@link PCollection}.
   *
   * <p>The {@link PCollectionTuple} output will contain the following {@link PCollection}:
   *
   * <ul>
   *   <li>{@link KafkaToBigQuery#UDF_OUT} - Contains all {@link FailsafeElement} records
   *       successfully processed by the UDF.
   *   <li>{@link KafkaToBigQuery#UDF_DEADLETTER_OUT} - Contains all {@link FailsafeElement} records
   *       which failed processing during the UDF execution.
   *   <li>{@link KafkaToBigQuery#TRANSFORM_OUT} - Contains all records successfully converted from
   *       JSON to {@link TableRow} objects.
   *   <li>{@link KafkaToBigQuery#TRANSFORM_DEADLETTER_OUT} - Contains all {@link FailsafeElement}
   *       records which couldn't be converted to table rows.
   * </ul>
   */
  static class MessageToTableRow
      extends PTransform<PCollection<KV<String, String>>, PCollectionTuple> {

    private final KafkaToBQOptions options;

    MessageToTableRow(KafkaToBQOptions options) {
      this.options = options;
    }

    @Override
    public PCollectionTuple expand(PCollection<KV<String, String>> input) {

      PCollectionTuple udfOut =
          input
              // Map the incoming messages into FailsafeElements so we can recover from failures
              // across multiple transforms.
              .apply("MapToRecord", ParDo.of(new MessageToFailsafeElementFn()))
              .apply(
                  "InvokeUDF",
                  FailsafeJavascriptUdf.<KV<String, String>>newBuilder()
                      .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                      .setFunctionName(options.getJavascriptTextTransformFunctionName())
                      .setSuccessTag(UDF_OUT)
                      .setFailureTag(UDF_DEADLETTER_OUT)
                      .build());

      // Convert the records which were successfully processed by the UDF into TableRow objects.
      PCollectionTuple jsonToTableRowOut =
          udfOut
              .get(UDF_OUT)
              .apply(
                  "JsonToTableRow",
                  FailsafeJsonToTableRow.<KV<String, String>>newBuilder()
                      .setSuccessTag(TRANSFORM_OUT)
                      .setFailureTag(TRANSFORM_DEADLETTER_OUT)
                      .build());

      // Re-wrap the PCollections so we can return a single PCollectionTuple
      return PCollectionTuple.of(UDF_OUT, udfOut.get(UDF_OUT))
          .and(UDF_DEADLETTER_OUT, udfOut.get(UDF_DEADLETTER_OUT))
          .and(TRANSFORM_OUT, jsonToTableRowOut.get(TRANSFORM_OUT))
          .and(TRANSFORM_DEADLETTER_OUT, jsonToTableRowOut.get(TRANSFORM_DEADLETTER_OUT));
    }
  }

  /**
   * The {@link MessageToFailsafeElementFn} wraps an Kafka Message with the {@link FailsafeElement}
   * class so errors can be recovered from and the original message can be output to a error records
   * table.
   */
  static class MessageToFailsafeElementFn
      extends DoFn<KV<String, String>, FailsafeElement<KV<String, String>, String>> {

    @ProcessElement
    public void processElement(ProcessContext context) {
      KV<String, String> message = context.element();
      context.output(FailsafeElement.of(message, message.getValue()));
    }
  }

  /**
   * Method to wrap a {@link BigQueryInsertError} into a {@link FailsafeElement}.
   *
   * @param insertError BigQueryInsert error.
   * @return FailsafeElement object.
   */
  protected static FailsafeElement<String, String> wrapBigQueryInsertError(
      BigQueryInsertError insertError) {

    FailsafeElement<String, String> failsafeElement;
    try {

      failsafeElement =
          FailsafeElement.of(
              insertError.getRow().toPrettyString(), insertError.getRow().toPrettyString());
      failsafeElement.setErrorMessage(insertError.getError().toPrettyString());

    } catch (IOException e) {
      LOG.error("Failed to wrap BigQuery insert error.");
      throw new RuntimeException(e);
    }
    return failsafeElement;
  }
}

Datastream to BigQuery (Stream)

Datastream to BigQuery 模板是一种流处理流水线，可读取 Datastream 数据并将其复制到 BigQuery 中。该模板使用 Pub/Sub 通知从 Cloud Storage 中读取数据，并将其复制到时间分区的 BigQuery 暂存表中。复制后，该模板会在 BigQuery 中执行 MERGE，以将所有更改数据捕获 (CDC) 更改插入/更新到源表的副本中。

此模板处理创建和更新通过复制管理的 BigQuery 表。当需要数据定义语言 (DDL) 时，对 Datastream 的回调将提取源表架构并将其转换为 BigQuery 数据类型。支持的操作包括下列操作：

在插入数据时创建新表。
在 BigQuery 表中添加新列，且初始值为 null。
在 BigQuery 中忽略丢弃的列，且未来的值为 null。
重命名的列将作为新列添加到 BigQuery 中。
类型更改不会传播到 BigQuery。

对此流水线的要求：

已准备好或已在复制数据的 Datastream 数据流。
已经为 Datastream 数据启用 Cloud Storage Pub/Sub 通知。
BigQuery 目标数据集已创建，而 Compute Engine 服务帐号已获授予管理员的访问权限。
若要创建目标副本表，源表中必须有主键。

模板参数

参数	说明
`inputFilePattern`	Cloud Storage 中要复制的 Datastream 文件的位置。此文件位置通常是数据流的根路径。
`gcsPubSubSubscription`	包含 Datastream 文件通知的 Pub/Sub 订阅。例如 `projects/my-project-id/subscriptions/my-subscription-id`。
`inputFileFormat`	Datastream 生成的输出文件的格式。例如：`avro,json`。默认值：`avro`。
`outputStagingDatasetTemplate`	包含暂存表的现有数据集的名称。您可以将模板 `{_metadata_dataset}` 包括为占位符，它会被替换为源数据集/架构的名称（例如 `{_metadata_dataset}_log`）。
`outputDatasetTemplate`	包含副本表的现有数据集的名称。您可以将模板 `{_metadata_dataset}` 包括为占位符，它会被替换为源数据集/架构的名称（例如 `{_metadata_dataset}`）。
`deadLetterQueueDirectory`	用于存储任何未处理消息以及无法处理原因的文件路径。默认值为 Dataflow 作业的临时位置下的目录。在大多数情况下，默认值就可以了。
`outputStagingTableNameTemplate`	（可选）暂存表的名称模板。默认值为 {_metadata_table}_log。如果您要复制多个架构，建议使用 `{_metadata_schema}_{_metadata_table}_log`。
`outputTableNameTemplate`	（可选）副本表名称的模板。默认值：`{_metadata_table}`。如果您要复制多个架构，建议使用 `{_metadata_schema}_{_metadata_table}`。
`outputProjectId`	（可选）BigQuery 数据集的项目，用于将数据输出到其中。此参数的默认项目是 Dataflow 流水线在其中运行的项目。
`streamName`	（可选）用于轮询架构信息的数据流的名称或模板。默认值：`{_metadata_stream}`。
`mergeFrequencyMinutes`	（可选）合并给定表格的时间间隔（分钟）。默认值：5。
`dlqRetryMinutes`	（可选）死信队列 (DLQ) 重试之间的分钟数。默认值：10。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。

运行 Datastream to BigQuery 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Datastream to BigQuery template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --enable-streaming-engine \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/Cloud_Datastream_to_BigQuery \
    --parameters \
inputFilePattern=GCS_FILE_PATH,\
gcsPubSubSubscription=GCS_SUBSCRIPTION_NAME,\
outputStagingDatasetTemplate=BIGQUERY_DATASET,\
outputDatasetTemplate=BIGQUERY_DATASET,\
outputStagingTableNameTemplate=BIGQUERY_TABLE,\
outputTableNameTemplate=BIGQUERY_TABLE_log

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION: the version of the template that you want to use You can use the following values: latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/ the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/ Caution: The latest version of templates might update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows.
GCS_FILE_PATH：Datastream 数据的 Cloud Storage 路径。例如 gs://bucket/path/to/data/
GCS_SUBSCRIPTION_NAME：要从中读取已更改文件的 Pub/Sub 订阅。例如：projects/my-project-id/subscriptions/my-subscription-id。
BIGQUERY_DATASET：您的 BigQuery 数据集名称。
BIGQUERY_TABLE：您的 BigQuery 表模板。例如 {_metadata_schema}_{_metadata_table}_log

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {

          "inputFilePattern": "GCS_FILE_PATH",
          "gcsPubSubSubscription": "GCS_SUBSCRIPTION_NAME",
          "outputStagingDatasetTemplate": "BIGQUERY_DATASET",
          "outputDatasetTemplate": "BIGQUERY_DATASET",
          "outputStagingTableNameTemplate": "BIGQUERY_TABLE",
          "outputTableNameTemplate": "BIGQUERY_TABLE_log"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/Cloud_Datastream_to_BigQuery",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION: the version of the template that you want to use You can use the following values: latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/ the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/ Caution: The latest version of templates might update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows.
GCS_FILE_PATH：Datastream 数据的 Cloud Storage 路径。例如 gs://bucket/path/to/data/
GCS_SUBSCRIPTION_NAME：要从中读取已更改文件的 Pub/Sub 订阅。例如：projects/my-project-id/subscriptions/my-subscription-id。
BIGQUERY_DATASET：您的 BigQuery 数据集名称。
BIGQUERY_TABLE：您的 BigQuery 表模板。例如 {_metadata_schema}_{_metadata_table}_log

Datastream to MySQL 或 PostgreSQL (Stream)

Datastream to SQL 模板是一种流处理流水线，可读取 Datastream 数据并将其复制到任何 MySQL 或 PostgreSQL 数据库。该模板使用 Pub/Sub 通知从 Cloud Storage 中读取数据，并将此数据复制到 SQL 副本表。

该模板不支持数据定义语言 (DDL)，要求数据库中已存在所有表。复制使用 Dataflow 有状态转换来过滤过时的数据，并确保数据无序的一致性。例如，如果传递了某行的较新版本，则会忽略该行较晚到达的版本。执行的数据操纵语言 (DML) 是将源数据最佳复制到目标数据的最佳尝试。执行的 DML 语句遵循以下规则：

如果存在主键，则插入和更新操作使用插入/更新语法（例如 INSERT INTO table VALUES (...) ON CONFLICT (...) DO UPDATE）。
如果存在主键，则删除将作为删除 DML 复制。
如果不存在主键，则会将插入和更新操作插入表中。
如果不存在主键，则会忽略删除操作。

如果您使用的是 Oracle to Postgres 实用程序，请在 SQL 中添加 ROWID 作为主键（如果主键不存在）。

对此流水线的要求如下：

已准备好或已在复制数据的 Datastream 数据流。
已经为 Datastream 数据启用 Cloud Storage Pub/Sub 通知。
为 PostgreSQL 数据库提供了所需的架构。
已在 Dataflow 工作器和 PostgreSQL 之间设置网络访问权限。

模板参数

参数	说明
`inputFilePattern`	Cloud Storage 中要复制的 Datastream 文件的位置。此文件位置通常是数据流的根路径。
`gcsPubSubSubscription`	包含 Datastream 文件通知的 Pub/Sub 订阅。例如 `projects/my-project-id/subscriptions/my-subscription-id`。
`inputFileFormat`	Datastream 生成的输出文件的格式。例如：`avro,json`。默认值：`avro`。
`databaseHost`	要连接的 SQL 主机。
`databaseUser`	具有写入副本中所有表所需的全部权限的 SQL 用户。
`databasePassword`	给定 SQL 用户的密码。
`databasePort`	（可选）要连接到的 SQL 数据库端口。默认值：5432。
`databaseName`	（可选）要连接到的 SQL 数据库的名称。默认值：postgres。
`streamName`	（可选）用于轮询架构信息的数据流的名称或模板。默认值：`{_metadata_stream}`。

运行 Datastream to SQL 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cloud Datastream to SQL template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --enable-streaming-engine \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/Cloud_Datastream_to_SQL \
    --parameters \
inputFilePattern=GCS_FILE_PATH,\
gcsPubSubSubscription=GCS_SUBSCRIPTION_NAME,\
databaseHost=DATABASE_HOST,\
databaseUser=DATABASE_USER,\
databasePassword=DATABASE_PASSWORD

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION: the version of the template that you want to use You can use the following values: latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/ the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/ Caution: The latest version of templates might update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows.
GCS_FILE_PATH：Datastream 数据的 Cloud Storage 路径。例如 gs://bucket/path/to/data/
GCS_SUBSCRIPTION_NAME：要从中读取已更改文件的 Pub/Sub 订阅。例如：projects/my-project-id/subscriptions/my-subscription-id。
DATABASE_HOST：您的 SQL 主机 IP。
DATABASE_USER：您的 SQL 用户。
DATABASE_PASSWORD：您的 SQL 密码。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {

          "inputFilePattern": "GCS_FILE_PATH",
          "gcsPubSubSubscription": "GCS_SUBSCRIPTION_NAME",
          "databaseHost": "DATABASE_HOST",
          "databaseUser": "DATABASE_USER",
          "databasePassword": "DATABASE_PASSWORD"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/Cloud_Datastream_to_SQL",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION: the version of the template that you want to use You can use the following values: latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/ the version name, like 2021-09-20-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/ Caution: The latest version of templates might update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows.
GCS_FILE_PATH：Datastream 数据的 Cloud Storage 路径。例如 gs://bucket/path/to/data/
GCS_SUBSCRIPTION_NAME：要从中读取已更改文件的 Pub/Sub 订阅。例如：projects/my-project-id/subscriptions/my-subscription-id。
DATABASE_HOST：您的 SQL 主机 IP。
DATABASE_USER：您的 SQL 用户。
DATABASE_PASSWORD：您的 SQL 密码。

Pub/Sub to Java Database Connectivity (JDBC)

Pub/Sub to Java Database Connectivity (JDBC) 模板是一种流处理流水线，可从预先存在的 Cloud Pub/Sub 订阅注入数据作为 JSON 字符串，并将生成的记录写入 JDBC。

对此流水线的要求：

Cloud Pub/Sub 订阅必须已存在才能运行此流水线。
在运行流水线之前，JDBC 源必须已存在。
在运行流水线之前，Cloud Pub/Sub 输出死信主题必须已存在。

模板参数

参数	说明
`driverClassName`	JDBC 驱动程序类名称。例如 `com.mysql.jdbc.Driver`。
`connectionUrl`	JDBC 连接网址字符串。例如 `jdbc:mysql://some-host:3306/sampledb`。可作为 Base64 编码，然后使用 Cloud KMS 密钥加密的字符串传入。
`driverJars`	以英文逗号分隔的 JDBC 驱动程序 Cloud Storage 路径。例如 `gs://your-bucket/driver_jar1.jar,gs://your-bucket/driver_jar2.jar`。
`username`	（可选）用于 JDBC 连接的用户名。该参数可以作为使用 Cloud KMS 密钥加密的 Base64 编码字符串传入。
`password`	（可选）用于 JDBC 连接的密码。该参数可以作为使用 Cloud KMS 密钥加密的 Base64 编码字符串传入。
`connectionProperties`	（可选）用于 JDBC 连接的属性字符串。字符串的格式必须为 `[propertyName=property;]*`。例如 `unicode=true;characterEncoding=UTF-8`。
`statement`	针对数据库运行的语句。该语句必须以任意顺序指定表的列名。只会从 JSON 中读取指定列名称的值并将其添加到语句中。例如 `INSERT INTO tableName (column1, column2) VALUES (?,?)`。
`inputSubscription`	要读取的 Pub/Sub 输入订阅，格式为 `projects/<project>/subscriptions/<subscription>`。
`outputDeadletterTopic`	用于转发无法递送的消息的 Pub/Sub 主题，例如 `projects/<project-id>/topics/<topic-name>`。
`KMSEncryptionKey`	（可选）用于对用户名、密码和连接字符串进行解密的 Cloud KMS 加密密钥。如果传入了 Cloud KMS 密钥，则用户名、密码和连接字符串都必须以加密方式进行传递。
`extraFilesToStage`	用于将文件暂存在工作器中的 Cloud Storage 路径或 Secret Manager 密文，以逗号分隔。这些文件将保存在每个工作器的 `/extra_files` 目录下。例如 `gs://<my-bucket>/file.txt,projects/<project-id>/secrets/<secret-id>/versions/<version-id>`。

运行 Pub/Sub to Java Database Connectivity (JDBC) 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub to JDBC template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/PubSub_to_Jdbc \
    --region REGION_NAME \
    --parameters \
driverClassName=DRIVER_CLASS_NAME,\
connectionURL=JDBC_CONNECTION_URL,\
driverJars=DRIVER_PATHS,\
username=CONNECTION_USERNAME,\
password=CONNECTION_PASSWORD,\
connectionProperties=CONNECTION_PROPERTIES,\
statement=SQL_STATEMENT,\
inputSubscription=INPUT_SUBSCRIPTION,\
outputDeadletterTopic=OUTPUT_DEADLETTER_TOPIC,\
KMSEncryptionKey=KMS_ENCRYPTION_KEY

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
DRIVER_CLASS_NAME：驱动程序类名称
JDBC_CONNECTION_URL：JDBC 连接网址
DRIVER_PATHS：JDBC 驱动程序以英文逗号分隔的 Cloud Storage 路径
CONNECTION_USERNAME：JDBC 连接用户名
CONNECTION_PASSWORD：JDBC 连接密码
CONNECTION_PROPERTIES：JDBC 连接属性（如有需要）
SQL_STATEMENT：要对数据库执行的 SQL 语句
INPUT_SUBSCRIPTION：要读取的 Pub/Sub 输入订阅。
OUTPUT_DEADLETTER_TOPIC：用于转发无法递送的消息的 Pub/Sub
KMS_ENCRYPTION_KEY：Cloud KMS 加密密钥

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/PubSub_to_Jdbc
{
   "jobName": "JOB_NAME",
   "parameters": {
       "driverClassName": "DRIVER_CLASS_NAME",
       "connectionURL": "JDBC_CONNECTION_URL",
       "driverJars": "DRIVER_PATHS",
       "username": "CONNECTION_USERNAME",
       "password": "CONNECTION_PASSWORD",
       "connectionProperties": "CONNECTION_PROPERTIES",
       "statement": "SQL_STATEMENT",
       "inputSubscription": "INPUT_SUBSCRIPTION",
       "outputDeadletterTopic": "OUTPUT_DEADLETTER_TOPIC",
       "KMSEncryptionKey":"KMS_ENCRYPTION_KEY"
   },
   "environment": { "zone": "us-central1-f" },
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
DRIVER_CLASS_NAME：驱动程序类名称
JDBC_CONNECTION_URL：JDBC 连接网址
DRIVER_PATHS：JDBC 驱动程序以英文逗号分隔的 Cloud Storage 路径
CONNECTION_USERNAME：JDBC 连接用户名
CONNECTION_PASSWORD：JDBC 连接密码
CONNECTION_PROPERTIES：JDBC 连接属性（如有需要）
SQL_STATEMENT：要对数据库执行的 SQL 语句
INPUT_SUBSCRIPTION：要读取的 Pub/Sub 输入订阅。
OUTPUT_DEADLETTER_TOPIC：用于转发无法递送的消息的 Pub/Sub
KMS_ENCRYPTION_KEY：Cloud KMS 加密密钥

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import static com.google.cloud.teleport.v2.utils.KMSUtils.maybeDecrypt;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.io.DynamicJdbcIO;
import com.google.cloud.teleport.v2.options.PubsubToJdbcOptions;
import com.google.cloud.teleport.v2.transforms.ErrorConverters;
import com.google.cloud.teleport.v2.utils.JsonStringToQueryMapper;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import com.google.common.base.Splitter;
import java.util.List;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.values.PCollection;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link PubsubToJdbc} streaming pipeline reads data from Google Cloud PubSub and publishes to
 * JDBC. <br>
 */
@Template(
    name = "Pubsub_to_Jdbc",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub to JDBC",
    description =
        "A streaming pipeline which ingests data in the form of json strings from Pub/Sub"
            + " subscription and writes to a JDBC table. JDBC connection string, user name and"
            + " password can be passed in directly as plaintext or encrypted using the Google Cloud"
            + " KMS API.  If the parameter KMSEncryptionKey is specified, connectionUrl, username,"
            + " and password should be all in encrypted format. A sample curl command for the KMS"
            + " API encrypt endpoint: curl -s -X POST"
            + " \"https://cloudkms.googleapis.com/v1/projects/your-project/locations/your-path/keyRings/your-keyring/cryptoKeys/your-key:encrypt\""
            + "  -d \"{\\\"plaintext\\\":\\\"PasteBase64EncodedString\\\"}\"  -H \"Authorization:"
            + " Bearer $(gcloud auth application-default print-access-token)\" -H \"Content-Type:"
            + " application/json\"",
    optionsClass = PubsubToJdbcOptions.class,
    flexContainerName = "pubsub-to-jdbc",
    contactInformation = "https://cloud.google.com/support")
public class PubsubToJdbc {

  /* Logger for class.*/
  private static final Logger LOG = LoggerFactory.getLogger(PubsubToJdbc.class);

  /** String/String Coder for FailsafeElement. */
  public static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    PubsubToJdbcOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(PubsubToJdbcOptions.class);

    run(options);
  }

  /**
   * Runs a pipeline which reads message from Pub/Sub and writes to JDBC.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(PubsubToJdbcOptions options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    LOG.info("Starting Pubsub-to-Jdbc Pipeline.");

    /*
     * Steps:
     *  1) Read data from a Pub/Sub subscription
     *  2) Write to Jdbc Table
     *  3) Write errors to deadletter topic
     */
    PCollection<String> pubsubData =
        pipeline.apply(
            "readFromPubSubSubscription",
            PubsubIO.readStrings().fromSubscription(options.getInputSubscription()));

    DynamicJdbcIO.DynamicDataSourceConfiguration dataSourceConfiguration =
        DynamicJdbcIO.DynamicDataSourceConfiguration.create(
                options.getDriverClassName(),
                maybeDecrypt(options.getConnectionUrl(), options.getKMSEncryptionKey()))
            .withDriverJars(options.getDriverJars());
    if (options.getUsername() != null) {
      dataSourceConfiguration =
          dataSourceConfiguration.withUsername(
              maybeDecrypt(options.getUsername(), options.getKMSEncryptionKey()));
    }
    if (options.getPassword() != null) {
      dataSourceConfiguration =
          dataSourceConfiguration.withPassword(
              maybeDecrypt(options.getPassword(), options.getKMSEncryptionKey()));
    }
    if (options.getConnectionProperties() != null) {
      dataSourceConfiguration =
          dataSourceConfiguration.withConnectionProperties(options.getConnectionProperties());
    }

    PCollection<FailsafeElement<String, String>> errors =
        pubsubData
            .apply(
                "writeToJdbc",
                DynamicJdbcIO.<String>write()
                    .withDataSourceConfiguration(dataSourceConfiguration)
                    .withStatement(options.getStatement())
                    .withPreparedStatementSetter(
                        new JsonStringToQueryMapper(getKeyOrder(options.getStatement()))))
            .setCoder(FAILSAFE_ELEMENT_CODER);

    errors.apply(
        "WriteFailedRecords",
        ErrorConverters.WriteStringMessageErrorsToPubSub.newBuilder()
            .setErrorRecordsTopic(options.getOutputDeadletterTopic())
            .build());

    return pipeline.run();
  }

  private static List<String> getKeyOrder(String statement) {
    int startIndex = statement.indexOf("(");
    int endIndex = statement.indexOf(")");
    String data = statement.substring(startIndex + 1, endIndex);
    return Splitter.on(',').splitToList(data);
  }
}

Cloud Spanner change streams to Cloud Storage

Cloud Spanner change streams to Cloud Storage 模板是一种流处理流水线，可流式传输 Spanner 数据更改记录并使用 Dataflow Runner V2 将其写入 Cloud Storage 存储桶。

流水线根据 Spanner 变更数据流记录的时间戳将其分组到窗口中，每个窗口代表一个时长，您可以使用此模板配置该时长。时间戳属于某个窗口的所有记录都保证在该窗口中；不会有延迟到达。您还可以定义多个输出分片。流水线会为每个窗口的每个分片创建一个 Cloud Storage 输出文件。在输出文件中，记录是无序的。输出文件可以采用 JSON 或 AVRO 格式编写，具体取决于用户配置。

请注意，通过在与 Cloud Spanner 实例或 Cloud Storage 存储桶相同的区域运行 Dataflow 作业，您可以最大限度地减少网络延迟和网络传输费用。如果您使用位于作业区域之外的源、接收器、暂存文件位置或临时文件位置，则数据可能会跨区域发送。详细了解 Dataflow 区域端点。

详细了解变更数据流、如何构建变更数据流 Dataflow 流水线和最佳实践。

对此流水线的要求：

在运行流水线之前，Cloud Spanner 实例必须已存在。
在运行流水线之前，Cloud Spanner 数据库必须已存在。
在运行流水线之前，Cloud Spanner 元数据实例必须已存在。
在运行流水线之前，Cloud Spanner 元数据数据库必须已存在。
在运行流水线之前，Cloud Spanner 变更数据流必须已存在。
在运行流水线之前，Cloud Storage 输出存储桶必须已存在。

模板参数

参数	说明
`spannerInstanceId`	要从中读取变更数据流数据的 Cloud Spanner 实例 ID。
`spannerDatabase`	要从中读取变更数据流数据的 Cloud Spanner 数据库。
`spannerMetadataInstanceId`	用于变更数据流连接器元数据表的 Cloud Spanner 实例 ID。
`spannerMetadataDatabase`	要用于变更数据流连接器元数据表的 Cloud Spanner 数据库。
`spannerChangeStreamName`	要读取的 Cloud Spanner 变更数据流的名称。
`gcsOutputDirectory`	变更数据流输出在 Cloud Storage 中的文件位置，格式为：“gs://${BUCKET}/${ROOT_PATH}/”。
`outputFilenamePrefix`	（可选）要写入的文件的文件名前缀。默认文件前缀设置为“output”。
`spannerProjectId`	（可选）从中读取变更数据流的项目。这也是创建变更数据流连接器元数据表的项目。此参数的默认项目是 Dataflow 流水线在其中运行的项目。
`startTimestamp`	（可选）要用于读取变更数据流的起始 DateTime（含边界值）。Ex-2021-10-12T07:20:50.52Z。默认为流水线启动时的时间戳，即当前时间。
`endTimestamp`	（可选）要用于读取变更数据流的结束 DateTime（含边界值）。Ex-2021-10-12T07:20:50.52Z。默认为未来的无限时间。
`outputFileFormat`	（可选）输出 Cloud Storage 文件的格式。允许的格式为 TEXT、AVRO。默认为 AVRO。
`windowDuration`	（可选）窗口时长是将数据写入输出目录的时间间隔。请根据流水线的吞吐量配置时长。例如，较高的吞吐量可能需要较短的窗口时长，以便数据适应内存。默认为 5 分钟，最短可为 1 秒。允许的格式如下：[int]s（表示数秒，例如 5s）、[int]m（表示数分钟，例如 12m）、[int]h（表示数小时，例如 2h）。
`rpcPriority`	（可选）Cloud Spanner 调用的请求优先级。该值必须为 [HIGH,MEDIUM,LOW] 之一。（默认值：HIGH）
`numShards`	（可选）写入时产生的最大输出分片数。默认值为 20。分片数量越多，写入 Cloud Storage 的吞吐量越高，但处理输出 Cloud Storage 文件时跨分片聚合数据的费用也可能更高。
`spannerMetadataTableName`	（可选）要使用的 Cloud Spanner 变更数据流连接器元数据表名称。如果未提供，系统会在流水线流期间自动创建 Cloud Spanner 变更数据流元数据表。更新现有流水线时必须提供此参数，其他情况下则不应提供。

运行 Cloud Spanner change streams to Cloud Storage 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cloud Spanner change streams to Google Cloud Storage template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/Spanner_Change_Streams_to_Google_Cloud_Storage \
    --region REGION_NAME \
    --parameters \
spannerInstanceId=SPANNER_INSTANCE_ID,\
spannerDatabase=SPANNER_DATABASE,\
spannerMetadataInstanceId=SPANNER_METADATA_INSTANCE_ID,\
spannerMetadataDatabase=SPANNER_METADATA_DATABASE,\
spannerChangeStreamName=SPANNER_CHANGE_STREAM,\
gcsOutputDirectory=GCS_OUTPUT_DIRECTORY

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
SPANNER_INSTANCE_ID：Cloud Spanner 实例 ID
SPANNER_DATABASE：Cloud Spanner 数据库
SPANNER_METADATA_INSTANCE_ID：Cloud Spanner 元数据实例 ID
SPANNER_METADATA_DATABASE：Cloud Spanner 元数据数据库
SPANNER_CHANGE_STREAM：Cloud Spanner 变更数据流
GCS_OUTPUT_DIRECTORY：变更数据流输出的文件位置

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "spannerInstanceId": "SPANNER_INSTANCE_ID",
          "spannerDatabase": "SPANNER_DATABASE",
          "spannerMetadataInstanceId": "SPANNER_METADATA_INSTANCE_ID",
          "spannerMetadataDatabase": "SPANNER_METADATA_DATABASE",
          "spannerChangeStreamName": "SPANNER_CHANGE_STREAM",
          "gcsOutputDirectory": "GCS_OUTPUT_DIRECTORY"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/Spanner_Change_Streams_to_Google_Cloud_Storage",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
SPANNER_INSTANCE_ID：Cloud Spanner 实例 ID
SPANNER_DATABASE：Cloud Spanner 数据库
SPANNER_METADATA_INSTANCE_ID：Cloud Spanner 元数据实例 ID
SPANNER_METADATA_DATABASE：Cloud Spanner 元数据数据库
SPANNER_CHANGE_STREAM：Cloud Spanner 变更数据流
GCS_OUTPUT_DIRECTORY：变更数据流输出的文件位置

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2022 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import com.google.cloud.Timestamp;
import com.google.cloud.spanner.Options.RpcPriority;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.options.SpannerChangeStreamsToGcsOptions;
import com.google.cloud.teleport.v2.transforms.FileFormatFactorySpannerChangeStreams;
import com.google.cloud.teleport.v2.utils.DurationUtils;
import java.util.ArrayList;
import java.util.List;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.spanner.SpannerConfig;
import org.apache.beam.sdk.io.gcp.spanner.SpannerIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.windowing.FixedWindows;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link SpannerChangeStreamsToGcs} pipeline streams change stream record(s) and stores to
 * Google Cloud Storage bucket in user specified format. The sink data can be stored in a Text or
 * Avro file format.
 */
@Template(
    name = "Spanner_Change_Streams_to_Google_Cloud_Storage",
    category = TemplateCategory.STREAMING,
    displayName = "Cloud Spanner change streams to Cloud Storage",
    description =
        "Streaming pipeline. Streams Spanner change stream data records and writes them into a"
            + " Cloud Storage bucket using Dataflow Runner V2.",
    optionsClass = SpannerChangeStreamsToGcsOptions.class,
    flexContainerName = "spanner-changestreams-to-gcs",
    contactInformation = "https://cloud.google.com/support")
public class SpannerChangeStreamsToGcs {
  private static final Logger LOG = LoggerFactory.getLogger(SpannerChangeStreamsToGcs.class);
  private static final String USE_RUNNER_V2_EXPERIMENT = "use_runner_v2";

  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    LOG.info("Starting Input Files to GCS");

    SpannerChangeStreamsToGcsOptions options =
        PipelineOptionsFactory.fromArgs(args).as(SpannerChangeStreamsToGcsOptions.class);

    run(options);
  }

  private static String getProjectId(SpannerChangeStreamsToGcsOptions options) {
    return options.getSpannerProjectId().isEmpty()
        ? options.getProject()
        : options.getSpannerProjectId();
  }

  public static PipelineResult run(SpannerChangeStreamsToGcsOptions options) {
    LOG.info("Requested File Format is " + options.getOutputFileFormat());
    options.setStreaming(true);
    options.setEnableStreamingEngine(true);

    final Pipeline pipeline = Pipeline.create(options);

    // Get the Spanner project, instance, database, and change stream parameters.
    String projectId = getProjectId(options);
    String instanceId = options.getSpannerInstanceId();
    String databaseId = options.getSpannerDatabase();
    String metadataInstanceId = options.getSpannerMetadataInstanceId();
    String metadataDatabaseId = options.getSpannerMetadataDatabase();
    String changeStreamName = options.getSpannerChangeStreamName();

    // Retrieve and parse the start / end timestamps.
    Timestamp startTimestamp =
        options.getStartTimestamp().isEmpty()
            ? Timestamp.now()
            : Timestamp.parseTimestamp(options.getStartTimestamp());
    Timestamp endTimestamp =
        options.getEndTimestamp().isEmpty()
            ? Timestamp.MAX_VALUE
            : Timestamp.parseTimestamp(options.getEndTimestamp());

    // Add use_runner_v2 to the experiments option, since Change Streams connector is only supported
    // on Dataflow runner v2.
    List<String> experiments = options.getExperiments();
    if (experiments == null) {
      experiments = new ArrayList<>();
    }
    if (!experiments.contains(USE_RUNNER_V2_EXPERIMENT)) {
      experiments.add(USE_RUNNER_V2_EXPERIMENT);
    }
    options.setExperiments(experiments);

    String metadataTableName =
        options.getSpannerMetadataTableName() == null
            ? null
            : options.getSpannerMetadataTableName();

    final RpcPriority rpcPriority = options.getRpcPriority();
    pipeline
        .apply(
            SpannerIO.readChangeStream()
                .withSpannerConfig(
                    SpannerConfig.create()
                        .withHost(ValueProvider.StaticValueProvider.of(options.getSpannerHost()))
                        .withProjectId(projectId)
                        .withInstanceId(instanceId)
                        .withDatabaseId(databaseId))
                .withMetadataInstance(metadataInstanceId)
                .withMetadataDatabase(metadataDatabaseId)
                .withChangeStreamName(changeStreamName)
                .withInclusiveStartAt(startTimestamp)
                .withInclusiveEndAt(endTimestamp)
                .withRpcPriority(rpcPriority)
                .withMetadataTable(metadataTableName))
        .apply(
            "Creating " + options.getWindowDuration() + " Window",
            Window.into(FixedWindows.of(DurationUtils.parseDuration(options.getWindowDuration()))))
        .apply(
            "Write To GCS",
            FileFormatFactorySpannerChangeStreams.newBuilder().setOptions(options).build());

    return pipeline.run();
  }
}

Cloud Spanner change streams to BigQuery

Cloud Spanner change stream to BigQuery 模板是一种流处理流水线，用于流式传输 Cloud Spanner 数据更改记录并使用 Dataflow Runner V2 将其写入 BigQuery 表。

如果需要的 BigQuery 表不存在，流水线会创建这些表。否则将使用现有 BigQuery 表。现有 BigQuery 表的架构必须包含 Cloud Spanner 表的相应跟踪列以及“ignoreFields”选项未显式忽略的其他元数据列（请参阅以下列表中元数据字段的说明）。每个新的 BigQuery 行都包含变更数据流在更改记录的时间戳从 Cloud Spanner 表中的对应行中监控的所有列。

所有变更数据流监控的列都会包含在每个 BigQuery 表行中，无论它们是否被 Cloud Spanner 事务修改。未监控的列不会包含在 BigQuery 行中。任何小于 Dataflow 水印的 Cloud Spanner 更改都会成功应用于 BigQuery 表，或存储在死信队列中进行重试。与原始 Cloud Spanner 提交时间戳排序相比，BigQuery 行插入是乱序的。

以下元数据字段会添加到 BigQuery 表中：

_metadata_spanner_mod_type：从变更数据流数据更改记录中提取。
_metadata_spanner_table_name：Cloud Spanner 表名称。请注意，这不是连接器的元数据表名称。
_metadata_spanner_commit_timestamp：从变更数据流数据更改记录中提取。
_metadata_spanner_server_transaction_id：从变更数据流数据更改记录中提取。
_metadata_spanner_record_sequence：从变更数据流数据更改记录中提取。
_metadata_spanner_is_last_record_in_transaction_in_partition：从变更数据流数据更改记录中提取。
_metadata_spanner_number_of_records_in_transaction：从变更数据流数据更改记录中提取。
_metadata_spanner_number_of_partitions_in_transaction：从变更数据流数据更改记录中提取。
_metadata_big_query_commit_timestamp：行插入 BigQuery 的提交时间戳。

注意：

此模板不会将 Cloud Spanner 架构更改传播到 BigQuery。由于在 Cloud Spanner 中执行架构更改可能会中断流水线，因此您可能需要在架构更改后重新创建流水线。
对于 OLD_AND_NEW_VALUES 和 NEW_VALUES 值捕获类型，当数据更改记录包含 UPDATE 更改时，模板需要在数据更改记录的提交时间戳对 Cloud Spanner 执行过时读取，以检索未更改但受监控的列。请确保为过时读取正确配置数据库“version_retention_period”。对于 NEW_ROW 值捕获类型，模板效率更高，因为数据更改记录会捕获整个新行（包括 UPDATE 中未更新的列），并且模板不需要执行过时读取。
您可以通过在与 Cloud Spanner 实例或 BigQuery 表相同的区域运行 Dataflow 作业来最大限度地减少网络延迟和网络传输费用。如果您使用位于作业区域之外的源、接收器、暂存文件位置或临时文件位置，则数据可能会跨区域发送。详细了解 Dataflow 区域端点。
此模板支持所有有效的 Cloud Spanner 数据类型，但如果 BigQuery 类型比 Cloud Spanner 类型更精确，则在转换期间可能会发生精确率下降。具体而言：
- 对于 Cloud Spanner JSON 类型，对象的成员顺序按字典顺序排列，但对于 BigQuery JSON 类型没有此类保证。
- Cloud Spanner 支持纳秒时间戳类型，BigQuery 仅支持微秒时间戳类型。

详细了解变更数据流、如何构建变更数据流 Dataflow 流水线和最佳实践。

对此流水线的要求：

在运行流水线之前，Cloud Spanner 实例必须已存在。
在运行流水线之前，Cloud Spanner 数据库必须已存在。
在运行流水线之前，Cloud Spanner 元数据实例必须已存在。
在运行流水线之前，Cloud Spanner 元数据数据库必须已存在。
在运行流水线之前，Cloud Spanner 变更数据流必须已存在。
在运行流水线之前，BigQuery 数据集必须已存在。

模板参数

参数	说明
`spannerInstanceId`	要从中读取变更数据流的 Cloud Spanner 实例。
`spannerDatabase`	要从中读取变更数据流的 Cloud Spanner 数据库。
`spannerMetadataInstanceId`	要用于变更数据流连接器元数据表的 Cloud Spanner 实例。
`spannerMetadataDatabase`	要用于变更数据流连接器元数据表的 Cloud Spanner 数据库。
`spannerChangeStreamName`	要读取的 Cloud Spanner 变更数据流的名称。
`bigQueryDataSet`	变更数据流输出的 BigQuery 数据集。
`spannerProjectId`	（可选）从中读取变更数据流的项目。这也是创建变更数据流连接器元数据表的项目。此参数的默认项目是 Dataflow 流水线在其中运行的项目。
`spannerMetadataTableName`	（可选）要使用的 Cloud Spanner 变更数据流连接器元数据表名称。如果未提供，系统会在流水线流期间自动创建 Cloud Spanner 变更数据流连接器元数据表。更新现有流水线时必须提供此参数，其他情况下则不应提供。
`rpcPriority`	（可选）Cloud Spanner 调用的请求优先级。该值必须为 [HIGH,MEDIUM,LOW] 之一。（默认值：HIGH）
`startTimestamp`	（可选）要用于读取变更数据流的起始 DateTime（含边界值）。Ex-2021-10-12T07:20:50.52Z。默认为流水线启动时的时间戳，即当前时间。
`endTimestamp`	（可选）要用于读取变更数据流的结束 DateTime（含边界值）。Ex-2021-10-12T07:20:50.52Z。默认为未来的无限时间。
`bigQueryProjectId`	（可选）BigQuery 项目。默认为 Dataflow 作业的项目。
`bigQueryChangelogTableNameTemplate`	（可选）BigQuery 更新日志表名称的模板。默认为 {_metadata_spanner_table_name}_changelog。
`deadLetterQueueDirectory`	（可选）用于存储未处理记录以及无法处理原因的文件路径。默认值为 Dataflow 作业的临时位置下的目录。在大多数情况下，默认值就可以了。
`dlqRetryMinutes`	（可选）死信队列重试之间的分钟数。默认值为 10。
`ignoreFields`	（可选）要忽略的字段（区分大小写），以逗号分隔列表形式列出。这些字段可以是被监控表的字段，也可以是流水线添加的元数据字段。被忽略的字段不会插入到 BigQuery 中。

运行 Cloud Spanner change streams to BigQuery 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cloud Spanner change streams to BigQuery template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/Spanner_Change_Streams_to_BigQuery \
    --region REGION_NAME \
    --parameters \
spannerInstanceId=SPANNER_INSTANCE_ID,\
spannerDatabase=SPANNER_DATABASE,\
spannerMetadataInstanceId=SPANNER_METADATA_INSTANCE_ID,\
spannerMetadataDatabase=SPANNER_METADATA_DATABASE,\
spannerChangeStreamName=SPANNER_CHANGE_STREAM,\
bigQueryDataset=BIGQUERY_DATASET

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
SPANNER_INSTANCE_ID：Cloud Spanner 实例 ID
SPANNER_DATABASE：Cloud Spanner 数据库
SPANNER_METADATA_INSTANCE_ID：Cloud Spanner 元数据实例 ID
SPANNER_METADATA_DATABASE：Cloud Spanner 元数据数据库
SPANNER_CHANGE_STREAM：Cloud Spanner 变更数据流
BIGQUERY_DATASET：变更数据流输出的 BigQuery 数据集

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "spannerInstanceId": "SPANNER_INSTANCE_ID",
          "spannerDatabase": "SPANNER_DATABASE",
          "spannerMetadataInstanceId": "SPANNER_METADATA_INSTANCE_ID",
          "spannerMetadataDatabase": "SPANNER_METADATA_DATABASE",
          "spannerChangeStreamName": "SPANNER_CHANGE_STREAM",
          "bigQueryDataset": "BIGQUERY_DATASET"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/Spanner_Change_Streams_to_BigQuery",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
SPANNER_INSTANCE_ID：Cloud Spanner 实例 ID
SPANNER_DATABASE：Cloud Spanner 数据库
SPANNER_METADATA_INSTANCE_ID：Cloud Spanner 元数据实例 ID
SPANNER_METADATA_DATABASE：Cloud Spanner 元数据数据库
SPANNER_CHANGE_STREAM：Cloud Spanner 变更数据流
BIGQUERY_DATASET：变更数据流输出的 BigQuery 数据集

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2022 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates.spannerchangestreamstobigquery;

import com.google.api.services.bigquery.model.TableRow;
import com.google.cloud.Timestamp;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.cdc.dlq.DeadLetterQueueManager;
import com.google.cloud.teleport.v2.cdc.dlq.StringDeadLetterQueueSanitizer;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.options.SpannerChangeStreamsToBigQueryOptions;
import com.google.cloud.teleport.v2.templates.spannerchangestreamstobigquery.model.Mod;
import com.google.cloud.teleport.v2.templates.spannerchangestreamstobigquery.schemautils.BigQueryUtils;
import com.google.cloud.teleport.v2.transforms.DLQWriteTransform;
import com.google.cloud.teleport.v2.utils.BigQueryIOUtils;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import com.google.common.collect.ImmutableSet;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy;
import org.apache.beam.sdk.io.gcp.bigquery.WriteResult;
import org.apache.beam.sdk.io.gcp.spanner.SpannerConfig;
import org.apache.beam.sdk.io.gcp.spanner.SpannerIO;
import org.apache.beam.sdk.io.gcp.spanner.changestreams.model.DataChangeRecord;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.Flatten;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.Reshuffle;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionList;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

// TODO(haikuo-google): Add integration test.
// TODO(haikuo-google): Add README.
// TODO(haikuo-google): Add stackdriver metrics.
// TODO(haikuo-google): Ideally side input should be used to store schema information and shared
// accrss DoFns, but since side input fix is not yet deployed at the moment, we read schema
// information in the beginning of the DoFn as a work around. We should use side input instead when
// it's available.
// TODO(haikuo-google): Test the case where tables or columns are added while the pipeline is
// running.
/**
 * This pipeline ingests {@link DataChangeRecord} from Spanner change stream. The {@link
 * DataChangeRecord} is then broken into {@link Mod}, which converted into {@link TableRow} and
 * inserted into BigQuery table.
 */
@Template(
    name = "Spanner_Change_Streams_to_BigQuery",
    category = TemplateCategory.STREAMING,
    displayName = "Cloud Spanner change streams to BigQuery",
    description =
        "Streaming pipeline. Streams Spanner data change records and writes them into BigQuery"
            + " using Dataflow Runner V2.",
    optionsClass = SpannerChangeStreamsToBigQueryOptions.class,
    flexContainerName = "spanner-changestreams-to-bigquery",
    contactInformation = "https://cloud.google.com/support")
public final class SpannerChangeStreamsToBigQuery {

  /** String/String Coder for {@link FailsafeElement}. */
  public static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  private static final Logger LOG = LoggerFactory.getLogger(SpannerChangeStreamsToBigQuery.class);

  // Max number of deadletter queue retries.
  private static final int DLQ_MAX_RETRIES = 5;

  private static final String USE_RUNNER_V2_EXPERIMENT = "use_runner_v2";

  /**
   * Main entry point for executing the pipeline.
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    LOG.info("Starting to replicate change records from Spanner change streams to BigQuery");

    SpannerChangeStreamsToBigQueryOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(SpannerChangeStreamsToBigQueryOptions.class);

    run(options);
  }

  private static void validateOptions(SpannerChangeStreamsToBigQueryOptions options) {
    if (options.getDlqRetryMinutes() <= 0) {
      throw new IllegalArgumentException("dlqRetryMinutes must be positive.");
    }

    BigQueryIOUtils.validateBQStorageApiOptionsStreaming(options);
  }

  private static void setOptions(SpannerChangeStreamsToBigQueryOptions options) {
    options.setStreaming(true);
    options.setEnableStreamingEngine(true);

    // Add use_runner_v2 to the experiments option, since change streams connector is only supported
    // on Dataflow runner v2.
    List<String> experiments = options.getExperiments();
    if (experiments == null) {
      experiments = new ArrayList<>();
    }
    if (!experiments.contains(USE_RUNNER_V2_EXPERIMENT)) {
      experiments.add(USE_RUNNER_V2_EXPERIMENT);
    }
    options.setExperiments(experiments);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  public static PipelineResult run(SpannerChangeStreamsToBigQueryOptions options) {
    setOptions(options);
    validateOptions(options);

    /**
     * Stages: 1) Read {@link DataChangeRecord} from change stream. 2) Create {@link
     * FailsafeElement} of {@link Mod} JSON and merge from: - {@link DataChangeRecord}. - GCS Dead
     * letter queue. 3) Convert {@link Mod} JSON into {@link TableRow} by reading from Spanner at
     * commit timestamp. 4) Append {@link TableRow} to BigQuery. 5) Write Failures from 2), 3) and
     * 4) to GCS dead letter queue.
     */
    Pipeline pipeline = Pipeline.create(options);
    DeadLetterQueueManager dlqManager = buildDlqManager(options);
    String spannerProjectId = getSpannerProjectId(options);

    String dlqDirectory = dlqManager.getRetryDlqDirectoryWithDateTime();
    String tempDlqDirectory = dlqManager.getRetryDlqDirectory() + "tmp/";

    // Retrieve and parse the startTimestamp and endTimestamp.
    Timestamp startTimestamp =
        options.getStartTimestamp().isEmpty()
            ? Timestamp.now()
            : Timestamp.parseTimestamp(options.getStartTimestamp());
    Timestamp endTimestamp =
        options.getEndTimestamp().isEmpty()
            ? Timestamp.MAX_VALUE
            : Timestamp.parseTimestamp(options.getEndTimestamp());

    SpannerConfig spannerConfig =
        SpannerConfig.create()
            .withHost(ValueProvider.StaticValueProvider.of(options.getSpannerHost()))
            .withProjectId(spannerProjectId)
            .withInstanceId(options.getSpannerInstanceId())
            .withDatabaseId(options.getSpannerDatabase())
            .withRpcPriority(options.getRpcPriority());

    SpannerIO.ReadChangeStream readChangeStream =
        SpannerIO.readChangeStream()
            .withSpannerConfig(spannerConfig)
            .withMetadataInstance(options.getSpannerMetadataInstanceId())
            .withMetadataDatabase(options.getSpannerMetadataDatabase())
            .withChangeStreamName(options.getSpannerChangeStreamName())
            .withInclusiveStartAt(startTimestamp)
            .withInclusiveEndAt(endTimestamp)
            .withRpcPriority(options.getRpcPriority());

    String spannerMetadataTableName = options.getSpannerMetadataTableName();
    if (spannerMetadataTableName != null) {
      readChangeStream = readChangeStream.withMetadataTable(spannerMetadataTableName);
    }

    PCollection<DataChangeRecord> dataChangeRecord =
        pipeline
            .apply("Read from Spanner Change Streams", readChangeStream)
            .apply("Reshuffle DataChangeRecord", Reshuffle.viaRandomKey());

    PCollection<FailsafeElement<String, String>> sourceFailsafeModJson =
        dataChangeRecord
            .apply("DataChangeRecord To Mod JSON", ParDo.of(new DataChangeRecordToModJsonFn()))
            .apply(
                "Wrap Mod JSON In FailsafeElement",
                ParDo.of(
                    new DoFn<String, FailsafeElement<String, String>>() {
                      @ProcessElement
                      public void process(
                          @Element String input,
                          OutputReceiver<FailsafeElement<String, String>> receiver) {
                        receiver.output(FailsafeElement.of(input, input));
                      }
                    }))
            .setCoder(FAILSAFE_ELEMENT_CODER);

    PCollectionTuple dlqModJson =
        dlqManager.getReconsumerDataTransform(
            pipeline.apply(dlqManager.dlqReconsumer(options.getDlqRetryMinutes())));
    PCollection<FailsafeElement<String, String>> retryableDlqFailsafeModJson =
        dlqModJson.get(DeadLetterQueueManager.RETRYABLE_ERRORS).setCoder(FAILSAFE_ELEMENT_CODER);

    PCollection<FailsafeElement<String, String>> failsafeModJson =
        PCollectionList.of(sourceFailsafeModJson)
            .and(retryableDlqFailsafeModJson)
            .apply("Merge Source And DLQ Mod JSON", Flatten.pCollections());

    ImmutableSet.Builder<String> ignoreFieldsBuilder = ImmutableSet.builder();
    for (String ignoreField : options.getIgnoreFields().split(",")) {
      ignoreFieldsBuilder.add(ignoreField);
    }
    ImmutableSet<String> ignoreFields = ignoreFieldsBuilder.build();
    FailsafeModJsonToTableRowTransformer.FailsafeModJsonToTableRowOptions
        failsafeModJsonToTableRowOptions =
            FailsafeModJsonToTableRowTransformer.FailsafeModJsonToTableRowOptions.builder()
                .setSpannerConfig(spannerConfig)
                .setSpannerChangeStream(options.getSpannerChangeStreamName())
                .setIgnoreFields(ignoreFields)
                .setCoder(FAILSAFE_ELEMENT_CODER)
                .build();
    FailsafeModJsonToTableRowTransformer.FailsafeModJsonToTableRow failsafeModJsonToTableRow =
        new FailsafeModJsonToTableRowTransformer.FailsafeModJsonToTableRow(
            failsafeModJsonToTableRowOptions);

    PCollectionTuple tableRowTuple =
        failsafeModJson.apply("Mod JSON To TableRow", failsafeModJsonToTableRow);

    BigQueryDynamicDestinations.BigQueryDynamicDestinationsOptions
        bigQueryDynamicDestinationsOptions =
            BigQueryDynamicDestinations.BigQueryDynamicDestinationsOptions.builder()
                .setSpannerConfig(spannerConfig)
                .setChangeStreamName(options.getSpannerChangeStreamName())
                .setIgnoreFields(ignoreFields)
                .setBigQueryProject(getBigQueryProjectId(options))
                .setBigQueryDataset(options.getBigQueryDataset())
                .setBigQueryTableTemplate(options.getBigQueryChangelogTableNameTemplate())
                .build();
    WriteResult writeResult =
        tableRowTuple
            .get(failsafeModJsonToTableRow.transformOut)
            .apply(
                "Write To BigQuery",
                BigQueryIO.<TableRow>write()
                    .to(BigQueryDynamicDestinations.of(bigQueryDynamicDestinationsOptions))
                    .withFormatFunction(element -> removeIntermediateMetadataFields(element))
                    .withFormatRecordOnFailureFunction(element -> element)
                    .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
                    .withWriteDisposition(Write.WriteDisposition.WRITE_APPEND)
                    .withExtendedErrorInfo()
                    .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors()));

    PCollection<String> transformDlqJson =
        tableRowTuple
            .get(failsafeModJsonToTableRow.transformDeadLetterOut)
            .apply(
                "Failed Mod JSON During Table Row Transformation",
                MapElements.via(new StringDeadLetterQueueSanitizer()));

    PCollection<String> bqWriteDlqJson =
        BigQueryIOUtils.writeResultToBigQueryInsertErrors(writeResult, options)
            .apply(
                "Failed Mod JSON During BigQuery Writes",
                MapElements.via(new BigQueryDeadLetterQueueSanitizer()));

    PCollectionList.of(transformDlqJson)
        .and(bqWriteDlqJson)
        .apply("Merge Failed Mod JSON From Transform And BigQuery", Flatten.pCollections())
        .apply(
            "Write Failed Mod JSON To DLQ",
            DLQWriteTransform.WriteDLQ.newBuilder()
                .withDlqDirectory(dlqDirectory)
                .withTmpDirectory(tempDlqDirectory)
                .setIncludePaneInfo(true)
                .build());

    PCollection<FailsafeElement<String, String>> nonRetryableDlqModJsonFailsafe =
        dlqModJson.get(DeadLetterQueueManager.PERMANENT_ERRORS).setCoder(FAILSAFE_ELEMENT_CODER);

    nonRetryableDlqModJsonFailsafe
        .apply(
            "Write Mod JSON With Non-retryable Error To DLQ",
            MapElements.via(new StringDeadLetterQueueSanitizer()))
        .setCoder(StringUtf8Coder.of())
        .apply(
            DLQWriteTransform.WriteDLQ.newBuilder()
                .withDlqDirectory(dlqManager.getSevereDlqDirectoryWithDateTime())
                .withTmpDirectory(dlqManager.getSevereDlqDirectory() + "tmp/")
                .setIncludePaneInfo(true)
                .build());

    return pipeline.run();
  }

  private static DeadLetterQueueManager buildDlqManager(
      SpannerChangeStreamsToBigQueryOptions options) {
    String tempLocation =
        options.as(DataflowPipelineOptions.class).getTempLocation().endsWith("/")
            ? options.as(DataflowPipelineOptions.class).getTempLocation()
            : options.as(DataflowPipelineOptions.class).getTempLocation() + "/";
    String dlqDirectory =
        options.getDeadLetterQueueDirectory().isEmpty()
            ? tempLocation + "dlq/"
            : options.getDeadLetterQueueDirectory();

    LOG.info("Dead letter queue directory: {}", dlqDirectory);
    return DeadLetterQueueManager.create(dlqDirectory, DLQ_MAX_RETRIES);
  }

  private static String getSpannerProjectId(SpannerChangeStreamsToBigQueryOptions options) {
    return options.getSpannerProjectId().isEmpty()
        ? options.getProject()
        : options.getSpannerProjectId();
  }

  private static String getBigQueryProjectId(SpannerChangeStreamsToBigQueryOptions options) {
    return options.getBigQueryProjectId().isEmpty()
        ? options.getProject()
        : options.getBigQueryProjectId();
  }

  /**
   * Remove the following intermediate metadata fields that are not user data from {@link TableRow}:
   * _metadata_error, _metadata_retry_count, _metadata_spanner_original_payload_json.
   */
  private static TableRow removeIntermediateMetadataFields(TableRow tableRow) {
    TableRow cleanTableRow = tableRow.clone();
    Set<String> rowKeys = tableRow.keySet();
    Set<String> metadataFields = BigQueryUtils.getBigQueryIntermediateMetadataFieldNames();

    for (String rowKey : rowKeys) {
      if (metadataFields.contains(rowKey)) {
        cleanTableRow.remove(rowKey);
      }
    }

    return cleanTableRow;
  }

  /**
   * DoFn that converts a {@link DataChangeRecord} to multiple {@link Mod} in serialized JSON
   * format.
   */
  static class DataChangeRecordToModJsonFn extends DoFn<DataChangeRecord, String> {

    @ProcessElement
    public void process(@Element DataChangeRecord input, OutputReceiver<String> receiver) {
      for (org.apache.beam.sdk.io.gcp.spanner.changestreams.model.Mod changeStreamsMod :
          input.getMods()) {
        Mod mod =
            new Mod(
                changeStreamsMod.getKeysJson(),
                changeStreamsMod.getNewValuesJson(),
                input.getCommitTimestamp(),
                input.getServerTransactionId(),
                input.isLastRecordInTransactionInPartition(),
                input.getRecordSequence(),
                input.getTableName(),
                input.getModType(),
                input.getValueCaptureType(),
                input.getNumberOfRecordsInTransaction(),
                input.getNumberOfPartitionsInTransaction());

        String modJsonString;

        try {
          modJsonString = mod.toJson();
        } catch (IOException e) {
          // Ignore exception and print bad format.
          modJsonString = String.format("\"%s\"", input);
        }
        receiver.output(modJsonString);
      }
    }
  }
}

Cloud Spanner change streams to Pub/Sub

Cloud Spanner change streams to the Pub/Sub 模板是一种流处理流水线，用于使用 Dataflow Runner V2 流式传输 Cloud Spanner 数据更改记录并将其写入 Pub/Sub 主题。

如需将数据输出到新的 Pub/Sub 主题，您需要先创建主题。创建后，Pub/Sub 会自动生成订阅并将其附加到新主题。如果您尝试将数据输出到不存在的 Pub/Sub 主题，则 Dataflow 流水线会抛出异常，并且流水线会在不断尝试建立连接时卡住。

如果存在必要的 Pub/Sub 主题，则可以将数据输出到该主题。

如需了解详情，请参阅关于变更数据流、使用 Dataflow 构建变更数据流连接和变更数据流最佳实践。

对此流水线的要求：

在运行流水线之前，Cloud Spanner 实例必须已存在。
在运行流水线之前，Cloud Spanner 数据库必须已存在。
在运行流水线之前，Cloud Spanner 元数据实例必须已存在。
在运行流水线之前，Cloud Spanner 元数据数据库必须已存在。
在运行流水线之前，Cloud Spanner 变更数据流必须已存在。
在运行此流水线之前，Pub/Sub 主题必须已存在。

模板参数

参数	说明
`spannerInstanceId`	要从中读取变更数据流的 Cloud Spanner 实例。
`spannerDatabase`	要从中读取变更数据流的 Cloud Spanner 数据库。
`spannerMetadataInstanceId`	要用于变更数据流连接器元数据表的 Cloud Spanner 实例。
`spannerMetadataDatabase`	要用于变更数据流连接器元数据表的 Cloud Spanner 数据库。
`spannerChangeStreamName`	要读取的 Cloud Spanner 变更数据流的名称。
`pubsubTopic`	变更数据流输出的 Pub/Sub 主题。
`spannerProjectId`	（可选）从中读取变更数据流的项目。这也是创建变更数据流连接器元数据表的项目。此参数的默认项目是 Dataflow 流水线在其中运行的项目。
`spannerMetadataTableName`	（可选）要使用的 Cloud Spanner 变更数据流连接器元数据表名称。如果未提供，Cloud Spanner 将在流水线流更改期间自动创建流连接器元数据表。更新现有流水线时，您必须提供此参数。请勿将此参数用于其他情况。
`rpcPriority`	（可选）Cloud Spanner 调用的请求优先级。该值必须为 [HIGH,MEDIUM,LOW] 之一。（默认值：HIGH）
`startTimestamp`	（可选）要用于读取变更数据流的开始 DateTime（含边界值）。例如 ex-2021-10-12T07:20:50.52Z。默认为流水线启动时的时间戳，即当前时间。
`endTimestamp`	（可选）要用于读取变更数据流的结束 DateTime（含边界值）。例如 ex-2021-10-12T07:20:50.52Z。默认为未来的无限时间。
`outputFileFormat`	（可选）输出的格式。输出会封装在许多 PubsubMessage 中，并发送到 Pub/Sub 主题。允许的格式为 JSON 和 AVRO。默认值为 JSON。
`pubsubAPI`	（可选）用于实现流水线的 Pub/Sub API。允许的 API 包括 `pubsubio` 和 `native_client`。对于少量每秒查询次数 (QPS)，`native_client` 的延迟时间较短。对于 QPS 而言，`pubsubio` 可提供更好且更稳定的性能。默认值为 `pubsubio`。

运行 Cloud Spanner change streams to the Pub/Sub 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cloud Spanner change streams to Pub/Sub template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

    gcloud beta dataflow flex-template run JOB_NAME \
        --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/Spanner_Change_Streams_to_PubSub \
        --region REGION_NAME \
        --parameters \
    spannerInstanceId=SPANNER_INSTANCE_ID,\
    spannerDatabase=SPANNER_DATABASE,\
    spannerMetadataInstanceId=SPANNER_METADATA_INSTANCE_ID,\
    spannerMetadataDatabase=SPANNER_METADATA_DATABASE,\
    spannerChangeStreamName=SPANNER_CHANGE_STREAM,\
    pubsubTopic=PUBSUB_TOPIC

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
SPANNER_INSTANCE_ID：Cloud Spanner 实例 ID
SPANNER_DATABASE：Cloud Spanner 数据库
SPANNER_METADATA_INSTANCE_ID：Cloud Spanner 元数据实例 ID
SPANNER_METADATA_DATABASE：Cloud Spanner 元数据数据库
SPANNER_CHANGE_STREAM：Cloud Spanner 变更数据流
PUBSUB_TOPIC：变更数据流输出的 Pub/Sub 主题

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

  POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
  {
    "launch_parameter": {
        "jobName": "JOB_NAME",
        "parameters": {
            "spannerInstanceId": "SPANNER_INSTANCE_ID",
            "spannerDatabase": "SPANNER_DATABASE",
            "spannerMetadataInstanceId": "SPANNER_METADATA_INSTANCE_ID",
            "spannerMetadataDatabase": "SPANNER_METADATA_DATABASE",
            "spannerChangeStreamName": "SPANNER_CHANGE_STREAM",
            "pubsubTopic": "PUBSUB_TOPIC"
        },
        "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/Spanner_Change_Streams_to_PubSub",
    }
  }

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
SPANNER_INSTANCE_ID：Cloud Spanner 实例 ID
SPANNER_DATABASE：Cloud Spanner 数据库
SPANNER_METADATA_INSTANCE_ID：Cloud Spanner 元数据实例 ID
SPANNER_METADATA_DATABASE：Cloud Spanner 元数据数据库
SPANNER_CHANGE_STREAM：Cloud Spanner 变更数据流
PUBSUB_TOPIC：变更数据流输出的 Pub/Sub 主题

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2022 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import com.google.cloud.Timestamp;
import com.google.cloud.spanner.Options.RpcPriority;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.options.SpannerChangeStreamsToPubSubOptions;
import com.google.cloud.teleport.v2.transforms.FileFormatFactorySpannerChangeStreamsToPubSub;
import java.util.ArrayList;
import java.util.List;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.spanner.SpannerConfig;
import org.apache.beam.sdk.io.gcp.spanner.SpannerIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link SpannerChangeStreamsToPubSub} pipeline streams change stream record(s) and stores to
 * pubsub topic in user specified format. The sink data can be stored in a JSON Text or Avro data
 * format.
 */
@Template(
    name = "Spanner_Change_Streams_to_PubSub",
    category = TemplateCategory.STREAMING,
    displayName = "Cloud Spanner change streams to Pub/Sub",
    description =
        "Streaming pipeline. Streams Spanner change stream data records and writes them into a"
            + " Pub/Sub topic using Dataflow Runner V2.",
    optionsClass = SpannerChangeStreamsToPubSubOptions.class,
    flexContainerName = "spanner-changestreams-to-pubsub",
    contactInformation = "https://cloud.google.com/support")
public class SpannerChangeStreamsToPubSub {
  private static final Logger LOG = LoggerFactory.getLogger(SpannerChangeStreamsToPubSub.class);
  private static final String USE_RUNNER_V2_EXPERIMENT = "use_runner_v2";

  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    LOG.info("Starting Input Messages to Pub/Sub");

    SpannerChangeStreamsToPubSubOptions options =
        PipelineOptionsFactory.fromArgs(args).as(SpannerChangeStreamsToPubSubOptions.class);

    run(options);
  }

  private static String getProjectId(SpannerChangeStreamsToPubSubOptions options) {
    return options.getSpannerProjectId().isEmpty()
        ? options.getProject()
        : options.getSpannerProjectId();
  }

  public static PipelineResult run(SpannerChangeStreamsToPubSubOptions options) {
    LOG.info("Requested Message Format is " + options.getOutputDataFormat());
    options.setStreaming(true);
    options.setEnableStreamingEngine(true);

    final Pipeline pipeline = Pipeline.create(options);
    // Get the Spanner project, instance, database, metadata instance, metadata database
    // change stream, pubsub topic, and pubsub api parameters.
    String projectId = getProjectId(options);
    String instanceId = options.getSpannerInstanceId();
    String databaseId = options.getSpannerDatabase();
    String metadataInstanceId = options.getSpannerMetadataInstanceId();
    String metadataDatabaseId = options.getSpannerMetadataDatabase();
    String changeStreamName = options.getSpannerChangeStreamName();
    String pubsubTopicName = options.getPubsubTopic();
    String pubsubAPI = options.getPubsubAPI();

    // Retrieve and parse the start / end timestamps.
    Timestamp startTimestamp =
        options.getStartTimestamp().isEmpty()
            ? Timestamp.now()
            : Timestamp.parseTimestamp(options.getStartTimestamp());
    Timestamp endTimestamp =
        options.getEndTimestamp().isEmpty()
            ? Timestamp.MAX_VALUE
            : Timestamp.parseTimestamp(options.getEndTimestamp());

    // Add use_runner_v2 to the experiments option, since Change Streams connector is only supported
    // on Dataflow runner v2.
    List<String> experiments = options.getExperiments();
    if (experiments == null) {
      experiments = new ArrayList<>();
    }
    if (!experiments.contains(USE_RUNNER_V2_EXPERIMENT)) {
      experiments.add(USE_RUNNER_V2_EXPERIMENT);
    }
    options.setExperiments(experiments);

    String metadataTableName =
        options.getSpannerMetadataTableName() == null
            ? null
            : options.getSpannerMetadataTableName();

    final RpcPriority rpcPriority = options.getRpcPriority();

    final String errorMessage =
        "Invalid api:" + pubsubAPI + ". Supported apis: pubsubio, native_client";

    pipeline
        .apply(
            SpannerIO.readChangeStream()
                .withSpannerConfig(
                    SpannerConfig.create()
                        .withHost(ValueProvider.StaticValueProvider.of(options.getSpannerHost()))
                        .withProjectId(projectId)
                        .withInstanceId(instanceId)
                        .withDatabaseId(databaseId))
                .withMetadataInstance(metadataInstanceId)
                .withMetadataDatabase(metadataDatabaseId)
                .withChangeStreamName(changeStreamName)
                .withInclusiveStartAt(startTimestamp)
                .withInclusiveEndAt(endTimestamp)
                .withRpcPriority(rpcPriority)
                .withMetadataTable(metadataTableName))
        .apply(
            "Convert each record to a PubsubMessage",
            FileFormatFactorySpannerChangeStreamsToPubSub.newBuilder()
                .setOutputDataFormat(options.getOutputDataFormat())
                .setProjectId(projectId)
                .setPubsubAPI(pubsubAPI)
                .setPubsubTopicName(pubsubTopicName)
                .build());
    return pipeline.run();
  }
}

MongoDB to BigQuery (CDC)

MongoDB to BigQuery CDC（变更数据捕获）模板是一种流处理流水线，可与 MongoDB 变更数据流结合使用。该流水线读取通过 MongoDB 变更数据流推送到 Pub/Sub 的 JSON 记录，并将其写入 BigQuery（由 userOption 参数指定）。

对此流水线的要求

目标 BigQuery 数据集必须已存在。
必须可从 Dataflow 工作器机器访问源 MongoDB 实例。
从 MongoDB 向 Pub/Sub 推送更改的变更数据流应该正在运行。

模板参数

参数	说明
`mongoDbUri`	MongoDB 连接 URI，格式为 `mongodb+srv://:@`。
`database`	从中读取集合的 MongoDB 数据库。例如：`my-db`。
`collection`	MongoDB 数据库中集合的名称。例如：`my-collection`。
`outputTableSpec`	要写入的 BigQuery 表。例如 `bigquery-project:dataset.output_table`。
`userOption`	`FLATTEN` 或 `NONE`。`FLATTEN` 将文档展平至第一级。`NONE` 将整个文档存储为 JSON 字符串。
`inputTopic`	要读取的 Cloud Pub/Sub 输入主题，格式为 `projects/<project>/topics/<topic>`。

运行 MongoDB to BigQuery (CDC) 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the MongoDB to BigQuery (CDC) template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/MongoDB_to_BigQuery_CDC \
    --parameters \
outputTableSpec=OUTPUT_TABLE_SPEC,\
mongoDbUri=MONGO_DB_URI,\
database=DATABASE,\
collection=COLLECTION,\
userOption=USER_OPTION,\
inputTopic=INPUT_TOPIC

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
OUTPUT_TABLE_SPEC：您的 BigQuery 目标表的名称。
MONGO_DB_URI：您的 MongoDB URI。
DATABASE：您的 MongoDB 数据库。
COLLECTION：您的 MongoDB 集合。
USER_OPTION：FLATTEN 或 NONE。
INPUT_TOPIC：您的 Pub/Sub 输入主题。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputTableSpec": "INPUT_TABLE_SPEC",
          "mongoDbUri": "MONGO_DB_URI",
          "database": "DATABASE",
          "collection": "COLLECTION",
          "userOption": "USER_OPTION",
          "inputTopic": "INPUT_TOPIC"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/MongoDB_to_BigQuery_CDC",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
OUTPUT_TABLE_SPEC：您的 BigQuery 目标表的名称。
MONGO_DB_URI：您的 MongoDB URI。
DATABASE：您的 MongoDB 数据库。
COLLECTION：您的 MongoDB 集合。
USER_OPTION：FLATTEN 或 NONE。
INPUT_TOPIC：您的 Pub/Sub 输入主题。

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.mongodb.templates;

import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableSchema;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.mongodb.options.MongoDbToBigQueryOptions.BigQueryWriteOptions;
import com.google.cloud.teleport.v2.mongodb.options.MongoDbToBigQueryOptions.JavascriptDocumentTransformerOptions;
import com.google.cloud.teleport.v2.mongodb.options.MongoDbToBigQueryOptions.MongoDbOptions;
import com.google.cloud.teleport.v2.mongodb.options.MongoDbToBigQueryOptions.PubSubOptions;
import com.google.cloud.teleport.v2.mongodb.templates.MongoDbToBigQueryCdc.Options;
import com.google.cloud.teleport.v2.options.BigQueryStorageApiStreamingOptions;
import com.google.cloud.teleport.v2.transforms.JavascriptDocumentTransformer.TransformDocumentViaJavascript;
import com.google.cloud.teleport.v2.utils.BigQueryIOUtils;
import java.io.IOException;
import javax.script.ScriptException;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.bson.Document;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link BigQueryToMongoDbCDC} pipeline is a streaming pipeline which reads data pushed to
 * PubSub from MongoDB Changestream and outputs the resulting records to BigQuery.
 */
@Template(
    name = "MongoDB_to_BigQuery_CDC",
    category = TemplateCategory.STREAMING,
    displayName = "MongoDB to BigQuery (CDC)",
    description =
        "A streaming pipeline which reads data pushed to Pub/Sub from MongoDB Changestream and"
            + " writes the resulting records to BigQuery.",
    optionsClass = Options.class,
    flexContainerName = "mongodb-to-bigquery-cdc",
    contactInformation = "https://cloud.google.com/support")
public class MongoDbToBigQueryCdc {

  private static final Logger LOG = LoggerFactory.getLogger(MongoDbToBigQuery.class);

  /** Options interface. */
  public interface Options
      extends PipelineOptions,
          MongoDbOptions,
          PubSubOptions,
          BigQueryWriteOptions,
          JavascriptDocumentTransformerOptions,
          BigQueryStorageApiStreamingOptions {}

  /** class ParseAsDocumentsFn. */
  private static class ParseAsDocumentsFn extends DoFn<String, Document> {

    @ProcessElement
    public void processElement(ProcessContext context) {
      context.output(Document.parse(context.element()));
    }
  }

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args)
      throws ScriptException, IOException, NoSuchMethodException {
    UncaughtExceptionLogger.register();

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    BigQueryIOUtils.validateBQStorageApiOptionsStreaming(options);
    run(options);
  }

  /** Pipeline to read data from PubSub and write to MongoDB. */
  public static boolean run(Options options)
      throws ScriptException, IOException, NoSuchMethodException {
    options.setStreaming(true);
    Pipeline pipeline = Pipeline.create(options);
    String userOption = options.getUserOption();
    String inputOption = options.getInputTopic();

    TableSchema bigquerySchema;

    if (options.getJavascriptDocumentTransformFunctionName() != null
        && options.getJavascriptDocumentTransformGcsPath() != null) {
      bigquerySchema =
          MongoDbUtils.getTableFieldSchemaForUDF(
              options.getMongoDbUri(),
              options.getDatabase(),
              options.getCollection(),
              options.getJavascriptDocumentTransformGcsPath(),
              options.getJavascriptDocumentTransformFunctionName(),
              options.getUserOption());
    } else {
      bigquerySchema =
          MongoDbUtils.getTableFieldSchema(
              options.getMongoDbUri(),
              options.getDatabase(),
              options.getCollection(),
              options.getUserOption());
    }

    pipeline
        .apply("Read PubSub Messages", PubsubIO.readStrings().fromTopic(inputOption))
        .apply(
            "RTransform string to document",
            ParDo.of(
                new DoFn<String, Document>() {
                  @ProcessElement
                  public void process(ProcessContext c) {
                    Document document = Document.parse(c.element());
                    c.output(document);
                  }
                }))
        .apply(
            "UDF",
            TransformDocumentViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptDocumentTransformGcsPath())
                .setFunctionName(options.getJavascriptDocumentTransformFunctionName())
                .build())
        .apply(
            "Read and transform data",
            ParDo.of(
                new DoFn<Document, TableRow>() {
                  @ProcessElement
                  public void process(ProcessContext c) {
                    Document document = c.element();
                    TableRow row = MongoDbUtils.getTableSchema(document, userOption);
                    c.output(row);
                  }
                }))
        .apply(
            BigQueryIO.writeTableRows()
                .to(options.getOutputTableSpec())
                .withSchema(bigquerySchema)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
    pipeline.run();
    return true;
  }
}