Pub/Sub to BigQuery 템플릿

Pub/Sub to BigQuery 템플릿은 Pub/Sub에서 JSON 형식의 메시지를 읽고 BigQuery 테이블에 쓰는 스트리밍 파이프라인입니다. 선택적으로 수신 메시지를 처리하기 위해 JavaScript로 작성된 사용자 정의 함수(UDF)를 제공할 수 있습니다.

이 시나리오에 Dataflow 파이프라인을 실행하기 전에 UDF가 있는 Pub/Sub BigQuery 구독이 요구사항을 충족하는지 고려하세요.

파이프라인 요구사항

BigQuery 테이블이 있어야 하며 스키마가 포함되어야 합니다.
Pub/Sub 메시지 데이터는 JSON 형식을 사용해야 합니다. 또는 메시지 데이터를 JSON으로 변환하는 UDF를 제공해야 합니다. JSON 데이터는 BigQuery 테이블 스키마와 일치해야 합니다. 예를 들어 JSON 페이로드가 {"k1":"v1", "k2":"v2"} 형식인 경우 BigQuery 테이블에는 k1 및 k2라는 두 개의 문자열 열이 있어야 합니다.
inputSubscription 또는 inputTopic 파라미터 중 하나만 지정합니다.

템플릿 매개변수

필수 매개변수

outputTableSpec : 작성할 BigQuery 테이블이며 PROJECT_ID:DATASET_NAME.TABLE_NAME 형식입니다.

선택적 매개변수

inputTopic: 읽어올 Pub/Sub 주제이며 projects/<PROJECT_ID>/topics/<TOPIC_NAME> 형식입니다.
inputSubscription: 읽어올 Pub/Sub 구독이며 projects/<PROJECT_ID>/subscriptions/<SUBCRIPTION_NAME> 형식입니다.
outputDeadletterTable: 출력 테이블에 도달할 수 없는 메시지에 사용할 BigQuery 테이블로, PROJECT_ID:DATASET_NAME.TABLE_NAME 형식입니다. 테이블이 없으면 파이프라인이 실행될 때 생성됩니다. 이 파라미터를 지정하지 않으면 OUTPUT_TABLE_SPEC_error_records 값이 대신 사용됩니다.
useStorageWriteApiAtLeastOnce: Storage Write API를 사용할 경우 쓰기 시맨틱스를 지정합니다. 1회 이상 실행되는 시맨틱스(https://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics)를 사용하려면 이 파라미터를 true로 설정합니다. 정확히 1회만 실행되는 시맨틱스를 사용하려면 파라미터를 false로 설정합니다. 이 파라미터는 useStorageWriteApi가 true인 경우에만 적용됩니다. 기본값은 false입니다.
useStorageWriteApi: true이면 파이프라인에서 BigQuery Storage Write API(https://cloud.google.com/bigquery/docs/write-api)를 사용합니다. 기본값은 false입니다. 자세한 내용은 Storage Write API(https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api) 사용을 참조하세요.
numStorageWriteApiStreams: Storage Write API를 사용할 경우 쓰기 스트림 수를 지정합니다. useStorageWriteApi가 true이고 useStorageWriteApiAtLeastOnce가 false이면 이 파라미터를 설정해야 합니다. 기본값은 0입니다.
storageWriteApiTriggeringFrequencySec: Storage Write API를 사용할 경우 트리거 빈도를 초 단위로 지정합니다. useStorageWriteApi가 true이고 useStorageWriteApiAtLeastOnce가 false이면 이 매개변수를 설정해야 합니다.
javascriptTextTransformGcsPath: 사용할 JavaScript 사용자 정의 함수(UDF)를 정의하는 .js 파일의 Cloud Storage URI입니다. 예를 들면 gs://my-bucket/my-udfs/my_file.js입니다.
javascriptTextTransformFunctionName: 사용할 JavaScript 사용자 정의 함수(UDF) 이름입니다. 예를 들어 JavaScript 함수가 myTransform(inJson) { /*...do stuff...*/ }이면 함수 이름은 myTransform입니다. 샘플 JavaScript UDF는 UDF 예시(https://github.com/GoogleCloudPlatform/DataflowTemplates#udf-examples)를 참조하세요.
javascriptTextTransformReloadIntervalMinutes: UDF를 새로고침할 빈도(분)를 지정합니다. 값이 0보다 크면 Dataflow가 Cloud Storage에서 UDF 파일을 주기적으로 검사하고 파일이 수정된 경우 UDF를 새로고침합니다. 이 파라미터를 사용하면 파이프라인이 실행 중일 때 작업을 다시 시작할 필요 없이 UDF를 업데이트할 수 있습니다. 값이 0이면 UDF 새로고침이 사용 중지됩니다. 기본값은 0입니다.

사용자 정의 함수

선택적으로 사용자 정의 함수(UDF)를 작성하여 이 템플릿을 확장할 수 있습니다. 템플릿이 각 입력 요소에 대해 UDF를 호출합니다. 요소 페이로드는 JSON 문자열로 직렬화됩니다. 자세한 내용은 Dataflow 템플릿에 대한 사용자 정의 함수 만들기를 참조하세요.

함수 사양

UDF의 사양은 다음과 같습니다.

입력: JSON 문자열로 직렬화된 Pub/Sub 메시지 데이터 필드입니다.

출력: BigQuery 대상 테이블의 스키마와 일치하는 JSON 문자열입니다.

템플릿 실행

콘솔

Dataflow 템플릿에서 작업 만들기 페이지로 이동합니다.

템플릿에서 작업 만들기로 이동

작업 이름 필드에 고유한 작업 이름을 입력합니다.
(선택사항) 리전 엔드포인트의 드롭다운 메뉴에서 값을 선택합니다. 기본 리전은 us-central1입니다.
Dataflow 작업을 실행할 수 있는 리전 목록은 Dataflow 위치를 참조하세요.
Dataflow 템플릿 드롭다운 메뉴에서 the Pub/Sub to BigQuery template을 선택합니다.
제공된 파라미터 필드에 파라미터 값을 입력합니다.
(선택사항) 정확히 한 번 처리에서 적어도 한 번 스트리밍 모드로 전환하려면 적어도 한 번를 선택합니다.
작업 실행을 클릭합니다.

gcloud

셸 또는 터미널에서 템플릿을 실행합니다.

gcloud dataflow flex-template run JOB_NAME \
    --gcs-location gs://dataflow-templates-REGION_NAME/VERSION/flex/PubSub_to_BigQuery_Flex \
    --template-file-gcs-location REGION_NAME \
    --staging-location STAGING_LOCATION \
    --parameters \
inputTopic=projects/PROJECT_ID/topics/TOPIC_NAME,\
outputTableSpec=PROJECT_ID:DATASET.TABLE_NAME

다음을 바꿉니다.

JOB_NAME: 선택한 고유한 작업 이름
REGION_NAME: Dataflow 작업을 배포할 리전(예: us-central1)
VERSION: 사용할 템플릿 버전
다음 값을 사용할 수 있습니다.
- latest: 버킷의 날짜가 지정되지 않은 상위 폴더(gs://dataflow-templates-REGION_NAME/latest/)에서 사용할 수 있는 최신 버전의 템플릿을 사용합니다.
- 버전 이름(예: 2023-09-12-00_RC00): 버킷의 날짜가 지정된 해당 상위 폴더(gs://dataflow-templates-REGION_NAME/)에 중첩되어 있는 특정 버전의 템플릿을 사용합니다.
주의: 최신 버전의 템플릿이 브레이킹 체인지로 업데이트될 수 있습니다. 프로덕션 환경에서는 이러한 브레이킹 체인지가 프로덕션 워크플로에 영향을 미치지 않도록 최신 날짜가 지정된 상위 폴더에 보관된 템플릿을 사용해야 합니다.
STAGING_LOCATION: 로컬 파일의 스테이징 위치입니다(예: gs://your-bucket/staging).
TOPIC_NAME: Pub/Sub 주제 이름
DATASET: BigQuery 데이터 세트
TABLE_NAME: BigQuery 테이블 이름

API

REST API를 사용하여 템플릿을 실행하려면 HTTP POST 요청을 전송합니다. API 및 승인 범위에 대한 자세한 내용은 projects.templates.launch를 참조하세요.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
       "inputTopic": "projects/PROJECT_ID/subscriptions/SUBSCRIPTION_NAME",
       "outputTableSpec": "PROJECT_ID:DATASET.TABLE_NAME"
      },
      "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/PubSub_to_BigQuery_Flex",
   }
}

다음을 바꿉니다.

PROJECT_ID: Dataflow 작업을 실행하려는 Google Cloud 프로젝트 ID
JOB_NAME: 선택한 고유한 작업 이름
LOCATION: Dataflow 작업을 배포할 리전(예: us-central1)
VERSION: 사용할 템플릿 버전
다음 값을 사용할 수 있습니다.
- latest: 버킷의 날짜가 지정되지 않은 상위 폴더(gs://dataflow-templates-REGION_NAME/latest/)에서 사용할 수 있는 최신 버전의 템플릿을 사용합니다.
- 버전 이름(예: 2023-09-12-00_RC00): 버킷의 날짜가 지정된 해당 상위 폴더(gs://dataflow-templates-REGION_NAME/)에 중첩되어 있는 특정 버전의 템플릿을 사용합니다.
주의: 최신 버전의 템플릿이 브레이킹 체인지로 업데이트될 수 있습니다. 프로덕션 환경에서는 이러한 브레이킹 체인지가 프로덕션 워크플로에 영향을 미치지 않도록 최신 날짜가 지정된 상위 폴더에 보관된 템플릿을 사용해야 합니다.
STAGING_LOCATION: 로컬 파일의 스테이징 위치입니다(예: gs://your-bucket/staging).
TOPIC_NAME: Pub/Sub 주제 이름
DATASET: BigQuery 데이터 세트
TABLE_NAME: BigQuery 테이블 이름

템플릿 소스 코드

Java

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import static com.google.cloud.teleport.v2.templates.TextToBigQueryStreaming.wrapBigQueryInsertError;

import com.google.api.services.bigquery.model.TableRow;
import com.google.cloud.teleport.metadata.MultiTemplate;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.options.BigQueryStorageApiStreamingOptions;
import com.google.cloud.teleport.v2.templates.PubSubToBigQuery.Options;
import com.google.cloud.teleport.v2.transforms.BigQueryConverters.FailsafeJsonToTableRow;
import com.google.cloud.teleport.v2.transforms.ErrorConverters;
import com.google.cloud.teleport.v2.transforms.JavascriptTextTransformer.FailsafeJavascriptUdf;
import com.google.cloud.teleport.v2.transforms.PythonExternalTextTransformer;
import com.google.cloud.teleport.v2.transforms.PythonExternalTextTransformer.PythonExternalTextTransformerOptions;
import com.google.cloud.teleport.v2.transforms.PythonExternalTextTransformer.RowToPubSubFailsafeElementFn;
import com.google.cloud.teleport.v2.utils.BigQueryIOUtils;
import com.google.cloud.teleport.v2.utils.ResourceUtils;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import com.google.common.base.Strings;
import com.google.common.collect.ImmutableList;
import java.nio.charset.StandardCharsets;
import org.apache.beam.runners.dataflow.options.DataflowPipelineWorkerPoolOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryInsertError;
import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy;
import org.apache.beam.sdk.io.gcp.bigquery.WriteResult;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.Flatten;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionList;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.Row;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TupleTagList;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link PubSubToBigQuery} pipeline is a streaming pipeline which ingests data in JSON format
 * from Cloud Pub/Sub, executes a UDF, and outputs the resulting records to BigQuery. Any errors
 * which occur in the transformation of the data or execution of the UDF will be output to a
 * separate errors table in BigQuery. The errors table will be created if it does not exist prior to
 * execution. Both output and error tables are specified by the user as template parameters.
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>The Pub/Sub topic exists.
 *   <li>The BigQuery output table exists.
 * </ul>
 *
 * <p>Check out <a
 * href="https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/googlecloud-to-googlecloud/README_PubSub_to_BigQuery_Flex.md">README</a>
 * for instructions on how to use or modify this template.
 */
@MultiTemplate({
  @Template(
      name = "PubSub_to_BigQuery_Flex",
      category = TemplateCategory.STREAMING,
      displayName = "Pub/Sub to BigQuery",
      description =
          "The Pub/Sub to BigQuery template is a streaming pipeline that reads JSON-formatted messages from a Pub/Sub topic or subscription, and writes them to a BigQuery table. "
              + "You can use the template as a quick solution to move Pub/Sub data to BigQuery. "
              + "The template reads JSON-formatted messages from Pub/Sub and converts them to BigQuery elements.",
      optionsClass = Options.class,
      skipOptions = {
        "pythonExternalTextTransformGcsPath",
        "pythonExternalTextTransformFunctionName",
      },
      flexContainerName = "pubsub-to-bigquery",
      documentation =
          "https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-bigquery",
      contactInformation = "https://cloud.google.com/support",
      requirements = {
        "The <a href=\"https://cloud.google.com/pubsub/docs/reference/rest/v1/PubsubMessage\">`data` field</a> of Pub/Sub messages must use the JSON format, described in this <a href=\"https://developers.google.com/api-client-library/java/google-http-java-client/json\">JSON guide</a>. For example, messages with values in the `data` field formatted as `{\"k1\":\"v1\", \"k2\":\"v2\"}` can be inserted into a BigQuery table with two columns, named `k1` and `k2`, with a string data type.",
        "The output table must exist prior to running the pipeline. The table schema must match the input JSON objects."
      },
      streaming = true,
      supportsAtLeastOnce = true,
      supportsExactlyOnce = true),
  @Template(
      name = "PubSub_to_BigQuery_Xlang",
      category = TemplateCategory.STREAMING,
      displayName = "Pub/Sub to BigQuery with Python UDFs",
      type = Template.TemplateType.XLANG,
      description =
          "The Pub/Sub to BigQuery template is a streaming pipeline that reads JSON-formatted messages from a Pub/Sub topic or subscription, and writes them to a BigQuery table. "
              + "You can use the template as a quick solution to move Pub/Sub data to BigQuery. "
              + "The template reads JSON-formatted messages from Pub/Sub and converts them to BigQuery elements.",
      optionsClass = Options.class,
      skipOptions = {
        "javascriptTextTransformGcsPath",
        "javascriptTextTransformFunctionName",
        "javascriptTextTransformReloadIntervalMinutes"
      },
      flexContainerName = "pubsub-to-bigquery-xlang",
      documentation =
          "https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-bigquery",
      contactInformation = "https://cloud.google.com/support",
      requirements = {
        "The <a href=\"https://cloud.google.com/pubsub/docs/reference/rest/v1/PubsubMessage\">`data` field</a> of Pub/Sub messages must use the JSON format, described in this <a href=\"https://developers.google.com/api-client-library/java/google-http-java-client/json\">JSON guide</a>. For example, messages with values in the `data` field formatted as `{\"k1\":\"v1\", \"k2\":\"v2\"}` can be inserted into a BigQuery table with two columns, named `k1` and `k2`, with a string data type.",
        "The output table must exist prior to running the pipeline. The table schema must match the input JSON objects."
      },
      streaming = true,
      supportsAtLeastOnce = true,
      supportsExactlyOnce = true)
})
public class PubSubToBigQuery {

  /** The log to output status messages to. */
  private static final Logger LOG = LoggerFactory.getLogger(PubSubToBigQuery.class);

  /** The tag for the main output for the UDF. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> UDF_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The tag for the main output of the json transformation. */
  public static final TupleTag<TableRow> TRANSFORM_OUT = new TupleTag<TableRow>() {};

  /** The tag for the dead-letter output of the udf. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> UDF_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The tag for the dead-letter output of the json to table row transform. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> TRANSFORM_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The default suffix for error tables if dead letter table is not specified. */
  public static final String DEFAULT_DEADLETTER_TABLE_SUFFIX = "_error_records";

  /** Pubsub message/string coder for pipeline. */
  public static final FailsafeElementCoder<PubsubMessage, String> CODER =
      FailsafeElementCoder.of(PubsubMessageWithAttributesCoder.of(), StringUtf8Coder.of());

  /** String/String Coder for FailsafeElement. */
  public static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  /**
   * The {@link Options} class provides the custom execution options passed by the executor at the
   * command-line.
   */
  public interface Options
      extends PipelineOptions,
          BigQueryStorageApiStreamingOptions,
          PythonExternalTextTransformerOptions,
          DataflowPipelineWorkerPoolOptions {
    @TemplateParameter.BigQueryTable(
        order = 1,
        groupName = "Target",
        description = "BigQuery output table",
        helpText =
            "The BigQuery table to write to, formatted as `PROJECT_ID:DATASET_NAME.TABLE_NAME`.")
    String getOutputTableSpec();

    void setOutputTableSpec(String value);

    @TemplateParameter.PubsubTopic(
        order = 2,
        groupName = "Source",
        optional = true,
        description = "Input Pub/Sub topic",
        helpText =
            "The Pub/Sub topic to read from, formatted as `projects/<PROJECT_ID>/topics/<TOPIC_NAME>`.")
    String getInputTopic();

    void setInputTopic(String value);

    @TemplateParameter.PubsubSubscription(
        order = 3,
        groupName = "Source",
        optional = true,
        description = "Pub/Sub input subscription",
        helpText =
            "The Pub/Sub subscription to read from, "
                + "formatted as `projects/<PROJECT_ID>/subscriptions/<SUBCRIPTION_NAME>`.")
    String getInputSubscription();

    void setInputSubscription(String value);

    @TemplateParameter.BigQueryTable(
        order = 4,
        optional = true,
        description =
            "Table for messages failed to reach the output table (i.e., Deadletter table)",
        helpText =
            "The BigQuery table to use for messages that failed to reach the output table, "
                + "formatted as `PROJECT_ID:DATASET_NAME.TABLE_NAME`. If the table "
                + "doesn't exist, it is created when the pipeline runs. "
                + "If this parameter is not specified, "
                + "the value `OUTPUT_TABLE_SPEC_error_records` is used instead.")
    String getOutputDeadletterTable();

    void setOutputDeadletterTable(String value);

    @TemplateParameter.Boolean(
        order = 5,
        optional = true,
        parentName = "useStorageWriteApi",
        parentTriggerValues = {"true"},
        description = "Use at at-least-once semantics in BigQuery Storage Write API",
        helpText =
            "When using the Storage Write API, specifies the write semantics. "
                + "To use at-least-once semantics (https://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics)"
                + ", set this parameter to true. "
                + "To use exactly-once semantics, set the parameter to `false`. "
                + "This parameter applies only when `useStorageWriteApi` is `true`. "
                + "The default value is `false`.")
    @Default.Boolean(false)
    @Override
    Boolean getUseStorageWriteApiAtLeastOnce();

    void setUseStorageWriteApiAtLeastOnce(Boolean value);
  }

  /**
   * The main entry-point for pipeline execution. This method will start the pipeline but will not
   * wait for it's execution to finish. If blocking execution is required, use the {@link
   * PubSubToBigQuery#run(Options)} method to start the pipeline and invoke {@code
   * result.waitUntilFinish()} on the {@link PipelineResult}.
   *
   * @param args The command-line args passed by the executor.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    BigQueryIOUtils.validateBQStorageApiOptionsStreaming(options);
    //    options.setWorkerDiskType(
    //
    // "compute.googleapis.com/projects/cloud-teleport-testing/zones/us-central1-a/diskTypes/t2a-test");

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(Options options) {

    boolean useInputSubscription = !Strings.isNullOrEmpty(options.getInputSubscription());
    boolean useInputTopic = !Strings.isNullOrEmpty(options.getInputTopic());
    if (useInputSubscription == useInputTopic) {
      throw new IllegalArgumentException(
          "Either input topic or input subscription must be provided, but not both.");
    }

    Pipeline pipeline = Pipeline.create(options);

    CoderRegistry coderRegistry = pipeline.getCoderRegistry();
    coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER);

    /*
     * Steps:
     *  1) Read messages in from Pub/Sub
     *  2) Transform the PubsubMessages into TableRows
     *     - Transform message payload via UDF
     *     - Convert UDF result to TableRow objects
     *  3) Write successful records out to BigQuery
     *  4) Write failed records out to BigQuery
     */

    /*
     * Step #1: Read messages in from Pub/Sub
     * Either from a Subscription or Topic
     */

    PCollection<PubsubMessage> messages = null;
    if (useInputSubscription) {
      messages =
          pipeline.apply(
              "ReadPubSubSubscription",
              PubsubIO.readMessagesWithAttributes()
                  .fromSubscription(options.getInputSubscription()));
    } else {
      messages =
          pipeline.apply(
              "ReadPubSubTopic",
              PubsubIO.readMessagesWithAttributes().fromTopic(options.getInputTopic()));
    }

    PCollectionTuple convertedTableRows =
        messages
            /*
             * Step #2: Transform the PubsubMessages into TableRows
             */
            .apply("ConvertMessageToTableRow", new PubsubMessageToTableRow(options));

    /*
     * Step #3: Write the successful records out to BigQuery
     */
    WriteResult writeResult =
        convertedTableRows
            .get(TRANSFORM_OUT)
            .apply(
                "WriteSuccessfulRecords",
                BigQueryIO.writeTableRows()
                    .withoutValidation()
                    .withCreateDisposition(CreateDisposition.CREATE_NEVER)
                    .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                    .withExtendedErrorInfo()
                    .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
                    .to(options.getOutputTableSpec()));

    /*
     * Step 3 Contd.
     * Elements that failed inserts into BigQuery are extracted and converted to FailsafeElement
     */
    PCollection<FailsafeElement<String, String>> failedInserts =
        BigQueryIOUtils.writeResultToBigQueryInsertErrors(writeResult, options)
            .apply(
                "WrapInsertionErrors",
                MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor())
                    .via((BigQueryInsertError e) -> wrapBigQueryInsertError(e)))
            .setCoder(FAILSAFE_ELEMENT_CODER);

    /*
     * Step #4: Write records that failed table row transformation
     * or conversion out to BigQuery deadletter table.
     */
    PCollectionList.of(
            ImmutableList.of(
                convertedTableRows.get(UDF_DEADLETTER_OUT),
                convertedTableRows.get(TRANSFORM_DEADLETTER_OUT)))
        .apply("Flatten", Flatten.pCollections())
        .apply(
            "WriteFailedRecords",
            ErrorConverters.WritePubsubMessageErrors.newBuilder()
                .setErrorRecordsTable(
                    !Strings.isNullOrEmpty(options.getOutputDeadletterTable())
                        ? options.getOutputDeadletterTable()
                        : options.getOutputTableSpec() + DEFAULT_DEADLETTER_TABLE_SUFFIX)
                .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
                .build());

    // 5) Insert records that failed insert into deadletter table
    failedInserts.apply(
        "WriteFailedRecords",
        ErrorConverters.WriteStringMessageErrors.newBuilder()
            .setErrorRecordsTable(
                !Strings.isNullOrEmpty(options.getOutputDeadletterTable())
                    ? options.getOutputDeadletterTable()
                    : options.getOutputTableSpec() + DEFAULT_DEADLETTER_TABLE_SUFFIX)
            .setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
            .build());

    return pipeline.run();
  }

  /**
   * The {@link PubsubMessageToTableRow} class is a {@link PTransform} which transforms incoming
   * {@link PubsubMessage} objects into {@link TableRow} objects for insertion into BigQuery while
   * applying an optional UDF to the input. The executions of the UDF and transformation to {@link
   * TableRow} objects is done in a fail-safe way by wrapping the element with it's original payload
   * inside the {@link FailsafeElement} class. The {@link PubsubMessageToTableRow} transform will
   * output a {@link PCollectionTuple} which contains all output and dead-letter {@link
   * PCollection}.
   *
   * <p>The {@link PCollectionTuple} output will contain the following {@link PCollection}:
   *
   * <ul>
   *   <li>{@link PubSubToBigQuery#UDF_OUT} - Contains all {@link FailsafeElement} records
   *       successfully processed by the optional UDF.
   *   <li>{@link PubSubToBigQuery#UDF_DEADLETTER_OUT} - Contains all {@link FailsafeElement}
   *       records which failed processing during the UDF execution.
   *   <li>{@link PubSubToBigQuery#TRANSFORM_OUT} - Contains all records successfully converted from
   *       JSON to {@link TableRow} objects.
   *   <li>{@link PubSubToBigQuery#TRANSFORM_DEADLETTER_OUT} - Contains all {@link FailsafeElement}
   *       records which couldn't be converted to table rows.
   * </ul>
   */
  static class PubsubMessageToTableRow
      extends PTransform<PCollection<PubsubMessage>, PCollectionTuple> {

    private final Options options;

    PubsubMessageToTableRow(Options options) {
      this.options = options;
    }

    @Override
    public PCollectionTuple expand(PCollection<PubsubMessage> input) {
      boolean useJavascriptUdf =
          !Strings.isNullOrEmpty(options.getJavascriptTextTransformGcsPath());
      boolean usePythonUdf =
          !Strings.isNullOrEmpty(options.getPythonExternalTextTransformGcsPath());
      if (useJavascriptUdf && usePythonUdf) {
        throw new IllegalArgumentException(
            "Either javascript or Python gcs path must be provided, but not both.");
      }
      PCollectionTuple udfOut;
      if (usePythonUdf) {
        PCollection<Row> udfRowsOut =
            input
                // Map the incoming messages into FailsafeElements so we can recover from failures
                // across multiple transforms.
                .apply(
                    "MapToRecord",
                    PythonExternalTextTransformer.FailsafeRowPythonExternalUdf
                        .pubSubMappingFunction())
                .setRowSchema(PythonExternalTextTransformer.FailsafeRowPythonExternalUdf.ROW_SCHEMA)
                .apply(
                    "InvokeUDF",
                    PythonExternalTextTransformer.FailsafePythonExternalUdf.newBuilder()
                        .setFileSystemPath(options.getPythonExternalTextTransformGcsPath())
                        .setFunctionName(options.getPythonExternalTextTransformFunctionName())
                        .build());
        udfOut =
            udfRowsOut.apply(
                "MapRowsToFailsafeElements",
                ParDo.of(new RowToPubSubFailsafeElementFn(UDF_OUT, UDF_DEADLETTER_OUT))
                    .withOutputTags(UDF_OUT, TupleTagList.of(UDF_DEADLETTER_OUT)));
      } else {
        udfOut =
            input
                // Map the incoming messages into FailsafeElements so we can recover from failures
                // across multiple transforms.
                .apply("MapToRecord", ParDo.of(new PubsubMessageToFailsafeElementFn()))
                .apply(
                    "InvokeUDF",
                    FailsafeJavascriptUdf.<PubsubMessage>newBuilder()
                        .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                        .setFunctionName(options.getJavascriptTextTransformFunctionName())
                        .setReloadIntervalMinutes(
                            options.getJavascriptTextTransformReloadIntervalMinutes())
                        .setSuccessTag(UDF_OUT)
                        .setFailureTag(UDF_DEADLETTER_OUT)
                        .build());
      }

      // Convert the records which were successfully processed by the UDF into TableRow objects.
      PCollectionTuple jsonToTableRowOut =
          udfOut
              .get(UDF_OUT)
              .apply(
                  "JsonToTableRow",
                  FailsafeJsonToTableRow.<PubsubMessage>newBuilder()
                      .setSuccessTag(TRANSFORM_OUT)
                      .setFailureTag(TRANSFORM_DEADLETTER_OUT)
                      .build());

      // Re-wrap the PCollections so we can return a single PCollectionTuple
      return PCollectionTuple.of(UDF_OUT, udfOut.get(UDF_OUT))
          .and(UDF_DEADLETTER_OUT, udfOut.get(UDF_DEADLETTER_OUT))
          .and(TRANSFORM_OUT, jsonToTableRowOut.get(TRANSFORM_OUT))
          .and(TRANSFORM_DEADLETTER_OUT, jsonToTableRowOut.get(TRANSFORM_DEADLETTER_OUT));
    }
  }

  /**
   * The {@link PubsubMessageToFailsafeElementFn} wraps an incoming {@link PubsubMessage} with the
   * {@link FailsafeElement} class so errors can be recovered from and the original message can be
   * output to a error records table.
   */
  static class PubsubMessageToFailsafeElementFn
      extends DoFn<PubsubMessage, FailsafeElement<PubsubMessage, String>> {
    @ProcessElement
    public void processElement(ProcessContext context) {
      PubsubMessage message = context.element();
      context.output(
          FailsafeElement.of(message, new String(message.getPayload(), StandardCharsets.UTF_8)));
    }
  }
}

다음 단계

Dataflow 템플릿 알아보기
Google 제공 템플릿 목록 참조