이 페이지는 Cloud Translation API를 통해 번역되었습니다.

Datastream to BigQuery(Stream) 템플릿

Datastream to BigQuery 템플릿은 Datastream 데이터를 읽고 BigQuery에 복제하는 스트리밍 파이프라인입니다. 이 템플릿은 Pub/Sub 알림을 사용하여 Cloud Storage에서 데이터를 읽고 시간으로 파티션을 나눈 BigQuery 스테이징 테이블에 복제합니다. 복제 후 템플릿은 BigQuery에서 MERGE를 실행하여 모든 변경 데이터 캡처(CDC) 변경사항을 소스 테이블의 복제본에 적용합니다.

템플릿은 복제에서 관리하는 BigQuery 테이블의 생성 및 업데이트를 처리합니다. 데이터 정의 언어(DDL)가 필요한 경우 Datastream에 대한 콜백은 소스 테이블 스키마를 추출하여 BigQuery 데이터 유형으로 변환합니다. 지원되는 작업은 다음과 같습니다.

데이터가 삽입될 때 새 테이블이 생성됩니다.
초기 값이 null인 새 열이 BigQuery 테이블에 추가됩니다.
삭제된 열은 BigQuery에서 무시되며 향후 값은 null입니다.
이름이 변경된 열은 BigQuery에 새 열로 추가됩니다.
형식 변경사항은 BigQuery로 전파되지 않습니다.

임시 BigQuery 테이블의 데이터를 기본 BigQuery 테이블에 병합할 때 템플릿이 중복 삭제를 실행하므로 한 번 이상 스트리밍 모드를 사용하여 이 파이프라인을 실행하는 것이 좋습니다. 파이프라인의 이 단계는 정확히 한 번 스트리밍 모드를 사용하는 데 추가 이점이 없다는 것을 의미합니다.

파이프라인 요구사항

데이터 복제가 가능하거나 이미 복제하고 있는 Datastream 스트림
Datastream 데이터에 Cloud Storage Pub/Sub 알림이 사용 설정되어 있습니다.
BigQuery 대상 데이터 세트가 생성되었고 Compute Engine 서비스 계정에 이에 대한 관리자 액세스 권한이 부여되었습니다.
대상 복제본 테이블을 만들 소스 테이블에 기본 키가 있어야 합니다.
MySQL 또는 Oracle 소스 데이터베이스 PostgreSQL 및 SQL Server 데이터베이스는 지원되지 않습니다.

템플릿 매개변수

필수 매개변수

inputFilePattern: Cloud Storage에서 Datastream 파일 출력을 위한 파일 위치(gs://<BUCKET_NAME>/<ROOT_PATH>/ 형식)입니다.
inputFileFormat: Datastream에서 생성한 출력 파일의 형식입니다. 허용되는 값은 avro 및 json입니다. 기본값은 avro입니다.
gcsPubSubSubscription: Cloud Storage에서 처리할 수 있는 새 파일을 Dataflow에 알리는 데 사용하는 Pub/Sub 구독으로, 형식은 projects/<PROJECT_ID>/subscriptions/<SUBSCRIPTION_NAME>입니다.
outputStagingDatasetTemplate: 스테이징 테이블이 포함된 데이터 세트의 이름입니다. 이 매개변수는 템플릿(예: {_metadata_dataset}_log 또는 my_dataset_log)을 지원합니다. 일반적으로 이 매개변수는 데이터 세트 이름입니다. 기본값은 {_metadata_dataset}입니다.
outputDatasetTemplate: 복제본 테이블이 포함된 데이터 세트의 이름입니다. 이 매개변수는 템플릿(예: {_metadata_dataset} 또는 my_dataset)을 지원합니다. 일반적으로 이 매개변수는 데이터 세트 이름입니다. 기본값은 {_metadata_dataset}입니다.
deadLetterQueueDirectory: Dataflow에서 데드 레터 큐 출력을 쓰는 데 사용하는 경로입니다. 이 경로는 Datastream 파일 출력과 동일한 경로에 있으면 안 됩니다. 기본값은 empty입니다.

선택적 매개변수

streamName: 스키마 정보를 폴링할 스트림의 이름 또는 템플릿입니다. 기본값은 {_metadata_stream}입니다. 보통 기본값으로 충분합니다.
rfcStartDateTime: Cloud Storage에서 데이터를 가져오는 데 사용할 시작 DateTime입니다 (https://tools.ietf.org/html/rfc3339). 기본값은 1970-01-01T00:00:00.00Z입니다.
fileReadConcurrency: 읽을 동시 DataStream 파일 수입니다. 기본값은 10입니다.
outputProjectId: 데이터를 출력할 BigQuery 데이터 세트가 포함된 Google Cloud 프로젝트의 ID입니다. 이 매개변수의 기본값은 Dataflow 파이프라인이 실행되는 프로젝트입니다.
outputStagingTableNameTemplate: 스테이징 테이블의 이름을 지정하는 데 사용할 템플릿입니다. 예를 들면 {_metadata_table}입니다. 기본값은 {_metadata_table}_log입니다.
outputTableNameTemplate: 복제본 테이블 이름에 사용할 템플릿입니다(예: {_metadata_table}). 기본값은 {_metadata_table}입니다.
ignoreFields: BigQuery에서 무시할 필드를 쉼표로 구분하여 표시합니다. 기본값은 _metadata_stream,_metadata_schema,_metadata_table,_metadata_source,_metadata_tx_id,_metadata_dlq_reconsumed,_metadata_primary_keys,_metadata_error,_metadata_retry_count입니다. 예를 들면 _metadata_stream,_metadata_schema입니다.
mergeFrequencyMinutes: 지정된 테이블의 병합 간격(분)입니다. 기본값은 5입니다.
dlqRetryMinutes: DLQ 재시도 간격(분)입니다. 기본값은 10입니다.
dataStreamRootUrl: Datastream API 루트 URL입니다. 기본값은 https://datastream.googleapis.com/입니다.
applyMerge: 작업에 대해 MERGE 쿼리를 사용 중지할지 여부입니다. 기본값은 true입니다.
mergeConcurrency: 동시 실행되는 BigQuery MERGE 쿼리의 수입니다. applyMerge가 true로 설정된 경우에만 유효합니다. 기본값은 30입니다.
partitionRetentionDays: BigQuery 병합을 실행할 때 파티션 보관에 사용할 일 수입니다. 기본값은 1입니다.
useStorageWriteApiAtLeastOnce: 이 매개변수는 Use BigQuery Storage Write API가 사용 설정된 경우에만 적용됩니다. true인 경우 Storage Write API에 최소 1회의 시맨틱스가 사용됩니다. 그렇지 않으면 1회만 실행되는 시맨틱이 사용됩니다. 기본값은 false입니다.
javascriptTextTransformGcsPath: 사용할 JavaScript 사용자 정의 함수 (UDF)를 정의하는 .js 파일의 Cloud Storage URI입니다. 예를 들면 gs://my-bucket/my-udfs/my_file.js입니다.
javascriptTextTransformFunctionName: 사용할 JavaScript 사용자 정의 함수 (UDF)의 이름입니다. 예를 들어 JavaScript 함수가 myTransform(inJson) { /*...do stuff...*/ }이면 함수 이름은 myTransform입니다. 샘플 JavaScript UDF는 UDF 예시(https://github.com/GoogleCloudPlatform/DataflowTemplates#udf-examples)를 참조하세요.
javascriptTextTransformReloadIntervalMinutes: UDF를 새로고침할 빈도(분)를 지정합니다. 값이 0보다 크면 Dataflow가 Cloud Storage에서 UDF 파일을 주기적으로 검사하고 파일이 수정된 경우 UDF를 새로고침합니다. 이 매개변수를 사용하면 파이프라인이 실행 중일 때 작업을 다시 시작할 필요 없이 UDF를 업데이트할 수 있습니다. 값이 0이면 UDF 새로고침이 사용 중지됩니다. 기본값은 0입니다.
pythonTextTransformGcsPath: 사용자 정의 함수가 포함된 Python 코드의 Cloud Storage 경로 패턴입니다. 예를 들면 gs://your-bucket/your-transforms/*.py입니다.
pythonRuntimeVersion: 이 Python UDF에 사용할 런타임 버전입니다.
pythonTextTransformFunctionName: JavaScript 파일에서 호출할 함수의 이름입니다. 문자, 숫자, 밑줄만 사용합니다. 예를 들면 transform_udf1입니다.
runtimeRetries: 런타임이 실패하기 전에 재시도되는 횟수입니다. 기본값은 5입니다.
useStorageWriteApi: true인 경우 파이프라인은 BigQuery Storage Write API (https://cloud.google.com/bigquery/docs/write-api)를 사용합니다. 기본값은 false입니다. 자세한 내용은 Storage Write API(https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api) 사용을 참조하세요.
numStorageWriteApiStreams: Storage Write API를 사용할 때 쓰기 스트림 수를 지정합니다. useStorageWriteApi가 true이고 useStorageWriteApiAtLeastOnce가 false이면 이 매개변수를 설정해야 합니다. 기본값은 0입니다.
storageWriteApiTriggeringFrequencySec: Storage Write API를 사용할 때 트리거 빈도를 초 단위로 지정합니다. useStorageWriteApi가 true이고 useStorageWriteApiAtLeastOnce가 false이면 이 매개변수를 설정해야 합니다.

사용자 정의 함수

선택적으로 사용자 정의 함수(UDF)를 작성하여 이 템플릿을 확장할 수 있습니다. 템플릿이 각 입력 요소에 대해 UDF를 호출합니다. 요소 페이로드는 JSON 문자열로 직렬화됩니다. 자세한 내용은 Dataflow 템플릿에 대한 사용자 정의 함수 만들기를 참조하세요.

함수 사양

UDF의 사양은 다음과 같습니다.

입력: JSON 문자열로 직렬화된 CDC 데이터입니다.

출력: BigQuery 대상 테이블의 스키마와 일치하는 JSON 문자열입니다.

템플릿 실행

콘솔

Dataflow 템플릿에서 작업 만들기 페이지로 이동합니다.

템플릿에서 작업 만들기로 이동

작업 이름 필드에 고유한 작업 이름을 입력합니다.
(선택사항): 리전 엔드포인트의 드롭다운 메뉴에서 값을 선택합니다. 기본 리전은 us-central1입니다.
Dataflow 작업을 실행할 수 있는 리전 목록은 Dataflow 위치를 참조하세요.
Dataflow 템플릿 드롭다운 메뉴에서 the Datastream to BigQuery template을 선택합니다.
제공된 매개변수 필드에 매개변수 값을 입력합니다.
선택사항: 정확히 한 번 처리에서 적어도 한 번 스트리밍 모드로 전환하려면 적어도 한 번를 선택합니다.
작업 실행을 클릭합니다.

gcloud

셸 또는 터미널에서 템플릿을 실행합니다.

gcloud dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --enable-streaming-engine \
    --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/Cloud_Datastream_to_BigQuery \
    --parameters \
inputFilePattern=GCS_FILE_PATH,\
gcsPubSubSubscription=GCS_SUBSCRIPTION_NAME,\
outputStagingDatasetTemplate=BIGQUERY_DATASET,\
outputDatasetTemplate=BIGQUERY_DATASET,\
outputStagingTableNameTemplate=BIGQUERY_TABLE,\
outputTableNameTemplate=BIGQUERY_TABLE_log

다음을 바꿉니다.

PROJECT_ID: Dataflow 작업을 실행하려는 Google Cloud 프로젝트 ID
JOB_NAME: 선택한 고유한 작업 이름
REGION_NAME: Dataflow 작업을 배포할 리전(예: us-central1)
VERSION: the version of the template that you want to use You can use the following values: latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/ the version name, like 2023-09-12-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/ Caution: The latest version of templates might update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows.
GCS_FILE_PATH: Datastream 데이터의 Cloud Storage 경로입니다. 예를 들면 gs://bucket/path/to/data/입니다.
GCS_SUBSCRIPTION_NAME: 변경된 파일을 읽을 Pub/Sub 구독입니다. 예를 들면 projects/my-project-id/subscriptions/my-subscription-id입니다.
BIGQUERY_DATASET: BigQuery 데이터 세트 이름입니다.
BIGQUERY_TABLE: BigQuery 테이블 템플릿입니다. 예를 들면 {_metadata_schema}_{_metadata_table}_log입니다.

API

REST API를 사용하여 템플릿을 실행하려면 HTTP POST 요청을 전송합니다. API 및 승인 범위에 대한 자세한 내용은 projects.templates.launch를 참조하세요.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {

          "inputFilePattern": "GCS_FILE_PATH",
          "gcsPubSubSubscription": "GCS_SUBSCRIPTION_NAME",
          "outputStagingDatasetTemplate": "BIGQUERY_DATASET",
          "outputDatasetTemplate": "BIGQUERY_DATASET",
          "outputStagingTableNameTemplate": "BIGQUERY_TABLE",
          "outputTableNameTemplate": "BIGQUERY_TABLE_log"
      },
      "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/Cloud_Datastream_to_BigQuery",
   }
}

다음을 바꿉니다.

PROJECT_ID: Dataflow 작업을 실행하려는 Google Cloud 프로젝트 ID
JOB_NAME: 선택한 고유한 작업 이름
LOCATION: Dataflow 작업을 배포할 리전(예: us-central1)
VERSION: the version of the template that you want to use You can use the following values: latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/ the version name, like 2023-09-12-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/ Caution: The latest version of templates might update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows.
GCS_FILE_PATH: Datastream 데이터의 Cloud Storage 경로입니다. 예를 들면 gs://bucket/path/to/data/입니다.
GCS_SUBSCRIPTION_NAME: 변경된 파일을 읽을 Pub/Sub 구독입니다. 예를 들면 projects/my-project-id/subscriptions/my-subscription-id입니다.
BIGQUERY_DATASET: BigQuery 데이터 세트 이름입니다.
BIGQUERY_TABLE: BigQuery 테이블 템플릿입니다. 예를 들면 {_metadata_schema}_{_metadata_table}_log입니다.

템플릿 소스 코드

Java

/*
 * Copyright (C) 2020 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import static com.google.cloud.teleport.v2.transforms.StatefulRowCleaner.RowCleanerDeadLetterQueueSanitizer;

import com.google.api.services.bigquery.model.TableRow;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.metadata.TemplateParameter.TemplateEnumOption;
import com.google.cloud.teleport.v2.cdc.dlq.DeadLetterQueueManager;
import com.google.cloud.teleport.v2.cdc.dlq.StringDeadLetterQueueSanitizer;
import com.google.cloud.teleport.v2.cdc.mappers.BigQueryDefaultSchemas;
import com.google.cloud.teleport.v2.cdc.merge.BigQueryMerger;
import com.google.cloud.teleport.v2.cdc.merge.MergeConfiguration;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.datastream.mappers.DataStreamMapper;
import com.google.cloud.teleport.v2.datastream.mappers.MergeInfoMapper;
import com.google.cloud.teleport.v2.datastream.sources.DataStreamIO;
import com.google.cloud.teleport.v2.options.BigQueryStorageApiStreamingOptions;
import com.google.cloud.teleport.v2.templates.DataStreamToBigQuery.Options;
import com.google.cloud.teleport.v2.transforms.DLQWriteTransform;
import com.google.cloud.teleport.v2.transforms.StatefulRowCleaner;
import com.google.cloud.teleport.v2.transforms.StatefulRowCleaner.RowCleanerDeadLetterQueueSanitizer;
import com.google.cloud.teleport.v2.transforms.UDFTextTransformer.InputUDFOptions;
import com.google.cloud.teleport.v2.transforms.UDFTextTransformer.InputUDFToTableRow;
import com.google.cloud.teleport.v2.utils.BigQueryIOUtils;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import com.google.common.base.Splitter;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Pattern;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.extensions.gcp.options.GcpOptions;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.InsertRetryPolicy;
import org.apache.beam.sdk.io.gcp.bigquery.WriteResult;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.StreamingOptions;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.Flatten;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.Reshuffle;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionList;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.joda.time.Duration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * This pipeline ingests DataStream data from GCS. The data is then cleaned and validated against a
 * BigQuery Table. If new columns or tables appear, they are automatically added to BigQuery. The
 * data is then inserted into BigQuery staging tables and Merged into a final replica table.
 *
 * <p>NOTE: Future versions are planned to support: Pub/Sub, GCS, or Kafka as per DataStream
 *
 * <p>Check out <a
 * href="https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/datastream-to-bigquery/README_Cloud_Datastream_to_BigQuery.md">README</a>
 * for instructions on how to use or modify this template.
 */
@Template(
    name = "Cloud_Datastream_to_BigQuery",
    category = TemplateCategory.STREAMING,
    displayName = "Datastream to BigQuery",
    description = {
      "The Datastream to BigQuery template is a streaming pipeline that reads <a href=\"https://cloud.google.com/datastream/docs\">Datastream</a> data and replicates it into BigQuery. "
          + "The template reads data from Cloud Storage using Pub/Sub notifications and replicates it into a time partitioned BigQuery staging table. "
          + "Following replication, the template executes a MERGE in BigQuery to upsert all change data capture (CDC) changes into a replica of the source table.\n",
      "The template handles creating and updating the BigQuery tables managed by the replication. "
          + "When data definition language (DDL) is required, a callback to Datastream extracts the source table schema and translates it into BigQuery data types. Supported operations include the following:\n"
          + "- New tables are created as data is inserted.\n"
          + "- New columns are added to BigQuery tables with null initial values.\n"
          + "- Dropped columns are ignored in BigQuery and future values are null.\n"
          + "- Renamed columns are added to BigQuery as new columns.\n"
          + "- Type changes are not propagated to BigQuery."
    },
    optionsClass = Options.class,
    flexContainerName = "datastream-to-bigquery",
    documentation =
        "https://cloud.google.com/dataflow/docs/guides/templates/provided/datastream-to-bigquery",
    contactInformation = "https://cloud.google.com/support",
    requirements = {
      "A Datastream stream that is ready to or already replicating data.",
      "<a href=\"https://cloud.google.com/storage/docs/reporting-changes\">Cloud Storage Pub/Sub notifications</a> are enabled for the Datastream data.",
      "BigQuery destination datasets are created and the Compute Engine Service Account has been granted admin access to them.",
      "A primary key is necessary in the source table for the destination replica table to be created.",
      "A MySQL or Oracle source database. PostgreSQL databases are not supported."
    },
    streaming = true,
    supportsAtLeastOnce = true,
    supportsExactlyOnce = false)
public class DataStreamToBigQuery {

  private static final Logger LOG = LoggerFactory.getLogger(DataStreamToBigQuery.class);
  private static final String AVRO_SUFFIX = "avro";
  private static final String JSON_SUFFIX = "json";

  /** The tag for the main output of the json transformation. */
  public static final TupleTag<TableRow> TRANSFORM_OUT = new TupleTag<TableRow>() {};

  /** String/String Coder for FailsafeElement. */
  public static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  /** The tag for the dead-letter output of the json to table row transform. */
  public static final TupleTag<FailsafeElement<String, String>> TRANSFORM_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /**
   * Options supported by the pipeline.
   *
   * <p>Inherits standard configuration options.
   */
  public interface Options
      extends PipelineOptions,
          StreamingOptions,
          InputUDFOptions,
          BigQueryStorageApiStreamingOptions {

    @TemplateParameter.GcsReadFile(
        order = 1,
        groupName = "Source",
        description = "File location for Datastream file output in Cloud Storage.",
        helpText =
            "The file location for Datastream file output in Cloud Storage, in the format `gs://<BUCKET_NAME>/<ROOT_PATH>/`.")
    String getInputFilePattern();

    void setInputFilePattern(String value);

    @TemplateParameter.Enum(
        order = 2,
        enumOptions = {@TemplateEnumOption("avro"), @TemplateEnumOption("json")},
        description = "Datastream output file format (avro/json).",
        helpText =
            "The format of the output files produced by Datastream. Allowed values are `avro` and `json`. Defaults to `avro`.")
    @Default.String("avro")
    String getInputFileFormat();

    void setInputFileFormat(String value);

    @TemplateParameter.PubsubSubscription(
        order = 3,
        description = "The Pub/Sub subscription on the Cloud Storage bucket.",
        helpText =
            "The Pub/Sub subscription used by Cloud Storage to notify Dataflow of new files available for processing, in the format: `projects/<PROJECT_ID>/subscriptions/<SUBSCRIPTION_NAME>`.")
    String getGcsPubSubSubscription();

    void setGcsPubSubSubscription(String value);

    @TemplateParameter.Text(
        order = 4,
        optional = true,
        description = "Name or template for the stream to poll for schema information.",
        helpText =
            "The name or the template for the stream to poll for schema information. Defaults to: {_metadata_stream}. The default value is usually enough.")
    String getStreamName();

    void setStreamName(String value);

    @TemplateParameter.DateTime(
        order = 5,
        optional = true,
        description =
            "The starting DateTime used to fetch from Cloud Storage "
                + "(https://tools.ietf.org/html/rfc3339).",
        helpText =
            "The starting DateTime to use to fetch data from Cloud Storage (https://tools.ietf.org/html/rfc3339). Defaults to: `1970-01-01T00:00:00.00Z`.")
    @Default.String("1970-01-01T00:00:00.00Z")
    String getRfcStartDateTime();

    void setRfcStartDateTime(String value);

    @TemplateParameter.Integer(
        order = 6,
        optional = true,
        description = "File read concurrency",
        helpText = "The number of concurrent DataStream files to read. Default is `10`.")
    @Default.Integer(10)
    Integer getFileReadConcurrency();

    void setFileReadConcurrency(Integer value);

    @TemplateParameter.ProjectId(
        order = 7,
        optional = true,
        description = "Project Id for BigQuery datasets.",
        groupName = "Target",
        helpText =
            "The ID of the Google Cloud project that contains the BigQuery datasets to output data into. The default for this parameter is the project where the Dataflow pipeline is running.")
    String getOutputProjectId();

    void setOutputProjectId(String projectId);

    @TemplateParameter.Text(
        order = 8,
        groupName = "Target",
        description = "Name or template for the dataset to contain staging tables.",
        helpText =
            "The name of the dataset that contains staging tables. This parameter supports templates, for example `{_metadata_dataset}_log` or `my_dataset_log`. Normally, this parameter is a dataset name. Defaults to `{_metadata_dataset}`.")
    @Default.String("{_metadata_dataset}")
    String getOutputStagingDatasetTemplate();

    void setOutputStagingDatasetTemplate(String value);

    @TemplateParameter.Text(
        order = 9,
        optional = true,
        groupName = "Target",
        description = "Template for the name of staging tables.",
        helpText =
            "The template to use to name the staging tables. For example, `{_metadata_table}`. Defaults to `{_metadata_table}_log`.")
    @Default.String("{_metadata_table}_log")
    String getOutputStagingTableNameTemplate();

    void setOutputStagingTableNameTemplate(String value);

    @TemplateParameter.Text(
        order = 10,
        groupName = "Target",
        description = "Template for the dataset to contain replica tables.",
        helpText =
            "The name of the dataset that contains the replica tables. This parameter supports templates, for example `{_metadata_dataset}` or `my_dataset`. Normally, this parameter is a dataset name. Defaults to `{_metadata_dataset}`.")
    @Default.String("{_metadata_dataset}")
    String getOutputDatasetTemplate();

    void setOutputDatasetTemplate(String value);

    @TemplateParameter.Text(
        order = 11,
        groupName = "Target",
        optional = true,
        description = "Template for the name of replica tables.",
        helpText =
            "The template to use for the name of the replica tables, for example `{_metadata_table}`. Defaults to `{_metadata_table}`.")
    @Default.String("{_metadata_table}")
    String getOutputTableNameTemplate();

    void setOutputTableNameTemplate(String value);

    @TemplateParameter.Text(
        order = 12,
        optional = true,
        description = "Fields to be ignored",
        helpText =
            "Comma-separated fields to ignore in BigQuery. Defaults to: `_metadata_stream,_metadata_schema,_metadata_table,_metadata_source,_metadata_tx_id,_metadata_dlq_reconsumed,_metadata_primary_keys,_metadata_error,_metadata_retry_count`.",
        example = "_metadata_stream,_metadata_schema")
    @Default.String(
        "_metadata_stream,_metadata_schema,_metadata_table,_metadata_source,"
            + "_metadata_tx_id,_metadata_dlq_reconsumed,_metadata_primary_keys,"
            + "_metadata_error,_metadata_retry_count")
    String getIgnoreFields();

    void setIgnoreFields(String value);

    @TemplateParameter.Integer(
        order = 13,
        optional = true,
        description = "The number of minutes between merges for a given table",
        helpText = "The number of minutes between merges for a given table. Defaults to `5`.")
    @Default.Integer(5)
    Integer getMergeFrequencyMinutes();

    void setMergeFrequencyMinutes(Integer value);

    @TemplateParameter.Text(
        order = 14,
        description = "Dead letter queue directory.",
        helpText =
            "The path that Dataflow uses to write the dead-letter queue output. This path must not be in the same path as the Datastream file output. Defaults to `empty`.")
    @Default.String("")
    String getDeadLetterQueueDirectory();

    void setDeadLetterQueueDirectory(String value);

    @TemplateParameter.Integer(
        order = 15,
        optional = true,
        description = "The number of minutes between DLQ Retries.",
        helpText = "The number of minutes between DLQ Retries. Defaults to `10`.")
    @Default.Integer(10)
    Integer getDlqRetryMinutes();

    void setDlqRetryMinutes(Integer value);

    @TemplateParameter.Text(
        order = 16,
        optional = true,
        description = "Datastream API Root URL (only required for testing)",
        helpText = "The Datastream API root URL. Defaults to: https://datastream.googleapis.com/.")
    @Default.String("https://datastream.googleapis.com/")
    String getDataStreamRootUrl();

    void setDataStreamRootUrl(String value);

    @TemplateParameter.Boolean(
        order = 17,
        optional = true,
        description = "A switch to disable MERGE queries for the job.",
        helpText = "Whether to disable MERGE queries for the job. Defaults to `true`.")
    @Default.Boolean(true)
    Boolean getApplyMerge();

    void setApplyMerge(Boolean value);

    @TemplateParameter.Integer(
        order = 18,
        optional = true,
        parentName = "applyMerge",
        parentTriggerValues = {"true"},
        description = "Concurrent queries for merge.",
        helpText =
            "The number of concurrent BigQuery MERGE queries. Only effective when applyMerge is set to true. Defaults to `30`.")
    @Default.Integer(MergeConfiguration.DEFAULT_MERGE_CONCURRENCY)
    Integer getMergeConcurrency();

    void setMergeConcurrency(Integer value);

    @TemplateParameter.Integer(
        order = 19,
        optional = true,
        description = "Partition retention days.",
        helpText =
            "The number of days to use for partition retention when running BigQuery merges. Defaults to `1`.")
    @Default.Integer(MergeConfiguration.DEFAULT_PARTITION_RETENTION_DAYS)
    Integer getPartitionRetentionDays();

    void setPartitionRetentionDays(Integer value);

    @TemplateParameter.Boolean(
        order = 20,
        optional = true,
        parentName = "useStorageWriteApi",
        parentTriggerValues = {"true"},
        description = "Use at at-least-once semantics in BigQuery Storage Write API",
        helpText =
            "This parameter takes effect only if `Use BigQuery Storage Write API` is enabled. If `true`, at-least-once semantics are used for the Storage Write API. Otherwise, exactly-once semantics are used. Defaults to `false`.",
        hiddenUi = true)
    @Default.Boolean(false)
    @Override
    Boolean getUseStorageWriteApiAtLeastOnce();

    void setUseStorageWriteApiAtLeastOnce(Boolean value);
  }

  /**
   * Main entry point for executing the pipeline.
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    LOG.info("Starting Input Files to BigQuery");

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    options.setStreaming(true);
    options.setEnableStreamingEngine(true);

    validateOptions(options);
    run(options);
  }

  private static void validateOptions(Options options) {
    String outputDataset = options.getOutputDatasetTemplate();
    String outputStagingDs = options.getOutputStagingDatasetTemplate();

    String outputTable = options.getOutputTableNameTemplate();
    String outputStagingTb = options.getOutputStagingTableNameTemplate();

    if (outputDataset.equals(outputStagingDs) && outputTable.equals(outputStagingTb)) {
      throw new IllegalArgumentException(
          "Can not have equal templates for output tables and staging tables.");
    }

    String inputFileFormat = options.getInputFileFormat();
    if (!(inputFileFormat.equals(AVRO_SUFFIX) || inputFileFormat.equals(JSON_SUFFIX))) {
      throw new IllegalArgumentException(
          "Input file format must be one of: avro, json or left empty - found " + inputFileFormat);
    }

    BigQueryIOUtils.validateBQStorageApiOptionsStreaming(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  public static PipelineResult run(Options options) {
    /*
     * Stages:
     *   1) Ingest and Normalize Data to FailsafeElement with JSON Strings
     *   2) Write JSON Strings to TableRow Collection
     *       - Optionally apply a UDF
     *   3) BigQuery Output of TableRow Data
     *     a) Map New Columns & Write to Staging Tables
     *     b) Map New Columns & Merge Staging to Target Table
     *   4) Write Failures to GCS Dead Letter Queue
     */

    Pipeline pipeline = Pipeline.create(options);
    DeadLetterQueueManager dlqManager = buildDlqManager(options);

    String bigqueryProjectId = getBigQueryProjectId(options);
    String dlqDirectory = dlqManager.getRetryDlqDirectoryWithDateTime();
    String tempDlqDir = dlqManager.getRetryDlqDirectory() + "tmp/";

    InputUDFToTableRow<String> failsafeTableRowTransformer =
        new InputUDFToTableRow<String>(
            options.getJavascriptTextTransformGcsPath(),
            options.getJavascriptTextTransformFunctionName(),
            options.getJavascriptTextTransformReloadIntervalMinutes(),
            options.getPythonTextTransformGcsPath(),
            options.getPythonTextTransformFunctionName(),
            options.getRuntimeRetries(),
            FAILSAFE_ELEMENT_CODER);

    StatefulRowCleaner statefulCleaner = StatefulRowCleaner.of();

    /*
     * Stage 1: Ingest and Normalize Data to FailsafeElement with JSON Strings
     *   a) Read DataStream data from GCS into JSON String FailsafeElements (datastreamJsonRecords)
     *   b) Reconsume Dead Letter Queue data from GCS into JSON String FailsafeElements
     *     (dlqJsonRecords)
     *   c) Flatten DataStream and DLQ Streams (jsonRecords)
     */
    PCollection<FailsafeElement<String, String>> datastreamJsonRecords =
        pipeline.apply(
            new DataStreamIO(
                    options.getStreamName(),
                    options.getInputFilePattern(),
                    options.getInputFileFormat(),
                    options.getGcsPubSubSubscription(),
                    options.getRfcStartDateTime())
                .withFileReadConcurrency(options.getFileReadConcurrency()));

    // Elements sent to the Dead Letter Queue are to be reconsumed.
    // A DLQManager is to be created using PipelineOptions, and it is in charge
    // of building pieces of the DLQ.
    PCollection<FailsafeElement<String, String>> dlqJsonRecords =
        pipeline
            .apply("DLQ Consumer/reader", dlqManager.dlqReconsumer(options.getDlqRetryMinutes()))
            .apply(
                "DLQ Consumer/cleaner",
                ParDo.of(
                    new DoFn<String, FailsafeElement<String, String>>() {
                      @ProcessElement
                      public void process(
                          @Element String input,
                          OutputReceiver<FailsafeElement<String, String>> receiver) {
                        receiver.output(FailsafeElement.of(input, input));
                      }
                    }))
            .setCoder(FAILSAFE_ELEMENT_CODER);

    PCollection<FailsafeElement<String, String>> jsonRecords =
        PCollectionList.of(datastreamJsonRecords)
            .and(dlqJsonRecords)
            .apply("Merge Datastream & DLQ", Flatten.pCollections());

    /*
     * Stage 2: Write JSON Strings to TableRow PCollectionTuple
     *   a) Optionally apply a Javascript or Python UDF
     *   b) Convert JSON String FailsafeElements to TableRow's (tableRowRecords)
     */
    PCollectionTuple tableRowRecords =
        jsonRecords.apply("UDF to TableRow/udf", failsafeTableRowTransformer);

    PCollectionTuple cleanedRows =
        tableRowRecords
            .get(failsafeTableRowTransformer.transformOut)
            .apply("UDF to TableRow/Oracle Cleaner", statefulCleaner);

    PCollection<TableRow> shuffledTableRows =
        cleanedRows
            .get(statefulCleaner.successTag)
            .apply(
                "UDF to TableRow/ReShuffle",
                Reshuffle.<TableRow>viaRandomKey().withNumBuckets(100));

    /*
     * Stage 3: BigQuery Output of TableRow Data
     *   a) Map New Columns & Write to Staging Tables (writeResult)
     *   b) Map New Columns & Merge Staging to Target Table (null)
     *
     *   failsafe: writeResult.getFailedInsertsWithErr()
     */
    // TODO(beam 2.23): InsertRetryPolicy should be CDC compliant
    Set<String> fieldsToIgnore = getFieldsToIgnore(options.getIgnoreFields());

    WriteResult writeResult =
        shuffledTableRows
            .apply(
                "Map to Staging Tables",
                new DataStreamMapper(
                        options.as(GcpOptions.class),
                        options.getOutputProjectId(),
                        options.getOutputStagingDatasetTemplate(),
                        options.getOutputStagingTableNameTemplate())
                    .withDataStreamRootUrl(options.getDataStreamRootUrl())
                    .withDefaultSchema(BigQueryDefaultSchemas.DATASTREAM_METADATA_SCHEMA)
                    .withDayPartitioning(true)
                    .withIgnoreFields(fieldsToIgnore))
            .apply(
                "Write Successful Records",
                BigQueryIO.<KV<TableId, TableRow>>write()
                    .to(new BigQueryDynamicConverters().bigQueryDynamicDestination())
                    .withFormatFunction(
                        element -> removeTableRowFields(element.getValue(), fieldsToIgnore))
                    .withFormatRecordOnFailureFunction(element -> element.getValue())
                    .withoutValidation()
                    .ignoreInsertIds()
                    .withCreateDisposition(CreateDisposition.CREATE_NEVER)
                    .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                    .withExtendedErrorInfo() // takes effect only when Storage Write API is off
                    .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors()));

    if (options.getApplyMerge()) {
      shuffledTableRows
          .apply(
              "Map To Replica Tables",
              new DataStreamMapper(
                      options.as(GcpOptions.class),
                      options.getOutputProjectId(),
                      options.getOutputDatasetTemplate(),
                      options.getOutputTableNameTemplate())
                  .withDataStreamRootUrl(options.getDataStreamRootUrl())
                  .withDefaultSchema(BigQueryDefaultSchemas.DATASTREAM_METADATA_SCHEMA)
                  .withIgnoreFields(fieldsToIgnore))
          .apply(
              "BigQuery Merge/Build MergeInfo",
              new MergeInfoMapper(
                  bigqueryProjectId,
                  options.getOutputStagingDatasetTemplate(),
                  options.getOutputStagingTableNameTemplate(),
                  options.getOutputDatasetTemplate(),
                  options.getOutputTableNameTemplate()))
          .apply(
              "BigQuery Merge/Merge into Replica Tables",
              BigQueryMerger.of(
                  MergeConfiguration.bigQueryConfiguration()
                      .withProjectId(bigqueryProjectId)
                      .withMergeWindowDuration(
                          Duration.standardMinutes(options.getMergeFrequencyMinutes()))
                      .withMergeConcurrency(options.getMergeConcurrency())
                      .withPartitionRetention(options.getPartitionRetentionDays())));
    }

    /*
     * Stage 4: Write Failures to GCS Dead Letter Queue
     */
    PCollection<String> udfDlqJson =
        PCollectionList.of(tableRowRecords.get(failsafeTableRowTransformer.udfDeadletterOut))
            .and(tableRowRecords.get(failsafeTableRowTransformer.transformDeadletterOut))
            .apply("Transform Failures/Flatten", Flatten.pCollections())
            .apply(
                "Transform Failures/Sanitize",
                MapElements.via(new StringDeadLetterQueueSanitizer()));

    PCollection<String> rowCleanerJson =
        cleanedRows
            .get(statefulCleaner.failureTag)
            .apply(
                "Transform Failures/Oracle Cleaner Failures",
                MapElements.via(new RowCleanerDeadLetterQueueSanitizer()));

    PCollection<String> bqWriteDlqJson =
        BigQueryIOUtils.writeResultToBigQueryInsertErrors(writeResult, options)
            .apply("BigQuery Failures", MapElements.via(new BigQueryDeadLetterQueueSanitizer()));

    PCollectionList.of(udfDlqJson)
        .and(rowCleanerJson)
        .and(bqWriteDlqJson)
        .apply("Write To DLQ/Flatten", Flatten.pCollections())
        .apply(
            "Write To DLQ/Writer",
            DLQWriteTransform.WriteDLQ.newBuilder()
                .withDlqDirectory(dlqDirectory)
                .withTmpDirectory(tempDlqDir)
                .setIncludePaneInfo(true)
                .build());

    // Execute the pipeline and return the result.
    return pipeline.run();
  }

  private static Set<String> getFieldsToIgnore(String fields) {
    return new HashSet<>(Splitter.on(Pattern.compile("\\s*,\\s*")).splitToList(fields));
  }

  private static TableRow removeTableRowFields(TableRow tableRow, Set<String> ignoreFields) {
    LOG.debug("BigQuery Writes: {}", tableRow);
    TableRow cleanTableRow = tableRow.clone();
    Set<String> rowKeys = tableRow.keySet();

    for (String rowKey : rowKeys) {
      if (ignoreFields.contains(rowKey)) {
        cleanTableRow.remove(rowKey);
      }
    }

    return cleanTableRow;
  }

  private static String getBigQueryProjectId(Options options) {
    return options.getOutputProjectId() == null
        ? options.as(GcpOptions.class).getProject()
        : options.getOutputProjectId();
  }

  private static DeadLetterQueueManager buildDlqManager(Options options) {
    String tempLocation =
        options.as(DataflowPipelineOptions.class).getTempLocation().endsWith("/")
            ? options.as(DataflowPipelineOptions.class).getTempLocation()
            : options.as(DataflowPipelineOptions.class).getTempLocation() + "/";

    String dlqDirectory =
        options.getDeadLetterQueueDirectory().isEmpty()
            ? tempLocation + "dlq/"
            : options.getDeadLetterQueueDirectory();

    LOG.info("Dead-letter queue directory: {}", dlqDirectory);
    return DeadLetterQueueManager.create(dlqDirectory);
  }
}

다음 단계

분석을 위해 Datastream 및 Dataflow를 구현하는 방법을 알아보세요.
Dataflow 템플릿 알아보기
Google 제공 템플릿 목록 참조