Google 提供的 Dataflow 批处理模板

Google 提供了一组开源 Dataflow 模板。

这些 Dataflow 模板可帮助您处理大型数据任务，包括数据导入、数据导出、数据备份、数据恢复和批量 API 操作，所有这些均无需使用专用开发环境。这些模板基于 Apache Beam 构建，并使用 Dataflow 转换数据。

如需了解有关模板的一般信息，请参阅 Dataflow 模板。如需查看 Google 提供的所有模板的列表，请参阅开始使用 Google 提供的模板。

本指南介绍了批量模板。

BigQuery to Cloud Storage TFRecords

BigQuery to Cloud Storage TFRecords 模板是一种流水线，可从 BigQuery 查询读取数据并以 TFRecord 格式将其写入 Cloud Storage 存储桶。您可以指定训练、测试和验证拆分百分比。默认情况下，训练集的拆分比例为 1 或 100%，测试和验证集的拆分比例为 0 或 0%。设置数据集拆分比例时，训练、测试和验证之和加起来必须为 1 或 100%（例如，0.6 + 0.2 + 0.2）。Dataflow 会自动确定每个输出数据集的最佳分片数。

对此流水线的要求：

BigQuery 数据集和表必须已存在。
输出 Cloud Storage 存储桶必须存在才能执行此流水线。训练、测试和验证子目录不需要预先存在，将会自动生成。

模板参数

参数	说明
`readQuery`	用于从来源中提取数据的 BigQuery SQL 查询。例如 `select * from dataset1.sample_table`。
`outputDirectory`	在其中写入训练、测试和验证 TFRecord 文件的顶级 Cloud Storage 路径前缀。例如 `gs://mybucket/output`。生成的训练、测试和验证 TFRecord 文件的子目录根据 `outputDirectory` 自动生成。例如 `gs://mybucket/output/train`
`trainingPercentage`	（可选）分配给训练 TFRecord 文件的查询数据所占的百分比。默认值为 1 或 100%。
`testingPercentage`	（可选）分配给测试 TFRecord 文件的查询数据所占的百分比。默认值为 0 或 0%。
`validationPercentage`	（可选）分配给验证 TFRecord 文件的查询数据所占的百分比。默认值为 0 或 0%。
`outputSuffix`	（可选）写入的训练、测试和验证 TFRecord 文件的文件后缀。默认值为 `.tfrecord`。

运行 BigQuery to Cloud Storage TFRecord files 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the BigQuery to TFRecords template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records \
    --region REGION_NAME \
    --parameters \
readQuery=READ_QUERY,\
outputDirectory=OUTPUT_DIRECTORY,\
trainingPercentage=TRAINING_PERCENTAGE,\
testingPercentage=TESTING_PERCENTAGE,\
validationPercentage=VALIDATION_PERCENTAGE,\
outputSuffix=OUTPUT_FILENAME_SUFFIX

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
READ_QUERY：要运行的 BigQuery 查询
OUTPUT_DIRECTORY：输出数据集的 Cloud Storage 路径前缀
TRAINING_PERCENTAGE：训练数据集的拆分小数百分比
TESTING_PERCENTAGE：测试数据集的拆分小数百分比
VALIDATION_PERCENTAGE：验证数据集的拆分小数百分比
OUTPUT_FILENAME_SUFFIX：首选输出 TensorFlow 记录文件后缀

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records
{
   "jobName": "JOB_NAME",
   "parameters": {
       "readQuery":"READ_QUERY",
       "outputDirectory":"OUTPUT_DIRECTORY",
       "trainingPercentage":"TRAINING_PERCENTAGE",
       "testingPercentage":"TESTING_PERCENTAGE",
       "validationPercentage":"VALIDATION_PERCENTAGE",
       "outputSuffix":"OUTPUT_FILENAME_SUFFIX"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
READ_QUERY：要运行的 BigQuery 查询
OUTPUT_DIRECTORY：输出数据集的 Cloud Storage 路径前缀
TRAINING_PERCENTAGE：训练数据集的拆分小数百分比
TESTING_PERCENTAGE：测试数据集的拆分小数百分比
VALIDATION_PERCENTAGE：验证数据集的拆分小数百分比
OUTPUT_FILENAME_SUFFIX：首选输出 TensorFlow 记录文件后缀

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.BigQueryToTFRecord.Options;
import com.google.cloud.teleport.templates.common.BigQueryConverters.BigQueryReadOptions;
import com.google.protobuf.ByteString;
import java.util.Iterator;
import java.util.Random;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.util.Utf8;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.ByteArrayCoder;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.TFRecordIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.Partition;
import org.apache.beam.sdk.transforms.Reshuffle;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionList;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.annotations.VisibleForTesting;
import org.tensorflow.example.Example;
import org.tensorflow.example.Feature;
import org.tensorflow.example.Features;

/**
 * Dataflow template which reads BigQuery data and writes it to GCS as a set of TFRecords. The
 * source is a SQL query.
 */
@Template(
    name = "Cloud_BigQuery_to_GCS_TensorFlow_Records",
    category = TemplateCategory.BATCH,
    displayName = "BigQuery to TensorFlow Records",
    description =
        "A pipeline that reads rows from BigQuery and writes them as TFRecords in Cloud Storage. (NOTE: Nested BigQuery columns are currently not supported and should be unnested within the SQL query.)",
    optionsClass = Options.class,
    optionsOrder = {BigQueryReadOptions.class, Options.class},
    contactInformation = "https://cloud.google.com/support")
public class BigQueryToTFRecord {

  /**
   * The {@link BigQueryToTFRecord#buildFeatureFromIterator(Class, Object, Feature.Builder)} method
   * handles {@link GenericData.Array} that are passed into the {@link
   * BigQueryToTFRecord#buildFeature} method creating a TensorFlow feature from the record.
   */
  private static final String TRAIN = "train/";

  private static final String TEST = "test/";
  private static final String VAL = "val/";

  private static void buildFeatureFromIterator(
      Class<?> fieldType, Object field, Feature.Builder feature) {
    ByteString byteString;
    GenericData.Array f = (GenericData.Array) field;
    if (fieldType == Long.class) {
      Iterator<Long> longIterator = f.iterator();
      while (longIterator.hasNext()) {
        Long longValue = longIterator.next();
        feature.getInt64ListBuilder().addValue(longValue);
      }
    } else if (fieldType == double.class) {
      Iterator<Double> doubleIterator = f.iterator();
      while (doubleIterator.hasNext()) {
        double doubleValue = doubleIterator.next();
        feature.getFloatListBuilder().addValue((float) doubleValue);
      }
    } else if (fieldType == String.class) {
      Iterator<Utf8> stringIterator = f.iterator();
      while (stringIterator.hasNext()) {
        String stringValue = stringIterator.next().toString();
        byteString = ByteString.copyFromUtf8(stringValue);
        feature.getBytesListBuilder().addValue(byteString);
      }
    } else if (fieldType == boolean.class) {
      Iterator<Boolean> booleanIterator = f.iterator();
      while (booleanIterator.hasNext()) {
        Boolean boolValue = booleanIterator.next();
        int boolAsInt = boolValue ? 1 : 0;
        feature.getInt64ListBuilder().addValue(boolAsInt);
      }
    }
  }

  /**
   * The {@link BigQueryToTFRecord#buildFeature} method takes in an individual field and type
   * corresponding to a column value from a SchemaAndRecord Object returned from a BigQueryIO.read()
   * step. The method builds a TensorFlow Feature based on the type of the object- ie: STRING, TIME,
   * INTEGER etc..
   */
  private static Feature buildFeature(Object field, String type) {
    Feature.Builder feature = Feature.newBuilder();
    ByteString byteString;

    switch (type) {
      case "STRING":
      case "TIME":
      case "DATE":
        if (field instanceof GenericData.Array) {
          buildFeatureFromIterator(String.class, field, feature);
        } else {
          byteString = ByteString.copyFromUtf8(field.toString());
          feature.getBytesListBuilder().addValue(byteString);
        }
        break;
      case "BYTES":
        byteString = ByteString.copyFrom((byte[]) field);
        feature.getBytesListBuilder().addValue(byteString);
        break;
      case "INTEGER":
      case "INT64":
      case "TIMESTAMP":
        if (field instanceof GenericData.Array) {
          buildFeatureFromIterator(Long.class, field, feature);
        } else {
          feature.getInt64ListBuilder().addValue((long) field);
        }
        break;
      case "FLOAT":
      case "FLOAT64":
        if (field instanceof GenericData.Array) {
          buildFeatureFromIterator(double.class, field, feature);
        } else {
          feature.getFloatListBuilder().addValue((float) (double) field);
        }
        break;
      case "BOOLEAN":
      case "BOOL":
        if (field instanceof GenericData.Array) {
          buildFeatureFromIterator(boolean.class, field, feature);
        } else {
          int boolAsInt = (boolean) field ? 1 : 0;
          feature.getInt64ListBuilder().addValue(boolAsInt);
        }
        break;
      default:
        throw new RuntimeException("Unsupported type: " + type);
    }
    return feature.build();
  }

  /**
   * The {@link BigQueryToTFRecord#record2Example(SchemaAndRecord)} method uses takes in a
   * SchemaAndRecord Object returned from a BigQueryIO.read() step and builds a TensorFlow Example
   * from the record.
   */
  @VisibleForTesting
  protected static byte[] record2Example(SchemaAndRecord schemaAndRecord) {
    Example.Builder example = Example.newBuilder();
    Features.Builder features = example.getFeaturesBuilder();
    GenericRecord record = schemaAndRecord.getRecord();
    for (TableFieldSchema field : schemaAndRecord.getTableSchema().getFields()) {
      Object fieldValue = record.get(field.getName());
      if (fieldValue != null) {
        Feature feature = buildFeature(fieldValue, field.getType());
        features.putFeature(field.getName(), feature);
      }
    }
    return example.build().toByteArray();
  }

  /**
   * The {@link BigQueryToTFRecord#concatURI} method uses takes in a Cloud Storage URI and a
   * subdirectory name and safely concatenates them. The resulting String is used as a sink for
   * TFRecords.
   */
  private static String concatURI(String dir, String folder) {
    if (dir.endsWith("/")) {
      return dir + folder;
    } else {
      return dir + "/" + folder;
    }
  }

  /**
   * The {@link BigQueryToTFRecord#applyTrainTestValSplit} method transforms the PCollection by
   * randomly partitioning it into PCollections for each dataset.
   */
  static PCollectionList<byte[]> applyTrainTestValSplit(
      PCollection<byte[]> input,
      ValueProvider<Float> trainingPercentage,
      ValueProvider<Float> testingPercentage,
      ValueProvider<Float> validationPercentage,
      Random rand) {
    return input.apply(
        Partition.of(
            3,
            (Partition.PartitionFn<byte[]>)
                (number, numPartitions) -> {
                  Float train = trainingPercentage.get();
                  Float test = testingPercentage.get();
                  Float validation = validationPercentage.get();
                  Double d = rand.nextDouble();
                  if (train + test + validation != 1) {
                    throw new RuntimeException(
                        String.format(
                            "Train %.2f, Test %.2f, Validation"
                                + " %.2f percentages must add up to 100 percent",
                            train, test, validation));
                  }
                  if (d < train) {
                    return 0;
                  } else if (d >= train && d < train + test) {
                    return 1;
                  } else {
                    return 2;
                  }
                }));
  }

  /** Run the pipeline. */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(Options options) {
    Random rand = new Random(100); // set random seed
    Pipeline pipeline = Pipeline.create(options);

    PCollection<byte[]> bigQueryToExamples =
        pipeline
            .apply(
                "RecordToExample",
                BigQueryIO.read(BigQueryToTFRecord::record2Example)
                    .fromQuery(options.getReadQuery())
                    .withCoder(ByteArrayCoder.of())
                    .withTemplateCompatibility()
                    .withoutValidation()
                    .usingStandardSql()
                    .withMethod(BigQueryIO.TypedRead.Method.DIRECT_READ)
                // Enable BigQuery Storage API
                )
            .apply("ReshuffleResults", Reshuffle.viaRandomKey());

    PCollectionList<byte[]> partitionedExamples =
        applyTrainTestValSplit(
            bigQueryToExamples,
            options.getTrainingPercentage(),
            options.getTestingPercentage(),
            options.getValidationPercentage(),
            rand);

    partitionedExamples
        .get(0)
        .apply(
            "WriteTFTrainingRecord",
            FileIO.<byte[]>write()
                .via(TFRecordIO.sink())
                .to(
                    ValueProvider.NestedValueProvider.of(
                        options.getOutputDirectory(), dir -> concatURI(dir, TRAIN)))
                .withNumShards(0)
                .withSuffix(options.getOutputSuffix()));

    partitionedExamples
        .get(1)
        .apply(
            "WriteTFTestingRecord",
            FileIO.<byte[]>write()
                .via(TFRecordIO.sink())
                .to(
                    ValueProvider.NestedValueProvider.of(
                        options.getOutputDirectory(), dir -> concatURI(dir, TEST)))
                .withNumShards(0)
                .withSuffix(options.getOutputSuffix()));

    partitionedExamples
        .get(2)
        .apply(
            "WriteTFValidationRecord",
            FileIO.<byte[]>write()
                .via(TFRecordIO.sink())
                .to(
                    ValueProvider.NestedValueProvider.of(
                        options.getOutputDirectory(), dir -> concatURI(dir, VAL)))
                .withNumShards(0)
                .withSuffix(options.getOutputSuffix()));

    return pipeline.run();
  }

  /** Define command line arguments. */
  public interface Options extends BigQueryReadOptions {

    @TemplateParameter.GcsWriteFolder(
        order = 1,
        description = "Output Cloud Storage directory.",
        helpText = "Cloud Storage directory to store output TFRecord files.",
        example = "gs://your-bucket/your-path")
    ValueProvider<String> getOutputDirectory();

    void setOutputDirectory(ValueProvider<String> outputDirectory);

    @TemplateParameter.Text(
        order = 2,
        optional = true,
        regexes = {"^[A-Za-z_0-9.]*"},
        description = "The output suffix for TFRecord files",
        helpText = "File suffix to append to TFRecord files. Defaults to .tfrecord")
    @Default.String(".tfrecord")
    ValueProvider<String> getOutputSuffix();

    void setOutputSuffix(ValueProvider<String> outputSuffix);

    @TemplateParameter.Text(
        order = 3,
        optional = true,
        regexes = {"(^\\.[1-9]*$)|(^[01]*)"},
        description = "Percentage of data to be in the training set ",
        helpText = "Defaults to 1 or 100%. Should be decimal between 0 and 1 inclusive")
    @Default.Float(1)
    ValueProvider<Float> getTrainingPercentage();

    void setTrainingPercentage(ValueProvider<Float> trainingPercentage);

    @TemplateParameter.Text(
        order = 4,
        optional = true,
        regexes = {"(^\\.[1-9]*$)|(^[01]*)"},
        description = "Percentage of data to be in the testing set ",
        helpText = "Defaults to 0 or 0%. Should be decimal between 0 and 1 inclusive")
    @Default.Float(0)
    ValueProvider<Float> getTestingPercentage();

    void setTestingPercentage(ValueProvider<Float> testingPercentage);

    @TemplateParameter.Text(
        order = 5,
        optional = true,
        regexes = {"(^\\.[1-9]*$)|(^[01]*)"},
        description = "Percentage of data to be in the validation set ",
        helpText = "Defaults to 0 or 0%. Should be decimal between 0 and 1 inclusive")
    @Default.Float(0)
    ValueProvider<Float> getValidationPercentage();

    void setValidationPercentage(ValueProvider<Float> validationPercentage);
  }
}

BigQuery export to Parquet（通过 Storage API）

BigQuery export to Parquet 模板是一种批处理流水线，可从 BigQuery 表读取数据并以 Parquet 格式将其写入 Cloud Storage 存储桶。此模板利用 BigQuery Storage API 导出数据。

对此流水线的要求：

在运行此流水线之前，输入 BigQuery 表必须已存在。
在运行此流水线之前，输出 Cloud Storage 存储桶必须已存在。

模板参数

参数	说明
`tableRef`	BigQuery 输入表位置。例如 `<my-project>:<my-dataset>.<my-table>`。
`bucket`	要在其中写入 Parquet 文件的 Cloud Storage 文件夹。例如 `gs://mybucket/exports`。
`numShards`	（可选）输出文件分片数。默认值为 1。
`fields`	（可选）要从输入 BigQuery 表格选择的以英文逗号分隔的字段列表。

运行 BigQuery to Cloud Storage Parquet 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the BigQuery export to Parquet (via Storage API) template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/BigQuery_to_Parquet \
    --region=REGION_NAME \
    --parameters \
tableRef=BIGQUERY_TABLE,\
bucket=OUTPUT_DIRECTORY,\
numShards=NUM_SHARDS,\
fields=FIELDS

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGQUERY_TABLE：您的 BigQuery 表名称
OUTPUT_DIRECTORY：输出文件的 Cloud Storage 文件夹
NUM_SHARDS：所需的输出文件分片数
FIELDS：要从输入 BigQuery 表中选择的以英文逗号分隔的字段列表

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "tableRef": "BIGQUERY_TABLE",
          "bucket": "OUTPUT_DIRECTORY",
          "numShards": "NUM_SHARDS",
          "fields": "FIELDS"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/BigQuery_to_Parquet",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGQUERY_TABLE：您的 BigQuery 表名称
OUTPUT_DIRECTORY：输出文件的 Cloud Storage 文件夹
NUM_SHARDS：所需的输出文件分片数
FIELDS：要从输入 BigQuery 表中选择的以英文逗号分隔的字段列表

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import com.google.api.gax.rpc.InvalidArgumentException;
import com.google.api.services.bigquery.model.TableReference;
import com.google.cloud.bigquery.storage.v1beta1.BigQueryStorageClient;
import com.google.cloud.bigquery.storage.v1beta1.ReadOptions.TableReadOptions;
import com.google.cloud.bigquery.storage.v1beta1.Storage.CreateReadSessionRequest;
import com.google.cloud.bigquery.storage.v1beta1.Storage.ReadSession;
import com.google.cloud.bigquery.storage.v1beta1.TableReferenceProto;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.templates.BigQueryToParquet.BigQueryToParquetOptions;
import com.google.common.base.Splitter;
import com.google.common.base.Strings;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method;
import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
import org.apache.beam.sdk.io.parquet.ParquetIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link BigQueryToParquet} pipeline exports data from a BigQuery table to Parquet file(s) in a
 * Google Cloud Storage bucket.
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>BigQuery Table exists.
 *   <li>Google Cloud Storage bucket exists.
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 * # Set the pipeline vars
 * PROJECT=my-project
 * BUCKET_NAME=my-bucket
 * TABLE={$PROJECT}:my-dataset.my-table
 *
 * # Set containerization vars
 * IMAGE_NAME=my-image-name
 * TARGET_GCR_IMAGE=gcr.io/${PROJECT}/${IMAGE_NAME}
 * BASE_CONTAINER_IMAGE=my-base-container-image
 * BASE_CONTAINER_IMAGE_VERSION=my-base-container-image-version
 * APP_ROOT=/path/to/app-root
 * COMMAND_SPEC=/path/to/command-spec
 *
 * # Build and upload image
 * mvn clean package \
 * -Dimage=${TARGET_GCR_IMAGE} \
 * -Dbase-container-image=${BASE_CONTAINER_IMAGE} \
 * -Dbase-container-image.version=${BASE_CONTAINER_IMAGE_VERSION} \
 * -Dapp-root=${APP_ROOT} \
 * -Dcommand-spec=${COMMAND_SPEC}
 *
 * # Create an image spec in GCS that contains the path to the image
 * {
 *    "docker_template_spec": {
 *       "docker_image": $TARGET_GCR_IMAGE
 *     }
 *  }
 *
 * # Execute template:
 * API_ROOT_URL="https://dataflow.googleapis.com"
 * TEMPLATES_LAUNCH_API="${API_ROOT_URL}/v1b3/projects/${PROJECT}/templates:launch"
 * JOB_NAME="bigquery-to-parquet-`date +%Y%m%d-%H%M%S-%N`"
 *
 * time curl -X POST -H "Content-Type: application/json"     \
 *     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
 *     "${TEMPLATES_LAUNCH_API}"`
 *     `"?validateOnly=false"`
 *     `"&dynamicTemplate.gcsPath=${BUCKET_NAME}/path/to/image-spec"`
 *     `"&dynamicTemplate.stagingLocation=${BUCKET_NAME}/staging" \
 *     -d '
 *      {
 *       "jobName":"'$JOB_NAME'",
 *       "parameters": {
 *           "tableRef":"'$TABLE'",
 *           "bucket":"'$BUCKET_NAME/results'",
 *           "numShards":"5",
 *           "fields":"field1,field2"
 *        }
 *       }
 *      '
 * </pre>
 */
@Template(
    name = "BigQuery_to_Parquet",
    category = TemplateCategory.BATCH,
    displayName = "BigQuery export to Parquet (via Storage API)",
    description =
        "A pipeline to export a BigQuery table into Parquet files using the BigQuery Storage API.",
    optionsClass = BigQueryToParquetOptions.class,
    flexContainerName = "bigquery-to-parquet",
    contactInformation = "https://cloud.google.com/support")
public class BigQueryToParquet {

  /* Logger for class. */
  private static final Logger LOG = LoggerFactory.getLogger(BigQueryToParquet.class);

  /** File suffix for file to be written. */
  private static final String FILE_SUFFIX = ".parquet";

  /** Factory to create BigQueryStorageClients. */
  static class BigQueryStorageClientFactory {

    /**
     * Creates BigQueryStorage client for use in extracting table schema.
     *
     * @return BigQueryStorageClient
     */
    static BigQueryStorageClient create() {
      try {
        return BigQueryStorageClient.create();
      } catch (IOException e) {
        LOG.error("Error connecting to BigQueryStorage API: " + e.getMessage());
        throw new RuntimeException(e);
      }
    }
  }

  /** Factory to create ReadSessions. */
  static class ReadSessionFactory {

    /**
     * Creates ReadSession for schema extraction.
     *
     * @param client BigQueryStorage client used to create ReadSession.
     * @param tableString String that represents table to export from.
     * @param tableReadOptions TableReadOptions that specify any fields in the table to filter on.
     * @return session ReadSession object that contains the schema for the export.
     */
    static ReadSession create(
        BigQueryStorageClient client, String tableString, TableReadOptions tableReadOptions) {
      TableReference tableReference = BigQueryHelpers.parseTableSpec(tableString);
      String parentProjectId = "projects/" + tableReference.getProjectId();

      TableReferenceProto.TableReference storageTableRef =
          TableReferenceProto.TableReference.newBuilder()
              .setProjectId(tableReference.getProjectId())
              .setDatasetId(tableReference.getDatasetId())
              .setTableId(tableReference.getTableId())
              .build();

      CreateReadSessionRequest.Builder builder =
          CreateReadSessionRequest.newBuilder()
              .setParent(parentProjectId)
              .setReadOptions(tableReadOptions)
              .setTableReference(storageTableRef);
      try {
        return client.createReadSession(builder.build());
      } catch (InvalidArgumentException iae) {
        LOG.error("Error creating ReadSession: " + iae.getMessage());
        throw new RuntimeException(iae);
      }
    }
  }

  /**
   * The {@link BigQueryToParquetOptions} class provides the custom execution options passed by the
   * executor at the command-line.
   */
  public interface BigQueryToParquetOptions extends PipelineOptions {
    @TemplateParameter.BigQueryTable(
        order = 1,
        description = "BigQuery table to export",
        helpText = "BigQuery table location to export in the format <project>:<dataset>.<table>.",
        example = "your-project:your-dataset.your-table-name")
    @Required
    String getTableRef();

    void setTableRef(String tableRef);

    @TemplateParameter.GcsWriteFile(
        order = 2,
        description = "Output Cloud Storage file(s)",
        helpText = "Path and filename prefix for writing output files.",
        example = "gs://your-bucket/export/")
    @Required
    String getBucket();

    void setBucket(String bucket);

    @TemplateParameter.Integer(
        order = 3,
        optional = true,
        description = "Maximum output shards",
        helpText =
            "The maximum number of output shards produced when writing. A higher number of shards"
                + " means higher throughput for writing to Cloud Storage, but potentially higher"
                + " data aggregation cost across shards when processing output Cloud Storage"
                + " files.")
    @Default.Integer(0)
    Integer getNumShards();

    void setNumShards(Integer numShards);

    @TemplateParameter.Text(
        order = 4,
        optional = true,
        description = "List of field names",
        helpText = "Comma separated list of fields to select from the table.")
    String getFields();

    void setFields(String fields);

    @TemplateParameter.Text(
        order = 5,
        optional = true,
        description = "Row restrictions/filter.",
        helpText =
            "Read only rows which match the specified filter, which must be a SQL expression"
                + " compatible with Google standard SQL"
                + " (https://cloud.google.com/bigquery/docs/reference/standard-sql). If no value is"
                + " specified, then all rows are returned.")
    String getRowRestriction();

    void setRowRestriction(String restriction);
  }

  /**
   * The {@link BigQueryToParquet#getTableSchema(ReadSession)} method gets Avro schema for table
   * using from the {@link ReadSession} object.
   *
   * @param session ReadSession that contains schema for table, filtered by fields if any.
   * @return avroSchema Avro schema for table. If fields are provided then schema will only contain
   *     those fields.
   */
  private static Schema getTableSchema(ReadSession session) {
    Schema avroSchema;

    avroSchema = new Schema.Parser().parse(session.getAvroSchema().getSchema());
    LOG.info("Schema for export is: " + avroSchema.toString());

    return avroSchema;
  }

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    BigQueryToParquetOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(BigQueryToParquetOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  private static PipelineResult run(BigQueryToParquetOptions options) {

    // Create the pipeline.
    Pipeline pipeline = Pipeline.create(options);

    TableReadOptions.Builder builder = TableReadOptions.newBuilder();

    /* Add fields to filter export on, if any. */
    if (options.getFields() != null) {
      builder.addAllSelectedFields(Arrays.asList(options.getFields().split(",\\s*")));
    }

    TableReadOptions tableReadOptions = builder.build();
    BigQueryStorageClient client = BigQueryStorageClientFactory.create();
    ReadSession session =
        ReadSessionFactory.create(client, options.getTableRef(), tableReadOptions);

    // Extract schema from ReadSession
    Schema schema = getTableSchema(session);
    client.close();

    TypedRead<GenericRecord> readFromBQ =
        BigQueryIO.read(SchemaAndRecord::getRecord)
            .from(options.getTableRef())
            .withTemplateCompatibility()
            .withMethod(Method.DIRECT_READ)
            .withCoder(AvroCoder.of(schema));

    if (options.getFields() != null) {
      List<String> selectedFields = Splitter.on(",").splitToList(options.getFields());
      readFromBQ =
          selectedFields.isEmpty() ? readFromBQ : readFromBQ.withSelectedFields(selectedFields);
    }

    // Add row restrictions/filter if any.
    if (!Strings.isNullOrEmpty(options.getRowRestriction())) {
      readFromBQ = readFromBQ.withRowRestriction(options.getRowRestriction());
    }

    /*
     * Steps: 1) Read records from BigQuery via BigQueryIO.
     *        2) Write records to Google Cloud Storage in Parquet format.
     */
    pipeline
        /*
         * Step 1: Read records via BigQueryIO using supplied schema as a PCollection of
         *         {@link GenericRecord}.
         */
        .apply("ReadFromBigQuery", readFromBQ)
        /*
         * Step 2: Write records to Google Cloud Storage as one or more Parquet files
         *         via {@link ParquetIO}.
         */
        .apply(
            "WriteToParquet",
            FileIO.<GenericRecord>write()
                .via(ParquetIO.sink(schema))
                .to(options.getBucket())
                .withNumShards(options.getNumShards())
                .withSuffix(FILE_SUFFIX));

    // Execute the pipeline and return the result.
    return pipeline.run();
  }
}

BigQuery to Elasticsearch

BigQuery to Elasticsearch 模板是一种批处理流水线，用于将 BigQuery 表中的数据作为文档注入到 Elasticsearch 中。该模板可以读取整个表，也可以使用提供的查询读取特定记录。

对此流水线的要求

源 BigQuery 表必须存在。
必须具有在 Google Cloud 实例上或 Elastic Cloud 上运行的使用 Elasticsearch 7.0 版或更高版本的 Elasticsearch 主机，并且该主机应能够从 Dataflow 工作器机器进行访问。

模板参数

参数	说明
`connectionUrl`	Elasticsearch 网址，格式为 `https://hostname:[port]` 或指定 CloudID（如果使用 Elastic Cloud）。
`apiKey`	用于身份验证的 Base64 编码 API 密钥。
`index`	将向其发出请求的 Elasticsearch 索引，例如 `my-index`。
`inputTableSpec`	（可选）要读取并插入到 Elasticsearch 中的 BigQuery 表。必须提供表或查询。例如 `projectId:datasetId.tablename`。
`query`	（可选）用于从 BigQuery 拉取数据的 SQL 查询。必须提供表或查询。
`useLegacySql`	（可选）设置为 true 即可使用旧版 SQL（仅在提供查询时适用）。默认值：`false`。
`batchSize`	（可选）文档数量中的批次大小。默认值：`1000`。
`batchSizeBytes`	（可选）批次大小（以字节为单位）。默认值：`5242880` (5mb)。
`maxRetryAttempts`	（可选）尝试次数上限，必须大于 0。默认值：`no retries`。
`maxRetryDuration`	（可选）重试时长上限（以毫秒为单位），必须大于 0。默认值：`no retries`。
`propertyAsIndex`	（可选）要编入索引的文档中的一个属性，其值将指定批量请求要包含在文档中的 `_index` 元数据（优先于 `_index` UDF）。默认值：none。
`propertyAsId`	（可选）要编入索引的文档中的一个属性，其值将指定批量请求要包含在文档中的 `_id` 元数据（优先于 `_id` UDF）。默认值：none。
`javaScriptIndexFnGcsPath`	（可选）函数的 JavaScript UDF 源的 Cloud Storage 路径，该函数将指定批量请求要包含在文档中的 `_index` 元数据。默认值：none。
`javaScriptIndexFnName`	（可选）函数的 UDF JavaScript 函数名称，该函数将指定批量请求要包含在文档中的 `_index` 元数据。默认值：none。
`javaScriptIdFnGcsPath`	（可选）函数的 JavaScript UDF 源的 Cloud Storage 路径，该函数将指定批量请求要包含在文档中的 `_id` 元数据。默认值：none。
`javaScriptIdFnName`	（可选）函数的 UDF JavaScript 函数名称，该函数将指定批量请求要包含在文档中的 `_id` 元数据。默认值：none。
`javaScriptTypeFnGcsPath`	（可选）函数的 JavaScript UDF 源的 Cloud Storage 路径，该函数将指定批量请求要包含在文档中的 `_type` 元数据。默认值：none。
`javaScriptTypeFnName`	（可选）函数的 UDF JavaScript 函数名称，该函数将指定批量请求要包含在文档中的 `_type` 元数据。默认值：none。
`javaScriptIsDeleteFnGcsPath`	（可选）函数的 JavaScript UDF 源的 Cloud Storage 路径，该函数将确定是否应删除文档，而不是插入或更新文档。该函数应返回字符串值 `"true"` 或 `"false"`。默认值：none。
`javaScriptIsDeleteFnName`	（可选）函数的 UDF JavaScript 函数名称，该函数将确定是否应删除文档，而不是插入或更新文档。该函数应返回字符串值 `"true"` 或 `"false"`。默认值：none。
`usePartialUpdate`	（可选）是否在 Elasticsearch 请求中使用部分更新（更新而不是创建或索引，允许部分文档）。默认值：`false`。
`bulkInsertMethod`	（可选）在 Elasticsearch 批量请求中使用 `INDEX`（索引，允许执行更新插入操作）还是 `CREATE`（创建，会对重复 _id 报错）。默认值：`CREATE`。

运行 BigQuery to Elasticsearch 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the BigQuery to Elasticsearch template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/BigQuery_to_Elasticsearch \
    --parameters \
inputTableSpec=INPUT_TABLE_SPEC,\
connectionUrl=CONNECTION_URL,\
apiKey=APIKEY,\
index=INDEX

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
INPUT_TABLE_SPEC：您的 BigQuery 表名称。
CONNECTION_URL：您的 Elasticsearch 网址。
APIKEY：用于身份验证的 base64 编码 API 密钥。
INDEX：您的 Elasticsearch 索引。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputTableSpec": "INPUT_TABLE_SPEC",
          "connectionUrl": "CONNECTION_URL",
          "apiKey": "APIKEY",
          "index": "INDEX"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/BigQuery_to_Elasticsearch",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
INPUT_TABLE_SPEC：您的 BigQuery 表名称。
CONNECTION_URL：您的 Elasticsearch 网址。
APIKEY：用于身份验证的 base64 编码 API 密钥。
INDEX：您的 Elasticsearch 索引。

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.elasticsearch.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.elasticsearch.options.BigQueryToElasticsearchOptions;
import com.google.cloud.teleport.v2.elasticsearch.transforms.WriteToElasticsearch;
import com.google.cloud.teleport.v2.transforms.BigQueryConverters.ReadBigQuery;
import com.google.cloud.teleport.v2.transforms.BigQueryConverters.TableRowToJsonFn;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.ParDo;

/**
 * The {@link BigQueryToElasticsearch} pipeline exports data from a BigQuery table to Elasticsearch.
 *
 * <p>Please refer to <b><a href=
 * "https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/v2/googlecloud-to-elasticsearch/docs/BigQueryToElasticsearch/README.md">
 * README.md</a></b> for further information.
 */
@Template(
    name = "BigQuery_to_Elasticsearch",
    category = TemplateCategory.BATCH,
    displayName = "BigQuery to Elasticsearch",
    description =
        "A pipeline which sends BigQuery records into an Elasticsearch instance as json documents.",
    optionsClass = BigQueryToElasticsearchOptions.class,
    flexContainerName = "bigquery-to-elasticsearch",
    contactInformation = "https://cloud.google.com/support")
public class BigQueryToElasticsearch {
  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    BigQueryToElasticsearchOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(BigQueryToElasticsearchOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  private static PipelineResult run(BigQueryToElasticsearchOptions options) {

    // Create the pipeline.
    Pipeline pipeline = Pipeline.create(options);
    /*
     * Steps: 1) Read records from BigQuery via BigQueryIO.
     *        2) Create json string from Table Row.
     *        3) Write records to Elasticsearch.
     *
     *
     * Step #1: Read from BigQuery. If a query is provided then it is used to get the TableRows.
     */
    pipeline
        .apply(
            "ReadFromBigQuery",
            ReadBigQuery.newBuilder()
                .setOptions(options.as(BigQueryToElasticsearchOptions.class))
                .build())

        /*
         * Step #2: Convert table rows to JSON documents.
         */
        .apply("TableRowsToJsonDocument", ParDo.of(new TableRowToJsonFn()))

        /*
         * Step #3: Write converted records to Elasticsearch
         */
        .apply(
            "WriteToElasticsearch",
            WriteToElasticsearch.newBuilder()
                .setOptions(options.as(BigQueryToElasticsearchOptions.class))
                .build());

    return pipeline.run();
  }
}

BigQuery to MongoDB

BigQuery to MongoDB 模板是一种批处理流水线，可从 BigQuery 读取行数据并将其作为文档写入 MongoDB。目前，每一行数据都被存储为一个文档。

对此流水线的要求

源 BigQuery 表必须存在。
应该能够从 Dataflow 工作器机器访问目标 MongoDB 实例。

模板参数

参数	说明
`mongoDbUri`	MongoDB 连接 URI，格式为 `mongodb+srv://:@`。
`database`	存储集合的 MongoDB 数据库。例如：`my-db`。
`collection`	MongoDB 数据库中集合的名称。例如：`my-collection`。
`inputTableSpec`	要读取的 BigQuery 表。例如 `bigquery-project:dataset.input_table`。

运行 BigQuery to MongoDB 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the BigQuery to MongoDB template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

  gcloud beta dataflow flex-template run JOB_NAME \
      --project=PROJECT_ID \
      --region=REGION_NAME \
      --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/BigQuery_to_MongoDB \
      --parameters \
  inputTableSpec=INPUT_TABLE_SPEC,\
  mongoDbUri=MONGO_DB_URI,\
  database=DATABASE,\
  collection=COLLECTION

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
INPUT_TABLE_SPEC：您的 BigQuery 源表的名称。
MONGO_DB_URI：您的 MongoDB URI。
DATABASE：您的 MongoDB 数据库。
COLLECTION：您的 MongoDB 集合。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

  POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
  {
     "launch_parameter": {
        "jobName": "JOB_NAME",
        "parameters": {
            "inputTableSpec": "INPUT_TABLE_SPEC",
            "mongoDbUri": "MONGO_DB_URI",
            "database": "DATABASE",
            "collection": "COLLECTION"
        },
        "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/BigQuery_to_MongoDB",
     }
  }

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
INPUT_TABLE_SPEC：您的 BigQuery 源表的名称。
MONGO_DB_URI：您的 MongoDB URI。
DATABASE：您的 MongoDB 数据库。
COLLECTION：您的 MongoDB 集合。

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.mongodb.templates;

import com.google.api.services.bigquery.model.TableRow;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.mongodb.options.BigQueryToMongoDbOptions.BigQueryReadOptions;
import com.google.cloud.teleport.v2.mongodb.options.BigQueryToMongoDbOptions.MongoDbOptions;
import com.google.cloud.teleport.v2.mongodb.templates.BigQueryToMongoDb.Options;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.mongodb.MongoDbIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.bson.Document;

/**
 * The {@link BigQueryToMongoDb} pipeline is a batch pipeline which reads data from BigQuery and
 * outputs the resulting records to MongoDB.
 */
@Template(
    name = "BigQuery_to_MongoDB",
    category = TemplateCategory.BATCH,
    displayName = "BigQuery to MongoDB",
    description =
        "A batch pipeline which reads data rows from BigQuery and writes them to MongoDB as"
            + " documents.",
    optionsClass = Options.class,
    flexContainerName = "bigquery-to-mongodb",
    contactInformation = "https://cloud.google.com/support")
public class BigQueryToMongoDb {
  /**
   * Options supported by {@link BigQueryToMongoDb}
   *
   * <p>Inherits standard configuration options.
   */
  public interface Options extends PipelineOptions, MongoDbOptions, BigQueryReadOptions {}

  private static class ParseAsDocumentsFn extends DoFn<String, Document> {

    @ProcessElement
    public void processElement(ProcessContext context) {
      context.output(Document.parse(context.element()));
    }
  }

  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    run(options);
  }

  public static boolean run(Options options) {
    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(BigQueryIO.readTableRows().withoutValidation().from(options.getInputTableSpec()))
        .apply(
            "bigQueryDataset",
            ParDo.of(
                new DoFn<TableRow, Document>() {
                  @ProcessElement
                  public void process(ProcessContext c) {
                    Document doc = new Document();
                    TableRow row = c.element();
                    row.forEach(
                        (key, value) -> {
                          if (key != "_id") {
                            doc.append(key, value);
                          }
                        });
                    c.output(doc);
                  }
                }))
        .apply(
            MongoDbIO.write()
                .withUri(options.getMongoDbUri())
                .withDatabase(options.getDatabase())
                .withCollection(options.getCollection()));
    pipeline.run();
    return true;
  }
}

Bigtable to Cloud Storage Avro

Bigtable to Cloud Storage Avro 模板是一种流水线，可从 Bigtable 表中读取数据并以 Avro 格式将其写入 Cloud Storage 存取分区。您可以使用该模板将数据从 Bigtable 移动到 Cloud Storage。

对此流水线的要求：

Bigtable 表必须已存在。
在运行此流水线之前，输出 Cloud Storage 存储桶必须已存在。

模板参数

参数	说明
`bigtableProjectId`	您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
`bigtableInstanceId`	表所属的 Bigtable 实例的 ID。
`bigtableTableId`	要导出的 Bigtable 表的 ID。
`outputDirectory`	写入数据的 Cloud Storage 路径。例如 `gs://mybucket/somefolder`。
`filenamePrefix`	Avro 文件名的前缀。例如 `output-`。

运行 Bigtable to Cloud Storage Avro file 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cloud Bigtable to Avro Files on Cloud Storage template 。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
outputDirectory=OUTPUT_DIRECTORY,\
filenamePrefix=FILENAME_PREFIX

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
INSTANCE_ID：表所属的 Bigtable 实例的 ID。
TABLE_ID：需要导出的 Bigtable 表的 ID。
OUTPUT_DIRECTORY：写入数据的 Cloud Storage 路径，例如 gs://mybucket/somefolder
FILENAME_PREFIX：Avro 文件名的前缀，例如 output-

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "outputDirectory": "OUTPUT_DIRECTORY",
       "filenamePrefix": "FILENAME_PREFIX",
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
INSTANCE_ID：表所属的 Bigtable 实例的 ID。
TABLE_ID：需要导出的 Bigtable 表的 ID。
OUTPUT_DIRECTORY：写入数据的 Cloud Storage 路径，例如 gs://mybucket/somefolder
FILENAME_PREFIX：Avro 文件名的前缀，例如 output-

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.bigtable;

import com.google.bigtable.v2.Cell;
import com.google.bigtable.v2.Column;
import com.google.bigtable.v2.Family;
import com.google.bigtable.v2.Row;
import com.google.cloud.teleport.bigtable.BigtableToAvro.Options;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.util.DualInputNestedValueProvider;
import com.google.cloud.teleport.util.DualInputNestedValueProvider.TranslatorInput;
import com.google.protobuf.ByteOutput;
import com.google.protobuf.ByteString;
import com.google.protobuf.UnsafeByteOperations;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.List;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.AvroIO;
import org.apache.beam.sdk.io.gcp.bigtable.BigtableIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.transforms.SimpleFunction;

/**
 * Dataflow pipeline that exports data from a Cloud Bigtable table to Avro files in GCS. Currently,
 * filtering on Cloud Bigtable table is not supported.
 */
@Template(
    name = "Cloud_Bigtable_to_GCS_Avro",
    category = TemplateCategory.BATCH,
    displayName = "Cloud Bigtable to Avro Files in Cloud Storage",
    description =
        "A pipeline which reads in Cloud Bigtable table and writes it to Cloud Storage in Avro format.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class BigtableToAvro {

  /** Options for the export pipeline. */
  public interface Options extends PipelineOptions {
    @TemplateParameter.ProjectId(
        order = 1,
        description = "Project ID",
        helpText =
            "The ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from")
    ValueProvider<String> getBigtableProjectId();

    @SuppressWarnings("unused")
    void setBigtableProjectId(ValueProvider<String> projectId);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"[a-z][a-z0-9\\-]+[a-z0-9]"},
        description = "Instance ID",
        helpText = "The ID of the Cloud Bigtable instance that contains the table")
    ValueProvider<String> getBigtableInstanceId();

    @SuppressWarnings("unused")
    void setBigtableInstanceId(ValueProvider<String> instanceId);

    @TemplateParameter.Text(
        order = 3,
        regexes = {"[_a-zA-Z0-9][-_.a-zA-Z0-9]*"},
        description = "Table ID",
        helpText = "The ID of the Cloud Bigtable table to read")
    ValueProvider<String> getBigtableTableId();

    @SuppressWarnings("unused")
    void setBigtableTableId(ValueProvider<String> tableId);

    @TemplateParameter.GcsWriteFolder(
        order = 4,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime formatting is used to parse directory path for date & time formatters.",
        example = "gs://your-bucket/your-path")
    ValueProvider<String> getOutputDirectory();

    @SuppressWarnings("unused")
    void setOutputDirectory(ValueProvider<String> outputDirectory);

    @TemplateParameter.Text(
        order = 5,
        description = "Avro file prefix",
        helpText = "The prefix of the Avro file name. For example, \"table1-\"")
    ValueProvider<String> getFilenamePrefix();

    @SuppressWarnings("unused")
    void setFilenamePrefix(ValueProvider<String> filenamePrefix);
  }

  /**
   * Runs a pipeline to export data from a Cloud Bigtable table to Avro files in GCS.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    PipelineResult result = run(options);

    // Wait for pipeline to finish only if it is not constructing a template.
    if (options.as(DataflowPipelineOptions.class).getTemplateLocation() == null) {
      result.waitUntilFinish();
    }
  }

  public static PipelineResult run(Options options) {
    Pipeline pipeline = Pipeline.create(PipelineUtils.tweakPipelineOptions(options));

    BigtableIO.Read read =
        BigtableIO.read()
            .withProjectId(options.getBigtableProjectId())
            .withInstanceId(options.getBigtableInstanceId())
            .withTableId(options.getBigtableTableId());

    // Do not validate input fields if it is running as a template.
    if (options.as(DataflowPipelineOptions.class).getTemplateLocation() != null) {
      read = read.withoutValidation();
    }

    ValueProvider<String> filePathPrefix =
        DualInputNestedValueProvider.of(
            options.getOutputDirectory(),
            options.getFilenamePrefix(),
            new SerializableFunction<TranslatorInput<String, String>, String>() {
              @Override
              public String apply(TranslatorInput<String, String> input) {
                return new StringBuilder(input.getX()).append(input.getY()).toString();
              }
            });

    pipeline
        .apply("Read from Bigtable", read)
        .apply("Transform to Avro", MapElements.via(new BigtableToAvroFn()))
        .apply(
            "Write to Avro in GCS",
            AvroIO.write(BigtableRow.class).to(filePathPrefix).withSuffix(".avro"));

    return pipeline.run();
  }

  /** Translates Bigtable {@link Row} to Avro {@link BigtableRow}. */
  static class BigtableToAvroFn extends SimpleFunction<Row, BigtableRow> {
    @Override
    public BigtableRow apply(Row row) {
      ByteBuffer key = ByteBuffer.wrap(toByteArray(row.getKey()));
      List<BigtableCell> cells = new ArrayList<>();
      for (Family family : row.getFamiliesList()) {
        String familyName = family.getName();
        for (Column column : family.getColumnsList()) {
          ByteBuffer qualifier = ByteBuffer.wrap(toByteArray(column.getQualifier()));
          for (Cell cell : column.getCellsList()) {
            long timestamp = cell.getTimestampMicros();
            ByteBuffer value = ByteBuffer.wrap(toByteArray(cell.getValue()));
            cells.add(new BigtableCell(familyName, qualifier, timestamp, value));
          }
        }
      }
      return new BigtableRow(key, cells);
    }
  }

  /**
   * Extracts the byte array from the given {@link ByteString} without copy.
   *
   * @param byteString A {@link ByteString} from which to extract the array.
   * @return an array of byte.
   */
  protected static byte[] toByteArray(final ByteString byteString) {
    try {
      ZeroCopyByteOutput byteOutput = new ZeroCopyByteOutput();
      UnsafeByteOperations.unsafeWriteTo(byteString, byteOutput);
      return byteOutput.bytes;
    } catch (IOException e) {
      return byteString.toByteArray();
    }
  }

  private static final class ZeroCopyByteOutput extends ByteOutput {
    private byte[] bytes;

    @Override
    public void writeLazy(byte[] value, int offset, int length) {
      if (offset != 0 || length != value.length) {
        throw new UnsupportedOperationException();
      }
      bytes = value;
    }

    @Override
    public void write(byte value) {
      throw new UnsupportedOperationException();
    }

    @Override
    public void write(byte[] value, int offset, int length) {
      throw new UnsupportedOperationException();
    }

    @Override
    public void write(ByteBuffer value) {
      throw new UnsupportedOperationException();
    }

    @Override
    public void writeLazy(ByteBuffer value) {
      throw new UnsupportedOperationException();
    }
  }
}

Bigtable to Cloud Storage Parquet

Bigtable to Cloud Storage Parquet 模板是一种流水线，可从 BigQuery 表读取数据并以 Parquet 格式将其写入 Cloud Storage 存储桶。您可以使用该模板将数据从 Bigtable 移动到 Cloud Storage。

对此流水线的要求：

Bigtable 表必须已存在。
在运行此流水线之前，输出 Cloud Storage 存储桶必须已存在。

模板参数

参数	说明
`bigtableProjectId`	您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
`bigtableInstanceId`	表所属的 Bigtable 实例的 ID。
`bigtableTableId`	要导出的 Bigtable 表的 ID。
`outputDirectory`	写入数据的 Cloud Storage 路径。例如 `gs://mybucket/somefolder`。
`filenamePrefix`	Parquet 文件名的前缀。例如 `output-`。
`numShards`	输出文件分片数。例如 `2`。

运行 Bigtable to Cloud Storage Parquet file 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cloud Bigtable to Parquet Files on Cloud Storage template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
outputDirectory=OUTPUT_DIRECTORY,\
filenamePrefix=FILENAME_PREFIX,\
numShards=NUM_SHARDS

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
INSTANCE_ID：表所属的 Bigtable 实例的 ID。
TABLE_ID：需要导出的 Bigtable 表的 ID。
OUTPUT_DIRECTORY：写入数据的 Cloud Storage 路径，例如 gs://mybucket/somefolder
FILENAME_PREFIX：Parquet 文件名的前缀，例如 output-
NUM_SHARDS：要输出的 Parquet 文件的数量，例如 1

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "outputDirectory": "OUTPUT_DIRECTORY",
       "filenamePrefix": "FILENAME_PREFIX",
       "numShards": "NUM_SHARDS"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
INSTANCE_ID：表所属的 Bigtable 实例的 ID。
TABLE_ID：需要导出的 Bigtable 表的 ID。
OUTPUT_DIRECTORY：写入数据的 Cloud Storage 路径，例如 gs://mybucket/somefolder
FILENAME_PREFIX：Parquet 文件名的前缀，例如 output-
NUM_SHARDS：要输出的 Parquet 文件的数量，例如 1

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.bigtable;

import static com.google.cloud.teleport.bigtable.BigtableToAvro.toByteArray;

import com.google.bigtable.v2.Cell;
import com.google.bigtable.v2.Column;
import com.google.bigtable.v2.Family;
import com.google.bigtable.v2.Row;
import com.google.cloud.teleport.bigtable.BigtableToParquet.Options;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.List;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.GenericRecordBuilder;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.gcp.bigtable.BigtableIO;
import org.apache.beam.sdk.io.parquet.ParquetIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.PCollection;

/**
 * Dataflow pipeline that exports data from a Cloud Bigtable table to Parquet files in GCS.
 * Currently, filtering on Cloud Bigtable table is not supported.
 */
@Template(
    name = "Cloud_Bigtable_to_GCS_Parquet",
    category = TemplateCategory.BATCH,
    displayName = "Cloud Bigtable to Parquet Files on Cloud Storage",
    description =
        "A pipeline which reads in Cloud Bigtable table and writes it to Cloud Storage in Parquet format.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class BigtableToParquet {

  /** Options for the export pipeline. */
  public interface Options extends PipelineOptions {

    @TemplateParameter.ProjectId(
        order = 1,
        description = "Project ID",
        helpText =
            "The ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from")
    ValueProvider<String> getBigtableProjectId();

    @SuppressWarnings("unused")
    void setBigtableProjectId(ValueProvider<String> projectId);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"[a-z][a-z0-9\\-]+[a-z0-9]"},
        description = "Instance ID",
        helpText = "The ID of the Cloud Bigtable instance that contains the table")
    ValueProvider<String> getBigtableInstanceId();

    @SuppressWarnings("unused")
    void setBigtableInstanceId(ValueProvider<String> instanceId);

    @TemplateParameter.Text(
        order = 3,
        regexes = {"[_a-zA-Z0-9][-_.a-zA-Z0-9]*"},
        description = "Table ID",
        helpText = "The ID of the Cloud Bigtable table to export")
    ValueProvider<String> getBigtableTableId();

    @SuppressWarnings("unused")
    void setBigtableTableId(ValueProvider<String> tableId);

    @TemplateParameter.GcsWriteFolder(
        order = 4,
        description = "Output file directory in Cloud Storage",
        helpText =
            "The path and filename prefix for writing output files. Must end with a slash. DateTime formatting is used to parse directory path for date & time formatters.",
        example = "gs://your-bucket/your-path")
    ValueProvider<String> getOutputDirectory();

    @SuppressWarnings("unused")
    void setOutputDirectory(ValueProvider<String> outputDirectory);

    @TemplateParameter.Text(
        order = 5,
        description = "Parquet file prefix",
        helpText = "The prefix of the Parquet file name. For example, \"table1-\"")
    @Default.String("output")
    ValueProvider<String> getFilenamePrefix();

    @SuppressWarnings("unused")
    void setFilenamePrefix(ValueProvider<String> filenamePrefix);

    @TemplateParameter.Integer(
        order = 6,
        optional = true,
        description = "Maximum output shards",
        helpText =
            "The maximum number of output shards produced when writing. A higher number of "
                + "shards means higher throughput for writing to Cloud Storage, but potentially higher "
                + "data aggregation cost across shards when processing output Cloud Storage files. "
                + "Default value is decided by the runner.")
    @Default.Integer(0)
    ValueProvider<Integer> getNumShards();

    @SuppressWarnings("unused")
    void setNumShards(ValueProvider<Integer> numShards);
  }

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    PipelineResult result = run(options);

    // Wait for pipeline to finish only if it is not constructing a template.
    if (options.as(DataflowPipelineOptions.class).getTemplateLocation() == null) {
      result.waitUntilFinish();
    }
  }

  /**
   * Runs a pipeline to export data from a Cloud Bigtable table to Parquet file(s) in GCS.
   *
   * @param options arguments to the pipeline
   */
  public static PipelineResult run(Options options) {
    Pipeline pipeline = Pipeline.create(PipelineUtils.tweakPipelineOptions(options));
    BigtableIO.Read read =
        BigtableIO.read()
            .withProjectId(options.getBigtableProjectId())
            .withInstanceId(options.getBigtableInstanceId())
            .withTableId(options.getBigtableTableId());

    // Do not validate input fields if it is running as a template.
    if (options.as(DataflowPipelineOptions.class).getTemplateLocation() != null) {
      read = read.withoutValidation();
    }

    /**
     * Steps: 1) Read records from Bigtable. 2) Convert a Bigtable Row to a GenericRecord. 3) Write
     * GenericRecord(s) to GCS in parquet format.
     */
    pipeline
        .apply("Read from Bigtable", read)
        .apply("Transform to Parquet", MapElements.via(new BigtableToParquetFn()))
        .setCoder(AvroCoder.of(GenericRecord.class, BigtableRow.getClassSchema()))
        .apply(
            "Write to Parquet in GCS",
            FileIO.<GenericRecord>write()
                .via(ParquetIO.sink(BigtableRow.getClassSchema()))
                .to(options.getOutputDirectory())
                .withPrefix(options.getFilenamePrefix())
                .withSuffix(".parquet")
                .withNumShards(options.getNumShards()));

    return pipeline.run();
  }

  /**
   * Translates a {@link PCollection} of Bigtable {@link Row} to a {@link PCollection} of {@link
   * GenericRecord}.
   */
  static class BigtableToParquetFn extends SimpleFunction<Row, GenericRecord> {
    @Override
    public GenericRecord apply(Row row) {
      ByteBuffer key = ByteBuffer.wrap(toByteArray(row.getKey()));
      List<BigtableCell> cells = new ArrayList<>();
      for (Family family : row.getFamiliesList()) {
        String familyName = family.getName();
        for (Column column : family.getColumnsList()) {
          ByteBuffer qualifier = ByteBuffer.wrap(toByteArray(column.getQualifier()));
          for (Cell cell : column.getCellsList()) {
            long timestamp = cell.getTimestampMicros();
            ByteBuffer value = ByteBuffer.wrap(toByteArray(cell.getValue()));
            cells.add(new BigtableCell(familyName, qualifier, timestamp, value));
          }
        }
      }
      return new GenericRecordBuilder(BigtableRow.getClassSchema())
          .set("key", key)
          .set("cells", cells)
          .build();
    }
  }
}

Bigtable to Cloud Storage SequenceFile

Bigtable to Cloud Storage SequenceFile 模板是一种流水线，可从 Bigtable 表读取数据并以 SequenceFile 格式将其写入 Cloud Storage 存储桶。您可以使用该模板将数据从 Bigtable 复制到 Cloud Storage。

对此流水线的要求：

Bigtable 表必须已存在。
在运行此流水线之前，输出 Cloud Storage 存储桶必须已存在。

模板参数

参数	说明
`bigtableProject`	您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
`bigtableInstanceId`	表所属的 Bigtable 实例的 ID。
`bigtableTableId`	要导出的 Bigtable 表的 ID。
`bigtableAppProfileId`	要用于导出的 Bigtable 应用配置文件的 ID。如果您没有指定应用配置文件，则 Bigtable 将使用该实例的默认应用配置文件。
`destinationPath`	写入数据的 Cloud Storage 路径。例如 `gs://mybucket/somefolder`。
`filenamePrefix`	SequenceFile 文件名的前缀。例如 `output-`。

运行 Bigtable to Cloud Storage SequenceFile 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cloud Bigtable to SequenceFile Files on Cloud Storage template 。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile \
    --region REGION_NAME \
    --parameters \
bigtableProject=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
bigtableAppProfileId=APPLICATION_PROFILE_ID,\
destinationPath=DESTINATION_PATH,\
filenamePrefix=FILENAME_PREFIX

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
INSTANCE_ID：表所属的 Bigtable 实例的 ID。
TABLE_ID：需要导出的 Bigtable 表的 ID。
APPLICATION_PROFILE_ID：将用于导出的 Bigtable 应用配置文件的 ID。
DESTINATION_PATH：写入数据的 Cloud Storage 路径，例如 gs://mybucket/somefolder
FILENAME_PREFIX：SequenceFile 文件名的前缀，例如 output-

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProject": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "bigtableAppProfileId": "APPLICATION_PROFILE_ID",
       "destinationPath": "DESTINATION_PATH",
       "filenamePrefix": "FILENAME_PREFIX",
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
INSTANCE_ID：表所属的 Bigtable 实例的 ID。
TABLE_ID：需要导出的 Bigtable 表的 ID。
APPLICATION_PROFILE_ID：将用于导出的 Bigtable 应用配置文件的 ID。
DESTINATION_PATH：写入数据的 Cloud Storage 路径，例如 gs://mybucket/somefolder
FILENAME_PREFIX：SequenceFile 文件名的前缀，例如 output-

模板源代码

Java

此模板的源代码位于 GitHub 上的 GoogleCloudPlatform/cloud-bigtable-client 代码库中。

Datastore to Cloud Storage Text [已弃用]

此模板已弃用，将于 2022 年第一季度移除。请迁移到 Firestore to Cloud Storage Text 模板。

Datastore to Cloud Storage Text 模板是一种批处理流水线，可读取 Datastore 实体并以文本文件形式将其写入 Cloud Storage。您可以提供一个函数以将每个实体处理为 JSON 字符串。如果您未提供此类函数，则输出文件中的每一行都将是一个 JSON 序列化实体。

对此流水线的要求：

在运行此流水线之前，必须先在项目中设置 Datastore。

模板参数

参数	说明
`datastoreReadGqlQuery`	一种 GQL 查询，用于指定要获取的实体。例如 `SELECT * FROM MyKind`。
`datastoreReadProjectId`	您要从中读取数据的 Datastore 实例的 Google Cloud 项目 ID。
`datastoreReadNamespace`	所请求实体的命名空间。如需使用默认命名空间，请将此参数留空。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
`textWritePrefix`	Cloud Storage 路径前缀，用于指定写入数据的位置。例如 `gs://mybucket/somefolder/`。

运行 Datastore to Cloud Storage Text 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Datastore to Text Files on Cloud Storage template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Datastore_to_GCS_Text \
    --region REGION_NAME \
    --parameters \
datastoreReadGqlQuery="SELECT * FROM DATASTORE_KIND",\
datastoreReadProjectId=DATASTORE_PROJECT_ID,\
datastoreReadNamespace=DATASTORE_NAMESPACE,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
textWritePrefix=gs://BUCKET_NAME/output/

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
BUCKET_NAME：Cloud Storage 存储桶的名称
DATASTORE_PROJECT_ID：Datastore 实例所在的 Cloud 项目的 ID
DATASTORE_KIND：您的 Datastore 实体的类型
DATASTORE_NAMESPACE：Datastore 实体的命名空间
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Datastore_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "datastoreReadGqlQuery": "SELECT * FROM DATASTORE_KIND"
       "datastoreReadProjectId": "DATASTORE_PROJECT_ID",
       "datastoreReadNamespace": "DATASTORE_NAMESPACE",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "textWritePrefix": "gs://BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
BUCKET_NAME：Cloud Storage 存储桶的名称
DATASTORE_PROJECT_ID：Datastore 实例所在的 Cloud 项目的 ID
DATASTORE_KIND：您的 Datastore 实体的类型
DATASTORE_NAMESPACE：Datastore 实体的命名空间
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.DatastoreToText.DatastoreToTextOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreReadOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.ReadJsonEntities;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import com.google.cloud.teleport.templates.common.TextConverters.FilesystemWriteOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;

/**
 * Dataflow template which copies Datastore Entities to a Text sink. Text is encoded using JSON
 * encoded entity in the v1/Entity rest format:
 * https://cloud.google.com/datastore/docs/reference/rest/v1/Entity
 */
@Template(
    name = "Datastore_to_GCS_Text",
    category = TemplateCategory.BATCH,
    displayName = "Datastore to Text Files on Cloud Storage [Deprecated]",
    description =
        "Batch pipeline. Reads Datastore entities and writes them to Cloud Storage as text files.",
    optionsClass = DatastoreToTextOptions.class,
    skipOptions = {"firestoreReadNamespace", "firestoreReadGqlQuery", "firestoreReadProjectId"},
    contactInformation = "https://cloud.google.com/support")
@Template(
    name = "Firestore_to_GCS_Text",
    category = TemplateCategory.BATCH,
    displayName = "Firestore (Datastore mode) to Text Files on Cloud Storage",
    description =
        "Batch pipeline. Reads Firestore entities and writes them to Cloud Storage as text files.",
    optionsClass = DatastoreToTextOptions.class,
    skipOptions = {"datastoreReadNamespace", "datastoreReadGqlQuery", "datastoreReadProjectId"},
    contactInformation = "https://cloud.google.com/support")
public class DatastoreToText {

  public static ValueProvider<String> selectProvidedInput(
      ValueProvider<String> datastoreInput, ValueProvider<String> firestoreInput) {
    return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
  }

  /** Custom PipelineOptions. */
  public interface DatastoreToTextOptions
      extends PipelineOptions,
          DatastoreReadOptions,
          JavascriptTextTransformerOptions,
          FilesystemWriteOptions {}

  /**
   * Runs a pipeline which reads in Entities from Datastore, passes in the JSON encoded Entities to
   * a Javascript UDF, and writes the JSON to TextIO sink.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    DatastoreToTextOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(DatastoreToTextOptions.class);

    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(
            ReadJsonEntities.newBuilder()
                .setGqlQuery(
                    selectProvidedInput(
                        options.getDatastoreReadGqlQuery(), options.getFirestoreReadGqlQuery()))
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreReadProjectId(), options.getFirestoreReadProjectId()))
                .setNamespace(
                    selectProvidedInput(
                        options.getDatastoreReadNamespace(), options.getFirestoreReadNamespace()))
                .build())
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(TextIO.write().to(options.getTextWritePrefix()).withSuffix(".json"));

    pipeline.run();
  }
}

Firestore to Cloud Storage Text

Firestore to Cloud Storage Text 模板是一种批处理流水线，可读取 Firestore 实体并以文本文件形式将其写入 Cloud Storage。您可以提供一个函数以将每个实体处理为 JSON 字符串。如果您未提供此类函数，则输出文件中的每一行都将是一个 JSON 序列化实体。

对此流水线的要求：

在运行此流水线之前，必须先在项目中设置 Firestore。

模板参数

参数	说明
`firestoreReadGqlQuery`	一种 GQL 查询，用于指定要获取的实体。例如 `SELECT * FROM MyKind`。
`firestoreReadProjectId`	您要从中读取数据的 Firestore 实例的 Google Cloud 项目 ID。
`firestoreReadNamespace`	所请求实体的命名空间。如需使用默认命名空间，请将此参数留空。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
`textWritePrefix`	Cloud Storage 路径前缀，用于指定写入数据的位置。例如 `gs://mybucket/somefolder/`。

运行 Firestore to Cloud Storage Text 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Firestore to Text Files on Cloud Storage template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Firestore_to_GCS_Text \
    --region REGION_NAME \
    --parameters \
firestoreReadGqlQuery="SELECT * FROM FIRESTORE_KIND",\
firestoreReadProjectId=FIRESTORE_PROJECT_ID,\
firestoreReadNamespace=FIRESTORE_NAMESPACE,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
textWritePrefix=gs://BUCKET_NAME/output/

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
BUCKET_NAME：Cloud Storage 存储桶的名称
FIRESTORE_PROJECT_ID：Firestore 实例所在的 Cloud 项目的 ID
FIRESTORE_KIND：Firestore 实体的类型
FIRESTORE_NAMESPACE：Firestore 实体的命名空间
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Firestore_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "firestoreReadGqlQuery": "SELECT * FROM FIRESTORE_KIND"
       "firestoreReadProjectId": "FIRESTORE_PROJECT_ID",
       "firestoreReadNamespace": "FIRESTORE_NAMESPACE",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "textWritePrefix": "gs://BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
BUCKET_NAME：Cloud Storage 存储桶的名称
FIRESTORE_PROJECT_ID：Firestore 实例所在的 Cloud 项目的 ID
FIRESTORE_KIND：Firestore 实体的类型
FIRESTORE_NAMESPACE：Firestore 实体的命名空间
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.DatastoreToText.DatastoreToTextOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreReadOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.ReadJsonEntities;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import com.google.cloud.teleport.templates.common.TextConverters.FilesystemWriteOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;

/**
 * Dataflow template which copies Datastore Entities to a Text sink. Text is encoded using JSON
 * encoded entity in the v1/Entity rest format:
 * https://cloud.google.com/datastore/docs/reference/rest/v1/Entity
 */
@Template(
    name = "Datastore_to_GCS_Text",
    category = TemplateCategory.BATCH,
    displayName = "Datastore to Text Files on Cloud Storage [Deprecated]",
    description =
        "Batch pipeline. Reads Datastore entities and writes them to Cloud Storage as text files.",
    optionsClass = DatastoreToTextOptions.class,
    skipOptions = {"firestoreReadNamespace", "firestoreReadGqlQuery", "firestoreReadProjectId"},
    contactInformation = "https://cloud.google.com/support")
@Template(
    name = "Firestore_to_GCS_Text",
    category = TemplateCategory.BATCH,
    displayName = "Firestore (Datastore mode) to Text Files on Cloud Storage",
    description =
        "Batch pipeline. Reads Firestore entities and writes them to Cloud Storage as text files.",
    optionsClass = DatastoreToTextOptions.class,
    skipOptions = {"datastoreReadNamespace", "datastoreReadGqlQuery", "datastoreReadProjectId"},
    contactInformation = "https://cloud.google.com/support")
public class DatastoreToText {

  public static ValueProvider<String> selectProvidedInput(
      ValueProvider<String> datastoreInput, ValueProvider<String> firestoreInput) {
    return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
  }

  /** Custom PipelineOptions. */
  public interface DatastoreToTextOptions
      extends PipelineOptions,
          DatastoreReadOptions,
          JavascriptTextTransformerOptions,
          FilesystemWriteOptions {}

  /**
   * Runs a pipeline which reads in Entities from Datastore, passes in the JSON encoded Entities to
   * a Javascript UDF, and writes the JSON to TextIO sink.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    DatastoreToTextOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(DatastoreToTextOptions.class);

    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(
            ReadJsonEntities.newBuilder()
                .setGqlQuery(
                    selectProvidedInput(
                        options.getDatastoreReadGqlQuery(), options.getFirestoreReadGqlQuery()))
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreReadProjectId(), options.getFirestoreReadProjectId()))
                .setNamespace(
                    selectProvidedInput(
                        options.getDatastoreReadNamespace(), options.getFirestoreReadNamespace()))
                .build())
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(TextIO.write().to(options.getTextWritePrefix()).withSuffix(".json"));

    pipeline.run();
  }
}

Cloud Spanner to Cloud Storage Avro

Cloud Spanner to Cloud Storage Avro on Cloud Storage 模板是一种批处理流水线，可将整个 Cloud Spanner 数据库以 Avro 格式导出到 Cloud Storage。导出 Cloud Spanner 数据库会在您选择的存储桶中创建一个文件夹。该文件夹包含以下内容：

spanner-export.json 文件。
您导出的数据库中每个表的 TableName-manifest.json 文件。
一个或多个 TableName.avro-#####-of-##### 文件。

例如，如果导出包含两个表 Singers 和 Albums 的数据库，则系统会创建以下文件集：

Albums-manifest.json
Albums.avro-00000-of-00002
Albums.avro-00001-of-00002
Singers-manifest.json
Singers.avro-00000-of-00003
Singers.avro-00001-of-00003
Singers.avro-00002-of-00003
spanner-export.json

对此流水线的要求：

Cloud Spanner 数据库必须存在。
Cloud Storage 输出存储桶必须存在。
除了运行 Dataflow 作业所需的 IAM 角色之外，您还必须具有适当的 IAM 角色才能读取 Cloud Spanner 数据并写入 Cloud Storage 存储桶。

模板参数

参数	说明
`instanceId`	需要导出的 Cloud Spanner 数据库的实例 ID。
`databaseId`	需要导出的 Cloud Spanner 数据库的数据库 ID。
`outputDir`	您期望的 Avro 文件导出位置的 Cloud Storage 路径。导出作业在此路径下创建一个包含导出文件的新目录。
`snapshotTime`	（可选）与您要读取的 Cloud Spanner 数据库版本对应的时间戳。时间戳必须按照 RFC 3339 世界协调时间 (UTC)（即“祖鲁时”）格式指定。例如 `1990-12-31T23:59:60Z`。时间戳必须是过去的时间，并且必须遵循时间戳过时上限。
`tableNames`	（可选）英文逗号分隔列表，指定要导出的 Cloud Spanner 数据库子集。该列表必须包含所有相关表（父表、外键引用的表）。如果未明确列出，则必须设置“应该导出相关表”标志，才能成功导出。
`shouldExportRelatedTables`	（可选）与“tableNames”参数结合使用的标志，用于包括要导出的所有相关表。
`spannerProjectId`	（可选）您要从中读取数据的 Cloud Spanner 数据库的 Google Cloud 项目 ID。

在 Cloud Storage 模板上运行 Cloud Spanner to Avro 文件

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
作业名称必须与以下格式匹配，作业才会显示在 Google Cloud 控制台的 Spanner 实例页面中：
```
cloud-spanner-export-SPANNER_INSTANCE_ID-SPANNER_DATABASE_NAME
```
替换以下内容：
- SPANNER_INSTANCE_ID：Spanner 实例的 ID
- SPANNER_DATABASE_NAME：Spanner 数据库的名称
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cloud Spanner to Avro Files on Cloud Storage template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro \
    --region REGION_NAME \
    --staging-location GCS_STAGING_LOCATION \
    --parameters \
instanceId=INSTANCE_ID,\
databaseId=DATABASE_ID,\
outputDir=GCS_DIRECTORY

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
要使作业显示在 Google Cloud 控制台的 Cloud Spanner 部分中，作业名称必须与 cloud-spanner-export-INSTANCE_ID-DATABASE_ID 格式匹配。
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
GCS_STAGING_LOCATION：写入临时文件的位置；例如 gs://mybucket/temp
INSTANCE_ID：您的 Cloud Spanner 实例 ID
DATABASE_ID：您的 Cloud Spanner 数据库 ID
GCS_DIRECTORY：Avro 文件导出到

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro
{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "outputDir": "gs://GCS_DIRECTORY"
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
要使作业显示在 Google Cloud 控制台的 Cloud Spanner 部分中，作业名称必须与 cloud-spanner-export-INSTANCE_ID-DATABASE_ID 格式匹配。
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
GCS_STAGING_LOCATION：写入临时文件的位置；例如 gs://mybucket/temp
INSTANCE_ID：您的 Cloud Spanner 实例 ID
DATABASE_ID：您的 Cloud Spanner 数据库 ID
GCS_DIRECTORY：Avro 文件导出到

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.spanner;

import com.google.cloud.spanner.Options.RpcPriority;
import com.google.cloud.spanner.SpannerOptions;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateCreationParameter;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.spanner.ExportPipeline.ExportPipelineOptions;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.spanner.SpannerConfig;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider;
import org.apache.beam.sdk.transforms.SerializableFunction;

/** Dataflow template that exports a Cloud Spanner database to Avro files in GCS. */
@Template(
    name = "Cloud_Spanner_to_GCS_Avro",
    category = TemplateCategory.BATCH,
    displayName = "Cloud Spanner to Avro Files on Cloud Storage",
    description =
        "A pipeline to export a Cloud Spanner database to a set of Avro files in Cloud Storage.",
    optionsClass = ExportPipelineOptions.class,
    contactInformation = "https://cloud.google.com/support")
public class ExportPipeline {

  /** Options for Export pipeline. */
  public interface ExportPipelineOptions extends PipelineOptions {
    @TemplateParameter.Text(
        order = 1,
        regexes = {"[a-z][a-z0-9\\-]*[a-z0-9]"},
        description = "Cloud Spanner instance id",
        helpText = "The instance id of the Cloud Spanner database that you want to export.")
    ValueProvider<String> getInstanceId();

    void setInstanceId(ValueProvider<String> value);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"[a-z][a-z0-9_\\-]*[a-z0-9]"},
        description = "Cloud Spanner database id",
        helpText = "The database id of the Cloud Spanner database that you want to export.")
    ValueProvider<String> getDatabaseId();

    void setDatabaseId(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFolder(
        order = 3,
        description = "Cloud Storage output directory",
        helpText =
            "The Cloud Storage path where the Avro files should be exported to. A new directory will be created under this path that contains the export.",
        example = "gs://your-bucket/your-path")
    ValueProvider<String> getOutputDir();

    void setOutputDir(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFolder(
        order = 4,
        optional = true,
        description = "Cloud Storage temp directory for storing Avro files",
        helpText =
            "The Cloud Storage path where the temporary Avro files can be created. Ex: gs://your-bucket/your-path")
    ValueProvider<String> getAvroTempDirectory();

    void setAvroTempDirectory(ValueProvider<String> value);

    @TemplateCreationParameter(value = "")
    @Description("Test dataflow job identifier for Beam Direct Runner")
    @Default.String(value = "")
    ValueProvider<String> getTestJobId();

    void setTestJobId(ValueProvider<String> jobId);

    @TemplateParameter.Text(
        order = 6,
        optional = true,
        description = "Cloud Spanner Endpoint to call",
        helpText = "The Cloud Spanner endpoint to call in the template. Only used for testing.",
        example = "https://batch-spanner.googleapis.com")
    @Default.String("https://batch-spanner.googleapis.com")
    ValueProvider<String> getSpannerHost();

    void setSpannerHost(ValueProvider<String> value);

    @TemplateCreationParameter(value = "false")
    @Description("If true, wait for job finish")
    @Default.Boolean(true)
    boolean getWaitUntilFinish();

    void setWaitUntilFinish(boolean value);

    @TemplateParameter.Text(
        order = 7,
        optional = true,
        regexes = {
          "^([0-9]{4})-([0-9]{2})-([0-9]{2})T([0-9]{2}):([0-9]{2}):(([0-9]{2})(\\.[0-9]+)?)Z$"
        },
        description = "Snapshot time",
        helpText =
            "Specifies the snapshot time as RFC 3339 format in UTC time without the timezone offset(always ends in 'Z'). Timestamp must be in the past and Maximum timestamp staleness applies. See https://cloud.google.com/spanner/docs/timestamp-bounds#maximum_timestamp_staleness",
        example = "1990-12-31T23:59:59Z")
    @Default.String(value = "")
    ValueProvider<String> getSnapshotTime();

    void setSnapshotTime(ValueProvider<String> value);

    @TemplateParameter.ProjectId(
        order = 8,
        optional = true,
        description = "Cloud Spanner Project Id",
        helpText = "The project id of the Cloud Spanner instance.")
    ValueProvider<String> getSpannerProjectId();

    void setSpannerProjectId(ValueProvider<String> value);

    @TemplateParameter.Boolean(
        order = 9,
        optional = true,
        description = "Export Timestamps as Timestamp-micros type",
        helpText =
            "If true, Timestamps are exported as timestamp-micros type. Timestamps are exported as ISO8601 strings at nanosecond precision by default.")
    @Default.Boolean(false)
    ValueProvider<Boolean> getShouldExportTimestampAsLogicalType();

    void setShouldExportTimestampAsLogicalType(ValueProvider<Boolean> value);

    @TemplateParameter.Text(
        order = 10,
        optional = true,
        regexes = {"^[a-zA-Z0-9_]+(,[a-zA-Z0-9_]+)*$"},
        description = "Cloud Spanner table name(s).",
        helpText =
            "If provided, only this comma separated list of tables are exported. Ancestor tables and tables that are referenced via foreign keys are required. If not explicitly listed, the `shouldExportRelatedTables` flag must be set for a successful export.")
    @Default.String(value = "")
    ValueProvider<String> getTableNames();

    void setTableNames(ValueProvider<String> value);

    @TemplateParameter.Boolean(
        order = 11,
        optional = true,
        description = "Export necessary Related Spanner tables.",
        helpText =
            "Used in conjunction with `tableNames`. If true, add related tables necessary for the export, such as interleaved parent tables and foreign keys tables.  If `tableNames` is specified but doesn't include related tables, this option must be set to true for a successful export.")
    @Default.Boolean(false)
    ValueProvider<Boolean> getShouldExportRelatedTables();

    void setShouldExportRelatedTables(ValueProvider<Boolean> value);

    @TemplateParameter.Enum(
        order = 12,
        enumOptions = {"LOW", "MEDIUM", "HIGH"},
        optional = true,
        description = "Priority for Spanner RPC invocations",
        helpText =
            "The request priority for Cloud Spanner calls. The value must be one of: [HIGH,MEDIUM,LOW].")
    ValueProvider<RpcPriority> getSpannerPriority();

    void setSpannerPriority(ValueProvider<RpcPriority> value);
  }

  /**
   * Runs a pipeline to export a Cloud Spanner database to Avro files.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {

    ExportPipelineOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(ExportPipelineOptions.class);

    Pipeline p = Pipeline.create(options);

    SpannerConfig spannerConfig =
        SpannerConfig.create()
            // Temporary fix explicitly setting SpannerConfig.projectId to the default project
            // if spannerProjectId is not provided as a parameter. Required as of Beam 2.38,
            // which no longer accepts null label values on metrics, and SpannerIO#setup() has
            // a bug resulting in the label value being set to the original parameter value,
            // with no fallback to the default project.
            // TODO: remove NestedValueProvider when this is fixed in Beam.
            .withProjectId(
                NestedValueProvider.of(
                    options.getSpannerProjectId(),
                    (SerializableFunction<String, String>)
                        input -> input != null ? input : SpannerOptions.getDefaultProjectId()))
            .withHost(options.getSpannerHost())
            .withInstanceId(options.getInstanceId())
            .withDatabaseId(options.getDatabaseId())
            .withRpcPriority(options.getSpannerPriority());
    p.begin()
        .apply(
            "Run Export",
            new ExportTransform(
                spannerConfig,
                options.getOutputDir(),
                options.getTestJobId(),
                options.getSnapshotTime(),
                options.getTableNames(),
                options.getShouldExportRelatedTables(),
                options.getShouldExportTimestampAsLogicalType(),
                options.getAvroTempDirectory()));
    PipelineResult result = p.run();
    if (options.getWaitUntilFinish()
        &&
        /* Only if template location is null, there is a dataflow job to wait for. Else it's
         * template generation which doesn't start a dataflow job.
         */
        options.as(DataflowPipelineOptions.class).getTemplateLocation() == null) {
      result.waitUntilFinish();
    }
  }
}

Cloud Spanner to Cloud Storage Text

Cloud Spanner to Cloud Storage Text 模板是一种批处理流水线，可从 Cloud Spanner 表中读取数据，然后将其作为 CSV 文本文件写入 Cloud Storage。

对此流水线的要求：

在运行此流水线之前，输入 Spanner 表必须已存在。

模板参数

参数	说明
`spannerProjectId`	需要从中读取数据的 Cloud Spanner 数据库的 Google Cloud 项目 ID。
`spannerDatabaseId`	所请求表的数据库 ID。
`spannerInstanceId`	所请求的表的实例 ID。
`spannerTable`	用于读取数据的表。
`textWritePrefix`	写入输出文本文件的目录。在末尾添加“/”。例如：`gs://mybucket/somefolder/`。
`spannerSnapshotTime`	（可选）与您要读取的 Cloud Spanner 数据库版本对应的时间戳。时间戳必须按照 RFC 3339 世界协调时间 (UTC)（即“祖鲁时”）格式指定。例如 `1990-12-31T23:59:60Z`。时间戳必须是过去的时间，并且必须遵循时间戳过时上限。

运行 Cloud Spanner to Cloud Storage Text 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cloud Spanner to Text Files on Cloud Storage template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Spanner_to_GCS_Text \
    --region REGION_NAME \
    --parameters \
spannerProjectId=SPANNER_PROJECT_ID,\
spannerDatabaseId=DATABASE_ID,\
spannerInstanceId=INSTANCE_ID,\
spannerTable=TABLE_ID,\
textWritePrefix=gs://BUCKET_NAME/output/

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
SPANNER_PROJECT_ID：要从中读取数据的 Spanner 数据库的 Cloud 项目 ID
DATABASE_ID：Spanner 数据库 ID
BUCKET_NAME - Cloud Storage 存储桶的名称。
INSTANCE_ID：Spanner 实例 ID
TABLE_ID：Spanner 表 ID

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Spanner_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "spannerProjectId": "SPANNER_PROJECT_ID",
       "spannerDatabaseId": "DATABASE_ID",
       "spannerInstanceId": "INSTANCE_ID",
       "spannerTable": "TABLE_ID",
       "textWritePrefix": "gs://BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
SPANNER_PROJECT_ID：要从中读取数据的 Spanner 数据库的 Cloud 项目 ID
DATABASE_ID：Spanner 数据库 ID
BUCKET_NAME - Cloud Storage 存储桶的名称。
INSTANCE_ID：Spanner 实例 ID
TABLE_ID：Spanner 表 ID

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import static com.google.cloud.teleport.util.ValueProviderUtils.eitherOrValueProvider;

import com.google.cloud.spanner.Options.RpcPriority;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.SpannerToText.SpannerToTextOptions;
import com.google.cloud.teleport.templates.common.SpannerConverters;
import com.google.cloud.teleport.templates.common.SpannerConverters.CreateTransactionFnWithTimestamp;
import com.google.cloud.teleport.templates.common.SpannerConverters.SpannerReadOptions;
import com.google.cloud.teleport.templates.common.TextConverters.FilesystemWriteOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.FileSystems;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.fs.ResourceId;
import org.apache.beam.sdk.io.gcp.spanner.LocalSpannerIO;
import org.apache.beam.sdk.io.gcp.spanner.ReadOperation;
import org.apache.beam.sdk.io.gcp.spanner.SpannerConfig;
import org.apache.beam.sdk.io.gcp.spanner.Transaction;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.transforms.View;
import org.apache.beam.sdk.values.PBegin;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionView;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * Dataflow template which copies a Spanner table to a Text sink. It exports a Spanner table using
 * <a href="https://cloud.google.com/spanner/docs/reads#read_data_in_parallel">Batch API</a>, which
 * creates multiple workers in parallel for better performance. The result is written to a CSV file
 * in Google Cloud Storage. The table schema file is saved in json format along with the exported
 * table.
 *
 * <p>Schema file sample: { "id":"INT64", "name":"STRING(MAX)" }
 *
 * <p>A sample run:
 *
 * <pre>
 * mvn compile exec:java \
 *   -Dexec.mainClass=com.google.cloud.teleport.templates.SpannerToText \
 *   -Dexec.args="--runner=DataflowRunner \
 *                --spannerProjectId=projectId \
 *                --gcpTempLocation=gs://gsTmpLocation \
 *                --spannerInstanceId=instanceId \
 *                --spannerDatabaseId=databaseId \
 *                --spannerTable=table_name \
 *                --spannerSnapshotTime=snapshot_time \
 *                --textWritePrefix=gcsOutputPath"
 * </pre>
 */
@Template(
    name = "Spanner_to_GCS_Text",
    category = TemplateCategory.BATCH,
    displayName = "Cloud Spanner to Text Files on Cloud Storage",
    description =
        "A pipeline which reads in Cloud Spanner table and writes it to Cloud Storage as CSV text files.",
    optionsClass = SpannerToTextOptions.class,
    contactInformation = "https://cloud.google.com/support")
public class SpannerToText {

  private static final Logger LOG = LoggerFactory.getLogger(SpannerToText.class);

  /** Custom PipelineOptions. */
  public interface SpannerToTextOptions
      extends PipelineOptions, SpannerReadOptions, FilesystemWriteOptions {

    @TemplateParameter.GcsWriteFolder(
        order = 1,
        optional = true,
        description = "Cloud Storage temp directory for storing CSV files",
        helpText = "The Cloud Storage path where the temporary CSV files can be stored.",
        example = "gs://your-bucket/your-path")
    ValueProvider<String> getCsvTempDirectory();

    @SuppressWarnings("unused")
    void setCsvTempDirectory(ValueProvider<String> value);

    @TemplateParameter.Enum(
        order = 2,
        enumOptions = {"LOW", "MEDIUM", "HIGH"},
        optional = true,
        description = "Priority for Spanner RPC invocations",
        helpText =
            "The request priority for Cloud Spanner calls. The value must be one of: [HIGH,MEDIUM,LOW].")
    ValueProvider<RpcPriority> getSpannerPriority();

    void setSpannerPriority(ValueProvider<RpcPriority> value);
  }

  /**
   * Runs a pipeline which reads in Records from Spanner, and writes the CSV to TextIO sink.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    LOG.info("Starting pipeline setup");
    PipelineOptionsFactory.register(SpannerToTextOptions.class);
    SpannerToTextOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(SpannerToTextOptions.class);

    FileSystems.setDefaultPipelineOptions(options);
    Pipeline pipeline = Pipeline.create(options);

    SpannerConfig spannerConfig =
        SpannerConfig.create()
            .withHost(options.getSpannerHost())
            .withProjectId(options.getSpannerProjectId())
            .withInstanceId(options.getSpannerInstanceId())
            .withDatabaseId(options.getSpannerDatabaseId())
            .withRpcPriority(options.getSpannerPriority());

    PTransform<PBegin, PCollection<ReadOperation>> spannerExport =
        SpannerConverters.ExportTransformFactory.create(
            options.getSpannerTable(),
            spannerConfig,
            options.getTextWritePrefix(),
            options.getSpannerSnapshotTime());

    /* CreateTransaction and CreateTransactionFn classes in LocalSpannerIO
     * only take a timestamp object for exact staleness which works when
     * parameters are provided during template compile time. They do not work with
     * a Timestamp valueProvider which can take parameters at runtime. Hence a new
     * ParDo class CreateTransactionFnWithTimestamp had to be created for this
     * purpose.
     */
    PCollectionView<Transaction> tx =
        pipeline
            .apply("Setup for Transaction", Create.of(1))
            .apply(
                "Create transaction",
                ParDo.of(
                    new CreateTransactionFnWithTimestamp(
                        spannerConfig, options.getSpannerSnapshotTime())))
            .apply("As PCollectionView", View.asSingleton());

    PCollection<String> csv =
        pipeline
            .apply("Create export", spannerExport)
            // We need to use LocalSpannerIO.readAll() instead of LocalSpannerIO.read()
            // because ValueProvider parameters such as table name required for
            // LocalSpannerIO.read() can be read only inside DoFn but LocalSpannerIO.read() is of
            // type PTransform<PBegin, Struct>, which prevents prepending it with DoFn that reads
            // these parameters at the pipeline execution time.
            .apply(
                "Read all records",
                LocalSpannerIO.readAll().withTransaction(tx).withSpannerConfig(spannerConfig))
            .apply(
                "Struct To Csv",
                MapElements.into(TypeDescriptors.strings())
                    .via(struct -> (new SpannerConverters.StructCsvPrinter()).print(struct)));

    ValueProvider<ResourceId> tempDirectoryResource =
        ValueProvider.NestedValueProvider.of(
            eitherOrValueProvider(options.getCsvTempDirectory(), options.getTextWritePrefix()),
            (SerializableFunction<String, ResourceId>) s -> FileSystems.matchNewResource(s, true));

    csv.apply(
        "Write to storage",
        TextIO.write()
            .to(options.getTextWritePrefix())
            .withSuffix(".csv")
            .withTempDirectory(tempDirectoryResource));

    pipeline.run();
    LOG.info("Completed pipeline setup");
  }
}

Cloud Storage Avro to Bigtable

Cloud Storage Avro to Bigtable 模板是一种流水线，可从 Cloud Storage 存储桶中的 Avro 文件读取数据并将数据写入 Bigtable 表。您可以使用该模板将数据从 Cloud Storage 复制到 Bigtable。

对此流水线的要求：

Bigtable 表必须已存在，并且列族必须与 Avro 文件中导出的列族相同。
在运行此流水线之前，输入 Avro 文件必须已存在于 Cloud Storage 存储桶中。
Bigtable 需要输入 Avro 文件具有特定架构。

模板参数

参数	说明
`bigtableProjectId`	您要将数据写入的 Bigtable 实例的 Google Cloud 项目 ID。
`bigtableInstanceId`	表所属的 Bigtable 实例的 ID。
`bigtableTableId`	要导入的 Bigtable 表的 ID。
`inputFilePattern`	数据所在的 Cloud Storage 路径模式，例如 `gs://mybucket/somefolder/prefix*`。

运行 Cloud Storage Avro file to Bigtable 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Avro Files on Cloud Storage to Cloud Bigtable template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
inputFilePattern=INPUT_FILE_PATTERN

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
INSTANCE_ID：表所属的 Bigtable 实例的 ID。
TABLE_ID：需要导出的 Bigtable 表的 ID。
INPUT_FILE_PATTERN：数据所在的 Cloud Storage 路径模式，例如 gs://mybucket/somefolder/prefix*

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "inputFilePattern": "INPUT_FILE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
INSTANCE_ID：表所属的 Bigtable 实例的 ID。
TABLE_ID：需要导出的 Bigtable 表的 ID。
INPUT_FILE_PATTERN：数据所在的 Cloud Storage 路径模式，例如 gs://mybucket/somefolder/prefix*

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.bigtable;

import com.google.bigtable.v2.Mutation;
import com.google.bigtable.v2.Mutation.SetCell;
import com.google.cloud.teleport.bigtable.AvroToBigtable.Options;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.common.base.MoreObjects;
import com.google.common.collect.ImmutableList;
import com.google.protobuf.ByteString;
import java.nio.ByteBuffer;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.AvroIO;
import org.apache.beam.sdk.io.gcp.bigtable.BigtableIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.StaticValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.KV;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * Dataflow pipeline that imports data from Avro files in GCS to a Cloud Bigtable table. The Cloud
 * Bigtable table must be created before running the pipeline and must have a compatible table
 * schema. For example, if {@link BigtableCell} from the Avro files has a 'family' of "f1", the
 * Bigtable table should have a column family of "f1".
 */
@Template(
    name = "GCS_Avro_to_Cloud_Bigtable",
    category = TemplateCategory.BATCH,
    displayName = "Avro Files on Cloud Storage to Cloud Bigtable",
    description =
        "A pipeline which reads data from Avro files in Cloud Storage and writes it to Cloud Bigtable table.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public final class AvroToBigtable {
  private static final Logger LOG = LoggerFactory.getLogger(AvroToBigtable.class);

  /** Maximum number of mutations allowed per row by Cloud bigtable. */
  private static final int MAX_MUTATIONS_PER_ROW = 100000;

  private static final Boolean DEFAULT_SPLIT_LARGE_ROWS = false;

  /** Options for the import pipeline. */
  public interface Options extends PipelineOptions {
    @TemplateParameter.ProjectId(
        order = 1,
        description = "Project ID",
        helpText =
            "The ID of the Google Cloud project of the Cloud Bigtable instance that you want to write data to")
    ValueProvider<String> getBigtableProjectId();

    @SuppressWarnings("unused")
    void setBigtableProjectId(ValueProvider<String> projectId);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"[a-z][a-z0-9\\-]+[a-z0-9]"},
        description = "Instance ID",
        helpText = "The ID of the Cloud Bigtable instance that contains the table")
    ValueProvider<String> getBigtableInstanceId();

    @SuppressWarnings("unused")
    void setBigtableInstanceId(ValueProvider<String> instanceId);

    @TemplateParameter.Text(
        order = 4,
        regexes = {"[_a-zA-Z0-9][-_.a-zA-Z0-9]*"},
        description = "Table ID",
        helpText = "The ID of the Cloud Bigtable table to write")
    ValueProvider<String> getBigtableTableId();

    @SuppressWarnings("unused")
    void setBigtableTableId(ValueProvider<String> tableId);

    @TemplateParameter.GcsReadFile(
        order = 5,
        description = "Input Cloud Storage File(s)",
        helpText = "The Cloud Storage location of the files you'd like to process.",
        example = "gs://your-bucket/your-files/*.avro")
    ValueProvider<String> getInputFilePattern();

    @SuppressWarnings("unused")
    void setInputFilePattern(ValueProvider<String> inputFilePattern);

    @TemplateParameter.Boolean(
        order = 6,
        optional = true,
        description = "If true, large rows will be split into multiple MutateRows requests",
        helpText =
            "The flag for enabling splitting of large rows into multiple MutateRows requests. Note that when a large row is split between multiple API calls, the updates to the row are not atomic. ")
    ValueProvider<Boolean> getSplitLargeRows();

    void setSplitLargeRows(ValueProvider<Boolean> splitLargeRows);
  }

  /**
   * Runs a pipeline to import Avro files in GCS to a Cloud Bigtable table.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    PipelineResult result = run(options);

    // Wait for pipeline to finish only if it is not constructing a template.
    if (options.as(DataflowPipelineOptions.class).getTemplateLocation() == null) {
      result.waitUntilFinish();
    }
  }

  public static PipelineResult run(Options options) {
    Pipeline pipeline = Pipeline.create(PipelineUtils.tweakPipelineOptions(options));

    BigtableIO.Write write =
        BigtableIO.write()
            .withProjectId(options.getBigtableProjectId())
            .withInstanceId(options.getBigtableInstanceId())
            .withTableId(options.getBigtableTableId());

    pipeline
        .apply("Read from Avro", AvroIO.read(BigtableRow.class).from(options.getInputFilePattern()))
        .apply(
            "Transform to Bigtable",
            ParDo.of(
                AvroToBigtableFn.createWithSplitLargeRows(
                    options.getSplitLargeRows(), MAX_MUTATIONS_PER_ROW)))
        .apply("Write to Bigtable", write);

    return pipeline.run();
  }

  /**
   * Translates {@link BigtableRow} to {@link Mutation}s along with a row key. The mutations are
   * {@link SetCell}s that set the value for specified cells with family name, column qualifier and
   * timestamp.
   */
  static class AvroToBigtableFn extends DoFn<BigtableRow, KV<ByteString, Iterable<Mutation>>> {
    private final ValueProvider<Boolean> splitLargeRowsFlag;
    private Boolean splitLargeRows;
    private final int maxMutationsPerRow;

    public static AvroToBigtableFn create() {
      return new AvroToBigtableFn(StaticValueProvider.of(false), MAX_MUTATIONS_PER_ROW);
    }

    public static AvroToBigtableFn createWithSplitLargeRows(
        ValueProvider<Boolean> splitLargeRowsFlag, int maxMutationsPerRequest) {
      return new AvroToBigtableFn(splitLargeRowsFlag, maxMutationsPerRequest);
    }

    private AvroToBigtableFn(
        ValueProvider<Boolean> splitLargeRowsFlag, int maxMutationsPerRequest) {
      this.splitLargeRowsFlag = splitLargeRowsFlag;
      this.maxMutationsPerRow = maxMutationsPerRequest;
    }

    @Setup
    public void setup() {
      if (splitLargeRowsFlag != null) {
        splitLargeRows = splitLargeRowsFlag.get();
      }
      splitLargeRows = MoreObjects.firstNonNull(splitLargeRows, DEFAULT_SPLIT_LARGE_ROWS);
      LOG.info("splitLargeRows set to: " + splitLargeRows);
    }

    @ProcessElement
    public void processElement(
        @Element BigtableRow row, OutputReceiver<KV<ByteString, Iterable<Mutation>>> out) {
      ByteString key = toByteString(row.getKey());
      // BulkMutation doesn't split rows. Currently, if a single row contains more than 100,000
      // mutations, the service will fail the request.
      ImmutableList.Builder<Mutation> mutations = ImmutableList.builder();
      int cellsProcessed = 0;
      for (BigtableCell cell : row.getCells()) {
        SetCell setCell =
            SetCell.newBuilder()
                .setFamilyName(cell.getFamily().toString())
                .setColumnQualifier(toByteString(cell.getQualifier()))
                .setTimestampMicros(cell.getTimestamp())
                .setValue(toByteString(cell.getValue()))
                .build();

        mutations.add(Mutation.newBuilder().setSetCell(setCell).build());
        cellsProcessed++;

        if (this.splitLargeRows && cellsProcessed % maxMutationsPerRow == 0) {
          // Send a MutateRow request when we have accumulated max mutations per row.
          out.output(KV.of(key, mutations.build()));
          mutations = ImmutableList.builder();
        }
      }

      // Flush any remaining mutations.
      ImmutableList remainingMutations = mutations.build();
      if (!remainingMutations.isEmpty()) {
        out.output(KV.of(key, remainingMutations));
      }
    }
  }

  /** Copies the content in {@code byteBuffer} into a {@link ByteString}. */
  protected static ByteString toByteString(ByteBuffer byteBuffer) {
    return ByteString.copyFrom(byteBuffer.array());
  }
}

Cloud Storage Avro to Cloud Spanner

Cloud Storage Avro files to Cloud Spanner 模板是一种批处理流水线，会读取从 Cloud Storage 中存储的 Cloud Spanner 导出的 Avro 文件，并将其导入 Cloud Spanner 数据库。

对此流水线的要求：

目标 Cloud Spanner 数据库必须已存在且必须为空。
您必须拥有 Cloud Storage 存储桶的读取权限以及目标 Cloud Spanner 数据库的写入权限。
Cloud Storage 输入路径必须存在，并且必须包含 spanner-export.json 文件，且该文件包含要导入的文件的 JSON 描述。

模板参数

参数	说明
`instanceId`	Cloud Spanner 数据库的实例 ID。
`databaseId`	Cloud Spanner 数据库的 ID。
`inputDir`	导入 Avro 文件的 Cloud Storage 路径。

运行 Cloud Storage Avro to Cloud Spanner 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
作业名称必须与以下格式匹配，作业才会显示在 Google Cloud 控制台的 Spanner 实例页面中：
```
cloud-spanner-import-SPANNER_INSTANCE_ID-SPANNER_DATABASE_NAME
```
替换以下内容：
- SPANNER_INSTANCE_ID：Spanner 实例的 ID
- SPANNER_DATABASE_NAME：Spanner 数据库的名称
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Avro Files on Cloud Storage to Cloud Spanner template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner \
    --region REGION_NAME \
    --staging-location GCS_STAGING_LOCATION \
    --parameters \
instanceId=INSTANCE_ID,\
databaseId=DATABASE_ID,\
inputDir=GCS_DIRECTORY

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
INSTANCE_ID：包含数据库的 Spanner 实例的 ID
DATABASE_ID：需要导入到的 Spanner 数据库的 ID
GCS_DIRECTORY：导入 Avro 文件的 Cloud Storage 路径，例如 gs://mybucket/somefolder

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner
{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "inputDir": "gs://GCS_DIRECTORY"
   },
   "environment": {
       "machineType": "n1-standard-2"
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
INSTANCE_ID：包含数据库的 Spanner 实例的 ID
DATABASE_ID：需要导入到的 Spanner 数据库的 ID
GCS_DIRECTORY：导入 Avro 文件的 Cloud Storage 路径，例如 gs://mybucket/somefolder

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.spanner;

import com.google.cloud.spanner.Options.RpcPriority;
import com.google.cloud.spanner.SpannerOptions;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateCreationParameter;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.spanner.ImportPipeline.Options;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.spanner.SpannerConfig;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider;
import org.apache.beam.sdk.transforms.SerializableFunction;

/** Avro to Cloud Spanner Import pipeline. */
@Template(
    name = "GCS_Avro_to_Cloud_Spanner",
    category = TemplateCategory.BATCH,
    displayName = "Avro Files on Cloud Storage to Cloud Spanner",
    description =
        "A pipeline to import a Cloud Spanner database from a set of Avro files in Cloud Storage.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class ImportPipeline {

  /** Options for {@link ImportPipeline}. */
  public interface Options extends PipelineOptions {

    @TemplateParameter.Text(
        order = 1,
        regexes = {"^[a-z0-9\\-]+$"},
        description = "Cloud Spanner instance id",
        helpText = "The instance id of the Cloud Spanner database that you want to import to.")
    ValueProvider<String> getInstanceId();

    void setInstanceId(ValueProvider<String> value);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"^[a-z_0-9\\-]+$"},
        description = "Cloud Spanner database id",
        helpText =
            "The database id of the Cloud Spanner database that you want to import into (must already exist).")
    ValueProvider<String> getDatabaseId();

    void setDatabaseId(ValueProvider<String> value);

    @TemplateParameter.GcsReadFolder(
        order = 3,
        description = "Cloud storage input directory",
        helpText = "The Cloud Storage path where the Avro files should be imported from.")
    ValueProvider<String> getInputDir();

    void setInputDir(ValueProvider<String> value);

    @TemplateParameter.Text(
        order = 4,
        optional = true,
        description = "Cloud Spanner Endpoint to call",
        helpText = "The Cloud Spanner endpoint to call in the template. Only used for testing.",
        example = "https://batch-spanner.googleapis.com")
    @Default.String("https://batch-spanner.googleapis.com")
    ValueProvider<String> getSpannerHost();

    void setSpannerHost(ValueProvider<String> value);

    @TemplateParameter.Boolean(
        order = 5,
        optional = true,
        description = "Wait for Indexes",
        helpText =
            "By default the import pipeline is not blocked on index creation, and it "
                + "may complete with indexes still being created in the background. In testing, it may "
                + "be useful to set this option to false so that the pipeline waits until indexes are "
                + "finished.")
    @Default.Boolean(false)
    ValueProvider<Boolean> getWaitForIndexes();

    void setWaitForIndexes(ValueProvider<Boolean> value);

    @TemplateParameter.Boolean(
        order = 6,
        optional = true,
        description = "Wait for Foreign Keys",
        helpText =
            "By default the import pipeline is not blocked on foreign key creation, and it may complete"
                + " with foreign keys still being created in the background. In testing, it may be"
                + " useful to set this option to false so that the pipeline waits until foreign keys"
                + " are finished.")
    @Default.Boolean(false)
    ValueProvider<Boolean> getWaitForForeignKeys();

    void setWaitForForeignKeys(ValueProvider<Boolean> value);

    @TemplateParameter.Boolean(
        order = 7,
        optional = true,
        description = "Wait for Foreign Keys",
        helpText =
            "By default the import pipeline is blocked on change stream creation. If false, it may"
                + " complete with change streams still being created in the background.")
    @Default.Boolean(true)
    ValueProvider<Boolean> getWaitForChangeStreams();

    void setWaitForChangeStreams(ValueProvider<Boolean> value);

    @TemplateParameter.Boolean(
        order = 8,
        optional = true,
        description = "Create Indexes early",
        helpText =
            "Flag to turn off early index creation if there are many indexes. Indexes and Foreign keys are created after dataload. If there are more than "
                + "40 DDL statements to be executed after dataload, it is preferable to create the "
                + "indexes before datalod. This is the flag to turn the feature off.")
    @Default.Boolean(true)
    ValueProvider<Boolean> getEarlyIndexCreateFlag();

    void setEarlyIndexCreateFlag(ValueProvider<Boolean> value);

    @TemplateCreationParameter(value = "false")
    @Description("If true, wait for job finish")
    @Default.Boolean(true)
    boolean getWaitUntilFinish();

    @TemplateParameter.ProjectId(
        order = 9,
        optional = true,
        description = "Cloud Spanner Project Id",
        helpText = "The project id of the Cloud Spanner instance.")
    ValueProvider<String> getSpannerProjectId();

    void setSpannerProjectId(ValueProvider<String> value);

    void setWaitUntilFinish(boolean value);

    @TemplateParameter.Text(
        order = 10,
        optional = true,
        regexes = {"[0-9]+"},
        description = "DDL Creation timeout in minutes",
        helpText = "DDL Creation timeout in minutes.")
    @Default.Integer(30)
    ValueProvider<Integer> getDDLCreationTimeoutInMinutes();

    void setDDLCreationTimeoutInMinutes(ValueProvider<Integer> value);

    @TemplateParameter.Enum(
        order = 11,
        enumOptions = {"LOW", "MEDIUM", "HIGH"},
        optional = true,
        description = "Priority for Spanner RPC invocations",
        helpText =
            "The request priority for Cloud Spanner calls. The value must be one of: [HIGH,MEDIUM,LOW].")
    ValueProvider<RpcPriority> getSpannerPriority();

    void setSpannerPriority(ValueProvider<RpcPriority> value);
  }

  public static void main(String[] args) {

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    Pipeline p = Pipeline.create(options);

    SpannerConfig spannerConfig =
        SpannerConfig.create()
            // Temporary fix explicitly setting SpannerConfig.projectId to the default project
            // if spannerProjectId is not provided as a parameter. Required as of Beam 2.38,
            // which no longer accepts null label values on metrics, and SpannerIO#setup() has
            // a bug resulting in the label value being set to the original parameter value,
            // with no fallback to the default project.
            // TODO: remove NestedValueProvider when this is fixed in Beam.
            .withProjectId(
                NestedValueProvider.of(
                    options.getSpannerProjectId(),
                    (SerializableFunction<String, String>)
                        input -> input != null ? input : SpannerOptions.getDefaultProjectId()))
            .withHost(options.getSpannerHost())
            .withInstanceId(options.getInstanceId())
            .withDatabaseId(options.getDatabaseId())
            .withRpcPriority(options.getSpannerPriority());

    p.apply(
        new ImportTransform(
            spannerConfig,
            options.getInputDir(),
            options.getWaitForIndexes(),
            options.getWaitForForeignKeys(),
            options.getWaitForChangeStreams(),
            options.getEarlyIndexCreateFlag(),
            options.getDDLCreationTimeoutInMinutes()));

    PipelineResult result = p.run();

    if (options.getWaitUntilFinish()
        &&
        /* Only if template location is null, there is a dataflow job to wait for. Else it's
         * template generation which doesn't start a dataflow job.
         */
        options.as(DataflowPipelineOptions.class).getTemplateLocation() == null) {
      result.waitUntilFinish();
    }
  }
}

Cloud Storage Parquet to Bigtable

Cloudtable Parquet to Bigtable 模板是一种流水线，可从 Cloud Storage 存储桶中的 Parquet 文件读取数据并将数据写入 Bigtable 表。您可以使用该模板将数据从 Cloud Storage 复制到 Bigtable。

对此流水线的要求：

Bigtable 表必须已存在，并且列族必须与 Parquet 文件中导出的列族相同。
在运行此流水线之前，输入 Parquet 文件必须已存在于 Cloud Storage 存储桶中。
Bigtable 需要输入 Parquet 文件具有特定架构。

模板参数

参数	说明
`bigtableProjectId`	您要将数据写入的 Bigtable 实例的 Google Cloud 项目 ID。
`bigtableInstanceId`	表所属的 Bigtable 实例的 ID。
`bigtableTableId`	要导入的 Bigtable 表的 ID。
`inputFilePattern`	数据所在的 Cloud Storage 路径模式，例如 `gs://mybucket/somefolder/prefix*`。

运行 Cloud Storage Parquet file to Bigtable 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Parquet Files on Cloud Storage to Cloud Bigtable template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
inputFilePattern=INPUT_FILE_PATTERN

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
INSTANCE_ID：表所属的 Bigtable 实例的 ID。
TABLE_ID：需要导出的 Bigtable 表的 ID。
INPUT_FILE_PATTERN：数据所在的 Cloud Storage 路径模式，例如 gs://mybucket/somefolder/prefix*

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "inputFilePattern": "INPUT_FILE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
INSTANCE_ID：表所属的 Bigtable 实例的 ID。
TABLE_ID：需要导出的 Bigtable 表的 ID。
INPUT_FILE_PATTERN：数据所在的 Cloud Storage 路径模式，例如 gs://mybucket/somefolder/prefix*

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.bigtable;

import static com.google.cloud.teleport.bigtable.AvroToBigtable.toByteString;

import com.google.bigtable.v2.Mutation;
import com.google.cloud.teleport.bigtable.ParquetToBigtable.Options;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.protobuf.ByteString;
import java.nio.ByteBuffer;
import java.util.List;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.beam.runners.direct.DirectRunner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.bigtable.BigtableIO;
import org.apache.beam.sdk.io.parquet.ParquetIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.StaticValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.MoreObjects;
import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableList;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link ParquetToBigtable} pipeline imports data from Parquet files in GCS to a Cloud Bigtable
 * table. The Cloud Bigtable table must be created before running the pipeline and must have a
 * compatible table schema. For example, if {@link BigtableCell} from the Parquet files has a
 * 'family' of "f1", the Bigtable table should have a column family of "f1".
 *
 * <p><b>Pipeline Requirements</b>
 *
 * <ul>
 *   <li>Bigtable instance.
 *   <li>Bigtable table with compatible table schema.
 *   <li>Google Cloud Storage input bucket and parquet file(s) exists.
 * </ul>
 *
 * <p><b>Example Usage</b>
 *
 * <pre>
 *
 * # Set the pipeline vars
 * PROJECT_ID=PROJECT ID HERE
 * PIPELINE_FOLDER=gs://${PROJECT_ID}/dataflow/pipelines/parquet-to-bigtable
 * BIGTABLE_INSTANCE_ID=BIGTABLE INSTANCE ID HERE
 * BIGTABLE_TABLE_ID=BIGTABLE TABLE ID HERE
 *
 * # Set the runner
 * RUNNER=DataflowRunner
 *
 * # Build the template
 * mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.bigtable.ParquetToBigtable \
 * -Dexec.cleanupDaemonThreads=false \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=${PIPELINE_FOLDER}/staging \
 * --tempLocation=${PIPELINE_FOLDER}/temp \
 * --templateLocation=${PIPELINE_FOLDER}/template \
 * --runner=${RUNNER}"
 *
 * # Execute the template
 * JOB_NAME=parquet-to-bigtable-$USER-`date +"%Y%m%d-%H%M%S%z"`
 *
 * gcloud dataflow jobs run ${JOB_NAME} \
 * --gcs-location=${PIPELINE_FOLDER}/template \
 * --zone=us-east1-d \
 * --parameters \
 * "bigtableProjectId=${PROJECT_ID},\
 * bigtableInstanceId=${BIGTABLE_INSTANCE_ID},\
 * bigtableTableId=${BIGTABLE_TABLE_ID},\
 * inputFilePattern=${PIPELINE_FOLDER}/path/to/file/filename-*.parquet"
 * </pre>
 */
@Template(
    name = "GCS_Parquet_to_Cloud_Bigtable",
    category = TemplateCategory.BATCH,
    displayName = "Parquet Files on Cloud Storage to Cloud Bigtable",
    description =
        "A pipeline which reads data from Parquet files in Cloud Storage and writes it to Cloud Bigtable table.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class ParquetToBigtable {
  private static final Logger LOG = LoggerFactory.getLogger(AvroToBigtable.class);

  /** Maximum number of mutations allowed per row by Cloud bigtable. */
  private static final int MAX_MUTATIONS_PER_ROW = 100000;

  private static final Boolean DEFAULT_SPLIT_LARGE_ROWS = false;

  /** Options for the import pipeline. */
  public interface Options extends PipelineOptions {
    @TemplateParameter.ProjectId(
        order = 1,
        description = "Project ID",
        helpText =
            "The ID of the Google Cloud project of the Cloud Bigtable instance that you want to write data to")
    ValueProvider<String> getBigtableProjectId();

    @SuppressWarnings("unused")
    void setBigtableProjectId(ValueProvider<String> projectId);

    @TemplateParameter.Text(
        order = 2,
        regexes = {"[a-z][a-z0-9\\-]+[a-z0-9]"},
        description = "Instance ID",
        helpText = "The ID of the Cloud Bigtable instance that contains the table")
    ValueProvider<String> getBigtableInstanceId();

    @SuppressWarnings("unused")
    void setBigtableInstanceId(ValueProvider<String> instanceId);

    @TemplateParameter.Text(
        order = 3,
        regexes = {"[_a-zA-Z0-9][-_.a-zA-Z0-9]*"},
        description = "Table ID",
        helpText = "The ID of the Cloud Bigtable table to write")
    ValueProvider<String> getBigtableTableId();

    @SuppressWarnings("unused")
    void setBigtableTableId(ValueProvider<String> tableId);

    @TemplateParameter.GcsReadFile(
        order = 4,
        description = "Input Cloud Storage File(s)",
        helpText = "The Cloud Storage location of the files you'd like to process.",
        example = "gs://your-bucket/your-files/*.parquet")
    ValueProvider<String> getInputFilePattern();

    @SuppressWarnings("unused")
    void setInputFilePattern(ValueProvider<String> inputFilePattern);

    @TemplateParameter.Boolean(
        order = 5,
        optional = true,
        description = "If true, large rows will be split into multiple MutateRows requests",
        helpText =
            "The flag for enabling splitting of large rows into multiple MutateRows requests. Note that when a large row is split between multiple API calls, the updates to the row are not atomic. ")
    ValueProvider<Boolean> getSplitLargeRows();

    void setSplitLargeRows(ValueProvider<Boolean> splitLargeRows);
  }

  /**
   * Runs a pipeline to import Parquet files in GCS to a Cloud Bigtable table.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    PipelineResult result = run(options);
  }

  public static PipelineResult run(Options options) {
    Pipeline pipeline = Pipeline.create(PipelineUtils.tweakPipelineOptions(options));

    BigtableIO.Write write =
        BigtableIO.write()
            .withProjectId(options.getBigtableProjectId())
            .withInstanceId(options.getBigtableInstanceId())
            .withTableId(options.getBigtableTableId());

    /**
     * Steps: 1) Read records from Parquet File. 2) Convert a GenericRecord to a
     * KV<ByteString,Iterable<Mutation>>. 3) Write KV to Bigtable's table.
     */
    pipeline
        .apply(
            "Read from Parquet",
            ParquetIO.read(BigtableRow.getClassSchema()).from(options.getInputFilePattern()))
        .apply(
            "Transform to Bigtable",
            ParDo.of(
                ParquetToBigtableFn.createWithSplitLargeRows(
                    options.getSplitLargeRows(), MAX_MUTATIONS_PER_ROW)))
        .apply("Write to Bigtable", write);

    return pipeline.run();
  }

  static class ParquetToBigtableFn extends DoFn<GenericRecord, KV<ByteString, Iterable<Mutation>>> {

    private final ValueProvider<Boolean> splitLargeRowsFlag;
    private Boolean splitLargeRows;
    private final int maxMutationsPerRow;

    public static ParquetToBigtableFn create() {
      return new ParquetToBigtableFn(StaticValueProvider.of(false), MAX_MUTATIONS_PER_ROW);
    }

    public static ParquetToBigtableFn createWithSplitLargeRows(
        ValueProvider<Boolean> splitLargeRowsFlag, int maxMutationsPerRequest) {
      return new ParquetToBigtableFn(splitLargeRowsFlag, maxMutationsPerRequest);
    }

    @Setup
    public void setup() {
      if (splitLargeRowsFlag != null) {
        splitLargeRows = splitLargeRowsFlag.get();
      }
      splitLargeRows = MoreObjects.firstNonNull(splitLargeRows, DEFAULT_SPLIT_LARGE_ROWS);
      LOG.info("splitLargeRows set to: " + splitLargeRows);
    }

    private ParquetToBigtableFn(
        ValueProvider<Boolean> splitLargeRowsFlag, int maxMutationsPerRequest) {
      this.splitLargeRowsFlag = splitLargeRowsFlag;
      this.maxMutationsPerRow = maxMutationsPerRequest;
    }

    @ProcessElement
    public void processElement(ProcessContext ctx) {
      Class runner = ctx.getPipelineOptions().getRunner();
      ByteString key = toByteString((ByteBuffer) ctx.element().get(0));

      // BulkMutation doesn't split rows. Currently, if a single row contains more than 100,000
      // mutations, the service will fail the request.
      ImmutableList.Builder<Mutation> mutations = ImmutableList.builder();
      List<Object> cells = (List) ctx.element().get(1);
      int cellsProcessed = 0;
      for (Object element : cells) {
        Mutation.SetCell setCell = null;
        if (runner.isAssignableFrom(DirectRunner.class)) {
          setCell =
              Mutation.SetCell.newBuilder()
                  .setFamilyName(((GenericData.Record) element).get(0).toString())
                  .setColumnQualifier(
                      toByteString((ByteBuffer) ((GenericData.Record) element).get(1)))
                  .setTimestampMicros((Long) ((GenericData.Record) element).get(2))
                  .setValue(toByteString((ByteBuffer) ((GenericData.Record) element).get(3)))
                  .build();
        } else {
          BigtableCell bigtableCell = (BigtableCell) element;
          setCell =
              Mutation.SetCell.newBuilder()
                  .setFamilyName(bigtableCell.getFamily().toString())
                  .setColumnQualifier(toByteString(bigtableCell.getQualifier()))
                  .setTimestampMicros(bigtableCell.getTimestamp())
                  .setValue(toByteString(bigtableCell.getValue()))
                  .build();
        }
        mutations.add(Mutation.newBuilder().setSetCell(setCell).build());
        cellsProcessed++;

        if (this.splitLargeRows && cellsProcessed % maxMutationsPerRow == 0) {
          // Send a MutateRow request when we have accumulated max mutations per row.
          ctx.output(KV.of(key, mutations.build()));
          mutations = ImmutableList.builder();
        }
      }

      // Flush any remaining mutations.
      ImmutableList remainingMutations = mutations.build();
      if (!remainingMutations.isEmpty()) {
        ctx.output(KV.of(key, remainingMutations));
      }
    }
  }
}

Cloud Storage SequenceFile to Bigtable

Cloud Storage SequenceFile to Bigtable 模板是一种流水线，可从 Cloud Storage 存储桶中的 SequenceFile 读取数据并将数据写入 Bigtable 表。您可以使用该模板将数据从 Cloud Storage 复制到 Bigtable。

对此流水线的要求：

Bigtable 表必须已存在。
在运行此流水线之前，输入 SequenceFiles 文件必须已存在于 Cloud Storage 存储桶中。
输入 SequenceFile 必须已从 Bigtable 或 HBase 中导出。

模板参数

参数	说明
`bigtableProject`	您要将数据写入的 Bigtable 实例的 Google Cloud 项目 ID。
`bigtableInstanceId`	表所属的 Bigtable 实例的 ID。
`bigtableTableId`	要导入的 Bigtable 表的 ID。
`bigtableAppProfileId`	要用于导入的 Bigtable 应用配置文件的 ID。如果您没有指定应用配置文件，则 Bigtable 将使用该实例的默认应用配置文件。
`sourcePattern`	数据所在的 Cloud Storage 路径模式，例如 `gs://mybucket/somefolder/prefix*`。

运行 Cloud Storage SequenceFile to Bigtable 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the SequenceFile Files on Cloud Storage to Cloud Bigtable template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable \
    --region REGION_NAME \
    --parameters \
bigtableProject=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=INSTANCE_ID,\
bigtableTableId=TABLE_ID,\
bigtableAppProfileId=APPLICATION_PROFILE_ID,\
sourcePattern=SOURCE_PATTERN

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
INSTANCE_ID：表所属的 Bigtable 实例的 ID。
TABLE_ID：需要导出的 Bigtable 表的 ID。
APPLICATION_PROFILE_ID：将用于导出的 Bigtable 应用配置文件的 ID。
SOURCE_PATTERN：数据所在的 Cloud Storage 路径模式，例如 gs://mybucket/somefolder/prefix*

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProject": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "bigtableAppProfileId": "APPLICATION_PROFILE_ID",
       "sourcePattern": "SOURCE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：您要从中读取数据的 Bigtable 实例的 Google Cloud 项目的 ID。
INSTANCE_ID：表所属的 Bigtable 实例的 ID。
TABLE_ID：需要导出的 Bigtable 表的 ID。
APPLICATION_PROFILE_ID：将用于导出的 Bigtable 应用配置文件的 ID。
SOURCE_PATTERN：数据所在的 Cloud Storage 路径模式，例如 gs://mybucket/somefolder/prefix*

模板源代码

Java

此模板的源代码位于 GitHub 上的 GoogleCloudPlatform/cloud-bigtable-client 代码库中。

Cloud Storage Text to BigQuery

Cloud Storage Text to BigQuery 流水线是一种批处理流水线，可用于读取 Cloud Storage 中存储的文本文件，使用您提供的 JavaScript 用户定义函数 (UDF) 转换这些文件，然后将结果附加到 BigQuery 表。

注意：如果要覆盖 BigQuery 表中的数据而不是附加，请更新 WRITE_APPEND 到 WRITE_TRUNCATE 的模板源代码中的 WriteDisposition。

对此流水线的要求：

创建一个用于描述 BigQuery 架构的 JSON 文件。
确保有一个名为 BigQuery Schema 的顶级 JSON 数组，且该数组的内容遵循 {"name": "COLUMN_NAME", "type": "DATA_TYPE"} 格式。

Cloud Storage Text to BigQuery 批处理模板不支持将数据导入目标 BigQuery 表中的 STRUCT（记录）字段。

下面的 JSON 描述了一个 BigQuery 架构示例：
```
{
  "BigQuery Schema": [
    {
      "name": "location",
      "type": "STRING"
    },
    {
      "name": "name",
      "type": "STRING"
    },
    {
      "name": "age",
      "type": "STRING"
    },
    {
      "name": "color",
      "type": "STRING"
    },
    {
      "name": "coffee",
      "type": "STRING"
    }
  ]
}
```

使用 UDF 函数（该函数提供转换文本行的逻辑）创建一个 JavaScript (.js) 文件。您的函数必须返回一个 JSON 字符串。

例如，以下函数将拆分 CSV 文件的每行文本，并通过转换值返回 JSON 字符串。

function transform(line) {
var values = line.split(',');

var obj = new Object();
obj.location = values[0];
obj.name = values[1];
obj.age = values[2];
obj.color = values[3];
obj.coffee = values[4];
var jsonString = JSON.stringify(obj);

return jsonString;
}

模板参数

参数	说明
`javascriptTextTransformFunctionName`	您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
`JSONPath`	用于定义 BigQuery 架构的 JSON 文件（存储在 Cloud Storage 中）的 `gs://` 路径。例如 `gs://path/to/my/schema.json`。
`javascriptTextTransformGcsPath`	`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`inputFilePattern`	Cloud Storage 中待处理文本的 `gs://` 路径。例如 `gs://path/to/my/text/data.txt`。
`outputTable`	要创建用以存储已处理数据的 BigQuery 表名称。如果您重复使用现有 BigQuery 表，则数据将被附加到目标表。例如 `my-project-name:my-dataset.my-table`。
`bigQueryLoadingTemporaryDirectory`	BigQuery 加载进程的临时目录。例如 `gs://my-bucket/my-files/temp_dir`。

运行 Cloud Storage Text to BigQuery 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Text Files on Cloud Storage to BigQuery (Batch) template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery \
    --region REGION_NAME \
    --parameters \
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_BIGQUERY_SCHEMA_JSON：包含架构定义的 JSON 文件的 Cloud Storage 路径
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js
PATH_TO_TEXT_DATA：文本数据集的 Cloud Storage 路径
BIGQUERY_TABLE：您的 BigQuery 表名称
PATH_TO_TEMP_DIR_ON_GCS：临时目录的 Cloud Storage 路径

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "JSONPath": "PATH_TO_BIGQUERY_SCHEMA_JSON",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "inputFilePattern":"PATH_TO_TEXT_DATA",
       "outputTable":"BIGQUERY_TABLE",
       "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_BIGQUERY_SCHEMA_JSON：包含架构定义的 JSON 文件的 Cloud Storage 路径
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js
PATH_TO_TEXT_DATA：文本数据集的 Cloud Storage 路径
BIGQUERY_TABLE：您的 BigQuery 表名称
PATH_TO_TEMP_DIR_ON_GCS：临时目录的 Cloud Storage 路径

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.api.services.bigquery.model.TableSchema;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.TextIOToBigQuery.Options;
import com.google.cloud.teleport.templates.common.BigQueryConverters;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import java.util.ArrayList;
import java.util.List;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.CreateDisposition;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.WriteDisposition;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.options.ValueProvider.NestedValueProvider;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.json.JSONArray;
import org.json.JSONObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * Templated pipeline to read text from TextIO, apply a javascript UDF to it, and write it to GCS.
 */
@Template(
    name = "GCS_Text_to_BigQuery",
    category = TemplateCategory.BATCH,
    displayName = "Text Files on Cloud Storage to BigQuery",
    description =
        "Batch pipeline. Reads text files stored in Cloud Storage, transforms them using a JavaScript user-defined function (UDF), and outputs the result to BigQuery.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class TextIOToBigQuery {

  /** Options supported by {@link TextIOToBigQuery}. */
  public interface Options extends DataflowPipelineOptions, JavascriptTextTransformerOptions {

    @TemplateParameter.GcsReadFile(
        order = 1,
        description = "Cloud Storage Input File(s)",
        helpText = "Path of the file pattern glob to read from.",
        example = "gs://your-bucket/path/*.csv")
    ValueProvider<String> getInputFilePattern();

    void setInputFilePattern(ValueProvider<String> value);

    @TemplateParameter.GcsReadFile(
        order = 2,
        description = "Cloud Storage location of your BigQuery schema file, described as a JSON",
        helpText =
            "JSON file with BigQuery Schema description. JSON Example: {\n"
                + "\t\"BigQuery Schema\": [\n"
                + "\t\t{\n"
                + "\t\t\t\"name\": \"location\",\n"
                + "\t\t\t\"type\": \"STRING\"\n"
                + "\t\t},\n"
                + "\t\t{\n"
                + "\t\t\t\"name\": \"name\",\n"
                + "\t\t\t\"type\": \"STRING\"\n"
                + "\t\t},\n"
                + "\t\t{\n"
                + "\t\t\t\"name\": \"age\",\n"
                + "\t\t\t\"type\": \"STRING\"\n"
                + "\t\t},\n"
                + "\t\t{\n"
                + "\t\t\t\"name\": \"color\",\n"
                + "\t\t\t\"type\": \"STRING\"\n"
                + "\t\t},\n"
                + "\t\t{\n"
                + "\t\t\t\"name\": \"coffee\",\n"
                + "\t\t\t\"type\": \"STRING\"\n"
                + "\t\t}\n"
                + "\t]\n"
                + "}")
    ValueProvider<String> getJSONPath();

    void setJSONPath(ValueProvider<String> value);

    @TemplateParameter.BigQueryTable(
        order = 3,
        description = "BigQuery output table",
        helpText =
            "BigQuery table location to write the output to. The table's schema must match the "
                + "input objects.")
    ValueProvider<String> getOutputTable();

    void setOutputTable(ValueProvider<String> value);

    @TemplateParameter.GcsWriteFolder(
        order = 6,
        description = "Temporary directory for BigQuery loading process",
        helpText = "Temporary directory for BigQuery loading process",
        example = "gs://your-bucket/your-files/temp_dir")
    @Validation.Required
    ValueProvider<String> getBigQueryLoadingTemporaryDirectory();

    void setBigQueryLoadingTemporaryDirectory(ValueProvider<String> directory);
  }

  private static final Logger LOG = LoggerFactory.getLogger(TextIOToBigQuery.class);

  private static final String BIGQUERY_SCHEMA = "BigQuery Schema";
  private static final String NAME = "name";
  private static final String TYPE = "type";
  private static final String MODE = "mode";

  public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply("Read from source", TextIO.read().from(options.getInputFilePattern()))
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(BigQueryConverters.jsonToTableRow())
        .apply(
            "Insert into Bigquery",
            BigQueryIO.writeTableRows()
                .withSchema(
                    NestedValueProvider.of(
                        options.getJSONPath(),
                        new SerializableFunction<String, TableSchema>() {

                          @Override
                          public TableSchema apply(String jsonPath) {

                            TableSchema tableSchema = new TableSchema();
                            List<TableFieldSchema> fields = new ArrayList<>();
                            SchemaParser schemaParser = new SchemaParser();
                            JSONObject jsonSchema;

                            try {

                              jsonSchema = schemaParser.parseSchema(jsonPath);

                              JSONArray bqSchemaJsonArray =
                                  jsonSchema.getJSONArray(BIGQUERY_SCHEMA);

                              for (int i = 0; i < bqSchemaJsonArray.length(); i++) {
                                JSONObject inputField = bqSchemaJsonArray.getJSONObject(i);
                                TableFieldSchema field =
                                    new TableFieldSchema()
                                        .setName(inputField.getString(NAME))
                                        .setType(inputField.getString(TYPE));

                                if (inputField.has(MODE)) {
                                  field.setMode(inputField.getString(MODE));
                                }

                                fields.add(field);
                              }
                              tableSchema.setFields(fields);

                            } catch (Exception e) {
                              throw new RuntimeException(e);
                            }
                            return tableSchema;
                          }
                        }))
                .to(options.getOutputTable())
                .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
                .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                .withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory()));

    pipeline.run();
  }
}

Cloud Storage Text to Datastore [已弃用]

此模板已弃用，将于 2022 年第一季度移除。请迁移到 Cloud Storage Text to Firestore 模板。

Cloud Storage Text to Datastore 模板是一种批处理流水线，可从存储在 Cloud Storage 中的文本文件读取数据，并将采用 JSON 编码的实体写入 Datastore。输入文本文件中的所有行都必须采用指定的 JSON 格式。

对此流水线的要求：

必须在目标项目中启用 Datastore。

模板参数

参数	说明
`textReadPattern`	指定文本数据文件位置的 Cloud Storage 路径模式。例如 `gs://mybucket/somepath/*.json`。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
`datastoreWriteProjectId`	要向其写入 Datastore 实体的位置的 Google Cloud 项目 ID
`datastoreHintNumWorkers`	（可选）Datastore 逐步增加限制步骤中的预期工作器数量的提示。默认值为 `500`。
`errorWritePath`	错误日志输出文件，用于写入在处理期间发生的故障。例如 `gs://bucket-name/errors.txt`。

运行 Cloud Storage Text to Datastore 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Text Files on Cloud Storage to Datastore template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_Datastore \
    --region REGION_NAME \
    --parameters \
textReadPattern=PATH_TO_INPUT_TEXT_FILES,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
datastoreWriteProjectId=PROJECT_ID,\
errorWritePath=ERROR_FILE_WRITE_PATH

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
PATH_TO_INPUT_TEXT_FILES：Cloud Storage 上的输入文件模式
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js
ERROR_FILE_WRITE_PATH：Cloud Storage 上错误文件所需的路径

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Datastore
{
   "jobName": "JOB_NAME",
   "parameters": {
       "textReadPattern": "PATH_TO_INPUT_TEXT_FILES",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "datastoreWriteProjectId": "PROJECT_ID",
       "errorWritePath": "ERROR_FILE_WRITE_PATH"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
PATH_TO_INPUT_TEXT_FILES：Cloud Storage 上的输入文件模式
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js
ERROR_FILE_WRITE_PATH：Cloud Storage 上错误文件所需的路径

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.MultiTemplate;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.TextToDatastore.TextToDatastoreOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreWriteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.WriteJsonEntities;
import com.google.cloud.teleport.templates.common.ErrorConverters.ErrorWriteOptions;
import com.google.cloud.teleport.templates.common.ErrorConverters.LogErrors;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import com.google.cloud.teleport.templates.common.TextConverters.FilesystemReadOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.values.TupleTag;

/**
 * Dataflow template which reads from a Text Source and writes JSON encoded Entities into Datastore.
 * The Json is expected to be in the format of:
 * https://cloud.google.com/datastore/docs/reference/rest/v1/Entity
 */
@MultiTemplate({
  @Template(
      name = "GCS_Text_to_Datastore",
      category = TemplateCategory.BATCH,
      displayName = "Text Files on Cloud Storage to Datastore [Deprecated]",
      description =
          "Batch pipeline. Reads from text files stored in Cloud Storage and writes JSON-encoded entities to Datastore.",
      optionsClass = TextToDatastoreOptions.class,
      skipOptions = {
        "firestoreWriteProjectId",
        "firestoreWriteEntityKind",
        "firestoreWriteNamespace",
        "firestoreHintNumWorkers"
      },
      contactInformation = "https://cloud.google.com/support"),
  @Template(
      name = "GCS_Text_to_Firestore",
      category = TemplateCategory.BATCH,
      displayName = "Text Files on Cloud Storage to Firestore (Datastore mode)",
      description =
          "Batch pipeline. Reads from text files stored in Cloud Storage and writes JSON-encoded entities to Firestore.",
      optionsClass = TextToDatastoreOptions.class,
      skipOptions = {
        "datastoreWriteProjectId",
        "datastoreWriteEntityKind",
        "datastoreWriteNamespace",
        "datastoreHintNumWorkers"
      },
      contactInformation = "https://cloud.google.com/support")
})
public class TextToDatastore {

  public static <T> ValueProvider<T> selectProvidedInput(
      ValueProvider<T> datastoreInput, ValueProvider<T> firestoreInput) {
    return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
  }

  /** TextToDatastore Pipeline Options. */
  public interface TextToDatastoreOptions
      extends PipelineOptions,
          FilesystemReadOptions,
          JavascriptTextTransformerOptions,
          DatastoreWriteOptions,
          ErrorWriteOptions {}

  /**
   * Runs a pipeline which reads from a Text Source, passes the Text to a Javascript UDF, writes the
   * JSON encoded Entities to a TextIO sink.
   *
   * <p>If your Text Source does not contain JSON encoded Entities, then you'll need to supply a
   * Javascript UDF which transforms your data to be JSON encoded Entities.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    TextToDatastoreOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(TextToDatastoreOptions.class);

    TupleTag<String> errorTag = new TupleTag<String>("errors") {};

    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(TextIO.read().from(options.getTextReadPattern()))
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(
            WriteJsonEntities.newBuilder()
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreWriteProjectId(), options.getFirestoreWriteProjectId()))
                .setHintNumWorkers(
                    selectProvidedInput(
                        options.getDatastoreHintNumWorkers(), options.getFirestoreHintNumWorkers()))
                .setErrorTag(errorTag)
                .build())
        .apply(
            LogErrors.newBuilder()
                .setErrorWritePath(options.getErrorWritePath())
                .setErrorTag(errorTag)
                .build());

    pipeline.run();
  }
}

Cloud Storage Text to Firestore

Cloud Storage Text to Firestore 模板是一种批处理流水线，可从存储在 Cloud Storage 中的文本文件读取数据，并将采用 JSON 编码的实体写入 Firestore。输入文本文件中的所有行都必须采用指定的 JSON 格式。

对此流水线的要求：

必须在目标项目中启用 Firestore。

模板参数

参数	说明
`textReadPattern`	指定文本数据文件位置的 Cloud Storage 路径模式。例如 `gs://mybucket/somepath/*.json`。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
`firestoreWriteProjectId`	要向其写入 Firestore 实体的位置的 Google Cloud 项目 ID
`firestoreHintNumWorkers`	（可选）Firestore 逐步增加限制步骤中的预期工作器数量的提示。默认值为 `500`。
`errorWritePath`	错误日志输出文件，用于写入在处理期间发生的故障。例如 `gs://bucket-name/errors.txt`。

运行 Cloud Storage Text to Firestore 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Text Files on Cloud Storage to Firestore template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_Firestore \
    --region REGION_NAME \
    --parameters \
textReadPattern=PATH_TO_INPUT_TEXT_FILES,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
firestoreWriteProjectId=PROJECT_ID,\
errorWritePath=ERROR_FILE_WRITE_PATH

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
PATH_TO_INPUT_TEXT_FILES：Cloud Storage 上的输入文件模式
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js
ERROR_FILE_WRITE_PATH：Cloud Storage 上错误文件所需的路径

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Firestore
{
   "jobName": "JOB_NAME",
   "parameters": {
       "textReadPattern": "PATH_TO_INPUT_TEXT_FILES",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "firestoreWriteProjectId": "PROJECT_ID",
       "errorWritePath": "ERROR_FILE_WRITE_PATH"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
PATH_TO_INPUT_TEXT_FILES：Cloud Storage 上的输入文件模式
JAVASCRIPT_FUNCTION：您要使用的 JavaScript 用户定义的函数 (UDF) 的名称
例如，如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
PATH_TO_JAVASCRIPT_UDF_FILE：.js 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)，例如 gs://my-bucket/my-udfs/my_file.js
ERROR_FILE_WRITE_PATH：Cloud Storage 上错误文件所需的路径

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.MultiTemplate;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.TextToDatastore.TextToDatastoreOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreWriteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.WriteJsonEntities;
import com.google.cloud.teleport.templates.common.ErrorConverters.ErrorWriteOptions;
import com.google.cloud.teleport.templates.common.ErrorConverters.LogErrors;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import com.google.cloud.teleport.templates.common.TextConverters.FilesystemReadOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.values.TupleTag;

/**
 * Dataflow template which reads from a Text Source and writes JSON encoded Entities into Datastore.
 * The Json is expected to be in the format of:
 * https://cloud.google.com/datastore/docs/reference/rest/v1/Entity
 */
@MultiTemplate({
  @Template(
      name = "GCS_Text_to_Datastore",
      category = TemplateCategory.BATCH,
      displayName = "Text Files on Cloud Storage to Datastore [Deprecated]",
      description =
          "Batch pipeline. Reads from text files stored in Cloud Storage and writes JSON-encoded entities to Datastore.",
      optionsClass = TextToDatastoreOptions.class,
      skipOptions = {
        "firestoreWriteProjectId",
        "firestoreWriteEntityKind",
        "firestoreWriteNamespace",
        "firestoreHintNumWorkers"
      },
      contactInformation = "https://cloud.google.com/support"),
  @Template(
      name = "GCS_Text_to_Firestore",
      category = TemplateCategory.BATCH,
      displayName = "Text Files on Cloud Storage to Firestore (Datastore mode)",
      description =
          "Batch pipeline. Reads from text files stored in Cloud Storage and writes JSON-encoded entities to Firestore.",
      optionsClass = TextToDatastoreOptions.class,
      skipOptions = {
        "datastoreWriteProjectId",
        "datastoreWriteEntityKind",
        "datastoreWriteNamespace",
        "datastoreHintNumWorkers"
      },
      contactInformation = "https://cloud.google.com/support")
})
public class TextToDatastore {

  public static <T> ValueProvider<T> selectProvidedInput(
      ValueProvider<T> datastoreInput, ValueProvider<T> firestoreInput) {
    return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
  }

  /** TextToDatastore Pipeline Options. */
  public interface TextToDatastoreOptions
      extends PipelineOptions,
          FilesystemReadOptions,
          JavascriptTextTransformerOptions,
          DatastoreWriteOptions,
          ErrorWriteOptions {}

  /**
   * Runs a pipeline which reads from a Text Source, passes the Text to a Javascript UDF, writes the
   * JSON encoded Entities to a TextIO sink.
   *
   * <p>If your Text Source does not contain JSON encoded Entities, then you'll need to supply a
   * Javascript UDF which transforms your data to be JSON encoded Entities.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {
    TextToDatastoreOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(TextToDatastoreOptions.class);

    TupleTag<String> errorTag = new TupleTag<String>("errors") {};

    Pipeline pipeline = Pipeline.create(options);

    pipeline
        .apply(TextIO.read().from(options.getTextReadPattern()))
        .apply(
            TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
        .apply(
            WriteJsonEntities.newBuilder()
                .setProjectId(
                    selectProvidedInput(
                        options.getDatastoreWriteProjectId(), options.getFirestoreWriteProjectId()))
                .setHintNumWorkers(
                    selectProvidedInput(
                        options.getDatastoreHintNumWorkers(), options.getFirestoreHintNumWorkers()))
                .setErrorTag(errorTag)
                .build())
        .apply(
            LogErrors.newBuilder()
                .setErrorWritePath(options.getErrorWritePath())
                .setErrorTag(errorTag)
                .build());

    pipeline.run();
  }
}

Cloud Storage Text to Pub/Sub (Batch)

此模板会创建一种批处理流水线，该流水线可从存储在 Cloud Storage 中的文本文件读取记录，并将其发布到 Pub/Sub 主题。使用此模板，您可以将采用换行符分隔的文件中的 JSON 记录或 CSV 文件中的记录发布到 Pub/Sub 主题，以实现实时处理。您可以使用此模板将数据重放到 Pub/Sub。

此模板不会在各个记录上设置任何时间戳。事件时间等于执行期间的发布时间。如果您的流水线依赖准确的事件时间来执行处理，建议不要使用此流水线。

对此流水线的要求：

需要读取的文件必须采用换行符分隔 JSON 或 CSV 格式。在源文件中占多行的记录可能会导致下游问题，因为文件中的每一行都将以消息形式发布到 Pub/Sub。
在运行此流水线之前，Pub/Sub 主题必须已存在。

模板参数

参数	说明
`inputFilePattern`	要从中读取数据的输入文件模式。例如 `gs://bucket-name/files/*.json`。
`outputTopic`	要向其写入数据的 Pub/Sub 输入主题。名称必须采用 `projects/<project-id>/topics/<topic-name>` 格式。

运行 Cloud Storage Text to Pub/Sub (Batch) 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Text Files on Cloud Storage to Pub/Sub (Batch) template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub \
    --region REGION_NAME \
    --parameters \
inputFilePattern=gs://BUCKET_NAME/files/*.json,\
outputTopic=projects/PROJECT_ID/topics/TOPIC_NAME

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
TOPIC_NAME：您的 Pub/Sub 主题名称
BUCKET_NAME：Cloud Storage 存储桶的名称

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://BUCKET_NAME/files/*.json",
       "outputTopic": "projects/PROJECT_ID/topics/TOPIC_NAME"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
TOPIC_NAME：您的 Pub/Sub 主题名称
BUCKET_NAME：Cloud Storage 存储桶的名称

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import com.google.cloud.teleport.templates.TextToPubsub.Options;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;

/**
 * The {@code TextToPubsub} pipeline publishes records to Cloud Pub/Sub from a set of files. The
 * pipeline reads each file row-by-row and publishes each record as a string message. At the moment,
 * publishing messages with attributes is unsupported.
 *
 * <p>Example Usage:
 *
 * <pre>
 * {@code mvn compile exec:java \
 * -Dexec.mainClass=com.google.cloud.teleport.templates.TextToPubsub \
 * -Dexec.args=" \
 * --project=${PROJECT_ID} \
 * --stagingLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
 * --tempLocation=gs://${PROJECT_ID}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
 * --runner=DataflowRunner \
 * --inputFilePattern=gs://path/to/demo_file.csv \
 * --outputTopic=projects/${PROJECT_ID}/topics/${TOPIC_NAME}"
 * }
 * </pre>
 */
@Template(
    name = "GCS_Text_to_Cloud_PubSub",
    category = TemplateCategory.BATCH,
    displayName = "Cloud Storage Text File to Pub/Sub (Batch)",
    description =
        "Batch pipeline. Reads records from text files stored in Cloud Storage and publishes them to a Pub/Sub topic.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
public class TextToPubsub {

  /** The custom options supported by the pipeline. Inherits standard configuration options. */
  public interface Options extends PipelineOptions {
    @TemplateParameter.GcsReadFile(
        order = 1,
        description = "Cloud Storage Input File(s)",
        helpText = "Path of the file pattern glob to read from.",
        example = "gs://your-bucket/path/*.txt")
    @Required
    ValueProvider<String> getInputFilePattern();

    void setInputFilePattern(ValueProvider<String> value);

    @TemplateParameter.PubsubTopic(
        order = 2,
        description = "Output Pub/Sub topic",
        helpText =
            "The name of the topic to which data should published, in the format of 'projects/your-project-id/topics/your-topic-name'",
        example = "projects/your-project-id/topics/your-topic-name")
    @Required
    ValueProvider<String> getOutputTopic();

    void setOutputTopic(ValueProvider<String> value);
  }

  /**
   * Main entry-point for the pipeline. Reads in the command-line arguments, parses them, and
   * executes the pipeline.
   *
   * @param args Arguments passed in from the command-line.
   */
  public static void main(String[] args) {

    // Parse the user options passed from the command-line
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    run(options);
  }

  /**
   * Executes the pipeline with the provided execution parameters.
   *
   * @param options The execution parameters.
   */
  public static PipelineResult run(Options options) {
    // Create the pipeline.
    Pipeline pipeline = Pipeline.create(options);

    /*
     * Steps:
     *  1) Read from the text source.
     *  2) Write each text record to Pub/Sub
     */
    pipeline
        .apply("Read Text Data", TextIO.read().from(options.getInputFilePattern()))
        .apply("Write to PubSub", PubsubIO.writeStrings().to(options.getOutputTopic()));

    return pipeline.run();
  }
}

Cloud Storage Text to Cloud Spanner

Cloud Storage Text to Cloud Spanner 模板是一种批处理流水线，用于从 Cloud Storage 读取 CSV 文本文件并将其导入到 Cloud Spanner 数据库。

对此流水线的要求：

目标 Cloud Spanner 数据库和表必须已存在。
您必须拥有 Cloud Storage 存储桶的读取权限以及目标 Cloud Spanner 数据库的写入权限。
包含 CSV 文件的 Cloud Storage 输入路径必须存在。
您必须创建包含 CSV 文件的 JSON 说明的导入清单文件，并且必须将该清单存储在 Cloud Storage 中。
如果目标 Cloud Spanner 数据库已有架构，则清单文件中指定的任何列都必须与目标数据库架构中的相应列具有相同的数据类型。

采用 ASCII 或 UTF-8 编码的清单文件必须符合以下格式：

清单格式和示例

清单文件的格式对应于以下消息类型，此处以协议缓冲区格式显示：

message ImportManifest {
  // The per-table import manifest.
  message TableManifest {
    // Required. The name of the destination table.
    string table_name = 1;
    // Required. The CSV files to import. This value can be either a filepath or a glob pattern.
    repeated string file_patterns = 2;
    // The schema for a table column.
    message Column {
      // Required for each Column that you specify. The name of the column in the
      // destination table.
      string column_name = 1;
      // Required for each Column that you specify. The type of the column.
      string type_name = 2;
    }
    // Optional. The schema for the table columns.
    repeated Column columns = 3;
  }
  // Required. The TableManifest of the tables to be imported.
  repeated TableManifest tables = 1;

  enum ProtoDialect {
    GOOGLE_STANDARD_SQL = 0;
    POSTGRESQL = 1;
  }
  // Optional. The dialect of the receiving database. Defaults to GOOGLE_STANDARD_SQL.
  ProtoDialect dialect = 2;
}

以下示例展示了将名为 Albums 和 Singers 的表导入 GoogleSQL 方言数据库的清单文件。Albums 表使用作业从数据库中检索的列架构，Singers 表使用清单文件指定的架构：

{
  "tables": [
    {
      "table_name": "Albums",
      "file_patterns": [
        "gs://bucket1/Albums_1.csv",
        "gs://bucket1/Albums_2.csv"
      ]
    },
    {
      "table_name": "Singers",
      "file_patterns": [
        "gs://bucket1/Singers*.csv"
      ],
      "columns": [
        {"column_name": "SingerId", "type_name": "INT64"},
        {"column_name": "FirstName", "type_name": "STRING"},
        {"column_name": "LastName", "type_name": "STRING"}
      ]
    }
  ]
}

要导入的文本文件必须采用 ASCII 或 UTF-8 编码的 CSV 格式。我们建议您不要在 UTF-8 编码文件中使用字节顺序标记 (BOM)。

数据必须与下面的一种类型相匹配：

GoogleSQL

    BOOL
    INT64
    FLOAT64
    NUMERIC
    STRING
    DATE
    TIMESTAMP
    BYPES
    JSON

PostgreSQL

    boolean
    bigint
    double precision
    numeric
    character varying, text
    date
    timestamp with time zone
    bytea

注意：如果您没有在导入清单文件中指定目标表的列名称和数据类型，则 CSV 文件中的列必须与目标数据库中的列具有相同的顺序。您可以通过运行以下查询来查看表中列的顺序：

SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME =
      TABLE_NAME ORDER BY ORDINAL_POSITION

模板参数

参数	说明
`instanceId`	Cloud Spanner 数据库的实例 ID。
`databaseId`	Cloud Spanner 数据库的 ID。
`importManifest`	Cloud Storage 中导入清单文件的路径。
`columnDelimiter`	源文件使用的列分隔符。默认值为 `,`。
`fieldQualifier`	应放置在包含 `columnDelimiter` 的源文件中任何值前后的字符。默认值为 `"`。
`trailingDelimiter`	指定源文件中的行是否带有末尾分隔符（即，在每行末尾的最后一列值之后，是否会出现 `columnDelimiter` 字符）。默认值为 `true`。
`escape`	源文件使用的转义字符。默认情况下，此参数未设置，并且模板不使用转义字符。
`nullString`	表示 `NULL` 值的字符串。默认情况下，此参数未设置，并且模板不使用 null 字符串。
`dateFormat`	用于解析日期列的格式。默认情况下，流水线会尝试将日期列解析为 `yyyy-M-d[' 00:00:00']`，例如，解析为 2019-01-31 或 2019-1-1 00:00:00。如果您的日期格式有所不同，请使用 `java.time.format.DateTimeFormatter` 模式指定格式。
`timestampFormat`	用于解析时间戳列的格式。如果时间戳为长整数，则会解析为 Unix 时间。否则，时间戳会解析为 `java.time.format.DateTimeFormatter.ISO_INSTANT` 格式的字符串。对于其他情况，请指定您自己的模式字符串，例如，您可以对 `"Jan 21 1998 01:02:03.456+08:00"` 格式的时间戳使用 `MMM dd yyyy HH:mm:ss.SSSVV`。

如果您需要使用自定义日期或时间戳格式，请确保这些格式是有效的 java.time.format.DateTimeFormatter 模式。下表显示了日期和时间戳列的自定义格式的其他示例：

类型	输入值	格式	备注
`DATE`	2011-3-31		默认情况下，模板可以解析此格式。您无需指定 `dateFormat` 参数。
`DATE`	2011-3-31 00:00:00		默认情况下，模板可以解析此格式。您无需指定格式。如果需要，您可以使用 `yyyy-M-d' 00:00:00'`。
`DATE`	01 Apr, 18	dd MMM, yy
`DATE`	Wednesday, April 3, 2019 AD	EEEE, LLLL d, yyyy G
`TIMESTAMP`	2019-01-02T11:22:33Z 2019-01-02T11:22:33.123Z 2019-01-02T11:22:33.12356789Z		默认格式 `ISO_INSTANT` 可以解析此类型的时间戳。您无需提供 `timestampFormat` 参数。
`TIMESTAMP`	1568402363		默认情况下，模板可以解析此类型的时间戳并将其视为 Unix 时间。
`TIMESTAMP`	Tue, 3 Jun 2008 11:05:30 GMT	EEE, d MMM yyyy HH:mm:ss VV
`TIMESTAMP`	2018/12/31 110530.123PST	yyyy/MM/dd HHmmss.SSSz
`TIMESTAMP`	2019-01-02T11:22:33Z 或 2019-01-02T11:22:33.123Z	yyyy-MM-dd'T'HH:mm:ss[.SSS]VV	如果输入列为 2019-01-02T11:22:33Z 和 2019-01-02T11:22:33.123Z 的混合格式，则默认格式可以解析此类型的时间戳。您无需提供自己的格式参数。您可以使用 `yyyy-MM-dd'T'HH:mm:ss[.SSS]VV` 来处理这两种情况。您不能使用 `yyyy-MM-dd'T'HH:mm:ss[.SSS]'Z'`，因为后缀“Z”必须解析为时区 ID（而不是字符字面量）。在内部，时间戳列会转换为 `java.time.Instant`。因此，您必须以 UTC 指定时间戳列，或者将时区信息与其相关联。本地日期时间（例如 2019-01-02 11:22:33）无法解析为有效的 `java.time.Instant`。

在 Cloud Storage to Cloud Spanner 模板上运行文本文件

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Text Files on Cloud Storage to Cloud Spanner template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner \
    --region REGION_NAME \
    --parameters \
instanceId=INSTANCE_ID,\
databaseId=DATABASE_ID,\
importManifest=GCS_PATH_TO_IMPORT_MANIFEST

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
INSTANCE_ID：您的 Cloud Spanner 实例 ID
DATABASE_ID：您的 Cloud Spanner 数据库 ID
GCS_PATH_TO_IMPORT_MANIFEST：导入清单文件的 Cloud Storage 路径

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner
{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "importManifest": "GCS_PATH_TO_IMPORT_MANIFEST"
   },
   "environment": {
       "machineType": "n1-standard-2"
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
INSTANCE_ID：您的 Cloud Spanner 实例 ID
DATABASE_ID：您的 Cloud Spanner 数据库 ID
GCS_PATH_TO_IMPORT_MANIFEST：导入清单文件的 Cloud Storage 路径

模板源代码

Java

此模板的源代码位于 GitHub 上的 GoogleCloudPlatform/DataflowTemplates 代码库中。

Cloud Storage to Elasticsearch

Cloud Storage to Elasticsearch 模板是一种批处理流水线，可从存储在 Cloud Storage 存储桶中的 csv 文件读取数据，并将数据作为 JSON 文档注入到 Elasticsearch 中。

对此流水线的要求：

Cloud Storage 存储桶必须存在。
Google Cloud 实例或 Elasticsearch Cloud 上必须存在可通过 Dataflow 访问的 Elasticsearch 主机。
错误输出的 BigQuery 表必须存在。

模板参数

参数	说明
`inputFileSpec`	用于搜索 CSV 文件的 Cloud Storage 文件格式。示例：`gs://mybucket/test-*.csv`。
`connectionUrl`	Elasticsearch 网址，格式为 `https://hostname:[port]` 或指定 CloudID（如果使用 Elastic Cloud）。
`apiKey`	用于身份验证的 Base64 编码 API 密钥。
`index`	将对其发出请求的 Elasticsearch 索引，例如 `my-index`。
`deadletterTable`	将失败的插入发送到的 BigQuery Deadletter 表。示例：`<your-project>:<your-dataset>.<your-table-name>`。
`containsHeaders`	（可选）用于指明 CSV 中是否包含标题的布尔值。默认值：`true`。
`delimiter`	（可选）CSV 使用的分隔符。示例：`,`
`csvFormat`	（可选）基于 Apache Commons CSV 格式的 CSV 格式。默认值：`Default`。
`jsonSchemaPath`	（可选）JSON 架构的路径。默认值：`null`。
`largeNumFiles`	（可选）如果文件数达到数万个，则设置为 true。默认值：`false`。
`javascriptTextTransformGcsPath`	（可选）`.js` 文件的 Cloud Storage URI，用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 `gs://my-bucket/my-udfs/my_file.js`。
`javascriptTextTransformFunctionName`	（可选）您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果您的 JavaScript 函数代码为 `myTransform(inJson) { /...do stuff.../ }`，则函数名称为 `myTransform`。如需查看 JavaScript UDF 示例，请参阅 UDF 示例。
`batchSize`	（可选）文档数量中的批次大小。默认值：`1000`。
`batchSizeBytes`	（可选）批次大小（以字节为单位）。默认值：`5242880` (5mb)。
`maxRetryAttempts`	（可选）尝试次数上限，必须大于 0。默认值：不重试。
`maxRetryDuration`	（可选）重试时长上限（以毫秒为单位），必须大于 0。默认值：不重试。
`csvFileEncoding`	（可选）CSV 文件编码。
`propertyAsIndex`	（可选）要编入索引的文档中的一个属性，其值将指定批量请求要包含在文档中的 `_index` 元数据（优先于 `_index` UDF）。默认值：none。
`propertyAsId`	（可选）要编入索引的文档中的一个属性，其值将指定批量请求要包含在文档中的 `_id` 元数据（优先于 `_id` UDF）。默认值：none。
`javaScriptIndexFnGcsPath`	（可选）函数的 JavaScript UDF 源的 Cloud Storage 路径，该函数将指定批量请求要包含在文档中的 `_index` 元数据。默认值：none。
`javaScriptIndexFnName`	（可选）函数的 UDF JavaScript 函数名称，该函数将指定批量请求要包含在文档中的 `_index` 元数据。默认值：none。
`javaScriptIdFnGcsPath`	（可选）函数的 JavaScript UDF 源的 Cloud Storage 路径，该函数将指定批量请求要包含在文档中的 `_id` 元数据。默认值：none。
`javaScriptIdFnName`	（可选）函数的 UDF JavaScript 函数名称，该函数将指定批量请求要包含在文档中的 `_id` 元数据。默认值：none。
`javaScriptTypeFnGcsPath`	（可选）函数的 JavaScript UDF 源的 Cloud Storage 路径，该函数将指定批量请求要包含在文档中的 `_type` 元数据。默认值：none。
`javaScriptTypeFnName`	（可选）函数的 UDF JavaScript 函数名称，该函数将指定批量请求要包含在文档中的 `_type` 元数据。默认值：none。
`javaScriptIsDeleteFnGcsPath`	（可选）函数的 JavaScript UDF 源的 Cloud Storage 路径，该函数将确定是否应删除文档，而不是插入或更新文档。该函数应返回字符串值 `"true"` 或 `"false"`。默认值：none。
`javaScriptIsDeleteFnName`	（可选）函数的 UDF JavaScript 函数名称，该函数将确定是否应删除文档，而不是插入或更新文档。该函数应返回字符串值 `"true"` 或 `"false"`。默认值：none。
`usePartialUpdate`	（可选）是否在 Elasticsearch 请求中使用部分更新（更新而不是创建或索引，允许部分文档）。默认值：`false`。
`bulkInsertMethod`	（可选）在 Elasticsearch 批量请求中使用 `INDEX`（索引，允许执行更新插入操作）还是 `CREATE`（创建，会对重复 _id 报错）。默认值：`CREATE`。

运行 Cloud Storage to Elasticsearch 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cloud Storage to Elasticsearch template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID\
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/GCS_to_Elasticsearch \
    --parameters \
inputFileSpec=INPUT_FILE_SPEC,\
connectionUrl=CONNECTION_URL,\
apiKey=APIKEY,\
index=INDEX,\
deadletterTable=DEADLETTER_TABLE,\

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
INPUT_FILE_SPEC：您的 Cloud Storage 文件格式。
CONNECTION_URL：您的 Elasticsearch 网址。
APIKEY：用于身份验证的 base64 编码 API 密钥。
INDEX：您的 Elasticsearch 索引。
DEADLETTER_TABLE：您的 BigQuery 表。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputFileSpec": "INPUT_FILE_SPEC",
          "connectionUrl": "CONNECTION_URL",
          "apiKey": "APIKEY",
          "index": "INDEX",
          "deadletterTable": "DEADLETTER_TABLE"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/GCS_to_Elasticsearch",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
INPUT_FILE_SPEC：您的 Cloud Storage 文件格式。
CONNECTION_URL：您的 Elasticsearch 网址。
APIKEY：用于身份验证的 base64 编码 API 密钥。
INDEX：您的 Elasticsearch 索引。
DEADLETTER_TABLE：您的 BigQuery 表。

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.elasticsearch.templates;

import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.elasticsearch.options.GCSToElasticsearchOptions;
import com.google.cloud.teleport.v2.elasticsearch.transforms.WriteToElasticsearch;
import com.google.cloud.teleport.v2.transforms.CsvConverters;
import com.google.cloud.teleport.v2.transforms.ErrorConverters.WriteStringMessageErrors;
import com.google.cloud.teleport.v2.utils.SchemaUtils;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.NullableCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.WithTimestamps;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.joda.time.Instant;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link GCSToElasticsearch} pipeline exports data from one or more CSV files in Cloud Storage
 * to Elasticsearch.
 *
 * <p>Please refer to <b><a href=
 * "https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/v2/googlecloud-to-elasticsearch/docs/GCSToElasticsearch/README.md">
 * README.md</a></b> for further information.
 */
@Template(
    name = "GCS_to_Elasticsearch",
    category = TemplateCategory.BATCH,
    displayName = "Cloud Storage to Elasticsearch",
    description =
        "A pipeline to ingest csv files from Cloud Storage and writes each line into Elasticsearch"
            + " as a json document.",
    optionsClass = GCSToElasticsearchOptions.class,
    flexContainerName = "gcs-to-elasticsearch",
    contactInformation = "https://cloud.google.com/support")
public class GCSToElasticsearch {

  /** The tag for the headers of the CSV if required. */
  static final TupleTag<String> CSV_HEADERS = new TupleTag<String>() {};

  /** The tag for the lines of the CSV. */
  static final TupleTag<String> CSV_LINES = new TupleTag<String>() {};

  /** The tag for the dead-letter output of the UDF. */
  static final TupleTag<FailsafeElement<String, String>> PROCESSING_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /** The tag for the main output for the UDF. */
  static final TupleTag<FailsafeElement<String, String>> PROCESSING_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /* Logger for class. */
  private static final Logger LOG = LoggerFactory.getLogger(GCSToElasticsearch.class);

  /** String/String Coder for FailsafeElement. */
  private static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(
          NullableCoder.of(StringUtf8Coder.of()), NullableCoder.of(StringUtf8Coder.of()));

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    GCSToElasticsearchOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(GCSToElasticsearchOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  private static PipelineResult run(GCSToElasticsearchOptions options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Register the coder for pipeline
    CoderRegistry coderRegistry = pipeline.getCoderRegistry();
    coderRegistry.registerCoderForType(
        FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER);

    // Throw error if containsHeaders is true and a schema or Udf is also set.
    if (options.getContainsHeaders()) {
      checkArgument(
          options.getJavascriptTextTransformGcsPath() == null
              && options.getJsonSchemaPath() == null,
          "Cannot parse file containing headers with UDF or Json schema.");
    }

    // Throw error if only one retry configuration parameter is set.
    checkArgument(
        (options.getMaxRetryAttempts() == null && options.getMaxRetryDuration() == null)
            || (options.getMaxRetryAttempts() != null && options.getMaxRetryDuration() != null),
        "To specify retry configuration both max attempts and max duration must be set.");

    /*
     * Steps: 1) Read records from CSV(s) via {@link CsvConverters.ReadCsv}.
     *        2) Convert lines to JSON strings via {@link CsvConverters.LineToFailsafeJson}.
     *        3a) Write JSON strings as documents to Elasticsearch via {@link ElasticsearchIO}.
     *        3b) Write elements that failed processing to {@link org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO}.
     */
    PCollectionTuple convertedCsvLines =
        pipeline
            /*
             * Step 1: Read CSV file(s) from Cloud Storage using {@link CsvConverters.ReadCsv}.
             */
            .apply(
                "ReadCsv",
                CsvConverters.ReadCsv.newBuilder()
                    .setCsvFormat(options.getCsvFormat())
                    .setDelimiter(options.getDelimiter())
                    .setHasHeaders(options.getContainsHeaders())
                    .setInputFileSpec(options.getInputFileSpec())
                    .setHeaderTag(CSV_HEADERS)
                    .setLineTag(CSV_LINES)
                    .setFileEncoding(options.getCsvFileEncoding())
                    .build())
            /*
             * Step 2: Convert lines to Elasticsearch document.
             */
            .apply(
                "ConvertLine",
                CsvConverters.LineToFailsafeJson.newBuilder()
                    .setDelimiter(options.getDelimiter())
                    .setUdfFileSystemPath(options.getJavascriptTextTransformGcsPath())
                    .setUdfFunctionName(options.getJavascriptTextTransformFunctionName())
                    .setJsonSchemaPath(options.getJsonSchemaPath())
                    .setHeaderTag(CSV_HEADERS)
                    .setLineTag(CSV_LINES)
                    .setUdfOutputTag(PROCESSING_OUT)
                    .setUdfDeadletterTag(PROCESSING_DEADLETTER_OUT)
                    .build());
    /*
     * Step 3a: Write elements that were successfully processed to Elasticsearch using {@link WriteToElasticsearch}.
     */
    convertedCsvLines
        .get(PROCESSING_OUT)
        .apply(
            "GetJsonDocuments",
            MapElements.into(TypeDescriptors.strings()).via(FailsafeElement::getPayload))
        .apply(
            "WriteToElasticsearch",
            WriteToElasticsearch.newBuilder()
                .setOptions(options.as(GCSToElasticsearchOptions.class))
                .build());

    /*
     * Step 3b: Write elements that failed processing to deadletter table via {@link BigQueryIO}.
     */
    convertedCsvLines
        .get(PROCESSING_DEADLETTER_OUT)
        .apply(
            "AddTimestamps",
            WithTimestamps.of((FailsafeElement<String, String> failures) -> new Instant()))
        .apply(
            "WriteFailedElementsToBigQuery",
            WriteStringMessageErrors.newBuilder()
                .setErrorRecordsTable(options.getDeadletterTable())
                .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA)
                .build());

    return pipeline.run();
  }
}

Java Database Connectivity (JDBC) to BigQuery

JDBC to BigQuery 模板是一种批处理流水线，可将数据从关系数据库表中复制到现有的 BigQuery 表中。此流水线使用 JDBC 连接到关系数据库。您可以使用此模板将数据从任何具有可用 JDBC 驱动程序的关系数据库复制到 BigQuery 中。为了增加一项保护措施，您还可以在传入使用 Cloud KMS 密钥加密的 Base64 编码用户名、密码和连接字符串参数的同时，传入该 Cloud KMS 密钥。如需详细了解如何对用户名、密码和连接字符串参数进行加密，请参阅 Cloud KMS API 加密端点。

对此流水线的要求：

关系数据库的 JDBC 驱动程序必须可用。
在运行此流水线之前，BigQuery 表必须已存在。
BigQuery 表必须具有兼容的架构。
必须能够从 Dataflow 运行的子网访问关系数据库。

模板参数

参数	说明
`driverJars`	以逗号分隔的驱动程序 JAR 文件列表。例如 `gs://<my-bucket>/driver_jar1.jar,gs://<my-bucket>/driver_jar2.jar`。
`driverClassName`	JDBC 驱动程序类名称。例如 `com.mysql.jdbc.Driver`。
`connectionURL`	JDBC 连接网址字符串。例如 `jdbc:mysql://some-host:3306/sampledb`。可作为 Base64 编码，然后使用 Cloud KMS 密钥加密的字符串传入。
`query`	要在提取数据的源上运行的查询。例如 `select * from sampledb.sample_table`。
`outputTable`	BigQuery 输出表位置，采用 `<my-project>:<my-dataset>.<my-table>` 格式。
`bigQueryLoadingTemporaryDirectory`	BigQuery 加载进程的临时目录。例如 `gs://<my-bucket>/my-files/temp_dir`。
`connectionProperties`	（可选）用于 JDBC 连接的属性字符串。字符串的格式必须为 `[propertyName=property;]*`。例如 `unicode=true;characterEncoding=UTF-8`。
`username`	（可选）用于 JDBC 连接的用户名。该参数可以作为使用 Cloud KMS 密钥加密的 Base64 编码字符串传入。
`password`	（可选）用于 JDBC 连接的密码。该参数可以作为使用 Cloud KMS 密钥加密的 Base64 编码字符串传入。
`KMSEncryptionKey`	（可选）用于对用户名、密码和连接字符串进行解密的 Cloud KMS 加密密钥。如果传入了 Cloud KMS 密钥，则用户名、密码和连接字符串都必须以加密方式进行传递。
`disabledAlgorithms`	（可选）要停用的以英文逗号分隔的算法。如果此值设置为 `none`，则不会停用任何算法。请谨慎使用，因为默认停用的算法已知存在漏洞或性能问题。例如 `SSLv3, RC4.`
`extraFilesToStage`	用于将文件暂存在工作器中的 Cloud Storage 路径或 Secret Manager 密文，以逗号分隔。这些文件将保存在每个工作器的 `/extra_files` 目录下。例如 `gs://<my-bucket>/file.txt,projects/<project-id>/secrets/<secret-id>/versions/<version-id>`。

运行 JDBC to BigQuery 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the JDBC to BigQuery template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Jdbc_to_BigQuery \
    --region REGION_NAME \
    --parameters \
driverJars=DRIVER_PATHS,\
driverClassName=DRIVER_CLASS_NAME,\
connectionURL=JDBC_CONNECTION_URL,\
query=SOURCE_SQL_QUERY,\
outputTable=PROJECT_ID:DATASET.TABLE_NAME,
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS,\
connectionProperties=CONNECTION_PROPERTIES,\
username=CONNECTION_USERNAME,\
password=CONNECTION_PASSWORD,\
KMSEncryptionKey=KMS_ENCRYPTION_KEY

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
DRIVER_PATHS：JDBC 驱动程序以英文逗号分隔的 Cloud Storage 路径
DRIVER_CLASS_NAME：驱动器类名称
JDBC_CONNECTION_URL：JDBC 连接网址
SOURCE_SQL_QUERY：需要在源数据库上运行的 SQL 查询
DATASET：您的 BigQuery 数据集，并替换 TABLE_NAME：您的 BigQuery 表名称
PATH_TO_TEMP_DIR_ON_GCS：临时目录的 Cloud Storage 路径
CONNECTION_PROPERTIES：JDBC 连接属性（如有需要）
CONNECTION_USERNAME：JDBC 连接用户名
CONNECTION_PASSWORD：JDBC 连接密码
KMS_ENCRYPTION_KEY：Cloud KMS 加密密钥

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Jdbc_to_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "driverJars": "DRIVER_PATHS",
       "driverClassName": "DRIVER_CLASS_NAME",
       "connectionURL": "JDBC_CONNECTION_URL",
       "query": "SOURCE_SQL_QUERY",
       "outputTable": "PROJECT_ID:DATASET.TABLE_NAME",
       "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS",
       "connectionProperties": "CONNECTION_PROPERTIES",
       "username": "CONNECTION_USERNAME",
       "password": "CONNECTION_PASSWORD",
       "KMSEncryptionKey":"KMS_ENCRYPTION_KEY"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
DRIVER_PATHS：JDBC 驱动程序以英文逗号分隔的 Cloud Storage 路径
DRIVER_CLASS_NAME：驱动器类名称
JDBC_CONNECTION_URL：JDBC 连接网址
SOURCE_SQL_QUERY：需要在源数据库上运行的 SQL 查询
DATASET：您的 BigQuery 数据集，并替换 TABLE_NAME：您的 BigQuery 表名称
PATH_TO_TEMP_DIR_ON_GCS：临时目录的 Cloud Storage 路径
CONNECTION_PROPERTIES：JDBC 连接属性（如有需要）
CONNECTION_USERNAME：JDBC 连接用户名
CONNECTION_PASSWORD：JDBC 连接密码
KMS_ENCRYPTION_KEY：Cloud KMS 加密密钥

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2018 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.templates;

import com.google.api.services.bigquery.model.TableRow;
import com.google.cloud.teleport.io.DynamicJdbcIO;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.common.JdbcConverters;
import com.google.cloud.teleport.util.KMSEncryptedNestedValueProvider;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.TableRowJsonCoder;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * A template that copies data from a relational database using JDBC to an existing BigQuery table.
 */
@Template(
    name = "Jdbc_to_BigQuery",
    category = TemplateCategory.BATCH,
    displayName = "JDBC to BigQuery",
    description =
        "A pipeline that reads from a JDBC source and writes to a BigQuery table. JDBC connection string, user name and password can be passed in directly as plaintext or encrypted using the Google Cloud KMS API.  If the parameter KMSEncryptionKey is specified, connectionURL, username, and password should be all in encrypted format. A sample curl command for the KMS API encrypt endpoint: curl -s -X POST \"https://cloudkms.googleapis.com/v1/projects/your-project/locations/your-path/keyRings/your-keyring/cryptoKeys/your-key:encrypt\"  -d \"{\\\"plaintext\\\":\\\"PasteBase64EncodedString\\\"}\" -H \"Authorization: Bearer $(gcloud auth application-default print-access-token)\" -H \"Content-Type: application/json\"",
    optionsClass = JdbcConverters.JdbcToBigQueryOptions.class,
    contactInformation = "https://cloud.google.com/support")
public class JdbcToBigQuery {

  private static final Logger LOG = LoggerFactory.getLogger(JdbcToBigQuery.class);

  private static ValueProvider<String> maybeDecrypt(
      ValueProvider<String> unencryptedValue, ValueProvider<String> kmsKey) {
    return new KMSEncryptedNestedValueProvider(unencryptedValue, kmsKey);
  }

  /**
   * Main entry point for executing the pipeline. This will run the pipeline asynchronously. If
   * blocking execution is required, use the {@link
   * JdbcToBigQuery#run(JdbcConverters.JdbcToBigQueryOptions)} method to start the pipeline and
   * invoke {@code result.waitUntilFinish()} on the {@link PipelineResult}
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {

    // Parse the user options passed from the command-line
    JdbcConverters.JdbcToBigQueryOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(JdbcConverters.JdbcToBigQueryOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  private static PipelineResult run(JdbcConverters.JdbcToBigQueryOptions options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    /*
     * Steps: 1) Read records via JDBC and convert to TableRow via RowMapper
     *        2) Append TableRow to BigQuery via BigQueryIO
     */
    pipeline
        /*
         * Step 1: Read records via JDBC and convert to TableRow
         *         via {@link org.apache.beam.sdk.io.jdbc.JdbcIO.RowMapper}
         */
        .apply(
            "Read from JdbcIO",
            DynamicJdbcIO.<TableRow>read()
                .withDataSourceConfiguration(
                    DynamicJdbcIO.DynamicDataSourceConfiguration.create(
                            options.getDriverClassName(),
                            maybeDecrypt(options.getConnectionURL(), options.getKMSEncryptionKey()))
                        .withUsername(
                            maybeDecrypt(options.getUsername(), options.getKMSEncryptionKey()))
                        .withPassword(
                            maybeDecrypt(options.getPassword(), options.getKMSEncryptionKey()))
                        .withDriverJars(options.getDriverJars())
                        .withConnectionProperties(options.getConnectionProperties()))
                .withQuery(options.getQuery())
                .withCoder(TableRowJsonCoder.of())
                .withRowMapper(JdbcConverters.getResultSetToTableRow(options.getUseColumnAlias())))
        /*
         * Step 2: Append TableRow to an existing BigQuery table
         */
        .apply(
            "Write to BigQuery",
            BigQueryIO.writeTableRows()
                .withoutValidation()
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
                .withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory())
                .to(options.getOutputTable()));

    // Execute the pipeline and return the result.
    return pipeline.run();
  }
}

Java Database Connectivity (JDBC) to Pub/Sub

Java Database Connectivity (JDBC) to Pub/Sub 模板是一个批处理流水线，可从 JDBC 源注入数据，并将生成的记录作为 JSON 字符串写入预先存在的 Pub/Sub 主题。

对此流水线的要求：

在运行流水线之前，JDBC 源必须已存在。
在运行流水线之前，Cloud Pub/Sub 输出主题必须已存在。

模板参数

参数	说明
`driverClassName`	JDBC 驱动程序类名称。例如 `com.mysql.jdbc.Driver`。
`connectionUrl`	JDBC 连接网址字符串。例如 `jdbc:mysql://some-host:3306/sampledb`。可作为 Base64 编码，然后使用 Cloud KMS 密钥加密的字符串传入。
`driverJars`	以英文逗号分隔的 JDBC 驱动程序 Cloud Storage 路径。例如 `gs://your-bucket/driver_jar1.jar,gs://your-bucket/driver_jar2.jar`。
`username`	（可选）用于 JDBC 连接的用户名。该参数可以作为使用 Cloud KMS 密钥加密的 Base64 编码字符串传入。
`password`	（可选）用于 JDBC 连接的密码。该参数可以作为使用 Cloud KMS 密钥加密的 Base64 编码字符串传入。
`connectionProperties`	（可选）用于 JDBC 连接的属性字符串。字符串的格式必须为 `[propertyName=property;]*`。例如 `unicode=true;characterEncoding=UTF-8`。
`query`	要在提取数据的源上运行的查询。例如 `select * from sampledb.sample_table`。
`outputTopic`	要发布到的 Pub/Sub 主题，格式为 `projects/<project>/topics/<topic>`。
`KMSEncryptionKey`	（可选）用于对用户名、密码和连接字符串进行解密的 Cloud KMS 加密密钥。如果传入了 Cloud KMS 密钥，则用户名、密码和连接字符串都必须以加密方式进行传递。
`extraFilesToStage`	用于将文件暂存在工作器中的 Cloud Storage 路径或 Secret Manager 密文，以逗号分隔。这些文件将保存在每个工作器的 `/extra_files` 目录下。例如 `gs://<my-bucket>/file.txt,projects/<project-id>/secrets/<secret-id>/versions/<version-id>`。

运行 Java Database Connectivity (JDBC) to Pub/Sub 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the JDBC to Pub/Sub template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/flex/Jdbc_to_PubSub \
    --region REGION_NAME \
    --parameters \
driverClassName=DRIVER_CLASS_NAME,\
connectionURL=JDBC_CONNECTION_URL,\
driverJars=DRIVER_PATHS,\
username=CONNECTION_USERNAME,\
password=CONNECTION_PASSWORD,\
connectionProperties=CONNECTION_PROPERTIES,\
query=SOURCE_SQL_QUERY,\
outputTopic=OUTPUT_TOPIC,\
KMSEncryptionKey=KMS_ENCRYPTION_KEY

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
DRIVER_CLASS_NAME：驱动程序类名称
JDBC_CONNECTION_URL：JDBC 连接网址
DRIVER_PATHS：JDBC 驱动程序以英文逗号分隔的 Cloud Storage 路径
CONNECTION_USERNAME：JDBC 连接用户名
CONNECTION_PASSWORD：JDBC 连接密码
CONNECTION_PROPERTIES：JDBC 连接属性（如有需要）
SOURCE_SQL_QUERY：需要在源数据库上运行的 SQL 查询
OUTPUT_TOPIC：要发布到的 Pub/Sub
KMS_ENCRYPTION_KEY：Cloud KMS 加密密钥

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "jobName": "JOB_NAME",
   "parameters": {
       "driverClassName": "DRIVER_CLASS_NAME",
       "connectionURL": "JDBC_CONNECTION_URL",
       "driverJars": "DRIVER_PATHS",
       "username": "CONNECTION_USERNAME",
       "password": "CONNECTION_PASSWORD",
       "connectionProperties": "CONNECTION_PROPERTIES",
       "query": "SOURCE_SQL_QUERY",
       "outputTopic": "OUTPUT_TOPIC",
       "KMSEncryptionKey":"KMS_ENCRYPTION_KEY"
   },
   "environment": { "zone": "us-central1-f" },
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
DRIVER_CLASS_NAME：驱动程序类名称
JDBC_CONNECTION_URL：JDBC 连接网址
DRIVER_PATHS：JDBC 驱动程序以英文逗号分隔的 Cloud Storage 路径
CONNECTION_USERNAME：JDBC 连接用户名
CONNECTION_PASSWORD：JDBC 连接密码
CONNECTION_PROPERTIES：JDBC 连接属性（如有需要）
SOURCE_SQL_QUERY：需要在源数据库上运行的 SQL 查询
OUTPUT_TOPIC：要发布到的 Pub/Sub
KMS_ENCRYPTION_KEY：Cloud KMS 加密密钥

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.templates;

import static com.google.cloud.teleport.v2.utils.KMSUtils.maybeDecrypt;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.io.DynamicJdbcIO;
import com.google.cloud.teleport.v2.options.JdbcToPubsubOptions;
import java.sql.Clob;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.jdbc.JdbcIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.values.PCollection;
import org.json.JSONObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link JdbcToPubsub} batch pipeline reads data from JDBC and publishes to Google Cloud
 * PubSub. <br>
 */
@Template(
    name = "Jdbc_to_PubSub",
    category = TemplateCategory.BATCH,
    displayName = "JDBC to Pub/Sub",
    description =
        "A batch pipeline which ingests data from JDBC source and writes to a pre-existing Pub/Sub"
            + " topic as a JSON string. JDBC connection string, user name and password can be"
            + " passed in directly as plaintext or encrypted using the Google Cloud KMS API.  If"
            + " the parameter KMSEncryptionKey is specified, connectionUrl, username, and password"
            + " should be all in encrypted format. A sample curl command for the KMS API encrypt"
            + " endpoint: curl -s -X POST"
            + " \"https://cloudkms.googleapis.com/v1/projects/your-project/locations/your-path/keyRings/your-keyring/cryptoKeys/your-key:encrypt\""
            + "  -d \"{\\\"plaintext\\\":\"PasteBase64EncodedString\\\"}\"  -H \"Authorization:"
            + " Bearer $(gcloud auth application-default print-access-token)\"  -H \"Content-Type:"
            + " application/json\"",
    optionsClass = JdbcToPubsubOptions.class,
    flexContainerName = "jdbc-to-pubsub",
    contactInformation = "https://cloud.google.com/support")
public class JdbcToPubsub {

  /* Logger for class.*/
  private static final Logger LOG = LoggerFactory.getLogger(JdbcToPubsub.class);

  /**
   * {@link JdbcIO.RowMapper} implementation to convert Jdbc ResultSet rows to UTF-8 encoded JSONs.
   */
  public static class ResultSetToJSONString implements JdbcIO.RowMapper<String> {

    @Override
    public String mapRow(ResultSet resultSet) throws Exception {
      ResultSetMetaData metaData = resultSet.getMetaData();
      JSONObject json = new JSONObject();

      for (int i = 1; i <= metaData.getColumnCount(); i++) {
        Object value = resultSet.getObject(i);

        // JSONObject.put() does not support null values. The exception is JSONObject.NULL
        if (value == null) {
          json.put(metaData.getColumnLabel(i), JSONObject.NULL);
          continue;
        }

        switch (metaData.getColumnTypeName(i).toLowerCase()) {
          case "clob":
            Clob clobObject = resultSet.getClob(i);
            if (clobObject.length() > Integer.MAX_VALUE) {
              LOG.warn(
                  "The Clob value size {} in column {} exceeds 2GB and will be truncated.",
                  clobObject.length(),
                  metaData.getColumnLabel(i));
            }
            json.put(
                metaData.getColumnLabel(i), clobObject.getSubString(1, (int) clobObject.length()));
            break;
          default:
            json.put(metaData.getColumnLabel(i), value);
        }
      }
      return json.toString();
    }
  }

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    JdbcToPubsubOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(JdbcToPubsubOptions.class);

    run(options);
  }

  /**
   * Runs a pipeline which reads message from JDBC and writes to Pub/Sub.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(JdbcToPubsubOptions options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    LOG.info("Starting Jdbc-To-PubSub Pipeline.");

    /*
     * Steps:
     *  1) Read data from a Jdbc Table
     *  2) Write to Pub/Sub topic
     */
    DynamicJdbcIO.DynamicDataSourceConfiguration dataSourceConfiguration =
        DynamicJdbcIO.DynamicDataSourceConfiguration.create(
                options.getDriverClassName(),
                maybeDecrypt(options.getConnectionUrl(), options.getKMSEncryptionKey()))
            .withDriverJars(options.getDriverJars());
    if (options.getUsername() != null) {
      dataSourceConfiguration =
          dataSourceConfiguration.withUsername(
              maybeDecrypt(options.getUsername(), options.getKMSEncryptionKey()));
    }
    if (options.getPassword() != null) {
      dataSourceConfiguration =
          dataSourceConfiguration.withPassword(
              maybeDecrypt(options.getPassword(), options.getKMSEncryptionKey()));
    }
    if (options.getConnectionProperties() != null) {
      dataSourceConfiguration =
          dataSourceConfiguration.withConnectionProperties(options.getConnectionProperties());
    }

    PCollection<String> jdbcData =
        pipeline.apply(
            "readFromJdbc",
            DynamicJdbcIO.<String>read()
                .withDataSourceConfiguration(dataSourceConfiguration)
                .withQuery(options.getQuery())
                .withCoder(StringUtf8Coder.of())
                .withRowMapper(new ResultSetToJSONString()));

    jdbcData.apply("writeSuccessMessages", PubsubIO.writeStrings().to(options.getOutputTopic()));

    return pipeline.run();
  }
}

Apache Cassandra to Cloud Bigtable

“从 Apache Cassandra 到 Cloud Bigtable”模板会将一个表从 Apache Cassandra 复制到 Cloud Bigtable。此模板需要最低限度的配置，可在 Cloud Bigtable 中尽可能准确地复制 Cassandra 中的表结构。

“从 Apache Cassandra 到 Cloud Bigtable”模板适用于以下情况：

在可以接受短时间停机的情况下，迁移 Apache Cassandra 数据库。
定期将 Cassandra 表复制到 Cloud Bigtable 以便向全球用户传送数据。

对此流水线的要求：

在运行此流水线之前，目标 Bigtable 表必须已存在。
Dataflow 工作器与 Apache Cassandra 节点之间建立了网络连接。

类型转换

“从 Apache Cassandra 到 Cloud Bigtable”模板会自动将 Apache Cassandra 数据类型转换为 Cloud Bigtable 的数据类型。

大多数原语在 Cloud Bigtable 和 Apache Cassandra 中的表示方式相同；但以下原语的表示方式有所不同。

Date 和 Timestamp 会转化为 DateTime 对象
UUID 被转换为 String
Varint 被转换为 BigDecimal

Apache Cassandra 还原生支持更复杂的类型，例如 Tuple、List、Set 和 Map。此流水线不支持元组，因为 Apache Beam 中没有相应的类型。

例如，在 Apache Cassandra 中，您有一个 List 类型的列，名为“mylist”，还有一些类似下表内容的值。

row	mylist
1	`(a,b,c)`

流水线会直接将此 List 列扩展为三个不同的列（在 Cloud Bigtable 中称为列限定符）。列名称为“mylist”，但流水线还会附加上列表中各项的索引编号，例如“mylist[0]”。

row	mylist[0]	mylist[1]	mylist[2]
1	a	b	c

流水线按照处理列表的方式来处理集合，但添加了后缀来指明单元是键还是值。

row	mymap
1	`{"first_key":"first_value","another_key":"different_value"}`

转换后，表如下所示：

row	mymap[0].key	mymap[0].value	mymap[1].key	mymap[1].value
1	first_key	first_value	another_key	different_value

主键转换

在 Apache Cassandra 中，主键是使用数据定义语言定义的。主键可以是简单的、复合的，或者是与聚簇列复合的。Cloud Bigtable 支持手动行键构造，按字典顺序对字节数组进行排序。此流水线会收集有关键类型的信息，并遵循基于多个值构建行键的最佳做法构建键。

模板参数

参数	说明
`cassandraHosts`	以英文逗号分隔的列表中的 Apache Cassandra 节点的主机。
`cassandraPort`	（可选）节点上用于访问 Apache Cassandra 的 TCP 端口（默认为 `9042`）。
`cassandraKeyspace`	表格所在的 Apache Cassandra 键空间。
`cassandraTable`	要复制的 Apache Cassandra 表格。
`bigtableProjectId`	从中复制 Apache Cassandra 表的 Bigtable 实例的 Google Cloud 项目 ID。
`bigtableInstanceId`	要复制 Apache Cassandra 表格的 Bigtable 实例 ID。
`bigtableTableId`	要复制 Apache Cassandra 表格的 Bigtable 表格的名称。
`defaultColumnFamily`	（可选）Bigtable 表格的列族名称（默认为 `default`）。
`rowKeySeparator`	（可选）用于构建行键的分隔符（默认为 `#`）。

运行 Apache Cassandra to Cloud Bigtable 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cassandra to Cloud Bigtable template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/VERSION/Cassandra_To_Cloud_Bigtable \
    --region REGION_NAME \
    --parameters \
bigtableProjectId=BIGTABLE_PROJECT_ID,\
bigtableInstanceId=BIGTABLE_INSTANCE_ID,\
bigtableTableId=BIGTABLE_TABLE_ID,\
cassandraHosts=CASSANDRA_HOSTS,\
cassandraKeyspace=CASSANDRA_KEYSPACE,\
cassandraTable=CASSANDRA_TABLE

替换以下内容：

JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：Cloud Bigtable 所在的项目 ID
BIGTABLE_INSTANCE_ID：Cloud Bigtable 实例 ID
BIGTABLE_TABLE_ID：您的 Cloud Bigtable 表名称
CASSANDRA_HOSTS：Apache Cassandra 主机列表；如果提供了多个主机，请按照此说明了解如何转义逗号
CASSANDRA_KEYSPACE：表格所在的 Apache Cassandra 键空间
CASSANDRA_TABLE：需要迁移的 Apache Cassandra 表

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cassandra_To_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "BIGTABLE_INSTANCE_ID",
       "bigtableTableId": "BIGTABLE_TABLE_ID",
       "cassandraHosts": "CASSANDRA_HOSTS",
       "cassandraKeyspace": "CASSANDRA_KEYSPACE",
       "cassandraTable": "CASSANDRA_TABLE"
   },
   "environment": { "zone": "us-central1-f" }
}

替换以下内容：

PROJET_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
BIGTABLE_PROJECT_ID：Cloud Bigtable 所在的项目 ID
BIGTABLE_INSTANCE_ID：Cloud Bigtable 实例 ID
BIGTABLE_TABLE_ID：您的 Cloud Bigtable 表名称
CASSANDRA_HOSTS：Apache Cassandra 主机列表；如果提供了多个主机，请按照此说明了解如何转义逗号
CASSANDRA_KEYSPACE：表格所在的 Apache Cassandra 键空间
CASSANDRA_TABLE：需要迁移的 Apache Cassandra 表

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.bigtable;

import com.datastax.driver.core.Session;
import com.google.cloud.teleport.bigtable.CassandraToBigtable.Options;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.metadata.TemplateParameter;
import java.util.Arrays;
import java.util.List;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.coders.SerializableCoder;
import org.apache.beam.sdk.io.cassandra.CassandraIO;
import org.apache.beam.sdk.io.cassandra.Mapper;
import org.apache.beam.sdk.io.gcp.bigtable.BigtableIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.values.Row;

/**
 * This Dataflow Template performs a one off copy of one table from Apache Cassandra to Cloud
 * Bigtable. It is designed to require minimal configuration and aims to replicate the table
 * structure in Cassandra as closely as possible in Cloud Bigtable. To run the pipeline go to
 * "Create a job from Template", enter the required configuration and press "Run job"
 *
 * <p>The minimum required configuration required to run the pipeline is:
 *
 * <ul>
 *   <li><b>cassandraHosts:</b> The hosts of the Cassandra nodes in a comma separated value list.
 *   <li><b>cassandraPort:</b> The tcp port where Cassandra can be reached on the nodes.
 *   <li><b>cassandraKeyspace:</b> The Cassandra keyspace where the table is located.
 *   <li><b>cassandraTable:</b> The Cassandra table to be copied.
 *   <li><b>bigtableProjectId:</b> The Project ID of the Bigtable instance where the Cassandra table
 *       should be copied.
 *   <li><b>bigtableInstanceId:</b> The Bigtable Instance ID where the Cassandra table should be
 *       copied.
 *   <li><b>bigtableTableId:</b> The name of the Bigtable table where the Cassandra table should be
 *       copied.
 * </ul>
 */
@Template(
    name = "Cassandra_To_Cloud_Bigtable",
    category = TemplateCategory.BATCH,
    displayName = "Cassandra to Cloud Bigtable",
    description = "A pipeline to import a Apache Cassandra table into Cloud Bigtable.",
    optionsClass = Options.class,
    contactInformation = "https://cloud.google.com/support")
final class CassandraToBigtable {

  public interface Options extends PipelineOptions {

    @TemplateParameter.Text(
        order = 1,
        regexes = {"^[a-zA-Z0-9\\.\\-,]*$"},
        description = "Cassandra Hosts",
        helpText = "Comma separated value list of hostnames or ips of the Cassandra nodes.")
    ValueProvider<String> getCassandraHosts();

    @SuppressWarnings("unused")
    void setCassandraHosts(ValueProvider<String> hosts);

    @TemplateParameter.Text(
        order = 2,
        optional = true,
        regexes = {
          "^([0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])$"
        },
        description = "Cassandra Port",
        helpText = "The port where cassandra can be reached. Defaults to 9042.")
    @Default.Integer(9042)
    ValueProvider<Integer> getCassandraPort();

    @SuppressWarnings("unused")
    void setCassandraPort(ValueProvider<Integer> port);

    @TemplateParameter.Text(
        order = 3,
        regexes = {"^[a-zA-Z0-9][a-zA-Z0-9_]{0,47}$"},
        description = "Cassandra Keyspace",
        helpText = "Cassandra Keyspace where the table to be migrated can be located.")
    ValueProvider<String> getCassandraKeyspace();

    @SuppressWarnings("unused")
    void setCassandraKeyspace(ValueProvider<String> keyspace);

    @TemplateParameter.Text(
        order = 4,
        regexes = {"^[a-zA-Z][a-zA-Z0-9_]*$"},
        description = "Cassandra Table",
        helpText = "The name of the Cassandra table to Migrate")
    ValueProvider<String> getCassandraTable();

    @SuppressWarnings("unused")
    void setCassandraTable(ValueProvider<String> cassandraTable);

    @TemplateParameter.ProjectId(
        order = 5,
        description = "Bigtable Project ID",
        helpText = "The Project ID where the target Bigtable Instance is running.")
    ValueProvider<String> getBigtableProjectId();

    @SuppressWarnings("unused")
    void setBigtableProjectId(ValueProvider<String> projectId);

    @TemplateParameter.Text(
        order = 6,
        regexes = {"[a-z][a-z0-9\\-]+[a-z0-9]"},
        description = "Target Bigtable Instance",
        helpText = "The target Bigtable Instance where you want to write the data.")
    ValueProvider<String> getBigtableInstanceId();

    @SuppressWarnings("unused")
    void setBigtableInstanceId(ValueProvider<String> bigtableInstanceId);

    @TemplateParameter.Text(
        order = 7,
        regexes = {"[_a-zA-Z0-9][-_.a-zA-Z0-9]*"},
        description = "Target Bigtable Table",
        helpText = "The target Bigtable table where you want to write the data.")
    ValueProvider<String> getBigtableTableId();

    @SuppressWarnings("unused")
    void setBigtableTableId(ValueProvider<String> bigtableTableId);

    @TemplateParameter.Text(
        order = 8,
        optional = true,
        regexes = {"[-_.a-zA-Z0-9]+"},
        description = "The Default Bigtable Column Family",
        helpText =
            "This specifies the default column family to write data into. If no columnFamilyMapping is specified all Columns will be written into this column family. Default value is \"default\"")
    @Default.String("default")
    ValueProvider<String> getDefaultColumnFamily();

    @SuppressWarnings("unused")
    void setDefaultColumnFamily(ValueProvider<String> defaultColumnFamily);

    @TemplateParameter.Text(
        order = 9,
        optional = true,
        description = "The Row Key Separator",
        helpText =
            "All primary key fields will be appended to form your Bigtable Row Key. The rowKeySeparator allows you to specify a character separator. Default separator is '#'.")
    @Default.String("#")
    ValueProvider<String> getRowKeySeparator();

    @SuppressWarnings("unused")
    void setRowKeySeparator(ValueProvider<String> rowKeySeparator);

    @TemplateParameter.Boolean(
        order = 10,
        optional = true,
        description = "If true, large rows will be split into multiple MutateRows requests",
        helpText =
            "The flag for enabling splitting of large rows into multiple MutateRows requests. Note that when a large row is split between multiple API calls, the updates to the row are not atomic. ")
    ValueProvider<Boolean> getSplitLargeRows();

    void setSplitLargeRows(ValueProvider<Boolean> splitLargeRows);
  }

  /**
   * Runs a pipeline to copy one Cassandra table to Cloud Bigtable.
   *
   * @param args arguments to the pipeline
   */
  public static void main(String[] args) {

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    // Split the Cassandra Hosts value provider into a list value provider.
    ValueProvider.NestedValueProvider<List<String>, String> hosts =
        ValueProvider.NestedValueProvider.of(
            options.getCassandraHosts(),
            (SerializableFunction<String, List<String>>) value -> Arrays.asList(value.split(",")));

    Pipeline p = Pipeline.create(PipelineUtils.tweakPipelineOptions(options));

    // Create a factory method to inject the CassandraRowMapperFn to allow custom type mapping.
    SerializableFunction<Session, Mapper> cassandraObjectMapperFactory =
        new CassandraRowMapperFactory(options.getCassandraTable(), options.getCassandraKeyspace());

    CassandraIO.Read<Row> source =
        CassandraIO.<Row>read()
            .withHosts(hosts)
            .withPort(options.getCassandraPort())
            .withKeyspace(options.getCassandraKeyspace())
            .withTable(options.getCassandraTable())
            .withMapperFactoryFn(cassandraObjectMapperFactory)
            .withEntity(Row.class)
            .withCoder(SerializableCoder.of(Row.class));

    BigtableIO.Write sink =
        BigtableIO.write()
            .withProjectId(options.getBigtableProjectId())
            .withInstanceId(options.getBigtableInstanceId())
            .withTableId(options.getBigtableTableId());

    p.apply("Read from Cassandra", source)
        .apply(
            "Convert Row",
            ParDo.of(
                BeamRowToBigtableFn.createWithSplitLargeRows(
                    options.getRowKeySeparator(),
                    options.getDefaultColumnFamily(),
                    options.getSplitLargeRows(),
                    BeamRowToBigtableFn.MAX_MUTATION_PER_REQUEST)))
        .apply("Write to Bigtable", sink);
    p.run();
  }
}

MongoDB to BigQuery

MongoDB to BigQuery 模板是一种批处理流水线，可从 MongoDB 读取文档并按照 userOption 参数指定的选项将其写入 BigQuery。

对此流水线的要求

目标 BigQuery 数据集必须已存在。
必须可从 Dataflow 工作器机器访问 MongoDB 源实例。

模板参数

参数	说明
`mongoDbUri`	MongoDB 连接 URI，格式为 `mongodb+srv://:@`。
`database`	从中读取集合的 MongoDB 数据库。例如：`my-db`。
`collection`	MongoDB 数据库中集合的名称。例如：`my-collection`。
`outputTableSpec`	要写入的 BigQuery 表。例如 `bigquery-project:dataset.output_table`。
`userOption`	`FLATTEN` 或 `NONE`。`FLATTEN` 将文档展平至第一级。`NONE` 将整个文档存储为 JSON 字符串。

运行 MongoDB to BigQuery 模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域性端点为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the MongoDB to BigQuery template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/MongoDB_to_BigQuery \
    --parameters \
outputTableSpec=OUTPUT_TABLE_SPEC,\
mongoDbUri=MONGO_DB_URI,\
database=DATABASE,\
collection=COLLECTION,\
userOption=USER_OPTION

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
OUTPUT_TABLE_SPEC：您的 BigQuery 目标表的名称。
MONGO_DB_URI：您的 MongoDB URI。
DATABASE：您的 MongoDB 数据库。
COLLECTION：您的 MongoDB 集合。
USER_OPTION：FLATTEN 或 NONE。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputTableSpec": "INPUT_TABLE_SPEC",
          "mongoDbUri": "MONGO_DB_URI",
          "database": "DATABASE",
          "collection": "COLLECTION",
          "userOption": "USER_OPTION"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/MongoDB_to_BigQuery",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域端点，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates/latest/) 中可用
- 版本名称（如 2021-09-20-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
OUTPUT_TABLE_SPEC：您的 BigQuery 目标表的名称。
MONGO_DB_URI：您的 MongoDB URI。
DATABASE：您的 MongoDB 数据库。
COLLECTION：您的 MongoDB 集合。
USER_OPTION：FLATTEN 或 NONE。

模板源代码

Java

在 GitHub 上查看反馈

/*
 * Copyright (C) 2019 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.mongodb.templates;

import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableSchema;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.mongodb.options.MongoDbToBigQueryOptions.BigQueryWriteOptions;
import com.google.cloud.teleport.v2.mongodb.options.MongoDbToBigQueryOptions.JavascriptDocumentTransformerOptions;
import com.google.cloud.teleport.v2.mongodb.options.MongoDbToBigQueryOptions.MongoDbOptions;
import com.google.cloud.teleport.v2.mongodb.templates.MongoDbToBigQuery.Options;
import com.google.cloud.teleport.v2.options.BigQueryStorageApiBatchOptions;
import com.google.cloud.teleport.v2.transforms.JavascriptDocumentTransformer.TransformDocumentViaJavascript;
import com.google.cloud.teleport.v2.utils.BigQueryIOUtils;
import java.io.IOException;
import javax.script.ScriptException;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.mongodb.MongoDbIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.bson.Document;

/**
 * The {@link MongoDbToBigQuery} pipeline is a batch pipeline which ingests data from MongoDB and
 * outputs the resulting records to BigQuery.
 */
@Template(
    name = "MongoDB_to_BigQuery",
    category = TemplateCategory.BATCH,
    displayName = "MongoDB to BigQuery",
    description =
        "A batch pipeline which reads data documents from MongoDB and writes them to BigQuery.",
    optionsClass = Options.class,
    flexContainerName = "mongodb-to-bigquery",
    contactInformation = "https://cloud.google.com/support")
public class MongoDbToBigQuery {
  /**
   * Options supported by {@link MongoDbToBigQuery}
   *
   * <p>Inherits standard configuration options.
   */
  public interface Options
      extends PipelineOptions,
          MongoDbOptions,
          BigQueryWriteOptions,
          BigQueryStorageApiBatchOptions,
          JavascriptDocumentTransformerOptions {}

  private static class ParseAsDocumentsFn extends DoFn<String, Document> {
    @ProcessElement
    public void processElement(ProcessContext context) {
      context.output(Document.parse(context.element()));
    }
  }

  public static void main(String[] args)
      throws ScriptException, IOException, NoSuchMethodException {
    UncaughtExceptionLogger.register();

    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);

    BigQueryIOUtils.validateBQStorageApiOptionsBatch(options);

    run(options);
  }

  public static boolean run(Options options)
      throws ScriptException, IOException, NoSuchMethodException {
    Pipeline pipeline = Pipeline.create(options);
    String userOption = options.getUserOption();

    TableSchema bigquerySchema;

    if (options.getJavascriptDocumentTransformFunctionName() != null
        && options.getJavascriptDocumentTransformGcsPath() != null) {
      bigquerySchema =
          MongoDbUtils.getTableFieldSchemaForUDF(
              options.getMongoDbUri(),
              options.getDatabase(),
              options.getCollection(),
              options.getJavascriptDocumentTransformGcsPath(),
              options.getJavascriptDocumentTransformFunctionName(),
              options.getUserOption());
    } else {
      bigquerySchema =
          MongoDbUtils.getTableFieldSchema(
              options.getMongoDbUri(),
              options.getDatabase(),
              options.getCollection(),
              options.getUserOption());
    }

    pipeline
        .apply(
            "Read Documents",
            MongoDbIO.read()
                .withUri(options.getMongoDbUri())
                .withDatabase(options.getDatabase())
                .withCollection(options.getCollection()))
        .apply(
            "UDF",
            TransformDocumentViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptDocumentTransformGcsPath())
                .setFunctionName(options.getJavascriptDocumentTransformFunctionName())
                .build())
        .apply(
            "Transform to TableRow",
            ParDo.of(
                new DoFn<Document, TableRow>() {

                  @ProcessElement
                  public void process(ProcessContext c) {
                    Document document = c.element();
                    TableRow row = MongoDbUtils.getTableSchema(document, userOption);
                    c.output(row);
                  }
                }))
        .apply(
            "Write to Bigquery",
            BigQueryIO.writeTableRows()
                .to(options.getOutputTableSpec())
                .withSchema(bigquerySchema)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
    pipeline.run();
    return true;
  }
}