Cloud Storage 到 Elasticsearch 模板

Cloud Storage to Elasticsearch 模板是一种批处理流水线，可从存储在 Cloud Storage 存储桶中的 CSV 文件读取数据，并将数据作为 JSON 文档注入到 Elasticsearch 中。

流水线要求

Cloud Storage 存储桶必须存在。
Google Cloud 实例或 Elasticsearch Cloud 上必须存在可通过 Dataflow 访问的 Elasticsearch 主机。
错误输出的 BigQuery 表必须存在。

CSV 架构

如果 CSV 文件包含标题行，请将 containsHeaders 模板参数设置为 true。

否则，请创建一个 JSON 架构文件来描述数据。在 jsonSchemaPath 模板参数中指定架构文件的 Cloud Storage URI。以下示例展示了 JSON 架构：

[{"name":"id", "type":"text"}, {"name":"age", "type":"integer"}]

或者，您可以提供一个用户定义的函数 (UDF)，用于解析 CSV 文本并输出 Elasticsearch 文档。

模板参数

必需参数

deadletterTable：将失败的插入发送到的 BigQuery 死信表。例如 your-project:your-dataset.your-table-name。
inputFileSpec：用于搜索 CSV 文件的 Cloud Storage 文件格式。例如 gs://mybucket/test-*.csv。
connectionUrl：Elasticsearch 网址，格式为 https://hostname:[port]。如果使用 Elastic Cloud，请指定 CloudID。例如 https://elasticsearch-host:9200。
apiKey：用于身份验证的 Base64 编码 API 密钥。
index：发出请求的 Elasticsearch 索引。例如 my-index。

可选参数

inputFormat：输入文件格式。默认值为 CSV。
containsHeaders：输入 CSV 文件包含标题记录 (true/false)。仅在读取 CSV 文件时才需要。默认值为：false。
delimiter：输入文本文件的列分隔符。默认值：,。例如，,。
csvFormat：用于解析记录的 CSV 格式规范。默认值为：Default。如需了解详情，请参阅 https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html。必须与以下网址中找到的格式名称完全一致：https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.Predefined.html。
jsonSchemaPath：JSON 架构的路径。默认值为 null。例如 gs://path/to/schema。
largeNumFiles：如果文件数达到数万个，则设置为 true。默认值为 false。
csvFileEncoding：CSV 文件的字符编码格式。允许的值包括 US-ASCII、ISO-8859-1、UTF-8 和 UTF-16。默认为 UTF-8。
logDetailedCsvConversionErrors：设置为 true 可在 CSV 解析失败时启用详细错误日志记录。请注意，这可能会导致日志中包含敏感数据（例如，如果 CSV 文件包含密码）。默认值：false。
elasticsearchUsername：用于进行身份验证的 Elasticsearch 用户名。如果指定，则系统会忽略 apiKey 的值。
elasticsearchPassword：用于进行身份验证的 Elasticsearch 密码。如果指定，则系统会忽略 apiKey 的值。
batchSize：按文档数量的批次大小。默认值为 1000。
batchSizeBytes：批次大小（以字节数为单位）。默认值为 5242880 (5mb)。
maxRetryAttempts：重试次数上限。必须大于零。默认值为 no retries。
maxRetryDuration：重试时长上限（以毫秒为单位）。必须大于零。默认值为 no retries。
propertyAsIndex：要编入索引的文档中的一个属性，其值指定批量请求要包含在文档中的 _index 元数据。优先于 _index UDF。默认值为 none。
javaScriptIndexFnGcsPath：函数的 JavaScript UDF 来源的 Cloud Storage 路径，该函数指定批量请求要包含在文档中的 _index 元数据。默认值为 none。
javaScriptIndexFnName：UDF JavaScript 函数的名称，该函数指定要将 _index 元数据包含在批量请求的文档中。默认值为 none。
propertyAsId：要编入索引的文档中的一个属性，其值指定批量请求要包含在文档中的 _id 元数据。优先于 _id UDF。默认值为 none。
javaScriptIdFnGcsPath：函数的 JavaScript UDF 来源的 Cloud Storage 路径，该函数指定批量请求要包含在文档中的 _id 元数据。默认值为 none。
javaScriptIdFnName：UDF JavaScript 函数的名称，该函数指定批量请求要包含在文档中的 _id 元数据。默认值为 none。
javaScriptTypeFnGcsPath：函数的 JavaScript UDF 来源的 Cloud Storage 路径，该函数指定要将 _type 元数据包含在批量请求的文档中。默认值为 none。
javaScriptTypeFnName：UDF JavaScript 函数的名称，该函数指定批量请求要包含在文档中的 _type 元数据。默认值为 none。
javaScriptIsDeleteFnGcsPath：函数的 JavaScript UDF 来源的 Cloud Storage 路径，该函数确定是否应删除文档，而不是插入或更新文档。该函数会返回字符串值 true 或 false。默认值为 none。
javaScriptIsDeleteFnName：UDF JavaScript 函数的名称，该函数确定是否应删除文档，而不是插入或更新文档。该函数会返回字符串值 true 或 false。默认值为 none。
usePartialUpdate：是否在 Elasticsearch 请求中使用部分更新（更新而不是创建或编入索引，允许部分文档）。默认值为 false。
bulkInsertMethod：在 Elasticsearch 批量请求中使用 INDEX（编入索引，允许 upsert）还是 CREATE（创建，会对重复 _id 报错）。默认值为 CREATE。
trustSelfSignedCerts：是否信任自签名证书。已安装的 Elasticsearch 实例可能具有自签名证书，将此参数设置为 true 可绕过对 SSL 证书的验证。（默认值为：false。）
disableCertificateValidation：如果为 true，则信任自签名 SSL 证书。Elasticsearch 实例可能具有自签名证书。如需绕过证书的验证，请将此参数设置为 true。默认值为 false。
apiKeyKMSEncryptionKey：用于解密 API 密钥的 Cloud KMS 密钥。如果 apiKeySource 设置为 KMS，则此参数是必需的。如果提供了此参数，请传入加密的 apiKey 字符串。使用 KMS API 加密端点对参数进行加密。对于密钥，请使用 projects/<PROJECT_ID>/locations/<KEY_REGION>/keyRings/<KEY_RING>/cryptoKeys/<KMS_KEY_NAME> 格式。请参阅：https://cloud.google.com/kms/docs/reference/rest/v1/projects.locations.keyRings.cryptoKeys/encrypt。例如，projects/your-project-id/locations/global/keyRings/your-keyring/cryptoKeys/your-key-name。
apiKeySecretId：apiKey 的 Secret Manager Secret ID。如果 apiKeySource 设置为 SECRET_MANAGER，请提供此参数。请使用格式 projects/<PROJECT_ID>/secrets/<SECRET_ID>/versions/<SECRET_VERSION>. For example, projects/your-project-id/secrets/your-secret/versions/your-secret-version`。
apiKeySource：API 密钥的来源。允许的值包括 PLAINTEXT、KMS 和 SECRET_MANAGER。如果您使用 Secret Manager 或 KMS，则此参数是必需的。如果 apiKeySource 设置为 KMS、apiKeyKMSEncryptionKey 和已加密，则必须提供 apiKey。如果 apiKeySource 设置为 SECRET_MANAGER，则必须提供 apiKeySecretId。如果 apiKeySource 设置为 PLAINTEXT，则必须提供 apiKey。默认值为 PLAINTEXT。
socketTimeout：如果设置，则会覆盖 Elastic RestClient 中的默认重试超时上限和默认套接字超时（30,000 毫秒）。
javascriptTextTransformGcsPath：.js 文件的 Cloud Storage URI，用于定义要使用的 JavaScript 用户定义的函数 (UDF)。例如 gs://my-bucket/my-udfs/my_file.js。
javascriptTextTransformFunctionName：要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例 (https://github.com/GoogleCloudPlatform/DataflowTemplates#udf-examples)。

用户定义的函数

此模板支持流水线中多个位置的用户定义的函数 (UDF)，如下所述。如需了解详情，请参阅为 Dataflow 模板创建用户定义的函数。

文本转换函数

将 CSV 数据转换为 Elasticsearch 文档。

模板参数：

javascriptTextTransformGcsPath：JavaScript 文件的 Cloud Storage URI。
javascriptTextTransformFunctionName：JavaScript 函数的名称。

函数规范：

输入：来自输入 CSV 文件的单行内容。
输出：要插入到 Elasticsearch 中的字符串化 JSON 文档。

索引函数

返回文档所属的索引。

模板参数：

javaScriptIndexFnGcsPath：JavaScript 文件的 Cloud Storage URI。
javaScriptIndexFnName：JavaScript 函数的名称。

函数规范：

输入：Elasticsearch 文档，序列化为 JSON 字符串。
输出：文档的 _index 元数据字段的值。

文档 ID 函数

返回文档 ID。

模板参数：

javaScriptIdFnGcsPath：JavaScript 文件的 Cloud Storage URI。
javaScriptIdFnName：JavaScript 函数的名称。

函数规范：

输入：Elasticsearch 文档，序列化为 JSON 字符串。
输出：文档的 _id 元数据字段的值。

文档删除函数

指定是否删除文档。如需使用此函数，请将批量插入模式设置为 INDEX 并提供文档 ID 函数。

模板参数：

javaScriptIsDeleteFnGcsPath：JavaScript 文件的 Cloud Storage URI。
javaScriptIsDeleteFnName：JavaScript 函数的名称。

函数规范：

输入：Elasticsearch 文档，序列化为 JSON 字符串。
输出：返回字符串 "true" 可删除文档，返回 "false" 可更新/插入文档。

映射类型函数

返回文档的映射类型。

模板参数：

javaScriptTypeFnGcsPath：JavaScript 文件的 Cloud Storage URI。
javaScriptTypeFnName：JavaScript 函数的名称。

函数规范：

输入：Elasticsearch 文档，序列化为 JSON 字符串。
输出：文档的 _type 元数据字段的值。

运行模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Cloud Storage to Elasticsearch template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID\
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/GCS_to_Elasticsearch \
    --parameters \
inputFileSpec=INPUT_FILE_SPEC,\
connectionUrl=CONNECTION_URL,\
apiKey=APIKEY,\
index=INDEX,\
deadletterTable=DEADLETTER_TABLE,\

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Google Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates-REGION_NAME/latest/) 中可用
- 版本名称（如 2023-09-12-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates-REGION_NAME/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
REGION_NAME：要在其中部署 Dataflow 作业的区域，例如 us-central1
INPUT_FILE_SPEC：您的 Cloud Storage 文件格式。
CONNECTION_URL：您的 Elasticsearch 网址。
APIKEY：用于身份验证的 base64 编码 API 密钥。
INDEX：您的 Elasticsearch 索引。
DEADLETTER_TABLE：您的 BigQuery 表。

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputFileSpec": "INPUT_FILE_SPEC",
          "connectionUrl": "CONNECTION_URL",
          "apiKey": "APIKEY",
          "index": "INDEX",
          "deadletterTable": "DEADLETTER_TABLE"
      },
      "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/GCS_to_Elasticsearch",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Google Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates-REGION_NAME/latest/) 中可用
- 版本名称（如 2023-09-12-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates-REGION_NAME/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
LOCATION：要在其中部署 Dataflow 作业的区域，例如 us-central1
INPUT_FILE_SPEC：您的 Cloud Storage 文件格式。
CONNECTION_URL：您的 Elasticsearch 网址。
APIKEY：用于身份验证的 base64 编码 API 密钥。
INDEX：您的 Elasticsearch 索引。
DEADLETTER_TABLE：您的 BigQuery 表。

模板源代码

Java

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.elasticsearch.templates;

import static org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkArgument;

import com.google.cloud.teleport.metadata.MultiTemplate;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.elasticsearch.options.GCSToElasticsearchOptions;
import com.google.cloud.teleport.v2.elasticsearch.transforms.WriteToElasticsearch;
import com.google.cloud.teleport.v2.transforms.CsvConverters;
import com.google.cloud.teleport.v2.transforms.ErrorConverters.WriteStringMessageErrors;
import com.google.cloud.teleport.v2.utils.SchemaUtils;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import com.google.common.base.Strings;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.NullableCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.WithTimestamps;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.joda.time.Instant;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link GCSToElasticsearch} pipeline exports data from one or more CSV files in Cloud Storage
 * to Elasticsearch.
 *
 * <p>Check out <a
 * href="https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/googlecloud-to-elasticsearch/README_GCS_to_Elasticsearch.md">README</a>
 * for instructions on how to use or modify this template.
 */
@MultiTemplate({
  @Template(
      name = "GCS_to_Elasticsearch",
      category = TemplateCategory.BATCH,
      displayName = "Cloud Storage to Elasticsearch",
      description = {
        "The Cloud Storage to Elasticsearch template is a batch pipeline that reads data from CSV files stored in a Cloud Storage bucket and writes the data into Elasticsearch as JSON documents.",
        "If the CSV files contain headers, set the <code>containsHeaders</code> template parameter to <code>true</code>.\n"
            + "Otherwise, create a JSON schema file that describes the data. Specify the Cloud Storage URI of the schema file in the jsonSchemaPath template parameter. "
            + "The following example shows a JSON schema:\n"
            + "<code>[{\"name\":\"id\", \"type\":\"text\"}, {\"name\":\"age\", \"type\":\"integer\"}]</code>\n"
            + "Alternatively, you can provide a Javascript user-defined function (UDF) that parses the CSV text and outputs Elasticsearch documents."
      },
      optionsClass = GCSToElasticsearchOptions.class,
      skipOptions = {
        "javascriptTextTransformReloadIntervalMinutes",
        "pythonExternalTextTransformGcsPath",
        "pythonExternalTextTransformFunctionName"
      },
      flexContainerName = "gcs-to-elasticsearch",
      documentation =
          "https://cloud.google.com/dataflow/docs/guides/templates/provided/cloud-storage-to-elasticsearch",
      contactInformation = "https://cloud.google.com/support",
      preview = true,
      requirements = {
        "The Cloud Storage bucket must exist.",
        "A Elasticsearch host on a Google Cloud instance or on Elasticsearch Cloud that is accessible from Dataflow must exist.",
        "A BigQuery table for error output must exist."
      }),
  @Template(
      name = "GCS_to_Elasticsearch_Xlang",
      category = TemplateCategory.BATCH,
      displayName = "Cloud Storage to Elasticsearch with Python UDFs",
      type = Template.TemplateType.XLANG,
      description = {
        "The Cloud Storage to Elasticsearch template is a batch pipeline that reads data from CSV files stored in a Cloud Storage bucket and writes the data into Elasticsearch as JSON documents.",
        "If the CSV files contain headers, set the <code>containsHeaders</code> template parameter to <code>true</code>.\n"
            + "Otherwise, create a JSON schema file that describes the data. Specify the Cloud Storage URI of the schema file in the jsonSchemaPath template parameter. "
            + "The following example shows a JSON schema:\n"
            + "<code>[{\"name\":\"id\", \"type\":\"text\"}, {\"name\":\"age\", \"type\":\"integer\"}]</code>\n"
            + "Alternatively, you can provide a Python user-defined function (UDF) that parses the CSV text and outputs Elasticsearch documents."
      },
      optionsClass = GCSToElasticsearchOptions.class,
      skipOptions = {
        "javascriptTextTransformGcsPath",
        "javascriptTextTransformFunctionName",
        "javascriptTextTransformReloadIntervalMinutes"
      },
      flexContainerName = "gcs-to-elasticsearch-xlang",
      documentation =
          "https://cloud.google.com/dataflow/docs/guides/templates/provided/cloud-storage-to-elasticsearch",
      contactInformation = "https://cloud.google.com/support",
      preview = true,
      requirements = {
        "The Cloud Storage bucket must exist.",
        "A Elasticsearch host on a Google Cloud instance or on Elasticsearch Cloud that is accessible from Dataflow must exist.",
        "A BigQuery table for error output must exist."
      })
})
public class GCSToElasticsearch {

  /** The tag for the headers of the CSV if required. */
  static final TupleTag<String> CSV_HEADERS = new TupleTag<String>() {};

  /** The tag for the lines of the CSV. */
  static final TupleTag<String> CSV_LINES = new TupleTag<String>() {};

  /** The tag for the dead-letter output of the UDF. */
  static final TupleTag<FailsafeElement<String, String>> PROCESSING_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /** The tag for the main output for the UDF. */
  static final TupleTag<FailsafeElement<String, String>> PROCESSING_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /* Logger for class. */
  private static final Logger LOG = LoggerFactory.getLogger(GCSToElasticsearch.class);

  /** String/String Coder for FailsafeElement. */
  private static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(
          NullableCoder.of(StringUtf8Coder.of()), NullableCoder.of(StringUtf8Coder.of()));

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    GCSToElasticsearchOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(GCSToElasticsearchOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  private static PipelineResult run(GCSToElasticsearchOptions options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Register the coder for pipeline
    CoderRegistry coderRegistry = pipeline.getCoderRegistry();
    coderRegistry.registerCoderForType(
        FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER);

    // Throw error if containsHeaders is true and a schema or Udf is also set.
    if (options.getContainsHeaders()) {
      checkArgument(
          options.getJavascriptTextTransformGcsPath() == null
              && options.getJsonSchemaPath() == null
              && options.getPythonExternalTextTransformGcsPath() == null,
          "Cannot parse file containing headers with UDF or Json schema.");
    }

    // Throw error if only one retry configuration parameter is set.
    checkArgument(
        (options.getMaxRetryAttempts() == null && options.getMaxRetryDuration() == null)
            || (options.getMaxRetryAttempts() != null && options.getMaxRetryDuration() != null),
        "To specify retry configuration both max attempts and max duration must be set.");

    // Throw error if both Javascript UDF and Python UDF are set. We can only apply one or the
    // other.
    boolean useJavascriptUdf = !Strings.isNullOrEmpty(options.getJavascriptTextTransformGcsPath());
    boolean usePythonUdf = !Strings.isNullOrEmpty(options.getPythonExternalTextTransformGcsPath());
    if (useJavascriptUdf && usePythonUdf) {
      throw new IllegalArgumentException(
          "Either javascript or Python gcs path must be provided, but not both.");
    }

    /*
     * Steps: 1) Read records from CSV(s) via {@link CsvConverters.ReadCsv}.
     *        2) Convert lines to JSON strings via {@link CsvConverters.LineToFailsafeJson}.
     *        3a) Write JSON strings as documents to Elasticsearch via {@link ElasticsearchIO}.
     *        3b) Write elements that failed processing to {@link org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO}.
     */
    PCollectionTuple readCsvLines =
        pipeline
            /*
             * Step 1: Read CSV file(s) from Cloud Storage using {@link CsvConverters.ReadCsv}.
             */
            .apply(
            "ReadCsv",
            CsvConverters.ReadCsv.newBuilder()
                .setCsvFormat(options.getCsvFormat())
                .setDelimiter(options.getDelimiter())
                .setHasHeaders(options.getContainsHeaders())
                .setInputFileSpec(options.getInputFileSpec())
                .setHeaderTag(CSV_HEADERS)
                .setLineTag(CSV_LINES)
                .setFileEncoding(options.getCsvFileEncoding())
                .build());
    /*
     * Step 2: Convert lines to Elasticsearch document.
     */
    CsvConverters.LineToFailsafeJson.Builder lineToFailsafeJsonBuilder =
        CsvConverters.LineToFailsafeJson.newBuilder()
            .setDelimiter(options.getDelimiter())
            .setJsonSchemaPath(options.getJsonSchemaPath())
            .setHeaderTag(CSV_HEADERS)
            .setLineTag(CSV_LINES)
            .setUdfOutputTag(PROCESSING_OUT)
            .setUdfDeadletterTag(PROCESSING_DEADLETTER_OUT);
    if (options.getPythonExternalTextTransformGcsPath() != null) {
      lineToFailsafeJsonBuilder
          .setPythonUdfFileSystemPath(options.getPythonExternalTextTransformGcsPath())
          .setPythonUdfFunctionName(options.getPythonExternalTextTransformFunctionName());
    } else {
      lineToFailsafeJsonBuilder
          .setJavascriptUdfFileSystemPath(options.getJavascriptTextTransformGcsPath())
          .setJavascriptUdfFunctionName(options.getJavascriptTextTransformFunctionName());
    }
    PCollectionTuple convertedCsvLines =
        readCsvLines.apply("ConvertLine", lineToFailsafeJsonBuilder.build());
    /*
     * Step 3a: Write elements that were successfully processed to Elasticsearch using {@link WriteToElasticsearch}.
     */
    convertedCsvLines
        .get(PROCESSING_OUT)
        .apply(
            "GetJsonDocuments",
            MapElements.into(TypeDescriptors.strings()).via(FailsafeElement::getPayload))
        .apply(
            "WriteToElasticsearch",
            WriteToElasticsearch.newBuilder()
                .setUserAgent("dataflow-gcs-to-elasticsearch-template/v2")
                .setOptions(options.as(GCSToElasticsearchOptions.class))
                .build());

    /*
     * Step 3b: Write elements that failed processing to deadletter table via {@link BigQueryIO}.
     */
    convertedCsvLines
        .get(PROCESSING_DEADLETTER_OUT)
        .apply(
            "AddTimestamps",
            WithTimestamps.of((FailsafeElement<String, String> failures) -> new Instant()))
        .apply(
            "WriteFailedElementsToBigQuery",
            WriteStringMessageErrors.newBuilder()
                .setErrorRecordsTable(options.getDeadletterTable())
                .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA)
                .build());

    return pipeline.run();
  }
}

后续步骤

了解 Dataflow 模板。
参阅 Google 提供的模板列表。

Cloud Storage 到 Elasticsearch 模板 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

流水线要求

CSV 架构

模板参数

必需参数

可选参数

用户定义的函数

文本转换函数

索引函数

文档 ID 函数

文档删除函数

映射类型函数

运行模板

控制台

gcloud

API

模板源代码

Java

后续步骤

Cloud Storage 到 Elasticsearch 模板