Pub/Sub to Elasticsearch 模板

Pub/Sub to Elasticsearch 模板是一种流处理流水线，可从 Pub/Sub 订阅读取消息、执行用户定义的函数 (UDF) 并将其作为文档写入 Elasticsearch。Dataflow 模板使用 Elasticsearch 的数据流功能跨多个索引存储时序数据，同时为请求提供单个命名资源。数据流非常适合存储在 Pub/Sub 中的日志、指标、跟踪记录和其他持续生成的数据。

该模板会创建一个名为 logs-gcp.DATASET-NAMESPACE 的数据流，其中：

DATASET 是 dataset 模板参数的值；如果未指定，则为 pubsub。
NAMESPACE 是 namespace 模板参数的值；如果未指定，则为 default。

流水线要求

来源 Pub/Sub 订阅必须存在，并且消息必须采用有效的 JSON 格式进行编码。
Google Cloud 实例上或 Elastic Cloud 上使用 Elasticsearch 7.0 版或更高版本的可公开访问的 Elasticsearch 主机。如需了解详情，请参阅适用于 Elastic 的 Google Cloud 集成。
用于错误输出的 Pub/Sub 主题。

模板参数

必需参数

inputSubscription：要从中使用输入的 Pub/Sub 订阅。例如 projects/<PROJECT_ID>/subscriptions/<SUBSCRIPTION_NAME>。
errorOutputTopic：用于发布失败记录的 Pub/Sub 输出主题，格式为 projects/<PROJECT_ID>/topics/<TOPIC_NAME>。
connectionUrl：Elasticsearch 网址，格式为 https://hostname:[port]。如果使用 Elastic Cloud，请指定 CloudID。例如 https://elasticsearch-host:9200。
apiKey：用于身份验证的 Base64 编码 API 密钥。

可选参数

dataset：使用 Pub/Sub 发送的日志类型，我们为其提供了开箱即用的信息中心。已知的日志类型值包括 audit、vpcflow 和 firewall。默认值为 pubsub。
namespace：任意分组，例如环境（dev、prod 或 qa）、团队或战略性业务部门。默认值为 default。
elasticsearchTemplateVersion：Dataflow 模板版本标识符，通常由 Google Cloud 定义。默认值为 1.0.0。
javascriptTextTransformGcsPath：.js 文件的 Cloud Storage URI，用于定义要使用的 JavaScript 用户定义的函数 (UDF)。例如 gs://my-bucket/my-udfs/my_file.js。
javascriptTextTransformFunctionName：要使用的 JavaScript 用户定义的函数 (UDF) 的名称。例如，如果 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ }，则函数名称为 myTransform。如需查看 JavaScript UDF 示例，请参阅 UDF 示例 (https://github.com/GoogleCloudPlatform/DataflowTemplates#udf-examples)。
javascriptTextTransformReloadIntervalMinutes：指定重新加载 UDF 的频率（以分钟为单位）。如果值大于 0，则 Dataflow 会定期检查 Cloud Storage 中的 UDF 文件，并在文件修改时重新加载 UDF。此参数可让您在流水线运行时更新 UDF，而无需重启作业。如果值为 0，则停用 UDF 重新加载。默认值为 0。
elasticsearchUsername：用于进行身份验证的 Elasticsearch 用户名。如果指定，则系统会忽略 apiKey 的值。
elasticsearchPassword：用于进行身份验证的 Elasticsearch 密码。如果指定，则系统会忽略 apiKey 的值。
batchSize：按文档数量的批次大小。默认值为 1000。
batchSizeBytes：批次大小（以字节数为单位）。默认值为 5242880 (5mb)。
maxRetryAttempts：重试次数上限。必须大于零。默认值为 no retries。
maxRetryDuration：重试时长上限（以毫秒为单位）。必须大于零。默认值为 no retries。
propertyAsIndex：要编入索引的文档中的一个属性，其值指定批量请求要包含在文档中的 _index 元数据。优先于 _index UDF。默认值为 none。
javaScriptIndexFnGcsPath：函数的 JavaScript UDF 来源的 Cloud Storage 路径，该函数指定批量请求要包含在文档中的 _index 元数据。默认值为 none。
javaScriptIndexFnName：UDF JavaScript 函数的名称，该函数指定要将 _index 元数据包含在批量请求的文档中。默认值为 none。
propertyAsId：要编入索引的文档中的一个属性，其值指定批量请求要包含在文档中的 _id 元数据。优先于 _id UDF。默认值为 none。
javaScriptIdFnGcsPath：函数的 JavaScript UDF 来源的 Cloud Storage 路径，该函数指定批量请求要包含在文档中的 _id 元数据。默认值为 none。
javaScriptIdFnName：UDF JavaScript 函数的名称，该函数指定批量请求要包含在文档中的 _id 元数据。默认值为 none。
javaScriptTypeFnGcsPath：函数的 JavaScript UDF 来源的 Cloud Storage 路径，该函数指定要将 _type 元数据包含在批量请求的文档中。默认值为 none。
javaScriptTypeFnName：UDF JavaScript 函数的名称，该函数指定批量请求要包含在文档中的 _type 元数据。默认值为 none。
javaScriptIsDeleteFnGcsPath：函数的 JavaScript UDF 来源的 Cloud Storage 路径，该函数确定是否应删除文档，而不是插入或更新文档。该函数会返回字符串值 true 或 false。默认值为 none。
javaScriptIsDeleteFnName：UDF JavaScript 函数的名称，该函数确定是否应删除文档，而不是插入或更新文档。该函数会返回字符串值 true 或 false。默认值为 none。
usePartialUpdate：是否在 Elasticsearch 请求中使用部分更新（更新而不是创建或编入索引，允许部分文档）。默认值为 false。
bulkInsertMethod：在 Elasticsearch 批量请求中使用 INDEX（编入索引，允许 upsert）还是 CREATE（创建，会对重复 _id 报错）。默认值为 CREATE。
trustSelfSignedCerts：是否信任自签名证书。已安装的 Elasticsearch 实例可能具有自签名证书，将此参数设置为 true 可绕过对 SSL 证书的验证。（默认值为：false。）
disableCertificateValidation：如果为 true，则信任自签名 SSL 证书。Elasticsearch 实例可能具有自签名证书。如需绕过证书的验证，请将此参数设置为 true。默认值为 false。
apiKeyKMSEncryptionKey：用于解密 API 密钥的 Cloud KMS 密钥。如果 apiKeySource 设置为 KMS，则此参数是必需的。如果提供了此参数，请传入加密的 apiKey 字符串。使用 KMS API 加密端点对参数进行加密。对于密钥，请使用 projects/<PROJECT_ID>/locations/<KEY_REGION>/keyRings/<KEY_RING>/cryptoKeys/<KMS_KEY_NAME> 格式。请参阅：https://cloud.google.com/kms/docs/reference/rest/v1/projects.locations.keyRings.cryptoKeys/encrypt。例如，projects/your-project-id/locations/global/keyRings/your-keyring/cryptoKeys/your-key-name。
apiKeySecretId：apiKey 的 Secret Manager Secret ID。如果 apiKeySource 设置为 SECRET_MANAGER，请提供此参数。请使用格式 projects/<PROJECT_ID>/secrets/<SECRET_ID>/versions/<SECRET_VERSION>. For example, projects/your-project-id/secrets/your-secret/versions/your-secret-version`。
apiKeySource：API 密钥的来源。允许的值包括 PLAINTEXT、KMS 和 SECRET_MANAGER。如果您使用 Secret Manager 或 KMS，则此参数是必需的。如果 apiKeySource 设置为 KMS、apiKeyKMSEncryptionKey 和已加密，则必须提供 apiKey。如果 apiKeySource 设置为 SECRET_MANAGER，则必须提供 apiKeySecretId。如果 apiKeySource 设置为 PLAINTEXT，则必须提供 apiKey。默认值为 PLAINTEXT。
socketTimeout：如果设置，则会覆盖 Elastic RestClient 中的默认重试超时上限和默认套接字超时（30,000 毫秒）。

用户定义的函数

此模板支持流水线中多个位置的用户定义的函数 (UDF)，如下所述。如需了解详情，请参阅为 Dataflow 模板创建用户定义的函数。

文本转换函数

将 Pub/Sub 消息转换为 Elasticsearch 文档。

模板参数：

javascriptTextTransformGcsPath：JavaScript 文件的 Cloud Storage URI。
javascriptTextTransformFunctionName：JavaScript 函数的名称。

函数规范：

输入：Pub/Sub 消息数据字段，序列化为 JSON 字符串。
输出：要插入到 Elasticsearch 中的字符串化 JSON 文档。

索引函数

返回文档所属的索引。

模板参数：

javaScriptIndexFnGcsPath：JavaScript 文件的 Cloud Storage URI。
javaScriptIndexFnName：JavaScript 函数的名称。

函数规范：

输入：Elasticsearch 文档，序列化为 JSON 字符串。
输出：文档的 _index 元数据字段的值。

文档 ID 函数

返回文档 ID。

模板参数：

javaScriptIdFnGcsPath：JavaScript 文件的 Cloud Storage URI。
javaScriptIdFnName：JavaScript 函数的名称。

函数规范：

输入：Elasticsearch 文档，序列化为 JSON 字符串。
输出：文档的 _id 元数据字段的值。

文档删除函数

指定是否删除文档。如需使用此函数，请将批量插入模式设置为 INDEX 并提供文档 ID 函数。

模板参数：

javaScriptIsDeleteFnGcsPath：JavaScript 文件的 Cloud Storage URI。
javaScriptIsDeleteFnName：JavaScript 函数的名称。

函数规范：

输入：Elasticsearch 文档，序列化为 JSON 字符串。
输出：返回字符串 "true" 可删除文档，返回 "false" 可更新/插入文档。

映射类型函数

返回文档的映射类型。

模板参数：

javaScriptTypeFnGcsPath：JavaScript 文件的 Cloud Storage URI。
javaScriptTypeFnName：JavaScript 函数的名称。

函数规范：

输入：Elasticsearch 文档，序列化为 JSON 字符串。
输出：文档的 _type 元数据字段的值。

运行模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub to Elasticsearch template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/PubSub_to_Elasticsearch_Flex \
    --parameters \
inputSubscription=SUBSCRIPTION_NAME,\
connectionUrl=CONNECTION_URL,\
dataset=DATASET,\
namespace=NAMESPACE,\
apiKey=APIKEY,\
errorOutputTopic=ERROR_OUTPUT_TOPIC

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Google Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates-REGION_NAME/latest/) 中可用
- 版本名称（如 2023-09-12-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates-REGION_NAME/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
ERROR_OUTPUT_TOPIC：用于错误输出的 Pub/Sub 主题
SUBSCRIPTION_NAME：您的 Pub/Sub 订阅名称
CONNECTION_URL：您的 Elasticsearch 网址
DATASET：您的日志类型
NAMESPACE：数据集的命名空间
APIKEY：用于身份验证的 base64 编码 API 密钥

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputSubscription": "SUBSCRIPTION_NAME",
          "connectionUrl": "CONNECTION_URL",
          "dataset": "DATASET",
          "namespace": "NAMESPACE",
          "apiKey": "APIKEY",
          "errorOutputTopic": "ERROR_OUTPUT_TOPIC"
      },
      "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/PubSub_to_Elasticsearch_Flex",
   }
}

替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Google Cloud 项目 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates-REGION_NAME/latest/) 中可用
- 版本名称（如 2023-09-12-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates-REGION_NAME/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
ERROR_OUTPUT_TOPIC：用于错误输出的 Pub/Sub 主题
SUBSCRIPTION_NAME：您的 Pub/Sub 订阅名称
CONNECTION_URL：您的 Elasticsearch 网址
DATASET：您的日志类型
NAMESPACE：数据集的命名空间
APIKEY：用于身份验证的 base64 编码 API 密钥

模板源代码

Java

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.elasticsearch.templates;

import com.google.cloud.teleport.metadata.MultiTemplate;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.elasticsearch.options.PubSubToElasticsearchOptions;
import com.google.cloud.teleport.v2.elasticsearch.transforms.FailedPubsubMessageToPubsubTopicFn;
import com.google.cloud.teleport.v2.elasticsearch.transforms.ProcessEventMetadata;
import com.google.cloud.teleport.v2.elasticsearch.transforms.PubSubMessageToJsonDocument;
import com.google.cloud.teleport.v2.elasticsearch.transforms.WriteToElasticsearch;
import com.google.cloud.teleport.v2.elasticsearch.utils.ElasticsearchIndex;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link PubSubToElasticsearch} pipeline is a streaming pipeline which ingests data in JSON
 * format from PubSub, applies a Javascript UDF if provided and writes the resulting records to
 * Elasticsearch. If the element fails to be processed then it is written to an error output table
 * in BigQuery.
 *
 * <p>Check out <a
 * href="https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/googlecloud-to-elasticsearch/README_PubSub_to_Elasticsearch.md">README</a>
 * for instructions on how to use or modify this template.
 */
@MultiTemplate({
  @Template(
      name = "PubSub_to_Elasticsearch_Flex",
      category = TemplateCategory.STREAMING,
      displayName = "Pub/Sub to Elasticsearch",
      description = {
        "The Pub/Sub to Elasticsearch template is a streaming pipeline that reads messages from a Pub/Sub subscription, executes a user-defined function (UDF), and writes them to Elasticsearch as documents. "
            + "The Dataflow template uses Elasticsearch's <a href=\"https://www.elastic.co/guide/en/elasticsearch/reference/master/data-streams.html\">data streams</a> feature to store time series data across multiple indices while giving you a single named resource for requests. "
            + "Data streams are well-suited for logs, metrics, traces, and other continuously generated data stored in Pub/Sub.\n",
        "The template creates a datastream named <code>logs-gcp.DATASET-NAMESPACE</code>, where:\n"
            + "- <code>DATASET</code> is the value of the <code>dataset</code> template parameter, or <code>pubsub</code> if not specified.\n"
            + "- <code>NAMESPACE</code> is the value of the <code>namespace</code> template parameter, or <code>default</code> if not specified."
      },
      optionsClass = PubSubToElasticsearchOptions.class,
      skipOptions = {
        "index",
        "pythonExternalTextTransformGcsPath",
        "pythonExternalTextTransformFunctionName",
      }, // Template just ignores what is sent as "index"
      flexContainerName = "pubsub-to-elasticsearch",
      documentation =
          "https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-elasticsearch",
      contactInformation = "https://cloud.google.com/support",
      preview = true,
      requirements = {
        "The source Pub/Sub subscription must exist and the messages must be encoded in a valid JSON format.",
        "A publicly reachable Elasticsearch host on a Google Cloud instance or on Elastic Cloud with Elasticsearch version 7.0 or above. See <a href=\"https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/googlecloud-to-elasticsearch/docs/PubSubToElasticsearch/README.md#google-cloud-integration-for-elastic\">Google Cloud Integration for Elastic</a> for more details.",
        "A Pub/Sub topic for error output.",
      },
      streaming = true,
      supportsAtLeastOnce = true),
  @Template(
      name = "PubSub_to_Elasticsearch_Xlang",
      category = TemplateCategory.STREAMING,
      displayName = "Pub/Sub to Elasticsearch With Python UDFs",
      type = Template.TemplateType.XLANG,
      description = {
        "The Pub/Sub to Elasticsearch template is a streaming pipeline that reads messages from a Pub/Sub subscription, executes a Python user-defined function (UDF), and writes them to Elasticsearch as documents. "
            + "The Dataflow template uses Elasticsearch's <a href=\"https://www.elastic.co/guide/en/elasticsearch/reference/master/data-streams.html\">data streams</a> feature to store time series data across multiple indices while giving you a single named resource for requests. "
            + "Data streams are well-suited for logs, metrics, traces, and other continuously generated data stored in Pub/Sub.\n",
        "The template creates a datastream named <code>logs-gcp.DATASET-NAMESPACE</code>, where:\n"
            + "- <code>DATASET</code> is the value of the <code>dataset</code> template parameter, or <code>pubsub</code> if not specified.\n"
            + "- <code>NAMESPACE</code> is the value of the <code>namespace</code> template parameter, or <code>default</code> if not specified."
      },
      optionsClass = PubSubToElasticsearchOptions.class,
      skipOptions = {
        "index",
        "javascriptTextTransformGcsPath",
        "javascriptTextTransformFunctionName",
        "javascriptTextTransformReloadIntervalMinutes"
      }, // Template just ignores what is sent as "index" and javascript udf as this is for python
      // udf only.
      flexContainerName = "pubsub-to-elasticsearch-xlang",
      documentation =
          "https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-elasticsearch",
      contactInformation = "https://cloud.google.com/support",
      preview = true,
      requirements = {
        "The source Pub/Sub subscription must exist and the messages must be encoded in a valid JSON format.",
        "A publicly reachable Elasticsearch host on a Google Cloud instance or on Elastic Cloud with Elasticsearch version 7.0 or above. See <a href=\"https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/googlecloud-to-elasticsearch/docs/PubSubToElasticsearch/README.md#google-cloud-integration-for-elastic\">Google Cloud Integration for Elastic</a> for more details.",
        "A Pub/Sub topic for error output.",
      },
      streaming = true,
      supportsAtLeastOnce = true)
})
public class PubSubToElasticsearch {

  /** The tag for the main output of the json transformation. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> TRANSFORM_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The tag for the error output table of the json to table row transform. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> TRANSFORM_ERROR_OUTPUT_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** Pubsub message/string coder for pipeline. */
  public static final FailsafeElementCoder<PubsubMessage, String> CODER =
      FailsafeElementCoder.of(PubsubMessageWithAttributesCoder.of(), StringUtf8Coder.of());

  /** String/String Coder for FailsafeElement. */
  public static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  /** The log to output status messages to. */
  private static final Logger LOG = LoggerFactory.getLogger(PubSubToElasticsearch.class);

  /**
   * Main entry point for executing the pipeline.
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    // Parse the user options passed from the command-line.
    PubSubToElasticsearchOptions pubSubToElasticsearchOptions =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(PubSubToElasticsearchOptions.class);

    pubSubToElasticsearchOptions.setIndex(
        new ElasticsearchIndex(
                pubSubToElasticsearchOptions.getDataset(),
                pubSubToElasticsearchOptions.getNamespace())
            .getIndex());

    validateOptions(pubSubToElasticsearchOptions);
    run(pubSubToElasticsearchOptions);
  }

  public static void validateOptions(PubSubToElasticsearchOptions options) {
    switch (options.getApiKeySource()) {
      case "PLAINTEXT":
        return;
      case "KMS":
        // validate that the encryption key is provided.
        if (StringUtils.isEmpty(options.getApiKeyKMSEncryptionKey())) {
          throw new IllegalArgumentException(
              "If apiKeySource is set to KMS, apiKeyKMSEncryptionKey should be provided.");
        }
        return;
      case "SECRET_MANAGER":
        // validate that secretId is provided.
        if (StringUtils.isEmpty(options.getApiKeySecretId())) {
          throw new IllegalArgumentException(
              "If apiKeySource is set to SECRET_MANAGER, apiKeySecretId should be provided.");
        }
    }
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  public static PipelineResult run(PubSubToElasticsearchOptions options) {

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Register the coders for pipeline
    CoderRegistry coderRegistry = pipeline.getCoderRegistry();

    coderRegistry.registerCoderForType(
        FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER);

    coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER);

    /*
     * Steps: 1) Read PubSubMessage with attributes from input PubSub subscription.
     *        2) Apply Javascript UDF if provided.
     *        3) Index Json string to output ES index.
     *
     */
    LOG.info("Reading from subscription: " + options.getInputSubscription());

    PCollectionTuple convertedPubsubMessages =
        pipeline
            /*
             * Step #1: Read from a PubSub subscription.
             */
            .apply(
                "ReadPubSubSubscription",
                PubsubIO.readMessagesWithAttributes()
                    .fromSubscription(options.getInputSubscription()))
            /*
             * Step #2: Transform the PubsubMessages into Json documents.
             */
            .apply(
                "ConvertMessageToJsonDocument",
                PubSubMessageToJsonDocument.newBuilder()
                    .setJavascriptTextTransformFunctionName(
                        options.getJavascriptTextTransformFunctionName())
                    .setJavascriptTextTransformGcsPath(options.getJavascriptTextTransformGcsPath())
                    .setPythonExternalTextTransformGcsPath(
                        options.getPythonExternalTextTransformGcsPath())
                    .setPythonExternalTextTransformFunctionName(
                        options.getPythonExternalTextTransformFunctionName())
                    .build());

    /*
     * Step #3a: Write Json documents into Elasticsearch using {@link ElasticsearchTransforms.WriteToElasticsearch}.
     */
    convertedPubsubMessages
        .get(TRANSFORM_OUT)
        .apply(
            "GetJsonDocuments",
            MapElements.into(TypeDescriptors.strings()).via(FailsafeElement::getPayload))
        .apply("Insert metadata", new ProcessEventMetadata())
        .apply(
            "WriteToElasticsearch",
            WriteToElasticsearch.newBuilder()
                .setUserAgent("dataflow-pubsub-to-elasticsearch-template/v2")
                .setOptions(options.as(PubSubToElasticsearchOptions.class))
                .build());

    /*
     * Step 3b: Write elements that failed processing to error output PubSub topic via {@link PubSubIO}.
     */
    convertedPubsubMessages
        .get(TRANSFORM_ERROR_OUTPUT_OUT)
        .apply(ParDo.of(new FailedPubsubMessageToPubsubTopicFn()))
        .apply("writeFailureMessages", PubsubIO.writeMessages().to(options.getErrorOutputTopic()));

    // Execute the pipeline and return the result.
    return pipeline.run();
  }
}

后续步骤

了解 Dataflow 模板。
参阅 Google 提供的模板列表。

Pub/Sub to Elasticsearch 模板 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

流水线要求

模板参数

必需参数

可选参数

用户定义的函数

文本转换函数

索引函数

文档 ID 函数

文档删除函数

映射类型函数

运行模板

控制台

gcloud

API

模板源代码

Java

后续步骤

Pub/Sub to Elasticsearch 模板