Pub/Sub to Elasticsearch 模板

Pub/Sub to Elasticsearch 模板是一种流处理流水线，可从 Pub/Sub 订阅读取消息、执行用户定义的函数 (UDF) 并将其作为文档写入 Elasticsearch。Dataflow 模板使用 Elasticsearch 的数据流功能跨多个索引存储时间序列数据，同时为请求提供单个命名资源。数据流非常适合存储在 Pub/Sub 中的日志、指标、跟踪记录和其他持续生成的数据。

该模板会创建一个名为 logs-gcp.DATASET-NAMESPACE 的数据流，其中：

DATASET 是 dataset 模板参数的值，如果未指定，则为 pubsub。
NAMESPACE 是 namespace 模板参数的值，如果未指定，则为 default。

流水线要求

来源 Pub/Sub 订阅必须存在，并且消息必须采用有效的 JSON 格式进行编码。
Google Cloud 实例上或 Elastic Cloud 上使用 Elasticsearch 7.0 版或更高版本的可公开访问的 Elasticsearch 主机。如需了解详情，请参阅适用于 Elastic 的 Google Cloud 集成。
用于错误输出的 Pub/Sub 主题。

模板参数

必需参数

inputSubscription：要从中使用输入的 Pub/Sub 订阅。名称应采用“projects/your-project-id/subscriptions/your-subscription-name”格式（示例：projects/your-project-id/subscriptions/your-subscription-name）。
errorOutputTopic：用于发布失败记录的 Pub/Sub 输出主题，格式为“projects/your-project-id/topics/your-topic-name”。
connectionUrl：Elasticsearch 网址（格式为“https://hostname:[port]”），或指定 CloudID（如果使用 Elastic Cloud，示例：https://elasticsearch-host:9200）。
apiKey：用于身份验证的 Base64 编码 API 密钥。请参阅：https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html#security-api-create-api-key-request。

可选参数

dataset：使用 Pub/Sub 发送的日志类型，我们为其提供了开箱即用的信息中心。已知的日志类型值包括 audit、vpcflow 和 firewall。默认值“pubsub”。
namespace：任意分组，例如环境（dev、prod 或 qa）、团队或战略性业务部门。默认值：“default”。
elasticsearchTemplateVersion：Dataflow 模板版本标识符，通常由 Google Cloud 定义。默认值为 1.0.0。
javascriptTextTransformGcsPath：包含用户定义的函数的 JavaScript 代码的 Cloud Storage 路径模式。（示例：gs://your-bucket/your-function.js）。
javascriptTextTransformFunctionName：要从 JavaScript 文件调用的函数的名称。只能使用字母、数字和下划线。（示例：“transform”或“transform_udf1”）。
javascriptTextTransformReloadIntervalMinutes：定义工作器检查 JavaScript UDF 更改以重新加载文件的时间间隔。默认值为 0。
elasticsearchUsername：用于进行身份验证的 Elasticsearch 用户名。如果指定了此参数，则系统会忽略“apiKey”的值。
elasticsearchPassword：用于进行身份验证的 Elasticsearch 密码。如果指定了此参数，则系统会忽略“apiKey”的值。
batchSize：按文档数量的批次大小。默认值：“1000”。
batchSizeBytes：用于将消息批量插入 Elasticsearch 的批次大小（以字节为单位）。默认值：“5242880 (5mb)”。
maxRetryAttempts：重试次数上限，必须大于 0。默认值：“no retries”。
maxRetryDuration：重试时长上限（以毫秒为单位），必须大于 0。默认值：“no retries”。
propertyAsIndex：要编入索引的文档中的一个属性，其值指定批量请求要包含在文档中的“_index”元数据（优先于“_index”UDF）。默认值：none。
javaScriptIndexFnGcsPath：函数的 JavaScript UDF 来源的 Cloud Storage 路径，该函数指定批量请求要包含在文档中的“_index”元数据。默认值：none。
javaScriptIndexFnName：函数的 UDF JavaScript 函数名称，该函数指定批量请求要包含在文档中的 _index 元数据。默认值：none。
propertyAsId：要编入索引的文档中的一个属性，其值指定批量请求要包含在文档中的“_id”元数据（优先于“_id”UDF）。默认值：none。
javaScriptIdFnGcsPath：函数的 JavaScript UDF 来源的 Cloud Storage 路径，该函数指定批量请求要包含在文档中的“_id”元数据。默认值：none。
javaScriptIdFnName：函数的 UDF JavaScript 函数名称，该函数指定批量请求要包含在文档中的 _id 元数据。默认值：none。
javaScriptTypeFnGcsPath：函数的 JavaScript UDF 来源的 Cloud Storage 路径，该函数指定批量请求要包含在文档中的“_type”元数据。默认值：none。
javaScriptTypeFnName：函数的 UDF JavaScript 函数名称，该函数指定批量请求要包含在文档中的“_type”元数据。默认值：none。
javaScriptIsDeleteFnGcsPath：函数的 JavaScript UDF 来源的 Cloud Storage 路径，该函数确定是否应删除文档，而不是插入或更新文档。该函数应返回字符串值“true”或“false”。默认值：none。
javaScriptIsDeleteFnName：函数的 UDF JavaScript 函数名称，该函数确定是否应删除文档，而不是插入或更新文档。该函数应返回字符串值“true”或“false”。默认值：none。
usePartialUpdate：是否在 Elasticsearch 请求中使用部分更新（更新而不是创建或编入索引，允许部分文档）。默认值：“false”。
bulkInsertMethod：在 Elasticsearch 批量请求中使用“INDEX”（编入索引，允许 upsert）还是“CREATE”（创建，会对重复 _id 报错）。默认值：“CREATE”。
trustSelfSignedCerts：是否信任自签名证书。已安装的 Elasticsearch 实例可能具有自签名证书，将此参数设置为 True 可绕过对 SSL 证书的验证。（默认值为 False）。
disableCertificateValidation：如果为“true”，则信任自签名 SSL 证书。Elasticsearch 实例可能具有自签名证书。如需绕过对证书的验证，请将此参数设置为“true”。默认值：false。
apiKeyKMSEncryptionKey：用于解密 API 密钥的 Cloud KMS 密钥。如果 apiKeySource 设置为 KMS，则必须提供此参数。如果提供了此参数，则应以加密方式传递 apiKey 字符串。使用 KMS API 加密端点对参数进行加密。密钥应采用 projects/{gcp_project}/locations/{key_region}/keyRings/{key_ring}/cryptoKeys/{kms_key_name} 格式。请参阅：https://cloud.google.com/kms/docs/reference/rest/v1/projects.locations.keyRings.cryptoKeys/encrypt（示例：projects/your-project-id/locations/global/keyRings/your-keyring/cryptoKeys/your-key-name）。
apiKeySecretId：apiKey 的 Secret Manager Secret ID。如果 apiKeySource 设置为 SECRET_MANAGER，则应提供此参数。应采用 projects/{project}/secrets/{secret}/versions/{secret_version} 格式。（示例：projects/your-project-id/secrets/your-secret/versions/your-secret-version）。
apiKeySource：API 密钥的来源。PLAINTEXT、KMS 或 SECRET_MANAGER 之一。如果使用了 Secret Manager 或 KMS，则必须提供此参数。如果 apiKeySource 设置为 KMS，则必须提供 apiKeyKMSEncryptionKey 和加密的 apiKey。如果 apiKeySource 设置为 SECRET_MANAGER，则必须提供 apiKeySecretId。如果 apiKeySource 设置为 PLAINTEXT，则必须提供 apiKey。默认值为 PLAINTEXT。

用户定义的函数

此模板支持流水线中多个位置的用户定义的函数 (UDF)，如下所述。如需了解详情，请参阅为 Dataflow 模板创建用户定义的函数。

文本转换函数

将 Pub/Sub 消息转换为 Elasticsearch 文档。

模板参数：

javascriptTextTransformGcsPath：JavaScript 文件的 Cloud Storage URI。
javascriptTextTransformFunctionName：JavaScript 函数的名称。

函数规范：

输入：Pub/Sub 消息数据字段，序列化为 JSON 字符串。
输出：要插入到 Elasticsearch 中的字符串化 JSON 文档。

索引函数

返回文档所属的索引。

模板参数：

javaScriptIndexFnGcsPath：JavaScript 文件的 Cloud Storage URI。
javaScriptIndexFnName：JavaScript 函数的名称。

函数规范：

输入：Elasticsearch 文档，序列化为 JSON 字符串。
输出：文档的 _index 元数据字段的值。

文档 ID 函数

返回文档 ID。

模板参数：

javaScriptIdFnGcsPath：JavaScript 文件的 Cloud Storage URI。
javaScriptIdFnName：JavaScript 函数的名称。

函数规范：

输入：Elasticsearch 文档，序列化为 JSON 字符串。
输出：文档的 _id 元数据字段的值。

文档删除函数

指定是否删除文档。如需使用此函数，请将批量插入模式设置为 INDEX 并提供文档 ID 函数。

模板参数：

javaScriptIsDeleteFnGcsPath：JavaScript 文件的 Cloud Storage URI。
javaScriptIsDeleteFnName：JavaScript 函数的名称。

函数规范：

输入：Elasticsearch 文档，序列化为 JSON 字符串。
输出：返回字符串 "true" 可删除文档，返回 "false" 可更新/插入文档。

映射类型函数

返回文档的映射类型。

模板参数：

javaScriptTypeFnGcsPath：JavaScript 文件的 Cloud Storage URI。
javaScriptTypeFnName：JavaScript 函数的名称。

函数规范：

输入：Elasticsearch 文档，序列化为 JSON 字符串。
输出：文档的 _type 元数据字段的值。

运行模板

控制台

转到 Dataflow 基于模板创建作业页面。

转到“基于模板创建作业”

在作业名称字段中，输入唯一的作业名称。
可选：对于区域性端点，从下拉菜单中选择一个值。默认区域为 us-central1。
如需查看可以在其中运行 Dataflow 作业的区域列表，请参阅 Dataflow 位置。
从 Dataflow 模板下拉菜单中，选择 the Pub/Sub to Elasticsearch template。
在提供的参数字段中，输入您的参数值。
点击运行作业。

gcloud

在 shell 或终端中，运行模板：

gcloud dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/PubSub_to_Elasticsearch \
    --parameters \
inputSubscription=SUBSCRIPTION_NAME,\
connectionUrl=CONNECTION_URL,\
dataset=DATASET,\
namespace=NAMESPACE,\
apiKey=APIKEY,\
errorOutputTopic=ERROR_OUTPUT_TOPIC

请替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Google Cloud 项目的 ID
JOB_NAME：您选择的唯一性作业名称
REGION_NAME：要在其中部署 Dataflow 作业的区域，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates-REGION_NAME/latest/) 中可用
- 版本名称（如 2023-09-12-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates-REGION_NAME/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
ERROR_OUTPUT_TOPIC：用于错误输出的 Pub/Sub 主题
SUBSCRIPTION_NAME：您的 Pub/Sub 订阅名称
CONNECTION_URL：您的 Elasticsearch 网址
DATASET：您的日志类型
NAMESPACE：数据集的命名空间
APIKEY：用于身份验证的 base64 编码 API 密钥

API

如需使用 REST API 来运行模板，请发送 HTTP POST 请求。如需详细了解 API 及其授权范围，请参阅 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputSubscription": "SUBSCRIPTION_NAME",
          "connectionUrl": "CONNECTION_URL",
          "dataset": "DATASET",
          "namespace": "NAMESPACE",
          "apiKey": "APIKEY",
          "errorOutputTopic": "ERROR_OUTPUT_TOPIC"
      },
      "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/PubSub_to_Elasticsearch",
   }
}

请替换以下内容：

PROJECT_ID：您要在其中运行 Dataflow 作业的 Google Cloud 项目的 ID
JOB_NAME：您选择的唯一性作业名称
LOCATION：要在其中部署 Dataflow 作业的区域，例如 us-central1
VERSION：您要使用的模板的版本
您可使用以下值：
- latest，以使用模板的最新版本，该模板在存储桶的未标示日期的父文件夹 (gs://dataflow-templates-REGION_NAME/latest/) 中可用
- 版本名称（如 2023-09-12-00_RC00），以使用模板的特定版本，该版本嵌套在存储桶的相应日期父文件夹 (gs://dataflow-templates-REGION_NAME/) 中
注意：最新版模板可能会随着重大更改而更新。为了防止这些重大更改影响您的生产工作流程，生产环境应使用有最近标示日期的父文件夹中保存的模板。
ERROR_OUTPUT_TOPIC：用于错误输出的 Pub/Sub 主题
SUBSCRIPTION_NAME：您的 Pub/Sub 订阅名称
CONNECTION_URL：您的 Elasticsearch 网址
DATASET：您的日志类型
NAMESPACE：数据集的命名空间
APIKEY：用于身份验证的 base64 编码 API 密钥

模板源代码

Java

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.elasticsearch.templates;

import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.elasticsearch.options.PubSubToElasticsearchOptions;
import com.google.cloud.teleport.v2.elasticsearch.transforms.FailedPubsubMessageToPubsubTopicFn;
import com.google.cloud.teleport.v2.elasticsearch.transforms.ProcessEventMetadata;
import com.google.cloud.teleport.v2.elasticsearch.transforms.PubSubMessageToJsonDocument;
import com.google.cloud.teleport.v2.elasticsearch.transforms.WriteToElasticsearch;
import com.google.cloud.teleport.v2.elasticsearch.utils.ElasticsearchIndex;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessageWithAttributesCoder;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link PubSubToElasticsearch} pipeline is a streaming pipeline which ingests data in JSON
 * format from PubSub, applies a Javascript UDF if provided and writes the resulting records to
 * Elasticsearch. If the element fails to be processed then it is written to an error output table
 * in BigQuery.
 *
 * <p>Check out <a
 * href="https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/googlecloud-to-elasticsearch/README_PubSub_to_Elasticsearch.md">README</a>
 * for instructions on how to use or modify this template.
 */
@Template(
    name = "PubSub_to_Elasticsearch",
    category = TemplateCategory.STREAMING,
    displayName = "Pub/Sub to Elasticsearch",
    description = {
      "The Pub/Sub to Elasticsearch template is a streaming pipeline that reads messages from a Pub/Sub subscription, executes a user-defined function (UDF), and writes them to Elasticsearch as documents. "
          + "The Dataflow template uses Elasticsearch's <a href=\"https://www.elastic.co/guide/en/elasticsearch/reference/master/data-streams.html\">data streams</a> feature to store time series data across multiple indices while giving you a single named resource for requests. "
          + "Data streams are well-suited for logs, metrics, traces, and other continuously generated data stored in Pub/Sub.\n",
      "The template creates a datastream named <code>logs-gcp.DATASET-NAMESPACE</code>, where:\n"
          + "- <code>DATASET</code> is the value of the <code>dataset</code> template parameter, or <code>pubsub</code> if not specified.\n"
          + "- <code>NAMESPACE</code> is the value of the <code>namespace</code> template parameter, or <code>default</code> if not specified."
    },
    optionsClass = PubSubToElasticsearchOptions.class,
    skipOptions = "index", // Template just ignores what is sent as "index"
    flexContainerName = "pubsub-to-elasticsearch",
    documentation =
        "https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-elasticsearch",
    contactInformation = "https://cloud.google.com/support",
    preview = true,
    requirements = {
      "The source Pub/Sub subscription must exist and the messages must be encoded in a valid JSON format.",
      "A publicly reachable Elasticsearch host on a Google Cloud instance or on Elastic Cloud with Elasticsearch version 7.0 or above. See <a href=\"https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/googlecloud-to-elasticsearch/docs/PubSubToElasticsearch/README.md#google-cloud-integration-for-elastic\">Google Cloud Integration for Elastic</a> for more details.",
      "A Pub/Sub topic for error output.",
    },
    streaming = true,
    supportsAtLeastOnce = true)
public class PubSubToElasticsearch {

  /** The tag for the main output of the json transformation. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> TRANSFORM_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** The tag for the error output table of the json to table row transform. */
  public static final TupleTag<FailsafeElement<PubsubMessage, String>> TRANSFORM_ERROR_OUTPUT_OUT =
      new TupleTag<FailsafeElement<PubsubMessage, String>>() {};

  /** Pubsub message/string coder for pipeline. */
  public static final FailsafeElementCoder<PubsubMessage, String> CODER =
      FailsafeElementCoder.of(PubsubMessageWithAttributesCoder.of(), StringUtf8Coder.of());

  /** String/String Coder for FailsafeElement. */
  public static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());

  /** The log to output status messages to. */
  private static final Logger LOG = LoggerFactory.getLogger(PubSubToElasticsearch.class);

  /**
   * Main entry point for executing the pipeline.
   *
   * @param args The command-line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    // Parse the user options passed from the command-line.
    PubSubToElasticsearchOptions pubSubToElasticsearchOptions =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(PubSubToElasticsearchOptions.class);

    pubSubToElasticsearchOptions.setIndex(
        new ElasticsearchIndex(
                pubSubToElasticsearchOptions.getDataset(),
                pubSubToElasticsearchOptions.getNamespace())
            .getIndex());

    validateOptions(pubSubToElasticsearchOptions);
    run(pubSubToElasticsearchOptions);
  }

  public static void validateOptions(PubSubToElasticsearchOptions options) {
    switch (options.getApiKeySource()) {
      case "PLAINTEXT":
        return;
      case "KMS":
        // validate that the encryption key is provided.
        if (StringUtils.isEmpty(options.getApiKeyKMSEncryptionKey())) {
          throw new IllegalArgumentException(
              "If apiKeySource is set to KMS, apiKeyKMSEncryptionKey should be provided.");
        }
        return;
      case "SECRET_MANAGER":
        // validate that secretId is provided.
        if (StringUtils.isEmpty(options.getApiKeySecretId())) {
          throw new IllegalArgumentException(
              "If apiKeySource is set to SECRET_MANAGER, apiKeySecretId should be provided.");
        }
    }
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  public static PipelineResult run(PubSubToElasticsearchOptions options) {

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Register the coders for pipeline
    CoderRegistry coderRegistry = pipeline.getCoderRegistry();

    coderRegistry.registerCoderForType(
        FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER);

    coderRegistry.registerCoderForType(CODER.getEncodedTypeDescriptor(), CODER);

    /*
     * Steps: 1) Read PubSubMessage with attributes from input PubSub subscription.
     *        2) Apply Javascript UDF if provided.
     *        3) Index Json string to output ES index.
     *
     */
    LOG.info("Reading from subscription: " + options.getInputSubscription());

    PCollectionTuple convertedPubsubMessages =
        pipeline
            /*
             * Step #1: Read from a PubSub subscription.
             */
            .apply(
                "ReadPubSubSubscription",
                PubsubIO.readMessagesWithAttributes()
                    .fromSubscription(options.getInputSubscription()))
            /*
             * Step #2: Transform the PubsubMessages into Json documents.
             */
            .apply(
                "ConvertMessageToJsonDocument",
                PubSubMessageToJsonDocument.newBuilder()
                    .setJavascriptTextTransformFunctionName(
                        options.getJavascriptTextTransformFunctionName())
                    .setJavascriptTextTransformGcsPath(options.getJavascriptTextTransformGcsPath())
                    .build());

    /*
     * Step #3a: Write Json documents into Elasticsearch using {@link ElasticsearchTransforms.WriteToElasticsearch}.
     */
    convertedPubsubMessages
        .get(TRANSFORM_OUT)
        .apply(
            "GetJsonDocuments",
            MapElements.into(TypeDescriptors.strings()).via(FailsafeElement::getPayload))
        .apply("Insert metadata", new ProcessEventMetadata())
        .apply(
            "WriteToElasticsearch",
            WriteToElasticsearch.newBuilder()
                .setUserAgent("dataflow-pubsub-to-elasticsearch-template/v2")
                .setOptions(options.as(PubSubToElasticsearchOptions.class))
                .build());

    /*
     * Step 3b: Write elements that failed processing to error output PubSub topic via {@link PubSubIO}.
     */
    convertedPubsubMessages
        .get(TRANSFORM_ERROR_OUTPUT_OUT)
        .apply(ParDo.of(new FailedPubsubMessageToPubsubTopicFn()))
        .apply("writeFailureMessages", PubsubIO.writeMessages().to(options.getErrorOutputTopic()));

    // Execute the pipeline and return the result.
    return pipeline.run();
  }
}

后续步骤

了解 Dataflow 模板。
参阅 Google 提供的模板列表。