Cloud Storage Text to Firestore 模板是一种批处理流水线,可从存储在 Cloud Storage 中的 JSON 文档导入到 Firestore。
流水线要求
必须在目标项目中启用 Firestore。
每个输入文件必须包含以换行符分隔的 JSON,其中每行包含 Datastore Entity
数据类型的 JSON 表示法。
例如,以下 JSON 表示名为 Users
的集合中的文档。该示例已经过格式设置,便于阅读,但每个文档都必须显示为单行输入。
{
"key": {
"partitionId": {
"projectId": "my-project"
},
"path": [
{
"kind": "users",
"name": "alovelace"
}
]
},
"properties": {
"first": {
"stringValue": "Ada"
},
"last": {
"stringValue": "Lovelace"
},
"born": {
"integerValue": "1815",
"excludeFromIndexes": true
}
}
}
如需详细了解文档模型,请参阅实体、属性和键。
模板参数
参数 |
说明 |
textReadPattern |
指定文本数据文件位置的 Cloud Storage 路径模式。例如 gs://mybucket/somepath/*.json 。 |
javascriptTextTransformGcsPath |
(可选).js 文件的 Cloud Storage URI,用于定义您要使用的 JavaScript 用户定义的函数 (UDF)。例如 gs://my-bucket/my-udfs/my_file.js 。 |
javascriptTextTransformFunctionName |
(可选)
您要使用的 JavaScript 用户定义的函数 (UDF) 的名称。
例如,如果您的 JavaScript 函数代码为 myTransform(inJson) { /*...do stuff...*/ } ,则函数名称为 myTransform 。如需查看 JavaScript UDF 示例,请参阅 UDF 示例。
|
firestoreWriteProjectId |
要向其写入 Firestore 实体的位置的 Google Cloud 项目 ID |
firestoreHintNumWorkers |
(可选)Firestore 逐步增加限制步骤中的预期工作器数量的提示。默认值为 500 。 |
errorWritePath |
错误日志输出文件,用于写入在处理期间发生的故障。例如 gs://bucket-name/errors.txt 。 |
用户定义的函数
(可选)您可以通过编写用户定义的函数 (UDF) 来扩展此模板。该模板会为每个输入元素调用 UDF。元素载荷会序列化为 JSON 字符串。如需了解详情,请参阅为 Dataflow 模板创建用户定义的函数。
函数规范
UDF 具有以下规范:
- 输入:Cloud Storage 输入文件中的一行文本。
- 输出:一个
Entity
,序列化为 JSON 字符串。
运行模板
控制台
- 转到 Dataflow 基于模板创建作业页面。
转到“基于模板创建作业”- 在作业名称字段中,输入唯一的作业名称。
- 可选:对于区域性端点,从下拉菜单中选择一个值。默认区域为
us-central1
。
如需查看可以在其中运行 Dataflow 作业的区域列表,请参阅 Dataflow 位置。
- 从 Dataflow 模板下拉菜单中,选择 the Text Files on Cloud Storage to Firestore template。
- 在提供的参数字段中,输入您的参数值。
- 点击运行作业。
gcloud
在 shell 或终端中,运行模板:
gcloud dataflow jobs run JOB_NAME \
--gcs-location gs://dataflow-templates-REGION_NAME/VERSION/GCS_Text_to_Firestore \
--region REGION_NAME \
--parameters \
textReadPattern=PATH_TO_INPUT_TEXT_FILES,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
firestoreWriteProjectId=PROJECT_ID,\
errorWritePath=ERROR_FILE_WRITE_PATH
请替换以下内容:
API
如需使用 REST API 来运行模板,请发送 HTTP POST 请求。如需详细了解 API 及其授权范围,请参阅 projects.templates.launch
。
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates-LOCATION/VERSION/GCS_Text_to_Firestore
{
"jobName": "JOB_NAME",
"parameters": {
"textReadPattern": "PATH_TO_INPUT_TEXT_FILES",
"javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
"javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
"firestoreWriteProjectId": "PROJECT_ID",
"errorWritePath": "ERROR_FILE_WRITE_PATH"
},
"environment": { "zone": "us-central1-f" }
}
请替换以下内容:
模板源代码
Java
/*
* Copyright (C) 2018 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*/
package com.google.cloud.teleport.templates;
import com.google.cloud.teleport.metadata.MultiTemplate;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.templates.TextToDatastore.TextToDatastoreOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreWriteOptions;
import com.google.cloud.teleport.templates.common.DatastoreConverters.WriteJsonEntities;
import com.google.cloud.teleport.templates.common.ErrorConverters.ErrorWriteOptions;
import com.google.cloud.teleport.templates.common.ErrorConverters.LogErrors;
import com.google.cloud.teleport.templates.common.FirestoreNestedValueProvider;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
import com.google.cloud.teleport.templates.common.TextConverters.FilesystemReadOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.values.TupleTag;
/**
* Dataflow template which reads from a Text Source and writes JSON encoded Entities into Datastore.
* The JSON is expected to be in the format of: <a
* href="https://cloud.google.com/datastore/docs/reference/rest/v1/Entity">Entity</a>
*
* <p>Check out <a
* href="https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v1/README_GCS_Text_to_Datastore.md">README
* Datastore</a> or <a
* href="https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v1/README_GCS_Text_to_Firestore.md">README
* Firestore</a> for instructions on how to use or modify this template.
*/
@MultiTemplate({
@Template(
name = "GCS_Text_to_Datastore",
category = TemplateCategory.LEGACY,
displayName = "Text Files on Cloud Storage to Datastore [Deprecated]",
description =
"The Cloud Storage Text to Datastore template is a batch pipeline that reads from text files stored in "
+ "Cloud Storage and writes JSON encoded Entities to Datastore. "
+ "Each line in the input text files must be in the <a href=\"https://cloud.google.com/datastore/docs/reference/rest/v1/Entity\">specified JSON format</a>.",
optionsClass = TextToDatastoreOptions.class,
skipOptions = {
"firestoreWriteProjectId",
"firestoreWriteEntityKind",
"firestoreWriteNamespace",
"firestoreHintNumWorkers",
"javascriptTextTransformReloadIntervalMinutes",
// This template doesn't use neither firestoreWriteEntityKind/Namespace
// nor datastoreWriteEntityKind/Namespace pipeline options.
"datastoreWriteEntityKind",
"datastoreWriteNamespace"
},
documentation =
"https://cloud.google.com/dataflow/docs/guides/templates/provided/cloud-storage-to-datastore",
contactInformation = "https://cloud.google.com/support",
preview = true,
requirements = {"Datastore must be enabled in the destination project."}),
@Template(
name = "GCS_Text_to_Firestore",
category = TemplateCategory.BATCH,
displayName = "Text Files on Cloud Storage to Firestore (Datastore mode)",
description =
"The Cloud Storage Text to Firestore template is a batch pipeline that reads from text files stored in "
+ "Cloud Storage and writes JSON encoded Entities to Firestore. "
+ "Each line in the input text files must be in the <a href=\"https://cloud.google.com/datastore/docs/reference/rest/v1/Entity\">specified JSON format</a>.",
optionsClass = TextToDatastoreOptions.class,
skipOptions = {
"datastoreWriteProjectId",
"datastoreWriteEntityKind",
"datastoreWriteNamespace",
"datastoreHintNumWorkers",
"javascriptTextTransformReloadIntervalMinutes",
// This template doesn't use neither firestoreWriteEntityKind/Namespace
// nor datastoreWriteEntityKind/Namespace pipeline options.
"firestoreWriteEntityKind",
"firestoreWriteNamespace"
},
documentation =
"https://cloud.google.com/dataflow/docs/guides/templates/provided/cloud-storage-to-firestore",
contactInformation = "https://cloud.google.com/support",
requirements = {"Firestore must be enabled in the destination project."})
})
public class TextToDatastore {
public static <T> ValueProvider<T> selectProvidedInput(
ValueProvider<T> datastoreInput, ValueProvider<T> firestoreInput) {
return new FirestoreNestedValueProvider(datastoreInput, firestoreInput);
}
/** TextToDatastore Pipeline Options. */
public interface TextToDatastoreOptions
extends PipelineOptions,
FilesystemReadOptions,
JavascriptTextTransformerOptions,
DatastoreWriteOptions,
ErrorWriteOptions {}
/**
* Runs a pipeline which reads from a Text Source, passes the Text to a Javascript UDF, writes the
* JSON encoded Entities to a TextIO sink.
*
* <p>If your Text Source does not contain JSON encoded Entities, then you'll need to supply a
* Javascript UDF which transforms your data to be JSON encoded Entities.
*
* @param args arguments to the pipeline
*/
public static void main(String[] args) {
TextToDatastoreOptions options =
PipelineOptionsFactory.fromArgs(args).withValidation().as(TextToDatastoreOptions.class);
TupleTag<String> errorTag = new TupleTag<String>("errors") {};
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply(TextIO.read().from(options.getTextReadPattern()))
.apply(
TransformTextViaJavascript.newBuilder()
.setFileSystemPath(options.getJavascriptTextTransformGcsPath())
.setFunctionName(options.getJavascriptTextTransformFunctionName())
.build())
.apply(
WriteJsonEntities.newBuilder()
.setProjectId(
selectProvidedInput(
options.getDatastoreWriteProjectId(), options.getFirestoreWriteProjectId()))
.setHintNumWorkers(
selectProvidedInput(
options.getDatastoreHintNumWorkers(), options.getFirestoreHintNumWorkers()))
.setErrorTag(errorTag)
.build())
.apply(
LogErrors.newBuilder()
.setErrorWritePath(options.getErrorWritePath())
.setErrorTag(errorTag)
.build());
pipeline.run();
}
}