创建通用推荐数据存储区

如需创建数据存储区并注入通用推荐数据,请前往您计划使用的来源对应的部分:

网站网址

控制台

如需使用 Google Cloud 控制台创建数据存储区并对网站中的数据进行编入索引,请按以下步骤操作:

  1. 在 Google Cloud 控制台中,前往 Agent Builder 页面。

    Agent Builder

  2. 在导航菜单中,点击数据存储区

  3. 点击新建数据存储区

  4. 选择数据源页面上,选择网站内容

  5. 选择是否为此数据存储区开启高级网站索引编制。 此选项以后无法关闭。

    高级网站索引编制功能提供搜索摘要、搜索跟进和提取式回答等其他功能。使用高级网站索引编制功能会产生额外费用,并且要求您验证要编入索引的任何网站的域名所有权。如需了解详情,请参阅高级网站编入索引价格

  6. 要包括的网站字段中,指定要编入索引的网站的网址。请在每行输入一个网址,不加英文逗号分隔符。

  7. 可选:在要排除的网站字段中,输入您要从应用中排除的网站。

  8. 点击继续

  9. 为数据存储区输入名称。

  10. 为数据存储区选择位置。必须启用高级网站索引编制功能,才能选择位置。

  11. 点击创建。Vertex AI Agent Builder 会创建数据存储区,并在数据存储区页面上显示数据存储区。

  12. 如需查看数据存储区相关信息,请点击名称列中的数据存储区名称。系统随即会显示您的数据存储区页面。

    如果您已启用高级网站索引,系统会显示一条警告,提示您验证域名所有权。如果您超出配额(您指定的网站中的网页数量超出了项目的“每个项目的文档数量”配额),系统会显示一条额外的警告,提示您升级配额。以下步骤介绍了如何验证域名所有权和升级配额。

  13. 如需验证域名所有权,请按以下步骤操作:

    1. 点击在 Google Search Console 中验证。系统随即会显示欢迎使用 Google Search Console 页面。
    2. 请按照屏幕上的说明验证域名或网址前缀,具体取决于您要验证的是整个网域还是网域中的网址前缀。如需了解详情,请参阅 Search Console 帮助中的验证网站所有权
    3. 完成域名验证工作流后,返回 Agent Builder 页面,然后点击导航菜单中的数据存储区
    4. 点击名称列中的数据存储区的名称。系统会显示您的数据存储区页面。
    5. 点击刷新状态,以更新状态列中的值。 网站的状态列表示索引编制正在进行中。
    6. 针对需要验证域名的每个网站重复执行域名验证步骤,直到所有网站都开始编制索引。如果网址的状态列显示为已编入索引,则该网址或网址格式可以使用网站的“高级索引编制”功能
  14. 如需升级配额,请按以下步骤操作:

    1. 点击升级配额。系统随即会显示 Discovery Engine API 窗格,其中选中了 Quotas(配额)标签页。
    2. 请按照 Google Cloud 文档中的申请更高配额限制部分中的说明操作。要增加的配额是文档数量
    3. 提交提高配额上限的申请后,返回 Agent Builder 页面,然后点击导航菜单中的数据存储区
    4. 点击名称列中的数据存储区的名称。状态列表示系统正在为超出配额的网站编制索引。如果网址的状态列显示为已编入索引,则该网址或网址格式可以使用网站的“高级索引编制”功能

后续步骤

  • 如需将数据存储区附加到应用,请按照创建通用推荐应用中的步骤创建应用并选择数据存储区。

  • 如需预览设置应用和数据存储区后建议的显示方式,请参阅获取建议

BigQuery

如需从 BigQuery 注入数据,请按照以下步骤使用 Google Cloud 控制台或 API 创建数据存储区并注入数据。

在导入数据之前,请参阅准备数据以便提取

控制台

如需使用 Google Cloud 控制台从 BigQuery 注入数据,请按以下步骤操作:

  1. 在 Google Cloud 控制台中,前往 Agent Builder 页面。

    Agent Builder

  2. 前往数据存储区页面。

  3. 点击新建数据存储区

  4. 类型页面上,选择 BigQuery

  5. BigQuery 路径字段中,点击浏览,选择您准备好提取的表,然后点击选择。或者,您也可以直接在 BigQuery 路径字段中输入表位置。

  6. 选择要导入的数据类型。

  7. 点击继续

  8. 如果您要一次性导入结构化数据,请执行以下操作:

    1. 将字段映射到关键属性。

    2. 如果架构中缺少重要字段,请使用添加新字段进行添加。

      如需了解详情,请参阅自动检测和修改简介

    3. 点击继续

  9. 为您的数据存储区选择一个区域。

  10. 为数据存储区输入名称。

  11. 点击创建

  12. 如需确认数据存储区是否已创建,请前往数据存储区页面,然后点击数据存储区名称,在其数据页面上查看其详细信息。

  13. 如需查看数据注入的状态,请前往数据存储区页面,然后点击数据存储区名称,在其数据页面上查看相关详细信息。当活动标签页上的状态列从进行中更改为导入已完成时,提取操作即告完成。

    提取过程可能需要几分钟到几小时才能完成,具体取决于数据的大小。

REST

如需使用命令行创建数据存储区并从 BigQuery 导入数据,请按以下步骤操作:

  1. 创建数据存储区。

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores?dataStoreId=DATA_STORE_ID" \
    -d '{
      "displayName": "DATA_STORE_DISPLAY_NAME",
      "industryVertical": "GENERIC",
      "solutionTypes": ["SOLUTION_TYPE_RECOMMENDATION"]
    }'
    

    替换以下内容:

    • PROJECT_ID:您的 Google Cloud 项目的 ID。
    • DATA_STORE_ID:您要创建的推荐数据存储区的 ID。此 ID 只能包含小写字母、数字、下划线和连字符。
    • DATA_STORE_DISPLAY_NAME:您要创建的推荐数据存储区的显示名称。
  2. 可选:如果您要上传使用自定义架构的结构化数据,可以提供该架构。提供架构后,您通常会获得更好的结果。否则,系统会自动检测架构。如需了解详情,请参阅提供或自动检测架构

    curl -X PATCH \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/schemas/default_schema" \
    -d '{
      "structSchema": JSON_SCHEMA_OBJECT
    }'
    

    替换以下内容:

    • PROJECT_ID:您的 Google Cloud 项目的 ID。
    • DATA_STORE_ID:推荐数据存储区的 ID。
    • JSON_SCHEMA_OBJECT:您的 JSON 架构(作为 JSON 对象)- 例如:

      {
        "$schema": "https://json-schema.org/draft/2020-12/schema",
        "type": "object",
        "properties": {
          "title": {
            "type": "string",
            "keyPropertyMapping": "title"
          },
          "categories": {
            "type": "array",
            "items": {
              "type": "string",
              "keyPropertyMapping": "category"
            }
          },
          "uri": {
            "type": "string",
            "keyPropertyMapping": "uri"
          }
        }
      }
      
  3. 从 BigQuery 导入数据。

    如果您定义了架构,请确保数据符合该架构。

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
    -d '{
      "bigquerySource": {
        "projectId": "PROJECT_ID",
        "datasetId":"DATASET_ID",
        "tableId": "TABLE_ID",
        "dataSchema": "DATA_SCHEMA",
      },
      "reconciliationMode": "RECONCILIATION_MODE",
      "autoGenerateIds": "AUTO_GENERATE_IDS",
      "idField": "ID_FIELD",
      "errorConfig": {
        "gcsPrefix": "ERROR_DIRECTORY"
      }
    }'
    

    替换以下内容:

    • PROJECT_ID:您的 Google Cloud 项目的 ID。
    • DATA_STORE_ID:推荐数据存储区的 ID。
    • DATASET_ID:BigQuery 数据集的 ID。
    • TABLE_ID:BigQuery 表的 ID。
      • 如果 BigQuery 表不在 PROJECT_ID 下,您需要向服务账号 service-<project number>@gcp-sa-discoveryengine.iam.gserviceaccount.com 授予 BigQuery 表的“BigQuery Data Viewer”权限。例如,如果您要将 BigQuery 表从源项目“123”导入目标项目“456”,请为项目“123”下的 BigQuery 表授予 service-456@gcp-sa-discoveryengine.iam.gserviceaccount.com 权限。
    • DATA_SCHEMA:可选。值为 documentcustom。默认值为 document
      • document:您使用的 BigQuery 表必须符合准备数据以供提取中提供的默认 BigQuery 架构。您可以自行定义每个文档的 ID,同时将所有数据封装在 jsonData 字符串中。
      • custom:接受任何 BigQuery 表架构,并且推荐功能会自动为导入的每个文档生成 ID。
    • ERROR_DIRECTORY:可选。存放与导入有关的错误信息的 Cloud Storage 目录,例如 gs://<your-gcs-bucket>/directory/import_errors。Google 建议将此字段留空,以便推荐功能自动创建临时目录。
    • RECONCILIATION_MODE:可选。值为 FULLINCREMENTAL。默认值为 INCREMENTAL。 指定 INCREMENTAL 会导致从 BigQuery 到数据存储区的数据增量刷新。这会执行更新/插入操作,该操作会添加新文档,并将现有文档替换为具有相同 ID 的更新文档。指定 FULL 会导致数据存储区中的文档完全重新设置基准。换句话说,系统会将新建和更新的文档添加到您的数据存储区,并从您的数据存储区中移除不在 BigQuery 中的文档。如果您想自动删除不再需要的文档,FULL 模式会很有用。
    • AUTO_GENERATE_IDS:可选。指定是否自动生成文档 ID。如果设置为 true,则文档 ID 会根据载荷的哈希生成。请注意,在多次导入后,生成的文档 ID 可能不会保持一致。如果您在多次导入时自动生成 ID,Google 强烈建议您将 reconciliationMode 设置为 FULL,以保持文档 ID 的一致性。

      仅当 bigquerySource.dataSchema 设置为 custom 时,才指定 autoGenerateIds。否则,系统将返回 INVALID_ARGUMENT 错误。如果您未指定 autoGenerateIds 或将其设置为 false,则必须指定 idField。否则,文档将无法导入。

    • ID_FIELD:可选。指定哪些字段是文档 ID。对于 BigQuery 源文件,idField 表示 BigQuery 表中包含文档 ID 的列的名称。

      仅当满足以下条件时,才应指定 idField:(1) bigquerySource.dataSchema 设置为 custom,并且 (2) auto_generate_ids 设置为 false 或未指定。否则,系统将返回 INVALID_ARGUMENT 错误。

      BigQuery 列名称的值必须为字符串类型,必须介于 1 到 63 个字符之间,并且必须符合 RFC-1034 的要求。否则,文档将无法导入。

C#

如需了解详情,请参阅 Vertex AI Agent Builder C# API 参考文档

如需向 Vertex AI Agent Builder 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

此示例会将 BigQuery 或 Cloud Storage 中的非结构化数据提取到现有数据存储区。

using Google.Cloud.DiscoveryEngine.V1;
using Google.LongRunning;
using Google.Protobuf.WellKnownTypes;

public sealed partial class GeneratedDocumentServiceClientSnippets
{
    /// <summary>Snippet for ImportDocuments</summary>
    /// <remarks>
    /// This snippet has been automatically generated and should be regarded as a code template only.
    /// It will require modifications to work:
    /// - It may require correct/in-range values for request initialization.
    /// - It may require specifying regional endpoints when creating the service client as shown in
    ///   https://cloud.google.com/dotnet/docs/reference/help/client-configuration#endpoint.
    /// </remarks>
    public void ImportDocumentsRequestObject()
    {
        // Create client
        DocumentServiceClient documentServiceClient = DocumentServiceClient.Create();
        // Initialize request argument(s)
        ImportDocumentsRequest request = new ImportDocumentsRequest
        {
            ParentAsBranchName = BranchName.FromProjectLocationDataStoreBranch("[PROJECT]", "[LOCATION]", "[DATA_STORE]", "[BRANCH]"),
            InlineSource = new ImportDocumentsRequest.Types.InlineSource(),
            ErrorConfig = new ImportErrorConfig(),
            ReconciliationMode = ImportDocumentsRequest.Types.ReconciliationMode.Unspecified,
            UpdateMask = new FieldMask(),
            AutoGenerateIds = false,
            IdField = "",
        };
        // Make the request
        Operation<ImportDocumentsResponse, ImportDocumentsMetadata> response = documentServiceClient.ImportDocuments(request);

        // Poll until the returned long-running operation is complete
        Operation<ImportDocumentsResponse, ImportDocumentsMetadata> completedResponse = response.PollUntilCompleted();
        // Retrieve the operation result
        ImportDocumentsResponse result = completedResponse.Result;

        // Or get the name of the operation
        string operationName = response.Name;
        // This name can be stored, then the long-running operation retrieved later by name
        Operation<ImportDocumentsResponse, ImportDocumentsMetadata> retrievedResponse = documentServiceClient.PollOnceImportDocuments(operationName);
        // Check if the retrieved long-running operation has completed
        if (retrievedResponse.IsCompleted)
        {
            // If it has completed, then access the result
            ImportDocumentsResponse retrievedResult = retrievedResponse.Result;
        }
    }
}

Go

如需了解详情,请参阅 Vertex AI Agent Builder Go API 参考文档

如需向 Vertex AI Agent Builder 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

此示例会将 BigQuery 或 Cloud Storage 中的非结构化数据提取到现有数据存储区。


package main

import (
	"context"

	discoveryengine "cloud.google.com/go/discoveryengine/apiv1"
	discoveryenginepb "cloud.google.com/go/discoveryengine/apiv1/discoveryenginepb"
)

func main() {
	ctx := context.Background()
	// This snippet has been automatically generated and should be regarded as a code template only.
	// It will require modifications to work:
	// - It may require correct/in-range values for request initialization.
	// - It may require specifying regional endpoints when creating the service client as shown in:
	//   https://pkg.go.dev/cloud.google.com/go#hdr-Client_Options
	c, err := discoveryengine.NewDocumentClient(ctx)
	if err != nil {
		// TODO: Handle error.
	}
	defer c.Close()

	req := &discoveryenginepb.ImportDocumentsRequest{
		// TODO: Fill request struct fields.
		// See https://pkg.go.dev/cloud.google.com/go/discoveryengine/apiv1/discoveryenginepb#ImportDocumentsRequest.
	}
	op, err := c.ImportDocuments(ctx, req)
	if err != nil {
		// TODO: Handle error.
	}

	resp, err := op.Wait(ctx)
	if err != nil {
		// TODO: Handle error.
	}
	// TODO: Use resp.
	_ = resp
}

Java

如需了解详情,请参阅 Vertex AI Agent Builder Java API 参考文档

如需向 Vertex AI Agent Builder 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

此示例会将 BigQuery 或 Cloud Storage 中的非结构化数据提取到现有数据存储区。

import com.google.cloud.discoveryengine.v1.BranchName;
import com.google.cloud.discoveryengine.v1.DocumentServiceClient;
import com.google.cloud.discoveryengine.v1.ImportDocumentsRequest;
import com.google.cloud.discoveryengine.v1.ImportDocumentsResponse;
import com.google.cloud.discoveryengine.v1.ImportErrorConfig;
import com.google.protobuf.FieldMask;

public class SyncImportDocuments {

  public static void main(String[] args) throws Exception {
    syncImportDocuments();
  }

  public static void syncImportDocuments() throws Exception {
    // This snippet has been automatically generated and should be regarded as a code template only.
    // It will require modifications to work:
    // - It may require correct/in-range values for request initialization.
    // - It may require specifying regional endpoints when creating the service client as shown in
    // https://cloud.google.com/java/docs/setup#configure_endpoints_for_the_client_library
    try (DocumentServiceClient documentServiceClient = DocumentServiceClient.create()) {
      ImportDocumentsRequest request =
          ImportDocumentsRequest.newBuilder()
              .setParent(
                  BranchName.ofProjectLocationDataStoreBranchName(
                          "[PROJECT]", "[LOCATION]", "[DATA_STORE]", "[BRANCH]")
                      .toString())
              .setErrorConfig(ImportErrorConfig.newBuilder().build())
              .setUpdateMask(FieldMask.newBuilder().build())
              .setAutoGenerateIds(true)
              .setIdField("idField1629396127")
              .build();
      ImportDocumentsResponse response = documentServiceClient.importDocumentsAsync(request).get();
    }
  }
}

Node.js

如需了解详情,请参阅 Vertex AI Agent Builder Node.js API 参考文档

如需向 Vertex AI Agent Builder 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

此示例会将 BigQuery 或 Cloud Storage 中的非结构化数据提取到现有数据存储区。

/**
 * This snippet has been automatically generated and should be regarded as a code template only.
 * It will require modifications to work.
 * It may require correct/in-range values for request initialization.
 * TODO(developer): Uncomment these variables before running the sample.
 */
/**
 *  The Inline source for the input content for documents.
 */
// const inlineSource = {}
/**
 *  Cloud Storage location for the input content.
 */
// const gcsSource = {}
/**
 *  BigQuery input source.
 */
// const bigquerySource = {}
/**
 *  FhirStore input source.
 */
// const fhirStoreSource = {}
/**
 *  Spanner input source.
 */
// const spannerSource = {}
/**
 *  Cloud SQL input source.
 */
// const cloudSqlSource = {}
/**
 *  Firestore input source.
 */
// const firestoreSource = {}
/**
 *  AlloyDB input source.
 */
// const alloyDbSource = {}
/**
 *  Cloud Bigtable input source.
 */
// const bigtableSource = {}
/**
 *  Required. The parent branch resource name, such as
 *  `projects/{project}/locations/{location}/collections/{collection}/dataStores/{data_store}/branches/{branch}`.
 *  Requires create/update permission.
 */
// const parent = 'abc123'
/**
 *  The desired location of errors incurred during the Import.
 */
// const errorConfig = {}
/**
 *  The mode of reconciliation between existing documents and the documents to
 *  be imported. Defaults to
 *  ReconciliationMode.INCREMENTAL google.cloud.discoveryengine.v1.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL.
 */
// const reconciliationMode = {}
/**
 *  Indicates which fields in the provided imported documents to update. If
 *  not set, the default is to update all fields.
 */
// const updateMask = {}
/**
 *  Whether to automatically generate IDs for the documents if absent.
 *  If set to `true`,
 *  Document.id google.cloud.discoveryengine.v1.Document.id s are
 *  automatically generated based on the hash of the payload, where IDs may not
 *  be consistent during multiple imports. In which case
 *  ReconciliationMode.FULL google.cloud.discoveryengine.v1.ImportDocumentsRequest.ReconciliationMode.FULL 
 *  is highly recommended to avoid duplicate contents. If unset or set to
 *  `false`, Document.id google.cloud.discoveryengine.v1.Document.id s have
 *  to be specified using
 *  id_field google.cloud.discoveryengine.v1.ImportDocumentsRequest.id_field,
 *  otherwise, documents without IDs fail to be imported.
 *  Supported data sources:
 *  * GcsSource google.cloud.discoveryengine.v1.GcsSource.
 *  GcsSource.data_schema google.cloud.discoveryengine.v1.GcsSource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * BigQuerySource google.cloud.discoveryengine.v1.BigQuerySource.
 *  BigQuerySource.data_schema google.cloud.discoveryengine.v1.BigQuerySource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * SpannerSource google.cloud.discoveryengine.v1.SpannerSource.
 *  * CloudSqlSource google.cloud.discoveryengine.v1.CloudSqlSource.
 *  * FirestoreSource google.cloud.discoveryengine.v1.FirestoreSource.
 *  * BigtableSource google.cloud.discoveryengine.v1.BigtableSource.
 */
// const autoGenerateIds = true
/**
 *  The field indicates the ID field or column to be used as unique IDs of
 *  the documents.
 *  For GcsSource google.cloud.discoveryengine.v1.GcsSource  it is the key of
 *  the JSON field. For instance, `my_id` for JSON `{"my_id": "some_uuid"}`.
 *  For others, it may be the column name of the table where the unique ids are
 *  stored.
 *  The values of the JSON field or the table column are used as the
 *  Document.id google.cloud.discoveryengine.v1.Document.id s. The JSON field
 *  or the table column must be of string type, and the values must be set as
 *  valid strings conform to RFC-1034 (https://tools.ietf.org/html/rfc1034)
 *  with 1-63 characters. Otherwise, documents without valid IDs fail to be
 *  imported.
 *  Only set this field when
 *  auto_generate_ids google.cloud.discoveryengine.v1.ImportDocumentsRequest.auto_generate_ids 
 *  is unset or set as `false`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  If it is unset, a default value `_id` is used when importing from the
 *  allowed data sources.
 *  Supported data sources:
 *  * GcsSource google.cloud.discoveryengine.v1.GcsSource.
 *  GcsSource.data_schema google.cloud.discoveryengine.v1.GcsSource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * BigQuerySource google.cloud.discoveryengine.v1.BigQuerySource.
 *  BigQuerySource.data_schema google.cloud.discoveryengine.v1.BigQuerySource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * SpannerSource google.cloud.discoveryengine.v1.SpannerSource.
 *  * CloudSqlSource google.cloud.discoveryengine.v1.CloudSqlSource.
 *  * FirestoreSource google.cloud.discoveryengine.v1.FirestoreSource.
 *  * BigtableSource google.cloud.discoveryengine.v1.BigtableSource.
 */
// const idField = 'abc123'

// Imports the Discoveryengine library
const {DocumentServiceClient} = require('@google-cloud/discoveryengine').v1;

// Instantiates a client
const discoveryengineClient = new DocumentServiceClient();

async function callImportDocuments() {
  // Construct request
  const request = {
    parent,
  };

  // Run request
  const [operation] = await discoveryengineClient.importDocuments(request);
  const [response] = await operation.promise();
  console.log(response);
}

callImportDocuments();

Python

如需了解详情,请参阅 Vertex AI Agent Builder Python API 参考文档

如需向 Vertex AI Agent Builder 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

此示例会将 BigQuery 或 Cloud Storage 中的非结构化数据提取到现有数据存储区。



def import_documents_bigquery_sample(
    project_id: str,
    location: str,
    data_store_id: str,
    bigquery_dataset: str,
    bigquery_table: str,
) -> str:

    from google.api_core.client_options import ClientOptions
    from google.cloud import discoveryengine

    # TODO(developer): Uncomment these variables before running the sample.
    # project_id = "YOUR_PROJECT_ID"
    # location = "YOUR_LOCATION" # Values: "global"
    # data_store_id = "YOUR_DATA_STORE_ID"
    # bigquery_dataset = "YOUR_BIGQUERY_DATASET"
    # bigquery_table = "YOUR_BIGQUERY_TABLE"

    #  For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )

    # Create a client
    client = discoveryengine.DocumentServiceClient(client_options=client_options)

    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )

    request = discoveryengine.ImportDocumentsRequest(
        parent=parent,
        bigquery_source=discoveryengine.BigQuerySource(
            project_id=project_id,
            dataset_id=bigquery_dataset,
            table_id=bigquery_table,
            data_schema="custom",
        ),
        # Options: `FULL`, `INCREMENTAL`
        reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
    )

    # Make the request
    operation = client.import_documents(request=request)

    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()

    # After the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

    # Handle the response
    print(response)
    print(metadata)

    return operation.operation.name


def import_documents_gcs_sample(
    project_id: str,
    location: str,
    data_store_id: str,
    gcs_uri: str,
) -> str:
    from google.api_core.client_options import ClientOptions
    from google.cloud import discoveryengine

    # TODO(developer): Uncomment these variables before running the sample.
    # project_id = "YOUR_PROJECT_ID"
    # location = "YOUR_LOCATION" # Values: "global"
    # data_store_id = "YOUR_DATA_STORE_ID"

    # Examples:
    # - Unstructured documents
    #   - `gs://bucket/directory/file.pdf`
    #   - `gs://bucket/directory/*.pdf`
    # - Unstructured documents with JSONL Metadata
    #   - `gs://bucket/directory/file.json`
    # - Unstructured documents with CSV Metadata
    #   - `gs://bucket/directory/file.csv`
    # gcs_uri = "YOUR_GCS_PATH"

    #  For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )

    # Create a client
    client = discoveryengine.DocumentServiceClient(client_options=client_options)

    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )

    request = discoveryengine.ImportDocumentsRequest(
        parent=parent,
        gcs_source=discoveryengine.GcsSource(
            # Multiple URIs are supported
            input_uris=[gcs_uri],
            # Options:
            # - `content` - Unstructured documents (PDF, HTML, DOC, TXT, PPTX)
            # - `custom` - Unstructured documents with custom JSONL metadata
            # - `document` - Structured documents in the discoveryengine.Document format.
            # - `csv` - Unstructured documents with CSV metadata
            data_schema="content",
        ),
        # Options: `FULL`, `INCREMENTAL`
        reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
    )

    # Make the request
    operation = client.import_documents(request=request)

    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()

    # After the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

    # Handle the response
    print(response)
    print(metadata)

    return operation.operation.name

Ruby

如需了解详情,请参阅 Vertex AI Agent Builder Ruby API 参考文档

如需向 Vertex AI Agent Builder 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

此示例会将 BigQuery 或 Cloud Storage 中的非结构化数据提取到现有数据存储区。

require "google/cloud/discovery_engine/v1"

##
# Snippet for the import_documents call in the DocumentService service
#
# This snippet has been automatically generated and should be regarded as a code
# template only. It will require modifications to work:
# - It may require correct/in-range values for request initialization.
# - It may require specifying regional endpoints when creating the service
# client as shown in https://cloud.google.com/ruby/docs/reference.
#
# This is an auto-generated example demonstrating basic usage of
# Google::Cloud::DiscoveryEngine::V1::DocumentService::Client#import_documents.
#
def import_documents
  # Create a client object. The client can be reused for multiple calls.
  client = Google::Cloud::DiscoveryEngine::V1::DocumentService::Client.new

  # Create a request. To set request fields, pass in keyword arguments.
  request = Google::Cloud::DiscoveryEngine::V1::ImportDocumentsRequest.new

  # Call the import_documents method.
  result = client.import_documents request

  # The returned object is of type Gapic::Operation. You can use it to
  # check the status of an operation, cancel it, or wait for results.
  # Here is how to wait for a response.
  result.wait_until_done! timeout: 60
  if result.response?
    p result.response
  else
    puts "No response received."
  end
end

后续步骤

  • 如需将数据存储区附加到应用,请按照创建通用推荐应用中的步骤创建应用并选择数据存储区。

  • 如需预览设置应用和数据存储区后建议的显示方式,请参阅获取建议

Cloud Storage

如需从 Cloud Storage 注入数据,请按照以下步骤使用 Google Cloud 控制台或 API 创建数据存储区并注入数据。

在导入数据之前,请参阅准备数据以便提取

控制台

如需使用控制台从 Cloud Storage 存储桶注入数据,请按以下步骤操作:

  1. 在 Google Cloud 控制台中,前往 Agent Builder 页面。

    Agent Builder

  2. 前往数据存储区页面。

  3. 点击新建数据存储区

  4. 类型页面上,选择 Cloud Storage

  5. 选择要导入的文件夹或文件部分,选择文件夹文件

  6. 点击浏览,选择您准备好提取的数据,然后点击选择。或者,直接在 gs:// 字段中输入位置。

  7. 选择要导入的数据类型。

  8. 点击继续

  9. 如果您要一次性导入结构化数据,请执行以下操作:

    1. 将字段映射到关键属性。

    2. 如果架构中缺少重要字段,请使用添加新字段进行添加。

      如需了解详情,请参阅自动检测和修改简介

    3. 点击继续

  10. 为您的数据存储区选择一个区域。

  11. 为数据存储区输入名称。

  12. 点击创建

  13. 如需确认数据存储区是否已创建,请前往数据存储区页面,然后点击数据存储区名称,在其数据页面上查看其详细信息。

  14. 如需查看数据注入的状态,请前往数据存储区页面,然后点击数据存储区名称,在其数据页面上查看相关详细信息。当活动标签页上的状态列从进行中更改为导入已完成时,提取操作即告完成。

    提取过程可能需要几分钟到几小时才能完成,具体取决于数据的大小。

REST

如需使用命令行创建数据存储区并从 Cloud Storage 注入数据,请按以下步骤操作:

  1. 创建数据存储区。

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores?dataStoreId=DATA_STORE_ID" \
    -d '{
      "displayName": "DATA_STORE_DISPLAY_NAME",
      "industryVertical": "GENERIC",
      "solutionTypes": ["SOLUTION_TYPE_RECOMMENDATION"],
      "contentConfig": "CONTENT_REQUIRED"
    }'
    

    替换以下内容:

    • PROJECT_ID:您的 Google Cloud 项目的 ID。
    • DATA_STORE_ID:您要创建的推荐数据存储区的 ID。此 ID 只能包含小写字母、数字、下划线和连字符。
    • DATA_STORE_DISPLAY_NAME:您要创建的推荐数据存储区的显示名称。
  2. 从 Cloud Storage 导入数据。

      curl -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
      -d '{
        "gcsSource": {
          "inputUris": ["INPUT_FILE_PATTERN_1", "INPUT_FILE_PATTERN_2"],
          "dataSchema": "DATA_SCHEMA",
        },
        "reconciliationMode": "RECONCILIATION_MODE",
        "autoGenerateIds": "AUTO_GENERATE_IDS",
        "idField": "ID_FIELD",
        "errorConfig": {
          "gcsPrefix": "ERROR_DIRECTORY"
        }
      }'
    

    替换以下内容:

    • PROJECT_ID:您的 Google Cloud 项目的 ID。
    • DATA_STORE_ID:推荐数据存储区的 ID。
    • INPUT_FILE_PATTERN:Cloud Storage 中包含文档的文件格式。

      对于结构化数据,或包含非结构化文档元数据的非结构化数据,输入文件模式示例为 gs://<your-gcs-bucket>/directory/object.json,或与一个或多个文件匹配的模式,例如 gs://<your-gcs-bucket>/directory/*.json

      对于非结构化文档,示例如下:gs://<your-gcs-bucket>/directory/*.pdf。与模式匹配的每个文件都会成为一个文档。

      如果 <your-gcs-bucket> 不属于 PROJECT_ID,您需要向服务账号 service-<project number>@gcp-sa-discoveryengine.iam.gserviceaccount.com 授予对 Cloud Storage 存储桶的“Storage Object Viewer”权限。例如,如果您要将 Cloud Storage 存储桶从源项目“123”导入目标项目“456”,请向 service-456@gcp-sa-discoveryengine.iam.gserviceaccount.com 授予对项目“123”下的 Cloud Storage 存储桶的权限。

    • DATA_SCHEMA:可选。值为 documentcustomcsvcontent。默认值为 document

      • document:为非结构化文档上传包含元数据的非结构化数据。文件中的每一行都必须采用以下某种格式。您可以定义每个文档的 ID:

        • { "id": "<your-id>", "jsonData": "<JSON string>", "content": { "mimeType": "<application/pdf or text/html>", "uri": "gs://<your-gcs-bucket>/directory/filename.pdf" } }
        • { "id": "<your-id>", "structData": <JSON object>, "content": { "mimeType": "<application/pdf or text/html>", "uri": "gs://<your-gcs-bucket>/directory/filename.pdf" } }
      • custom:上传结构化文档的 JSON。数据会按照架构进行整理。您可以指定架构;否则,系统会自动检测架构。您可以将文档的 JSON 字符串以一致的格式直接放入每行中,推荐功能会自动为导入的每份文档生成 ID。

      • content:上传非结构化文档(PDF、HTML、DOC、TXT、PPTX)。每个文档的 ID 会自动生成为 SHA256(GCS_URI) 的前 128 位,编码为十六进制字符串。您可以指定多个输入文件格式,但匹配的文件不得超过 10 万个文件的限制。

      • csv:在 CSV 文件中添加标题行,并将每个标题映射到文档字段。使用 inputUris 字段指定 CSV 文件的路径。

    • ERROR_DIRECTORY:可选。存放与导入有关的错误信息的 Cloud Storage 目录,例如 gs://<your-gcs-bucket>/directory/import_errors。Google 建议将此字段留空,以便推荐功能自动创建临时目录。

    • RECONCILIATION_MODE:可选。值为 FULLINCREMENTAL。默认值为 INCREMENTAL。 指定 INCREMENTAL 会导致数据从 Cloud Storage 增量刷新到数据存储区。这会执行更新/插入操作,该操作会添加新文档,并将现有文档替换为具有相同 ID 的更新文档。指定 FULL 会导致数据存储区中的文档完全重新基准。换句话说,系统会将新文档和更新后的文档添加到您的数据存储区,并将 Cloud Storage 中不存在的文档从您的数据存储区中移除。如果您想自动删除不再需要的文档,FULL 模式会很有用。

    • AUTO_GENERATE_IDS:可选。指定是否自动生成文档 ID。如果设置为 true,文档 ID 将根据载荷的哈希生成。请注意,在多次导入后,生成的文档 ID 可能不会保持一致。如果您在多次导入时自动生成 ID,Google 强烈建议您将 reconciliationMode 设置为 FULL,以保持文档 ID 的一致性。

      仅当 gcsSource.dataSchema 设置为 customcsv 时,才应指定 autoGenerateIds。否则,系统将返回 INVALID_ARGUMENT 错误。如果您未指定 autoGenerateIds 或将其设置为 false,则必须指定 idField。否则,文档将无法导入。

    • ID_FIELD:可选。指定哪些字段是文档 ID。对于 Cloud Storage 来源文档,idField 用于在 JSON 字段(即文档 ID)中指定名称。例如,如果 {"my_id":"some_uuid"} 是文档中的一个文档 ID 字段,请指定 "idField":"my_id"。这会将名称为 "my_id" 的所有 JSON 字段标识为文档 ID。

      仅当满足以下条件时,才应指定此字段:(1) gcsSource.dataSchema 设置为 customcsv,并且 (2) auto_generate_ids 设置为 false 或未指定。否则,系统将返回 INVALID_ARGUMENT 错误。

      请注意,Cloud Storage JSON 字段的值必须为字符串类型,必须介于 1 到 63 个字符之间,并且必须符合 RFC-1034。否则,文档将无法导入。

      请注意,id_field 指定的 JSON 字段名称必须为字符串类型,长度必须介于 1 到 63 个字符之间,并且必须符合 RFC-1034。否则,文档将无法导入。

C#

如需了解详情,请参阅 Vertex AI Agent Builder C# API 参考文档

如需向 Vertex AI Agent Builder 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

此示例会将 BigQuery 或 Cloud Storage 中的非结构化数据提取到现有数据存储区。

using Google.Cloud.DiscoveryEngine.V1;
using Google.LongRunning;
using Google.Protobuf.WellKnownTypes;

public sealed partial class GeneratedDocumentServiceClientSnippets
{
    /// <summary>Snippet for ImportDocuments</summary>
    /// <remarks>
    /// This snippet has been automatically generated and should be regarded as a code template only.
    /// It will require modifications to work:
    /// - It may require correct/in-range values for request initialization.
    /// - It may require specifying regional endpoints when creating the service client as shown in
    ///   https://cloud.google.com/dotnet/docs/reference/help/client-configuration#endpoint.
    /// </remarks>
    public void ImportDocumentsRequestObject()
    {
        // Create client
        DocumentServiceClient documentServiceClient = DocumentServiceClient.Create();
        // Initialize request argument(s)
        ImportDocumentsRequest request = new ImportDocumentsRequest
        {
            ParentAsBranchName = BranchName.FromProjectLocationDataStoreBranch("[PROJECT]", "[LOCATION]", "[DATA_STORE]", "[BRANCH]"),
            InlineSource = new ImportDocumentsRequest.Types.InlineSource(),
            ErrorConfig = new ImportErrorConfig(),
            ReconciliationMode = ImportDocumentsRequest.Types.ReconciliationMode.Unspecified,
            UpdateMask = new FieldMask(),
            AutoGenerateIds = false,
            IdField = "",
        };
        // Make the request
        Operation<ImportDocumentsResponse, ImportDocumentsMetadata> response = documentServiceClient.ImportDocuments(request);

        // Poll until the returned long-running operation is complete
        Operation<ImportDocumentsResponse, ImportDocumentsMetadata> completedResponse = response.PollUntilCompleted();
        // Retrieve the operation result
        ImportDocumentsResponse result = completedResponse.Result;

        // Or get the name of the operation
        string operationName = response.Name;
        // This name can be stored, then the long-running operation retrieved later by name
        Operation<ImportDocumentsResponse, ImportDocumentsMetadata> retrievedResponse = documentServiceClient.PollOnceImportDocuments(operationName);
        // Check if the retrieved long-running operation has completed
        if (retrievedResponse.IsCompleted)
        {
            // If it has completed, then access the result
            ImportDocumentsResponse retrievedResult = retrievedResponse.Result;
        }
    }
}

Go

如需了解详情,请参阅 Vertex AI Agent Builder Go API 参考文档

如需向 Vertex AI Agent Builder 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

此示例会将 BigQuery 或 Cloud Storage 中的非结构化数据提取到现有数据存储区。


package main

import (
	"context"

	discoveryengine "cloud.google.com/go/discoveryengine/apiv1"
	discoveryenginepb "cloud.google.com/go/discoveryengine/apiv1/discoveryenginepb"
)

func main() {
	ctx := context.Background()
	// This snippet has been automatically generated and should be regarded as a code template only.
	// It will require modifications to work:
	// - It may require correct/in-range values for request initialization.
	// - It may require specifying regional endpoints when creating the service client as shown in:
	//   https://pkg.go.dev/cloud.google.com/go#hdr-Client_Options
	c, err := discoveryengine.NewDocumentClient(ctx)
	if err != nil {
		// TODO: Handle error.
	}
	defer c.Close()

	req := &discoveryenginepb.ImportDocumentsRequest{
		// TODO: Fill request struct fields.
		// See https://pkg.go.dev/cloud.google.com/go/discoveryengine/apiv1/discoveryenginepb#ImportDocumentsRequest.
	}
	op, err := c.ImportDocuments(ctx, req)
	if err != nil {
		// TODO: Handle error.
	}

	resp, err := op.Wait(ctx)
	if err != nil {
		// TODO: Handle error.
	}
	// TODO: Use resp.
	_ = resp
}

Java

如需了解详情,请参阅 Vertex AI Agent Builder Java API 参考文档

如需向 Vertex AI Agent Builder 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

此示例会将 BigQuery 或 Cloud Storage 中的非结构化数据提取到现有数据存储区。

import com.google.cloud.discoveryengine.v1.BranchName;
import com.google.cloud.discoveryengine.v1.DocumentServiceClient;
import com.google.cloud.discoveryengine.v1.ImportDocumentsRequest;
import com.google.cloud.discoveryengine.v1.ImportDocumentsResponse;
import com.google.cloud.discoveryengine.v1.ImportErrorConfig;
import com.google.protobuf.FieldMask;

public class SyncImportDocuments {

  public static void main(String[] args) throws Exception {
    syncImportDocuments();
  }

  public static void syncImportDocuments() throws Exception {
    // This snippet has been automatically generated and should be regarded as a code template only.
    // It will require modifications to work:
    // - It may require correct/in-range values for request initialization.
    // - It may require specifying regional endpoints when creating the service client as shown in
    // https://cloud.google.com/java/docs/setup#configure_endpoints_for_the_client_library
    try (DocumentServiceClient documentServiceClient = DocumentServiceClient.create()) {
      ImportDocumentsRequest request =
          ImportDocumentsRequest.newBuilder()
              .setParent(
                  BranchName.ofProjectLocationDataStoreBranchName(
                          "[PROJECT]", "[LOCATION]", "[DATA_STORE]", "[BRANCH]")
                      .toString())
              .setErrorConfig(ImportErrorConfig.newBuilder().build())
              .setUpdateMask(FieldMask.newBuilder().build())
              .setAutoGenerateIds(true)
              .setIdField("idField1629396127")
              .build();
      ImportDocumentsResponse response = documentServiceClient.importDocumentsAsync(request).get();
    }
  }
}

Node.js

如需了解详情,请参阅 Vertex AI Agent Builder Node.js API 参考文档

如需向 Vertex AI Agent Builder 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

此示例会将 BigQuery 或 Cloud Storage 中的非结构化数据提取到现有数据存储区。

/**
 * This snippet has been automatically generated and should be regarded as a code template only.
 * It will require modifications to work.
 * It may require correct/in-range values for request initialization.
 * TODO(developer): Uncomment these variables before running the sample.
 */
/**
 *  The Inline source for the input content for documents.
 */
// const inlineSource = {}
/**
 *  Cloud Storage location for the input content.
 */
// const gcsSource = {}
/**
 *  BigQuery input source.
 */
// const bigquerySource = {}
/**
 *  FhirStore input source.
 */
// const fhirStoreSource = {}
/**
 *  Spanner input source.
 */
// const spannerSource = {}
/**
 *  Cloud SQL input source.
 */
// const cloudSqlSource = {}
/**
 *  Firestore input source.
 */
// const firestoreSource = {}
/**
 *  AlloyDB input source.
 */
// const alloyDbSource = {}
/**
 *  Cloud Bigtable input source.
 */
// const bigtableSource = {}
/**
 *  Required. The parent branch resource name, such as
 *  `projects/{project}/locations/{location}/collections/{collection}/dataStores/{data_store}/branches/{branch}`.
 *  Requires create/update permission.
 */
// const parent = 'abc123'
/**
 *  The desired location of errors incurred during the Import.
 */
// const errorConfig = {}
/**
 *  The mode of reconciliation between existing documents and the documents to
 *  be imported. Defaults to
 *  ReconciliationMode.INCREMENTAL google.cloud.discoveryengine.v1.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL.
 */
// const reconciliationMode = {}
/**
 *  Indicates which fields in the provided imported documents to update. If
 *  not set, the default is to update all fields.
 */
// const updateMask = {}
/**
 *  Whether to automatically generate IDs for the documents if absent.
 *  If set to `true`,
 *  Document.id google.cloud.discoveryengine.v1.Document.id s are
 *  automatically generated based on the hash of the payload, where IDs may not
 *  be consistent during multiple imports. In which case
 *  ReconciliationMode.FULL google.cloud.discoveryengine.v1.ImportDocumentsRequest.ReconciliationMode.FULL 
 *  is highly recommended to avoid duplicate contents. If unset or set to
 *  `false`, Document.id google.cloud.discoveryengine.v1.Document.id s have
 *  to be specified using
 *  id_field google.cloud.discoveryengine.v1.ImportDocumentsRequest.id_field,
 *  otherwise, documents without IDs fail to be imported.
 *  Supported data sources:
 *  * GcsSource google.cloud.discoveryengine.v1.GcsSource.
 *  GcsSource.data_schema google.cloud.discoveryengine.v1.GcsSource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * BigQuerySource google.cloud.discoveryengine.v1.BigQuerySource.
 *  BigQuerySource.data_schema google.cloud.discoveryengine.v1.BigQuerySource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * SpannerSource google.cloud.discoveryengine.v1.SpannerSource.
 *  * CloudSqlSource google.cloud.discoveryengine.v1.CloudSqlSource.
 *  * FirestoreSource google.cloud.discoveryengine.v1.FirestoreSource.
 *  * BigtableSource google.cloud.discoveryengine.v1.BigtableSource.
 */
// const autoGenerateIds = true
/**
 *  The field indicates the ID field or column to be used as unique IDs of
 *  the documents.
 *  For GcsSource google.cloud.discoveryengine.v1.GcsSource  it is the key of
 *  the JSON field. For instance, `my_id` for JSON `{"my_id": "some_uuid"}`.
 *  For others, it may be the column name of the table where the unique ids are
 *  stored.
 *  The values of the JSON field or the table column are used as the
 *  Document.id google.cloud.discoveryengine.v1.Document.id s. The JSON field
 *  or the table column must be of string type, and the values must be set as
 *  valid strings conform to RFC-1034 (https://tools.ietf.org/html/rfc1034)
 *  with 1-63 characters. Otherwise, documents without valid IDs fail to be
 *  imported.
 *  Only set this field when
 *  auto_generate_ids google.cloud.discoveryengine.v1.ImportDocumentsRequest.auto_generate_ids 
 *  is unset or set as `false`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  If it is unset, a default value `_id` is used when importing from the
 *  allowed data sources.
 *  Supported data sources:
 *  * GcsSource google.cloud.discoveryengine.v1.GcsSource.
 *  GcsSource.data_schema google.cloud.discoveryengine.v1.GcsSource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * BigQuerySource google.cloud.discoveryengine.v1.BigQuerySource.
 *  BigQuerySource.data_schema google.cloud.discoveryengine.v1.BigQuerySource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * SpannerSource google.cloud.discoveryengine.v1.SpannerSource.
 *  * CloudSqlSource google.cloud.discoveryengine.v1.CloudSqlSource.
 *  * FirestoreSource google.cloud.discoveryengine.v1.FirestoreSource.
 *  * BigtableSource google.cloud.discoveryengine.v1.BigtableSource.
 */
// const idField = 'abc123'

// Imports the Discoveryengine library
const {DocumentServiceClient} = require('@google-cloud/discoveryengine').v1;

// Instantiates a client
const discoveryengineClient = new DocumentServiceClient();

async function callImportDocuments() {
  // Construct request
  const request = {
    parent,
  };

  // Run request
  const [operation] = await discoveryengineClient.importDocuments(request);
  const [response] = await operation.promise();
  console.log(response);
}

callImportDocuments();

Python

如需了解详情,请参阅 Vertex AI Agent Builder Python API 参考文档

如需向 Vertex AI Agent Builder 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

此示例会将 BigQuery 或 Cloud Storage 中的非结构化数据提取到现有数据存储区。



def import_documents_bigquery_sample(
    project_id: str,
    location: str,
    data_store_id: str,
    bigquery_dataset: str,
    bigquery_table: str,
) -> str:

    from google.api_core.client_options import ClientOptions
    from google.cloud import discoveryengine

    # TODO(developer): Uncomment these variables before running the sample.
    # project_id = "YOUR_PROJECT_ID"
    # location = "YOUR_LOCATION" # Values: "global"
    # data_store_id = "YOUR_DATA_STORE_ID"
    # bigquery_dataset = "YOUR_BIGQUERY_DATASET"
    # bigquery_table = "YOUR_BIGQUERY_TABLE"

    #  For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )

    # Create a client
    client = discoveryengine.DocumentServiceClient(client_options=client_options)

    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )

    request = discoveryengine.ImportDocumentsRequest(
        parent=parent,
        bigquery_source=discoveryengine.BigQuerySource(
            project_id=project_id,
            dataset_id=bigquery_dataset,
            table_id=bigquery_table,
            data_schema="custom",
        ),
        # Options: `FULL`, `INCREMENTAL`
        reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
    )

    # Make the request
    operation = client.import_documents(request=request)

    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()

    # After the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

    # Handle the response
    print(response)
    print(metadata)

    return operation.operation.name


def import_documents_gcs_sample(
    project_id: str,
    location: str,
    data_store_id: str,
    gcs_uri: str,
) -> str:
    from google.api_core.client_options import ClientOptions
    from google.cloud import discoveryengine

    # TODO(developer): Uncomment these variables before running the sample.
    # project_id = "YOUR_PROJECT_ID"
    # location = "YOUR_LOCATION" # Values: "global"
    # data_store_id = "YOUR_DATA_STORE_ID"

    # Examples:
    # - Unstructured documents
    #   - `gs://bucket/directory/file.pdf`
    #   - `gs://bucket/directory/*.pdf`
    # - Unstructured documents with JSONL Metadata
    #   - `gs://bucket/directory/file.json`
    # - Unstructured documents with CSV Metadata
    #   - `gs://bucket/directory/file.csv`
    # gcs_uri = "YOUR_GCS_PATH"

    #  For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )

    # Create a client
    client = discoveryengine.DocumentServiceClient(client_options=client_options)

    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )

    request = discoveryengine.ImportDocumentsRequest(
        parent=parent,
        gcs_source=discoveryengine.GcsSource(
            # Multiple URIs are supported
            input_uris=[gcs_uri],
            # Options:
            # - `content` - Unstructured documents (PDF, HTML, DOC, TXT, PPTX)
            # - `custom` - Unstructured documents with custom JSONL metadata
            # - `document` - Structured documents in the discoveryengine.Document format.
            # - `csv` - Unstructured documents with CSV metadata
            data_schema="content",
        ),
        # Options: `FULL`, `INCREMENTAL`
        reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
    )

    # Make the request
    operation = client.import_documents(request=request)

    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()

    # After the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

    # Handle the response
    print(response)
    print(metadata)

    return operation.operation.name

Ruby

如需了解详情,请参阅 Vertex AI Agent Builder Ruby API 参考文档

如需向 Vertex AI Agent Builder 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

此示例会将 BigQuery 或 Cloud Storage 中的非结构化数据提取到现有数据存储区。

require "google/cloud/discovery_engine/v1"

##
# Snippet for the import_documents call in the DocumentService service
#
# This snippet has been automatically generated and should be regarded as a code
# template only. It will require modifications to work:
# - It may require correct/in-range values for request initialization.
# - It may require specifying regional endpoints when creating the service
# client as shown in https://cloud.google.com/ruby/docs/reference.
#
# This is an auto-generated example demonstrating basic usage of
# Google::Cloud::DiscoveryEngine::V1::DocumentService::Client#import_documents.
#
def import_documents
  # Create a client object. The client can be reused for multiple calls.
  client = Google::Cloud::DiscoveryEngine::V1::DocumentService::Client.new

  # Create a request. To set request fields, pass in keyword arguments.
  request = Google::Cloud::DiscoveryEngine::V1::ImportDocumentsRequest.new

  # Call the import_documents method.
  result = client.import_documents request

  # The returned object is of type Gapic::Operation. You can use it to
  # check the status of an operation, cancel it, or wait for results.
  # Here is how to wait for a response.
  result.wait_until_done! timeout: 60
  if result.response?
    p result.response
  else
    puts "No response received."
  end
end

后续步骤

  • 如需将数据存储区附加到应用,请按照创建通用推荐应用中的步骤创建应用并选择数据存储区。

  • 如需预览设置应用和数据存储区后建议的显示方式,请参阅获取建议

使用 API 上传结构化 JSON 数据

如需使用该 API 直接上传 JSON 文档或对象,请按以下步骤操作。

在导入数据之前,请准备数据以进行提取

REST

如需使用命令行创建数据存储区并导入结构化 JSON 数据,请按以下步骤操作:

  1. 创建数据存储区。

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores?dataStoreId=DATA_STORE_ID" \
    -d '{
      "displayName": "DATA_STORE_DISPLAY_NAME",
      "industryVertical": "GENERIC",
      "solutionTypes": ["SOLUTION_TYPE_RECOMMENDATION"]
    }'
    

    替换以下内容:

    • PROJECT_ID:您的 Google Cloud 项目的 ID。
    • DATA_STORE_ID:您要创建的推荐数据存储区的 ID。此 ID 只能包含小写字母、数字、下划线和连字符。
    • DATA_STORE_DISPLAY_NAME:您要创建的推荐数据存储区的显示名称。
  2. 可选:提供您自己的架构。提供架构后,您通常会获得更好的结果。如需了解详情,请参阅提供或自动检测架构

    curl -X PATCH \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/schemas/default_schema" \
    -d '{
      "structSchema": JSON_SCHEMA_OBJECT
    }'
    

    替换以下内容:

    • PROJECT_ID:您的 Google Cloud 项目的 ID。
    • DATA_STORE_ID:推荐数据存储区的 ID。
    • JSON_SCHEMA_OBJECT:您的 JSON 架构(作为 JSON 对象)- 例如:

      {
        "$schema": "https://json-schema.org/draft/2020-12/schema",
        "type": "object",
        "properties": {
          "title": {
            "type": "string",
            "keyPropertyMapping": "title"
          },
          "categories": {
            "type": "array",
            "items": {
              "type": "string",
              "keyPropertyMapping": "category"
            }
          },
          "uri": {
            "type": "string",
            "keyPropertyMapping": "uri"
          }
        }
      }
      
  3. 导入符合定义的架构的结构化数据。

    您可以通过多种方式上传数据,包括:

    • 上传 JSON 文档。

      curl -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents?documentId=DOCUMENT_ID" \
      -d '{
        "jsonData": "JSON_DOCUMENT_STRING"
      }'
      

      JSON_DOCUMENT_STRING 替换为 JSON 文档(作为单个字符串)。此字符串必须符合您在上一步中提供的 JSON 架构,例如:

      ```none
      { \"title\": \"test title\", \"categories\": [\"cat_1\", \"cat_2\"], \"uri\": \"test uri\"}
      ```
      
    • 上传 JSON 对象。

      curl -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents?documentId=DOCUMENT_ID" \
      -d '{
        "structData": JSON_DOCUMENT_OBJECT
      }'
      

      JSON_DOCUMENT_OBJECT 替换为 JSON 文档作为 JSON 对象。此字符串必须符合您在上一步中提供的 JSON 架构,例如:

      ```json
      {
        "title": "test title",
        "categories": [
          "cat_1",
          "cat_2"
        ],
        "uri": "test uri"
      }
      ```
      
    • 使用 JSON 文档进行更新。

      curl -X PATCH \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents/DOCUMENT_ID" \
      -d '{
        "jsonData": "JSON_DOCUMENT_STRING"
      }'
      
    • 使用 JSON 对象进行更新。

      curl -X PATCH \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents/DOCUMENT_ID" \
      -d '{
        "structData": JSON_DOCUMENT_OBJECT
      }'
      

后续步骤

  • 如需将数据存储区附加到应用,请按照创建通用推荐应用中的步骤创建应用并选择数据存储区。

  • 如需预览设置应用和数据存储区后建议的显示方式,请参阅获取建议

使用 Terraform 创建数据存储区

您可以使用 Terraform 创建空数据存储区。创建空数据存储区后,您可以使用 Google Cloud 控制台或 API 命令将数据注入到数据存储区。

如需了解如何应用或移除 Terraform 配置,请参阅基本 Terraform 命令

如需使用 Terraform 创建空数据存储区,请参阅 google_discovery_engine_data_store