Create a search data store

To create a data store and ingest data for search, go to the section for the source you plan to use:

To sync data from a third-party data source instead, see Connect a third-party data source.

Import from website URLs

Use the following procedure to make a data store and index websites.

To use a website data store after creating it, you must attach it to an app that has Enterprise features turned on. You can turn on Enterprise Edition for an app when you create it. This incurs additional costs. See Create a search app and About advanced features.

Console

To use the Google Cloud console to make a data store and index websites, follow these steps:

  1. In the Google Cloud console, go to the Agent Builder page.

    Agent Builder

  2. In the navigation menu, click Data stores.

  3. Click New data store.

  4. In the Create data store pane, select Website URLs.

  5. Choose whether to turn on Advanced website indexing for this data store. This option can't be turned on or off later.

    Advanced website indexing provides additional features such as search summarization, search with follow-ups, and extractive answers. Advanced website indexing incurs additional cost, and requires that you verify domain ownership for any website that you index. For more information, see Advanced website indexing and Pricing.

  6. In the Sites to include field, enter the URLs of the websites that you want to include in your data store. Include one URL per line, without comma separators.

  7. Optional: In the Sites to exclude field, enter websites that you want to exclude from your data store.

  8. Click Continue.

  9. Enter a name for your data store.

  10. Select a location for your data store. Advanced website indexing must be turned on to select a location.

  11. Click Create. Vertex AI Search creates your data store and displays your data stores on the Data stores page.

  12. To view information about your data store, click the name of your data store in the Name column. Your data store page appears.

    If you turned on Advanced website indexing, a warning appears prompting you to verify your domain ownership. If you have a quota shortfall (the number of pages in the websites that you specified exceeds the "Number of documents per project" quota for your project), an additional warning appears prompting you to upgrade your quota. The following steps show you how to verify domain ownership and upgrade your quota.

  13. To verify your domain ownership, follow these steps:

    1. Click Verify in Google Search console. The Welcome to the Google Search Console page appears.
    2. Follow the on-screen instructions to verify a domain or a URL prefix, depending on whether you are verifying an entire domain or a URL prefix that is part of a domain. For more information, see Verify your site ownership in the Search Console Help.
    3. When you have finished the domain verification workflow, go back to the Agent Builder page and click Data stores in the navigation menu.
    4. Click the name of your data store in the Name column. Your data store page appears.
    5. Click Refresh status to update the values in the Status column. The Status column for your website indicates that indexing is in progress.
    6. Repeat the domain verification steps for every website that requires domain verification until all of them begin indexing. When the Status column for a URL shows Indexed, advanced website indexing features are available for that URL or URL pattern.
  14. To upgrade your quota, follow these steps:

    1. Click Upgrade quota. The IAM and Admin page of the Google Cloud console appears.
    2. Follow the instructions at Request a higher quota limit in the Google Cloud documentation. The quota to increase is Number of documents in the Discovery Engine API service.
    3. After submitting your request for a higher quota limit, go back to the Agent Builder page and click Data stores in the navigation menu.
    4. Click the name of your data store in the Name column. The Status column indicates that indexing is in progress for the websites that had surpassed the quota. When the Status column for a URL shows Indexed, advanced website indexing features are available for that URL or URL pattern.

    For more information, see Quota for webpage indexing in the "Quotas and limits" page.

Next steps

  • To attach your website data store to an app, create an app with Enterprise features enabled and select your data store following the steps in Create a search app.

  • To preview how your search results appear after your app and data store are set up, see Get search results.

Import from BigQuery

To ingest data from BigQuery, use the following steps to create a data store and ingest data using either the Google Cloud console or the API.

Before importing your data, review Prepare data for ingesting.

Console

To use the Google Cloud console to ingest data from BigQuery, follow these steps:

  1. In the Google Cloud console, go to the Agent Builder page.

    Agent Builder

  2. Go to the Data Stores page.

  3. Click New data store.

  4. On the Source page, select BigQuery.

  5. In the BigQuery path field, click Browse, select a table that you have prepared for ingesting, and then click Select. Alternatively, enter the table location directly in the BigQuery path field.

  6. Select what kind of data you are importing.

  7. Click Continue.

  8. Choose a region for your data store.

  9. Enter a name for your data store.

  10. Click Create.

  11. To check the status of your ingestion, go to the Data stores page and click your data store name to see details about it on its Data page. When the status column on the Activity tab changes from In progress to Import completed, the ingestion is complete.

    Depending on the size of your data, ingestion can take several minutes to several hours.

REST

To use the command line to create a data store and import data from BigQuery, follow these steps.

  1. Create a data store.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores?dataStoreId=DATA_STORE_ID" \
    -d '{
      "displayName": "DATA_STORE_DISPLAY_NAME",
      "industryVertical": "GENERIC",
      "solutionTypes": ["SOLUTION_TYPE_SEARCH"]
    }'
    

    Replace the following:

    • PROJECT_ID: the ID of your Google Cloud project.
    • DATA_STORE_ID: the ID of the Vertex AI Search data store that you want to create. This ID can contain only lowercase letters, digits, underscores, and hyphens.
    • DATA_STORE_DISPLAY_NAME: the display name of the Vertex AI Search data store that you want to create.

    Optional: If you're uploading unstructured data and want to configure document parsing or to turn on document chunking for RAG, specify the documentProcessingConfig object and include it in your data store creation request. Configuring an OCR parser for PDFs is recommended if you're ingesting scanned PDFs. For how to configure parsing or chunking options, see Parse and chunk documents.

  2. Optional: If you're uploading structured data with your own schema, you can provide the schema. When you provide the schema, you typically get better results. Otherwise, the schema is auto-detected. For more information, see Schemas: auto-detecting versus providing your own.

    curl -X PATCH \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/schemas/default_schema" \
    -d '{
      "structSchema": JSON_SCHEMA_OBJECT
    }'
    

    Replace the following:

    • PROJECT_ID: the ID of your Google Cloud project.
    • DATA_STORE_ID: the ID of the Vertex AI Search data store.
    • JSON_SCHEMA_OBJECT: your JSON schema as a JSON object—for example:

      {
        "$schema": "https://json-schema.org/draft/2020-12/schema",
        "type": "object",
        "properties": {
          "title": {
            "type": "string",
            "keyPropertyMapping": "title"
          },
          "categories": {
            "type": "array",
            "items": {
              "type": "string",
              "keyPropertyMapping": "category"
            }
          },
          "uri": {
            "type": "string",
            "keyPropertyMapping": "uri"
          }
        }
      }
      
  3. Import data from BigQuery.

    If you defined a schema, make sure the data conforms to that schema.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
    -d '{
      "bigquerySource": {
        "projectId": "PROJECT_ID",
        "datasetId":"DATASET_ID",
        "tableId": "TABLE_ID",
        "dataSchema": "DATA_SCHEMA",
        "aclEnabled": "BOOLEAN"
      },
      "reconciliationMode": "RECONCILIATION_MODE",
      "autoGenerateIds": "AUTO_GENERATE_IDS",
      "idField": "ID_FIELD",
      "errorConfig": {
        "gcsPrefix": "ERROR_DIRECTORY"
      }
    }'
    

    Replace the following:

    • PROJECT_ID: the ID of your Google Cloud project.
    • DATA_STORE_ID: the ID of the Vertex AI Search data store.
    • DATASET_ID: the ID of the BigQuery dataset.
    • TABLE_ID: the ID of the BigQuery table.
      • If the BigQuery table is not under PROJECT_ID, you need to give the service account service-<project number>@gcp-sa-discoveryengine.iam.gserviceaccount.com "BigQuery Data Viewer" permission for the BigQuery table. For example, if you are importing a BigQuery table from source project "123" to destination project "456", give service-456@gcp-sa-discoveryengine.iam.gserviceaccount.com permissions for the BigQuery table under project "123".
    • DATA_SCHEMA: Optional. Values are document and custom. The default is document.
      • document: the BigQuery table that you use must conform to the default BigQuery schema provided in Prepare data for ingesting. You can define the ID of each document yourself, while wrapping all the data in the jsonData string.
      • custom: Any BigQuery table schema is accepted, and Vertex AI Search automatically generates the IDs for each document that is imported.
    • ERROR_DIRECTORY: Optional. A Cloud Storage directory for error information about the import—for example, gs://<your-gcs-bucket>/directory/import_errors. Google recommends leaving this field empty to let Vertex AI Search automatically create a temporary directory.
    • RECONCILIATION_MODE: Optional. Values are FULL and INCREMENTAL. Default is INCREMENTAL. Specifying INCREMENTAL causes an incremental refresh of data from BigQuery to your data store. This does an upsert operation, which adds new documents and replaces existing documents with updated documents with the same ID. Specifying FULL causes a full rebase of the documents in your data store. In other words, new and updated documents are added to your data store, and documents that are not in BigQuery are removed from your data store. The FULL mode is helpful if you want to automatically delete documents that you no longer need.
    • AUTO_GENERATE_IDS: Optional. Specifies whether to automatically generate document IDs. If set to true, document IDs are generated based on a hash of the payload. Note that generated document IDs might not remain consistent over multiple imports. If you auto-generate IDs over multiple imports, Google highly recommends setting reconciliationMode to FULL to maintain consistent document IDs.

      Specify autoGenerateIds only when bigquerySource.dataSchema is set to custom. Otherwise an INVALID_ARGUMENT error is returned. If you don't specify autoGenerateIds or set it to false, you must specify idField. Otherwise the documents fail to import.

    • ID_FIELD: Optional. Specifies which fields are the document IDs. For BigQuery source files, idField indicates the name of the column in the BigQuery table that contains the document IDs.

      Specify idField only when: (1) bigquerySource.dataSchema is set to custom, and (2) auto_generate_ids is set to false or is unspecified. Otherwise an INVALID_ARGUMENT error is returned.

      The value of the BigQuery column name must be of string type, must be between 1 and 63 characters, and must conform to RFC-1034. Otherwise, the documents fail to import.

C#

For more information, see the Vertex AI Agent Builder C# API reference documentation.

To authenticate to Vertex AI Agent Builder, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

using Google.Cloud.DiscoveryEngine.V1;
using Google.LongRunning;
using Google.Protobuf.WellKnownTypes;

public sealed partial class GeneratedDocumentServiceClientSnippets
{
    /// <summary>Snippet for ImportDocuments</summary>
    /// <remarks>
    /// This snippet has been automatically generated and should be regarded as a code template only.
    /// It will require modifications to work:
    /// - It may require correct/in-range values for request initialization.
    /// - It may require specifying regional endpoints when creating the service client as shown in
    ///   https://cloud.google.com/dotnet/docs/reference/help/client-configuration#endpoint.
    /// </remarks>
    public void ImportDocumentsRequestObject()
    {
        // Create client
        DocumentServiceClient documentServiceClient = DocumentServiceClient.Create();
        // Initialize request argument(s)
        ImportDocumentsRequest request = new ImportDocumentsRequest
        {
            ParentAsBranchName = BranchName.FromProjectLocationDataStoreBranch("[PROJECT]", "[LOCATION]", "[DATA_STORE]", "[BRANCH]"),
            InlineSource = new ImportDocumentsRequest.Types.InlineSource(),
            ErrorConfig = new ImportErrorConfig(),
            ReconciliationMode = ImportDocumentsRequest.Types.ReconciliationMode.Unspecified,
            UpdateMask = new FieldMask(),
            AutoGenerateIds = false,
            IdField = "",
        };
        // Make the request
        Operation<ImportDocumentsResponse, ImportDocumentsMetadata> response = documentServiceClient.ImportDocuments(request);

        // Poll until the returned long-running operation is complete
        Operation<ImportDocumentsResponse, ImportDocumentsMetadata> completedResponse = response.PollUntilCompleted();
        // Retrieve the operation result
        ImportDocumentsResponse result = completedResponse.Result;

        // Or get the name of the operation
        string operationName = response.Name;
        // This name can be stored, then the long-running operation retrieved later by name
        Operation<ImportDocumentsResponse, ImportDocumentsMetadata> retrievedResponse = documentServiceClient.PollOnceImportDocuments(operationName);
        // Check if the retrieved long-running operation has completed
        if (retrievedResponse.IsCompleted)
        {
            // If it has completed, then access the result
            ImportDocumentsResponse retrievedResult = retrievedResponse.Result;
        }
    }
}

Go

For more information, see the Vertex AI Agent Builder Go API reference documentation.

To authenticate to Vertex AI Agent Builder, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


package main

import (
	"context"

	discoveryengine "cloud.google.com/go/discoveryengine/apiv1"
	discoveryenginepb "cloud.google.com/go/discoveryengine/apiv1/discoveryenginepb"
)

func main() {
	ctx := context.Background()
	// This snippet has been automatically generated and should be regarded as a code template only.
	// It will require modifications to work:
	// - It may require correct/in-range values for request initialization.
	// - It may require specifying regional endpoints when creating the service client as shown in:
	//   https://pkg.go.dev/cloud.google.com/go#hdr-Client_Options
	c, err := discoveryengine.NewDocumentClient(ctx)
	if err != nil {
		// TODO: Handle error.
	}
	defer c.Close()

	req := &discoveryenginepb.ImportDocumentsRequest{
		// TODO: Fill request struct fields.
		// See https://pkg.go.dev/cloud.google.com/go/discoveryengine/apiv1/discoveryenginepb#ImportDocumentsRequest.
	}
	op, err := c.ImportDocuments(ctx, req)
	if err != nil {
		// TODO: Handle error.
	}

	resp, err := op.Wait(ctx)
	if err != nil {
		// TODO: Handle error.
	}
	// TODO: Use resp.
	_ = resp
}

Java

For more information, see the Vertex AI Agent Builder Java API reference documentation.

To authenticate to Vertex AI Agent Builder, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.cloud.discoveryengine.v1.BranchName;
import com.google.cloud.discoveryengine.v1.DocumentServiceClient;
import com.google.cloud.discoveryengine.v1.ImportDocumentsRequest;
import com.google.cloud.discoveryengine.v1.ImportDocumentsResponse;
import com.google.cloud.discoveryengine.v1.ImportErrorConfig;
import com.google.protobuf.FieldMask;

public class SyncImportDocuments {

  public static void main(String[] args) throws Exception {
    syncImportDocuments();
  }

  public static void syncImportDocuments() throws Exception {
    // This snippet has been automatically generated and should be regarded as a code template only.
    // It will require modifications to work:
    // - It may require correct/in-range values for request initialization.
    // - It may require specifying regional endpoints when creating the service client as shown in
    // https://cloud.google.com/java/docs/setup#configure_endpoints_for_the_client_library
    try (DocumentServiceClient documentServiceClient = DocumentServiceClient.create()) {
      ImportDocumentsRequest request =
          ImportDocumentsRequest.newBuilder()
              .setParent(
                  BranchName.ofProjectLocationDataStoreBranchName(
                          "[PROJECT]", "[LOCATION]", "[DATA_STORE]", "[BRANCH]")
                      .toString())
              .setErrorConfig(ImportErrorConfig.newBuilder().build())
              .setUpdateMask(FieldMask.newBuilder().build())
              .setAutoGenerateIds(true)
              .setIdField("idField1629396127")
              .build();
      ImportDocumentsResponse response = documentServiceClient.importDocumentsAsync(request).get();
    }
  }
}

Node.js

For more information, see the Vertex AI Agent Builder Node.js API reference documentation.

To authenticate to Vertex AI Agent Builder, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * This snippet has been automatically generated and should be regarded as a code template only.
 * It will require modifications to work.
 * It may require correct/in-range values for request initialization.
 * TODO(developer): Uncomment these variables before running the sample.
 */
/**
 *  The Inline source for the input content for documents.
 */
// const inlineSource = {}
/**
 *  Cloud Storage location for the input content.
 */
// const gcsSource = {}
/**
 *  BigQuery input source.
 */
// const bigquerySource = {}
/**
 *  FhirStore input source.
 */
// const fhirStoreSource = {}
/**
 *  Spanner input source.
 */
// const spannerSource = {}
/**
 *  Cloud SQL input source.
 */
// const cloudSqlSource = {}
/**
 *  Firestore input source.
 */
// const firestoreSource = {}
/**
 *  Cloud Bigtable input source.
 */
// const bigtableSource = {}
/**
 *  Required. The parent branch resource name, such as
 *  `projects/{project}/locations/{location}/collections/{collection}/dataStores/{data_store}/branches/{branch}`.
 *  Requires create/update permission.
 */
// const parent = 'abc123'
/**
 *  The desired location of errors incurred during the Import.
 */
// const errorConfig = {}
/**
 *  The mode of reconciliation between existing documents and the documents to
 *  be imported. Defaults to
 *  ReconciliationMode.INCREMENTAL google.cloud.discoveryengine.v1.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL.
 */
// const reconciliationMode = {}
/**
 *  Indicates which fields in the provided imported documents to update. If
 *  not set, the default is to update all fields.
 */
// const updateMask = {}
/**
 *  Whether to automatically generate IDs for the documents if absent.
 *  If set to `true`,
 *  Document.id google.cloud.discoveryengine.v1.Document.id s are
 *  automatically generated based on the hash of the payload, where IDs may not
 *  be consistent during multiple imports. In which case
 *  ReconciliationMode.FULL google.cloud.discoveryengine.v1.ImportDocumentsRequest.ReconciliationMode.FULL 
 *  is highly recommended to avoid duplicate contents. If unset or set to
 *  `false`, Document.id google.cloud.discoveryengine.v1.Document.id s have
 *  to be specified using
 *  id_field google.cloud.discoveryengine.v1.ImportDocumentsRequest.id_field,
 *  otherwise, documents without IDs fail to be imported.
 *  Supported data sources:
 *  * GcsSource google.cloud.discoveryengine.v1.GcsSource.
 *  GcsSource.data_schema google.cloud.discoveryengine.v1.GcsSource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * BigQuerySource google.cloud.discoveryengine.v1.BigQuerySource.
 *  BigQuerySource.data_schema google.cloud.discoveryengine.v1.BigQuerySource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * SpannerSource google.cloud.discoveryengine.v1.SpannerSource.
 *  * CloudSqlSource google.cloud.discoveryengine.v1.CloudSqlSource.
 *  * FirestoreSource google.cloud.discoveryengine.v1.FirestoreSource.
 *  * BigtableSource google.cloud.discoveryengine.v1.BigtableSource.
 */
// const autoGenerateIds = true
/**
 *  The field indicates the ID field or column to be used as unique IDs of
 *  the documents.
 *  For GcsSource google.cloud.discoveryengine.v1.GcsSource  it is the key of
 *  the JSON field. For instance, `my_id` for JSON `{"my_id": "some_uuid"}`.
 *  For others, it may be the column name of the table where the unique ids are
 *  stored.
 *  The values of the JSON field or the table column are used as the
 *  Document.id google.cloud.discoveryengine.v1.Document.id s. The JSON field
 *  or the table column must be of string type, and the values must be set as
 *  valid strings conform to RFC-1034 (https://tools.ietf.org/html/rfc1034)
 *  with 1-63 characters. Otherwise, documents without valid IDs fail to be
 *  imported.
 *  Only set this field when
 *  auto_generate_ids google.cloud.discoveryengine.v1.ImportDocumentsRequest.auto_generate_ids 
 *  is unset or set as `false`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  If it is unset, a default value `_id` is used when importing from the
 *  allowed data sources.
 *  Supported data sources:
 *  * GcsSource google.cloud.discoveryengine.v1.GcsSource.
 *  GcsSource.data_schema google.cloud.discoveryengine.v1.GcsSource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * BigQuerySource google.cloud.discoveryengine.v1.BigQuerySource.
 *  BigQuerySource.data_schema google.cloud.discoveryengine.v1.BigQuerySource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * SpannerSource google.cloud.discoveryengine.v1.SpannerSource.
 *  * CloudSqlSource google.cloud.discoveryengine.v1.CloudSqlSource.
 *  * FirestoreSource google.cloud.discoveryengine.v1.FirestoreSource.
 *  * BigtableSource google.cloud.discoveryengine.v1.BigtableSource.
 */
// const idField = 'abc123'

// Imports the Discoveryengine library
const {DocumentServiceClient} = require('@google-cloud/discoveryengine').v1;

// Instantiates a client
const discoveryengineClient = new DocumentServiceClient();

async function callImportDocuments() {
  // Construct request
  const request = {
    parent,
  };

  // Run request
  const [operation] = await discoveryengineClient.importDocuments(request);
  const [response] = await operation.promise();
  console.log(response);
}

callImportDocuments();

Python

For more information, see the Vertex AI Agent Builder Python API reference documentation.

To authenticate to Vertex AI Agent Builder, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from typing import Optional

from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine

# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_LOCATION" # Values: "global"
# data_store_id = "YOUR_DATA_STORE_ID"

# Must specify either `gcs_uri` or (`bigquery_dataset` and `bigquery_table`)
# Format: `gs://bucket/directory/object.json` or `gs://bucket/directory/*.json`
# gcs_uri = "YOUR_GCS_PATH"
# bigquery_dataset = "YOUR_BIGQUERY_DATASET"
# bigquery_table = "YOUR_BIGQUERY_TABLE"


def import_documents_sample(
    project_id: str,
    location: str,
    data_store_id: str,
    gcs_uri: Optional[str] = None,
    bigquery_dataset: Optional[str] = None,
    bigquery_table: Optional[str] = None,
) -> str:
    #  For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )

    # Create a client
    client = discoveryengine.DocumentServiceClient(client_options=client_options)

    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )

    if gcs_uri:
        request = discoveryengine.ImportDocumentsRequest(
            parent=parent,
            gcs_source=discoveryengine.GcsSource(
                input_uris=[gcs_uri], data_schema="custom"
            ),
            # Options: `FULL`, `INCREMENTAL`
            reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
        )
    else:
        request = discoveryengine.ImportDocumentsRequest(
            parent=parent,
            bigquery_source=discoveryengine.BigQuerySource(
                project_id=project_id,
                dataset_id=bigquery_dataset,
                table_id=bigquery_table,
                data_schema="custom",
            ),
            # Options: `FULL`, `INCREMENTAL`
            reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
        )

    # Make the request
    operation = client.import_documents(request=request)

    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()

    # Once the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

    # Handle the response
    print(response)
    print(metadata)

    return operation.operation.name

Ruby

For more information, see the Vertex AI Agent Builder Ruby API reference documentation.

To authenticate to Vertex AI Agent Builder, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

require "google/cloud/discovery_engine/v1"

##
# Snippet for the import_documents call in the DocumentService service
#
# This snippet has been automatically generated and should be regarded as a code
# template only. It will require modifications to work:
# - It may require correct/in-range values for request initialization.
# - It may require specifying regional endpoints when creating the service
# client as shown in https://cloud.google.com/ruby/docs/reference.
#
# This is an auto-generated example demonstrating basic usage of
# Google::Cloud::DiscoveryEngine::V1::DocumentService::Client#import_documents.
#
def import_documents
  # Create a client object. The client can be reused for multiple calls.
  client = Google::Cloud::DiscoveryEngine::V1::DocumentService::Client.new

  # Create a request. To set request fields, pass in keyword arguments.
  request = Google::Cloud::DiscoveryEngine::V1::ImportDocumentsRequest.new

  # Call the import_documents method.
  result = client.import_documents request

  # The returned object is of type Gapic::Operation. You can use it to
  # check the status of an operation, cancel it, or wait for results.
  # Here is how to wait for a response.
  result.wait_until_done! timeout: 60
  if result.response?
    p result.response
  else
    puts "No response received."
  end
end

Next steps

  • To attach your data store to an app, create an app and select your data store following the steps in Create a search app.

  • To preview how your search results appear after your app and data store are set up, see Get search results.

Cloud Storage

To ingest data from Cloud Storage, use the following steps to create a data store and ingest data using either the Google Cloud console or the API.

Before importing your data, review Prepare data for ingesting.

Console

To use the console to ingest data from a Cloud Storage bucket, follow these steps:

  1. In the Google Cloud console, go to the Agent Builder page.

    Agent Builder

  2. Go to the Data Stores page.

  3. Click New data store.

  4. On the Source page, select Cloud Storage.

  5. In the Select a folder or file you want to import section, select Folder or File.

  6. Click Browse and choose the data you have prepared for ingesting, and then click Select. Alternatively, enter the location directly in the gs:// field.

  7. Select what kind of data you are importing.

  8. Optional: If you selected Unstructured documents, you can turn on OCR processing for PDFs. For more information about OCR processing and limitations, see About OCR parsing for PDFs.

    Google recommends OCR processing for PDFs if most of your unstructured data are scanned or image-based PDFs.

    • To turn on OCR processing, select Process and enrich your PDF files with OCR.
    • To turn on OCR processing for machine-readable PDF text with tables, select Use native text.
  9. Click Continue.

  10. Choose a region for your data store.

  11. Enter a name for your data store.

  12. Click Create.

  13. To check the status of your ingestion, go to the Data stores page and click your data store name to see details about it on its Data page. When the status column on the Activity tab changes from In progress to Import completed, the ingestion is complete.

    Depending on the size of your data, ingestion can take several minutes or several hours.

REST

To use the command line to create a data store and ingest data from Cloud Storage, follow these steps:

  1. Create a data store.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores?dataStoreId=DATA_STORE_ID" \
    -d '{
      "displayName": "DATA_STORE_DISPLAY_NAME",
      "industryVertical": "GENERIC",
      "solutionTypes": ["SOLUTION_TYPE_SEARCH"],
      "contentConfig": "CONTENT_REQUIRED",
    }'
    

    Replace the following:

    • PROJECT_ID: the ID of your Google Cloud project.
    • DATA_STORE_ID: the ID of the Vertex AI Search data store that you want to create. This ID can contain only lowercase letters, digits, underscores, and hyphens.
    • DATA_STORE_DISPLAY_NAME: the display name of the Vertex AI Search data store that you want to create.

    Optional: To configure document parsing or to turn on document chunking for RAG, specify the documentProcessingConfig object and include it in your data store creation request. Configuring an OCR parser for PDFs is recommended if you're ingesting scanned PDFs. For how to configure parsing or chunking options, see Parse and chunk documents.

  2. Import data from Cloud Storage.

      curl -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
      -d '{
        "gcsSource": {
          "inputUris": ["INPUT_FILE_PATTERN_1", "INPUT_FILE_PATTERN_2"],
          "dataSchema": "DATA_SCHEMA",
        },
        "reconciliationMode": "RECONCILIATION_MODE",
        "autoGenerateIds": "AUTO_GENERATE_IDS",
        "idField": "ID_FIELD",
        "errorConfig": {
          "gcsPrefix": "ERROR_DIRECTORY"
        }
      }'
    

    Replace the following:

    • PROJECT_ID: the ID of your Google Cloud project.
    • DATA_STORE_ID: the ID of the Vertex AI Search data store.
    • INPUT_FILE_PATTERN: A file pattern in Cloud Storage containing your documents.

      For structured data or for unstructured data with metadata, an example of the input file pattern is gs://<your-gcs-bucket>/directory/object.jsonand an example of pattern matching one or more files is gs://<your-gcs-bucket>/directory/*.json.

      For unstructured documents, an example is gs://<your-gcs-bucket>/directory/*.pdf. Each file that is matched by the pattern becomes a document.

      If <your-gcs-bucket> is not under PROJECT_ID, you need to give the service account service-<project number>@gcp-sa-discoveryengine.iam.gserviceaccount.com "Storage Object Viewer" permissions for the Cloud Storage bucket. For example, if you are importing a Cloud Storage bucket from source project "123" to destination project "456", give service-456@gcp-sa-discoveryengine.iam.gserviceaccount.com permissions on the Cloud Storage bucket under project "123".

    • DATA_SCHEMA: Optional. Values are document, custom, csv, and content. The default is document.

      • document: Upload unstructured data with metadata for unstructured documents. Each line of the file has to follow one of the following formats. You can define the ID of each document:

        • { "id": "<your-id>", "jsonData": "<JSON string>", "content": { "mimeType": "<application/pdf or text/html>", "uri": "gs://<your-gcs-bucket>/directory/filename.pdf" } }
        • { "id": "<your-id>", "structData": <JSON object>, "content": { "mimeType": "<application/pdf or text/html>", "uri": "gs://<your-gcs-bucket>/directory/filename.pdf" } }
      • custom: Upload JSON for structured documents. The data is organized according to a schema. You can specify the schema; otherwise it is auto-detected. You can put the JSON string of the document in a consistent format directly in each line, and Recommendations automatically generates the IDs for each document imported.

      • content: Upload unstructured documents (PDF, HTML, DOC, TXT, PPTX). The ID of each document is automatically generated as the first 128 bits of SHA256(GCS_URI) encoded as a hex string. You can specify multiple input file patterns as long as the matched files don't exceed the 100K files limit.

      • csv: Include a header row in your CSV file, with each header mapped to a document field. Specify the path to the CSV file using the inputUris field.

    • ERROR_DIRECTORY: Optional. A Cloud Storage directory for error information about the import—for example, gs://<your-gcs-bucket>/directory/import_errors. Google recommends leaving this field empty to let Vertex AI Search automatically create a temporary directory.

    • RECONCILIATION_MODE: Optional. Values are FULL and INCREMENTAL. Default is INCREMENTAL. Specifying INCREMENTAL causes an incremental refresh of data from Cloud Storage to your data store. This does an upsert operation, which adds new documents and replaces existing documents with updated documents with the same ID. Specifying FULL causes a full rebase of the documents in your data store. In other words, new and updated documents are added to your data store, and documents that are not in Cloud Storage are removed from your data store. The FULL mode is helpful if you want to automatically delete documents that you no longer need.

    • AUTO_GENERATE_IDS: Optional. Specifies whether to automatically generate document IDs. If set to true, document IDs are generated based on a hash of the payload. Note that generated document IDs might not remain consistent over multiple imports. If you auto-generate IDs over multiple imports, Google highly recommends setting reconciliationMode to FULL to maintain consistent document IDs.

      Specify autoGenerateIds only when gcsSource.dataSchema is set to custom or csv. Otherwise an INVALID_ARGUMENT error is returned. If you don't specify autoGenerateIds or set it to false, you must specify idField. Otherwise the documents fail to import.

    • ID_FIELD: Optional. Specifies which fields are the document IDs. For Cloud Storage source documents, idField specifies the name in the JSON fields that are document IDs. For example, if {"my_id":"some_uuid"} is the document ID field in one of your documents, specify "idField":"my_id". This identifies all JSON fields with the name "my_id" as document IDs.

      Specify this field only when: (1) gcsSource.dataSchema is set to custom or csv, and (2) auto_generate_ids is set to false or is unspecified. Otherwise an INVALID_ARGUMENT error is returned.

      Note that the value of the Cloud Storage JSON field must be of string type, must be between 1-63 characters, and must conform to RFC-1034. Otherwise, the documents fail to import.

      Note that the JSON field name specified by id_field must be of string type, must be between 1 and 63 characters, and must conform to RFC-1034. Otherwise, the documents fail to import.

C#

For more information, see the Vertex AI Agent Builder C# API reference documentation.

To authenticate to Vertex AI Agent Builder, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

using Google.Cloud.DiscoveryEngine.V1;
using Google.LongRunning;
using Google.Protobuf.WellKnownTypes;

public sealed partial class GeneratedDocumentServiceClientSnippets
{
    /// <summary>Snippet for ImportDocuments</summary>
    /// <remarks>
    /// This snippet has been automatically generated and should be regarded as a code template only.
    /// It will require modifications to work:
    /// - It may require correct/in-range values for request initialization.
    /// - It may require specifying regional endpoints when creating the service client as shown in
    ///   https://cloud.google.com/dotnet/docs/reference/help/client-configuration#endpoint.
    /// </remarks>
    public void ImportDocumentsRequestObject()
    {
        // Create client
        DocumentServiceClient documentServiceClient = DocumentServiceClient.Create();
        // Initialize request argument(s)
        ImportDocumentsRequest request = new ImportDocumentsRequest
        {
            ParentAsBranchName = BranchName.FromProjectLocationDataStoreBranch("[PROJECT]", "[LOCATION]", "[DATA_STORE]", "[BRANCH]"),
            InlineSource = new ImportDocumentsRequest.Types.InlineSource(),
            ErrorConfig = new ImportErrorConfig(),
            ReconciliationMode = ImportDocumentsRequest.Types.ReconciliationMode.Unspecified,
            UpdateMask = new FieldMask(),
            AutoGenerateIds = false,
            IdField = "",
        };
        // Make the request
        Operation<ImportDocumentsResponse, ImportDocumentsMetadata> response = documentServiceClient.ImportDocuments(request);

        // Poll until the returned long-running operation is complete
        Operation<ImportDocumentsResponse, ImportDocumentsMetadata> completedResponse = response.PollUntilCompleted();
        // Retrieve the operation result
        ImportDocumentsResponse result = completedResponse.Result;

        // Or get the name of the operation
        string operationName = response.Name;
        // This name can be stored, then the long-running operation retrieved later by name
        Operation<ImportDocumentsResponse, ImportDocumentsMetadata> retrievedResponse = documentServiceClient.PollOnceImportDocuments(operationName);
        // Check if the retrieved long-running operation has completed
        if (retrievedResponse.IsCompleted)
        {
            // If it has completed, then access the result
            ImportDocumentsResponse retrievedResult = retrievedResponse.Result;
        }
    }
}

Go

For more information, see the Vertex AI Agent Builder Go API reference documentation.

To authenticate to Vertex AI Agent Builder, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


package main

import (
	"context"

	discoveryengine "cloud.google.com/go/discoveryengine/apiv1"
	discoveryenginepb "cloud.google.com/go/discoveryengine/apiv1/discoveryenginepb"
)

func main() {
	ctx := context.Background()
	// This snippet has been automatically generated and should be regarded as a code template only.
	// It will require modifications to work:
	// - It may require correct/in-range values for request initialization.
	// - It may require specifying regional endpoints when creating the service client as shown in:
	//   https://pkg.go.dev/cloud.google.com/go#hdr-Client_Options
	c, err := discoveryengine.NewDocumentClient(ctx)
	if err != nil {
		// TODO: Handle error.
	}
	defer c.Close()

	req := &discoveryenginepb.ImportDocumentsRequest{
		// TODO: Fill request struct fields.
		// See https://pkg.go.dev/cloud.google.com/go/discoveryengine/apiv1/discoveryenginepb#ImportDocumentsRequest.
	}
	op, err := c.ImportDocuments(ctx, req)
	if err != nil {
		// TODO: Handle error.
	}

	resp, err := op.Wait(ctx)
	if err != nil {
		// TODO: Handle error.
	}
	// TODO: Use resp.
	_ = resp
}

Java

For more information, see the Vertex AI Agent Builder Java API reference documentation.

To authenticate to Vertex AI Agent Builder, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.cloud.discoveryengine.v1.BranchName;
import com.google.cloud.discoveryengine.v1.DocumentServiceClient;
import com.google.cloud.discoveryengine.v1.ImportDocumentsRequest;
import com.google.cloud.discoveryengine.v1.ImportDocumentsResponse;
import com.google.cloud.discoveryengine.v1.ImportErrorConfig;
import com.google.protobuf.FieldMask;

public class SyncImportDocuments {

  public static void main(String[] args) throws Exception {
    syncImportDocuments();
  }

  public static void syncImportDocuments() throws Exception {
    // This snippet has been automatically generated and should be regarded as a code template only.
    // It will require modifications to work:
    // - It may require correct/in-range values for request initialization.
    // - It may require specifying regional endpoints when creating the service client as shown in
    // https://cloud.google.com/java/docs/setup#configure_endpoints_for_the_client_library
    try (DocumentServiceClient documentServiceClient = DocumentServiceClient.create()) {
      ImportDocumentsRequest request =
          ImportDocumentsRequest.newBuilder()
              .setParent(
                  BranchName.ofProjectLocationDataStoreBranchName(
                          "[PROJECT]", "[LOCATION]", "[DATA_STORE]", "[BRANCH]")
                      .toString())
              .setErrorConfig(ImportErrorConfig.newBuilder().build())
              .setUpdateMask(FieldMask.newBuilder().build())
              .setAutoGenerateIds(true)
              .setIdField("idField1629396127")
              .build();
      ImportDocumentsResponse response = documentServiceClient.importDocumentsAsync(request).get();
    }
  }
}

Node.js

For more information, see the Vertex AI Agent Builder Node.js API reference documentation.

To authenticate to Vertex AI Agent Builder, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * This snippet has been automatically generated and should be regarded as a code template only.
 * It will require modifications to work.
 * It may require correct/in-range values for request initialization.
 * TODO(developer): Uncomment these variables before running the sample.
 */
/**
 *  The Inline source for the input content for documents.
 */
// const inlineSource = {}
/**
 *  Cloud Storage location for the input content.
 */
// const gcsSource = {}
/**
 *  BigQuery input source.
 */
// const bigquerySource = {}
/**
 *  FhirStore input source.
 */
// const fhirStoreSource = {}
/**
 *  Spanner input source.
 */
// const spannerSource = {}
/**
 *  Cloud SQL input source.
 */
// const cloudSqlSource = {}
/**
 *  Firestore input source.
 */
// const firestoreSource = {}
/**
 *  Cloud Bigtable input source.
 */
// const bigtableSource = {}
/**
 *  Required. The parent branch resource name, such as
 *  `projects/{project}/locations/{location}/collections/{collection}/dataStores/{data_store}/branches/{branch}`.
 *  Requires create/update permission.
 */
// const parent = 'abc123'
/**
 *  The desired location of errors incurred during the Import.
 */
// const errorConfig = {}
/**
 *  The mode of reconciliation between existing documents and the documents to
 *  be imported. Defaults to
 *  ReconciliationMode.INCREMENTAL google.cloud.discoveryengine.v1.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL.
 */
// const reconciliationMode = {}
/**
 *  Indicates which fields in the provided imported documents to update. If
 *  not set, the default is to update all fields.
 */
// const updateMask = {}
/**
 *  Whether to automatically generate IDs for the documents if absent.
 *  If set to `true`,
 *  Document.id google.cloud.discoveryengine.v1.Document.id s are
 *  automatically generated based on the hash of the payload, where IDs may not
 *  be consistent during multiple imports. In which case
 *  ReconciliationMode.FULL google.cloud.discoveryengine.v1.ImportDocumentsRequest.ReconciliationMode.FULL 
 *  is highly recommended to avoid duplicate contents. If unset or set to
 *  `false`, Document.id google.cloud.discoveryengine.v1.Document.id s have
 *  to be specified using
 *  id_field google.cloud.discoveryengine.v1.ImportDocumentsRequest.id_field,
 *  otherwise, documents without IDs fail to be imported.
 *  Supported data sources:
 *  * GcsSource google.cloud.discoveryengine.v1.GcsSource.
 *  GcsSource.data_schema google.cloud.discoveryengine.v1.GcsSource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * BigQuerySource google.cloud.discoveryengine.v1.BigQuerySource.
 *  BigQuerySource.data_schema google.cloud.discoveryengine.v1.BigQuerySource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * SpannerSource google.cloud.discoveryengine.v1.SpannerSource.
 *  * CloudSqlSource google.cloud.discoveryengine.v1.CloudSqlSource.
 *  * FirestoreSource google.cloud.discoveryengine.v1.FirestoreSource.
 *  * BigtableSource google.cloud.discoveryengine.v1.BigtableSource.
 */
// const autoGenerateIds = true
/**
 *  The field indicates the ID field or column to be used as unique IDs of
 *  the documents.
 *  For GcsSource google.cloud.discoveryengine.v1.GcsSource  it is the key of
 *  the JSON field. For instance, `my_id` for JSON `{"my_id": "some_uuid"}`.
 *  For others, it may be the column name of the table where the unique ids are
 *  stored.
 *  The values of the JSON field or the table column are used as the
 *  Document.id google.cloud.discoveryengine.v1.Document.id s. The JSON field
 *  or the table column must be of string type, and the values must be set as
 *  valid strings conform to RFC-1034 (https://tools.ietf.org/html/rfc1034)
 *  with 1-63 characters. Otherwise, documents without valid IDs fail to be
 *  imported.
 *  Only set this field when
 *  auto_generate_ids google.cloud.discoveryengine.v1.ImportDocumentsRequest.auto_generate_ids 
 *  is unset or set as `false`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  If it is unset, a default value `_id` is used when importing from the
 *  allowed data sources.
 *  Supported data sources:
 *  * GcsSource google.cloud.discoveryengine.v1.GcsSource.
 *  GcsSource.data_schema google.cloud.discoveryengine.v1.GcsSource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * BigQuerySource google.cloud.discoveryengine.v1.BigQuerySource.
 *  BigQuerySource.data_schema google.cloud.discoveryengine.v1.BigQuerySource.data_schema 
 *  must be `custom` or `csv`. Otherwise, an INVALID_ARGUMENT error is thrown.
 *  * SpannerSource google.cloud.discoveryengine.v1.SpannerSource.
 *  * CloudSqlSource google.cloud.discoveryengine.v1.CloudSqlSource.
 *  * FirestoreSource google.cloud.discoveryengine.v1.FirestoreSource.
 *  * BigtableSource google.cloud.discoveryengine.v1.BigtableSource.
 */
// const idField = 'abc123'

// Imports the Discoveryengine library
const {DocumentServiceClient} = require('@google-cloud/discoveryengine').v1;

// Instantiates a client
const discoveryengineClient = new DocumentServiceClient();

async function callImportDocuments() {
  // Construct request
  const request = {
    parent,
  };

  // Run request
  const [operation] = await discoveryengineClient.importDocuments(request);
  const [response] = await operation.promise();
  console.log(response);
}

callImportDocuments();

Python

For more information, see the Vertex AI Agent Builder Python API reference documentation.

To authenticate to Vertex AI Agent Builder, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from typing import Optional

from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine

# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_LOCATION" # Values: "global"
# data_store_id = "YOUR_DATA_STORE_ID"

# Must specify either `gcs_uri` or (`bigquery_dataset` and `bigquery_table`)
# Format: `gs://bucket/directory/object.json` or `gs://bucket/directory/*.json`
# gcs_uri = "YOUR_GCS_PATH"
# bigquery_dataset = "YOUR_BIGQUERY_DATASET"
# bigquery_table = "YOUR_BIGQUERY_TABLE"


def import_documents_sample(
    project_id: str,
    location: str,
    data_store_id: str,
    gcs_uri: Optional[str] = None,
    bigquery_dataset: Optional[str] = None,
    bigquery_table: Optional[str] = None,
) -> str:
    #  For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )

    # Create a client
    client = discoveryengine.DocumentServiceClient(client_options=client_options)

    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )

    if gcs_uri:
        request = discoveryengine.ImportDocumentsRequest(
            parent=parent,
            gcs_source=discoveryengine.GcsSource(
                input_uris=[gcs_uri], data_schema="custom"
            ),
            # Options: `FULL`, `INCREMENTAL`
            reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
        )
    else:
        request = discoveryengine.ImportDocumentsRequest(
            parent=parent,
            bigquery_source=discoveryengine.BigQuerySource(
                project_id=project_id,
                dataset_id=bigquery_dataset,
                table_id=bigquery_table,
                data_schema="custom",
            ),
            # Options: `FULL`, `INCREMENTAL`
            reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
        )

    # Make the request
    operation = client.import_documents(request=request)

    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()

    # Once the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

    # Handle the response
    print(response)
    print(metadata)

    return operation.operation.name

Ruby

For more information, see the Vertex AI Agent Builder Ruby API reference documentation.

To authenticate to Vertex AI Agent Builder, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

require "google/cloud/discovery_engine/v1"

##
# Snippet for the import_documents call in the DocumentService service
#
# This snippet has been automatically generated and should be regarded as a code
# template only. It will require modifications to work:
# - It may require correct/in-range values for request initialization.
# - It may require specifying regional endpoints when creating the service
# client as shown in https://cloud.google.com/ruby/docs/reference.
#
# This is an auto-generated example demonstrating basic usage of
# Google::Cloud::DiscoveryEngine::V1::DocumentService::Client#import_documents.
#
def import_documents
  # Create a client object. The client can be reused for multiple calls.
  client = Google::Cloud::DiscoveryEngine::V1::DocumentService::Client.new

  # Create a request. To set request fields, pass in keyword arguments.
  request = Google::Cloud::DiscoveryEngine::V1::ImportDocumentsRequest.new

  # Call the import_documents method.
  result = client.import_documents request

  # The returned object is of type Gapic::Operation. You can use it to
  # check the status of an operation, cancel it, or wait for results.
  # Here is how to wait for a response.
  result.wait_until_done! timeout: 60
  if result.response?
    p result.response
  else
    puts "No response received."
  end
end

Next steps

  • To attach your data store to an app, create an app and select your data store following the steps in Create a search app.

  • To preview how your search results appear after your app and data store are set up, see Get search results.

Sync from Google Drive

To sync data from Google Drive, use the following steps to create a data store and ingest data using the Google Cloud console.

Data from Google Drive continuously syncs to Vertex AI Search after you create your data store.

Before you begin:

  • You must be signed into the Google Cloud console with the same account that you use for the Google Drive instance that you plan to connect. Vertex AI Search uses your Google Workspace customer ID to connect to Google Drive.

  • Set up access control for Google Drive. For information about setting up access control, see Use data source access control.

Console

To use the console to ingest data from Google Drive, follow these steps:

  1. In the Google Cloud console, go to the Agent Builder page.

    Agent Builder

  2. Go to the Data Stores page.

  3. Click New data store.

  4. On the Source page, select Google Drive.

  5. Choose a region for your data store.

  6. Enter a name for your data store.

  7. Click Create. Depending on the size of your data, ingestion can take several minutes to several hours. Wait at least an hour before using your data store for searching.

Next steps

  • To attach your data store to an app, create an app and select your data store following the steps in Create a search app.

  • To preview how your search results appear after your app and data store are set up, see Get search results.

Import from Cloud SQL

To ingest data from Cloud SQL, use the following steps to set up Cloud SQL access, create a data store, and ingest data.

Set up staging bucket access for Cloud SQL instances

When ingesting data from Cloud SQL, data is first staged to a Cloud Storage bucket. Follow these steps to give a Cloud SQL instance access to Cloud Storage buckets.

  1. In the Google Cloud console, go to the SQL page.

    SQL

  2. Click the Cloud SQL instance that you plan to import from.

  3. Copy the identifier for the instance's service account, which looks like an email address—for example, p9876-abcd33f@gcp-sa-cloud-sql.iam.gserviceaccount.com.

  4. Go to the IAM & Admin page.

    IAM & Admin

  5. Click Grant access.

  6. For New principals, enter the instance's service account identifier and select the Cloud Storage > Storage Admin role.

  7. Click Save.

Next:

Set up Cloud SQL access from a different project

To give Vertex AI Search access to Cloud SQL data that's in a different project, follow these steps:

  1. Replace the following PROJECT_NUMBER variable with your Vertex AI Search project number, and then copy the contents of the code block. This is your Vertex AI Search service account identifier:

    service-PROJECT_NUMBER@gcp-sa-discoveryengine.iam.gserviceaccount.com`
    
  2. Go to the IAM & Admin page.

    IAM & Admin

  3. Switch to your Cloud SQL project on the IAM & Admin page and click Grant Access.

  4. For New principals, enter the identifier for the service account and select the Cloud SQL > Cloud SQL Viewer role.

  5. Click Save.

Next, go to Import data from Cloud SQL.

Import data from Cloud SQL

Console

To use the console to ingest data from Cloud SQL, follow these steps:

  1. In the Google Cloud console, go to the Agent Builder page.

    Agent Builder

  2. Go to the Data Stores page.

  3. Click New data store.

  4. On the Source page, select Cloud SQL.

  5. Specify the project ID, instance ID, database ID, and table ID of the data that you plan to import.

  6. Click Browse and choose an intermediate Cloud Storage location to export data to, and then click Select. Alternatively, enter the location directly in the gs:// field.

  7. Select whether to turn on serverless export. Serverless export incurs additional cost. For information about serverless export, see Minimize the performance impact of exports in the Cloud SQL documentation.

  8. Click Continue.

  9. Choose a region for your data store.

  10. Enter a name for your data store.

  11. Click Create.

  12. To check the status of your ingestion, go to the Data stores page and click your data store name to see details about it on its Data page. When the status column on the Activity tab changes from In progress to Import completed, the ingestion is complete.

    Depending on the size of your data, ingestion can take several minutes or several hours.

REST

To use the command line to create a data store and ingest data from Cloud SQL, follow these steps:

  1. Create a data store.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores?dataStoreId=DATA_STORE_ID" \
    -d '{
      "displayName": "DISPLAY_NAME",
      "industryVertical": "GENERIC",
      "solutionTypes": ["SOLUTION_TYPE_SEARCH"],
    }'
    

    Replace the following:

    • PROJECT_ID: The ID of your project.
    • DATA_STORE_ID: The ID of the data store. The ID can contain only lowercase letters, digits, underscores, and hyphens.
    • DISPLAY_NAME: The display name of the data store. This might be displayed in the Google Cloud console.
  2. Import data from Cloud SQL.

      curl -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
      -d '{
        "cloudSqlSource": {
          "projectId": "SQL_PROJECT_ID",
          "instanceId": "INSTANCE_ID",
          "databaseId": "DATABASE_ID",
          "tableId": "TABLE_ID",
          "gcsStagingDir": "STAGING_DIRECTORY"
        },
        "reconciliationMode": "RECONCILIATION_MODE",
        "autoGenerateIds": "AUTO_GENERATE_IDS",
        "idField": "ID_FIELD",
      }'
    

    Replace the following:

    • PROJECT_ID: The ID of your Vertex AI Search project.
    • DATA_STORE_ID: The ID of the data store. The ID can contain only lowercase letters, digits, underscores, and hyphens.
    • SQL_PROJECT_ID: The ID of your Cloud SQL project.
    • INSTANCE_ID: The ID of your Cloud SQL instance.
    • DATABASE_ID: The ID of your Cloud SQL database.
    • TABLE_ID: The ID of your Cloud SQL table.
    • STAGING_DIRECTORY: Optional. A Cloud Storage directory—for example, gs://<your-gcs-bucket>/directory/import_errors.
    • RECONCILIATION_MODE: Optional. Values are FULL and INCREMENTAL. Default is INCREMENTAL. Specifying INCREMENTAL causes an incremental refresh of data from Cloud SQL to your data store. This does an upsert operation, which adds new documents and replaces existing documents with updated documents with the same ID. Specifying FULL causes a full rebase of the documents in your data store. In other words, new and updated documents are added to your data store, and documents that are not in Cloud SQL are removed from your data store. The FULL mode is helpful if you want to automatically delete documents that you no longer need.

Next steps

  • To attach your data store to an app, create an app and select your data store following the steps in Create a search app.

  • To preview how your search results appear after your app and data store are set up, see Get search results.

Import from Spanner

To ingest data from Spanner, use the following steps to create a data store and ingest data using either the Google Cloud console or the API.

Set up Spanner access from a different project

If your Spanner data is in the same project as Vertex AI Search, skip to Import data from Spanner.

To give Vertex AI Search access to Spanner data that is in a different project, follow these steps:

  1. Replace the following PROJECT_NUMBER variable with your Vertex AI Search project number, and then copy the contents of this code block. This is your Vertex AI Search service account identifier:

    service-PROJECT_NUMBER@gcp-sa-discoveryengine.iam.gserviceaccount.com
    
  2. Go to the IAM & Admin page.

    IAM & Admin

  3. Switch to your Spanner project on the IAM & Admin page and click Grant Access.

  4. For New principals, enter the identifier for the service account and select one of the following:

    • If you won't use data boost during import, select the Cloud Spanner > Cloud Spanner Database Reader role.
    • If you plan to use data boost during import, select the Cloud Spanner > Cloud Spanner Admin role, or a custom role with the permissions of Cloud Spanner Database Reader and spanner.databases.useDataBoost. For information about Data Boost, see Data Boost overview in the Spanner documentation.
  5. Click Save.

Next, go to Import data from Spanner.

Import data from Spanner

Console

To use the console to ingest data from Spanner, follow these steps:

  1. In the Google Cloud console, go to the Agent Builder page.

    Agent Builder

  2. Go to the Data Stores page.

  3. Click New data store.

  4. On the Source page, select Cloud Spanner.

  5. Specify the project ID, instance ID, database ID, and table ID of the data that you plan to import.

  6. Select whether to turn on Data Boost. For information about Data Boost, see Data Boost overview in the Spanner documentation.

  7. Click Continue.

  8. Choose a region for your data store.

  9. Enter a name for your data store.

  10. Click Create.

  11. To check the status of your ingestion, go to the Data stores page and click your data store name to see details about it on its Data page. When the status column on the Activity tab changes from In progress to Import completed, the ingestion is complete.

    Depending on the size of your data, ingestion can take several minutes or several hours.

REST

To use the command line to create a data store and ingest data from Spanner, follow these steps:

  1. Create a data store.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores?dataStoreId=DATA_STORE_ID" \
    -d '{
      "displayName": "DISPLAY_NAME",
      "industryVertical": "GENERIC",
      "solutionTypes": ["SOLUTION_TYPE_SEARCH"],
      "contentConfig": "CONTENT_REQUIRED",
    }'
    

    Replace the following:

    • PROJECT_ID: The ID of your Vertex AI Search project.
    • DATA_STORE_ID: The ID of the data store. The ID can contain only lowercase letters, digits, underscores, and hyphens.
    • DISPLAY_NAME: The display name of the data store. This might be displayed in the Google Cloud console.
  2. Import data from Spanner.

      curl -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
      -d '{
        "cloudSpannerSource": {
          "projectId": "SPANNER_PROJECT_ID",
          "instanceId": "INSTANCE_ID",
          "databaseId": "DATABASE_ID",
          "tableId": "TABLE_ID",
          "enableDataBoost": "DATA_BOOST_BOOLEAN"
        },
        "reconciliationMode": "RECONCILIATION_MODE",
        "autoGenerateIds": "AUTO_GENERATE_IDS",
        "idField": "ID_FIELD",
      }'
    

    Replace the following:

    • PROJECT_ID: The ID of your Vertex AI Search project.
    • DATA_STORE_ID: The ID of the data store.
    • SPANNER_PROJECT_ID: The ID of your Spanner project.
    • INSTANCE_ID: The ID of your Spanner instance.
    • DATABASE_ID: The ID of your Spanner database.
    • TABLE_ID: The ID of your Spanner table.
    • DATA_BOOST_BOOLEAN: Optional. Whether to turn on Data Boost. For information about Data Boost, see Data Boost overview in the Spanner documentation.
    • RECONCILIATION_MODE: Optional. Values are FULL and INCREMENTAL. Default is INCREMENTAL. Specifying INCREMENTAL causes an incremental refresh of data from Spanner to your data store. This does an upsert operation, which adds new documents and replaces existing documents with updated documents with the same ID. Specifying FULL causes a full rebase of the documents in your data store. In other words, new and updated documents are added to your data store, and documents that are not in Spanner are removed from your data store. The FULL mode is helpful if you want to automatically delete documents that you no longer need.
    • AUTO_GENERATE_IDS: Optional. Specifies whether to automatically generate document IDs. If set to true, document IDs are generated based on a hash of the payload. Note that generated document IDs might not remain consistent over multiple imports. If you auto-generate IDs over multiple imports, Google highly recommends setting reconciliationMode to FULL to maintain consistent document IDs.

    • ID_FIELD: Optional. Specifies which fields are the document IDs.

Next steps

  • To attach your data store to an app, create an app and select your data store following the steps in Create a search app.

  • To preview how your search results appear after your app and data store are set up, see Get search results.

Import from Firestore

To ingest data from Firestore, use the following steps to create a data store and ingest data using either the Google Cloud console or the API.

Set up staging bucket access for Firestore from a different project

When ingesting data from Firestore, data is first staged to a Cloud Storage bucket. If Firestore is in a different project from the one that you use for Cloud Storage and Vertex AI Search, follow these steps to give Firestore access to Cloud Storage buckets.

  1. Replace the following FIRESTORE_PROJECT_NUMBER variable with your Firestore project number, and then copy the contents of this code block. This is your Firestore service account identifier:

    service-FIRESTORE_PROJECT_NUMBER@gcp-sa-discoveryengine.iam.gserviceaccount.com`
    
  2. Go to the IAM & Admin page.

    IAM & Admin

  3. Select the project that you use for Cloud Storage and Vertex AI Search.

  4. Click Grant access.

  5. For New principals, enter the Firestore service account identifier and select the Cloud Storage > Storage Admin role.

  6. Click Save.

Next, if your Firestore data is in the same project as Vertex AI Search, go to Import data from Firestore.

If your Firestore data is in a different project than your Vertex AI Search project, go to Set up Firestore access.

Set up Firestore access from a different project

To give Vertex AI Search access to Firestore data that's in a different project, follow these steps:

  1. Replace the following PROJECT_NUMBER variable with your Vertex AI Search project number, and then copy the contents of this code block. This is your Vertex AI Search service account identifier:

    service-PROJECT_NUMBER@gcp-sa-discoveryengine.iam.gserviceaccount.com
    
  2. Go to the IAM & Admin page.

    IAM & Admin

  3. Switch to your Firestore project on the IAM & Admin page and click Grant Access.

  4. For New principals, enter the instance's service account identifier and select the Datastore > Cloud Datastore Import Export Admin role.

  5. Click Save.

  6. Switch back to your Vertex AI Search project.

Next, go to Import data from Firestore.

Import data from Firestore

Console

To use the console to ingest data from Firestore, follow these steps:

  1. In the Google Cloud console, go to the Agent Builder page.

    Agent Builder

  2. Go to the Data Stores page.

  3. Click New data store.

  4. On the Source page, select Firestore.

  5. Specify the project ID, database ID, and collection ID of the data that you plan to import.

  6. Click Browse and choose an intermediate Cloud Storage location to export data to, and then click Select. Alternatively, enter the location directly in the gs:// field.

  7. Click Continue.

  8. Choose a region for your data store.

  9. Enter a name for your data store.

  10. Click Create.

  11. To check the status of your ingestion, go to the Data stores page and click your data store name to see details about it on its Data page. When the status column on the Activity tab changes from In progress to Import completed, the ingestion is complete.

    Depending on the size of your data, ingestion can take several minutes or several hours.

REST

To use the command line to create a data store and ingest data from Firestore, follow these steps:

  1. Create a data store.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores?dataStoreId=DATA_STORE_ID" \
    -d '{
      "displayName": "DISPLAY_NAME",
      "industryVertical": "GENERIC",
      "solutionTypes": ["SOLUTION_TYPE_SEARCH"],
    }'
    

    Replace the following:

    • PROJECT_ID: The ID of your project.
    • DATA_STORE_ID: The ID of the data store. The ID can contain only lowercase letters, digits, underscores, and hyphens.
    • DISPLAY_NAME: The display name of the data store. This might be displayed in the Google Cloud console.
  2. Import data from Firestore.

      curl -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
      -d '{
        "firestoreSource": {
          "projectId": "FIRESTORE_PROJECT_ID",
          "databaseId": "DATABASE_ID",
          "collectionId": "COLLECTION_ID",
        },
        "reconciliationMode": "RECONCILIATION_MODE",
        "autoGenerateIds": "AUTO_GENERATE_IDS",
        "idField": "ID_FIELD",
      }'
    

    Replace the following:

    • PROJECT_ID: The ID of your Vertex AI Search project.
    • DATA_STORE_ID: The ID of the data store. The ID can contain only lowercase letters, digits, underscores, and hyphens.
    • FIRESTORE_PROJECT_ID: The ID of your Firestore project.
    • DATABASE_ID: The ID of your Firestore database.
    • COLLECTION_ID: The ID of your Firestore collection.
    • RECONCILIATION_MODE: Optional. Values are FULL and INCREMENTAL. Default is INCREMENTAL. Specifying INCREMENTAL causes an incremental refresh of data from Firestore to your data store. This does an upsert operation, which adds new documents and replaces existing documents with updated documents with the same ID. Specifying FULL causes a full rebase of the documents in your data store. In other words, new and updated documents are added to your data store, and documents that are not in Firestore are removed from your data store. The FULL mode is helpful if you want to automatically delete documents that you no longer need.
    • AUTO_GENERATE_IDS: Optional. Specifies whether to automatically generate document IDs. If set to true, document IDs are generated based on a hash of the payload. Note that generated document IDs might not remain consistent over multiple imports. If you auto-generate IDs over multiple imports, Google highly recommends setting reconciliationMode to FULL to maintain consistent document IDs.
    • ID_FIELD: Optional. Specifies which fields are the document IDs.

Next steps

  • To attach your data store to an app, create an app and select your data store following the steps in Create a search app.

  • To preview how your search results appear after your app and data store are set up, see Get search results.

Import from Bigtable

To ingest data from Bigtable, use the following steps to create a data store and ingest data using the API.

Set up Bigtable access

To give Vertex AI Search access to Bigtable data that's in a different project, follow these steps:

  1. Replace the following PROJECT_NUMBER variable with your Vertex AI Search project number, then copy the contents of this code block. This is your Vertex AI Search service account identifier:

    service-PROJECT_NUMBER@gcp-sa-discoveryengine.iam.gserviceaccount.com`
    
  2. Go to the IAM & Admin page.

    IAM & Admin

  3. Switch to your Bigtable project on the IAM & Admin page and click Grant Access.

  4. For New principals, enter the instance's service account identifier and select the Bigtable > Bigtable Reader role.

  5. Click Save.

  6. Switch back to your Vertex AI Search project.

Next, go to Import data from Bigtable.

Import data from Bigtable

REST

To use the command line to create a data store and ingest data from Bigtable, follow these steps:

  1. Create a data store.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores?dataStoreId=DATA_STORE_ID" \
    -d '{
      "displayName": "DISPLAY_NAME",
      "industryVertical": "GENERIC",
      "solutionTypes": ["SOLUTION_TYPE_SEARCH"],
    }'
    

    Replace the following:

    • PROJECT_ID: The ID of your project.
    • DATA_STORE_ID: The ID of the data store. The ID can contain only lowercase letters, digits, underscores, and hyphens.
    • DISPLAY_NAME: The display name of the data store. This might be displayed in the Google Cloud console.
  2. Import data from Bigtable.

      curl -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
      -d '{
        "bigtableSource ": {
          "projectId": "BIGTABLE_PROJECT_ID",
          "instanceId": "INSTANCE_ID",
          "tableId": "TABLE_ID",
          "bigtableOptions": {
            "keyFieldName": "KEY_FIELD_NAME",
            "families": {
              "key": "KEY",
              "value": {
                "fieldName": "FIELD_NAME",
                "encoding": "ENCODING",
                "type": "TYPE",
                "columns": [
                  {
                    "qualifier": "QUALIFIER",
                    "fieldName": "FIELD_NAME",
                    "encoding": "COLUMN_ENCODING",
                    "type": "COLUMN_VALUES_TYPE"
                  }
                ]
              }
             }
             ...
          }
        },
        "reconciliationMode": "RECONCILIATION_MODE",
        "autoGenerateIds": "AUTO_GENERATE_IDS",
        "idField": "ID_FIELD",
      }'
    

    Replace the following:

    • PROJECT_ID: The ID of your Vertex AI Search project.
    • DATA_STORE_ID: The ID of the data store. The ID can contain only lowercase letters, digits, underscores, and hyphens.
    • BIGTABLE_PROJECT_ID: The ID of your Bigtable project.
    • INSTANCE_ID: The ID of your Bigtable instance.
    • TABLE_ID: The ID of your Bigtable table.
    • KEY_FIELD_NAME: Optional but recommended. The field name to use for the row key value after ingesting to Vertex AI Search.
    • KEY: Required. A string value for the column family key.
    • ENCODING: Optional. The encoding mode of the values when the type is not STRING.This can be overridden for a specific column by listing that column in columns and specifying an encoding for it.
    • COLUMN_TYPE: Optional. The type of values in this column family.
    • QUALIFIER: Required. Qualifier of the column.
    • FIELD_NAME: Optional but recommended. The field name to use for this column after ingesting to Vertex AI Search.
    • COLUMN_ENCODING: Optional. The encoding mode of the values for a specific column when the type is not STRING.
    • RECONCILIATION_MODE: Optional. Values are FULL and INCREMENTAL. Default is INCREMENTAL. Specifying INCREMENTAL causes an incremental refresh of data from Bigtable to your data store. This does an upsert operation, which adds new documents and replaces existing documents with updated documents with the same ID. Specifying FULL causes a full rebase of the documents in your data store. In other words, new and updated documents are added to your data store, and documents that are not in Bigtable are removed from your data store. The FULL mode is helpful if you want to automatically delete documents that you no longer need.
    • AUTO_GENERATE_IDS: Optional. Specifies whether to automatically generate document IDs. If set to true, document IDs are generated based on a hash of the payload. Note that generated document IDs might not remain consistent over multiple imports. If you auto-generate IDs over multiple imports, Google highly recommends setting reconciliationMode to FULL to maintain consistent document IDs.

      Specify autoGenerateIds only when bigquerySource.dataSchema is set to custom. Otherwise an INVALID_ARGUMENT error is returned. If you don't specify autoGenerateIds or set it to false, you must specify idField. Otherwise the documents fail to import.

    • ID_FIELD: Optional. Specifies which fields are the document IDs.

Next steps

  • To attach your data store to an app, create an app and select your data store following the steps in Create a search app.

  • To preview how your search results appear after your app and data store are set up, see Get search results.

Upload structured JSON data with the API

To directly upload a JSON document or object using the API, follow these steps.

Before importing your data, Prepare data for ingesting.

REST

To use the command line to create a data store and import structured JSON data, follow these steps:

  1. Create a data store.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores?dataStoreId=DATA_STORE_ID" \
    -d '{
      "displayName": "DATA_STORE_DISPLAY_NAME",
      "industryVertical": "GENERIC",
      "solutionTypes": ["SOLUTION_TYPE_SEARCH"]
    }'
    

    Replace the following:

    • PROJECT_ID: the ID of your Google Cloud project.
    • DATA_STORE_ID: the ID of the Vertex AI Search data store that you want to create. This ID can contain only lowercase letters, digits, underscores, and hyphens.
    • DATA_STORE_DISPLAY_NAME: the display name of the Vertex AI Search data store that you want to create.
  2. Optional: Provide your own schema. When you provide a schema, you typically get better results. For more information, see Schemas: auto-detecting versus providing your own.

    curl -X PATCH \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/schemas/default_schema" \
    -d '{
      "structSchema": JSON_SCHEMA_OBJECT
    }'
    

    Replace the following:

    • PROJECT_ID: the ID of your Google Cloud project.
    • DATA_STORE_ID: the ID of the Vertex AI Search data store.
    • JSON_SCHEMA_OBJECT: Your JSON schema as a JSON object—for example:

      {
        "$schema": "https://json-schema.org/draft/2020-12/schema",
        "type": "object",
        "properties": {
          "title": {
            "type": "string",
            "keyPropertyMapping": "title"
          },
          "categories": {
            "type": "array",
            "items": {
              "type": "string",
              "keyPropertyMapping": "category"
            }
          },
          "uri": {
            "type": "string",
            "keyPropertyMapping": "uri"
          }
        }
      }
      
  3. Import structured data that conforms to the defined schema.

    There are a few approaches that you can use to upload data, including:

    • Upload a JSON document.

      curl -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents?documentId=DOCUMENT_ID" \
      -d '{
        "jsonData": "JSON_DOCUMENT_STRING"
      }'
      

      Replace the following:

      • DOCUMENT_ID: a unique ID for the document. This ID can be up to 63 characters long and contain only lowercase letters, digits, underscores, and hyphens.
      • JSON_DOCUMENT_STRING: the JSON document as a single string. This must conform to the JSON schema that you provided in the previous step—for example:

        { \"title\": \"test title\", \"categories\": [\"cat_1\", \"cat_2\"], \"uri\": \"test uri\"}
        
    • Upload a JSON object.

      curl -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents?documentId=DOCUMENT_ID" \
      -d '{
        "structData": JSON_DOCUMENT_OBJECT
      }'
      

      Replace JSON_DOCUMENT_OBJECT with the JSON document as a JSON object. This must conform to the JSON schema that you provided in the previous step—for example:

      ```json
      {
        "title": "test title",
        "categories": [
          "cat_1",
          "cat_2"
        ],
        "uri": "test uri"
      }
      ```
      
    • Update with a JSON document.

      curl -X PATCH \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents/DOCUMENT_ID" \
      -d '{
        "jsonData": "JSON_DOCUMENT_STRING"
      }'
      
    • Update with a JSON object.

      curl -X PATCH \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents/DOCUMENT_ID" \
      -d '{
        "structData": JSON_DOCUMENT_OBJECT
      }'
      

Next steps

  • To attach your data store to an app, create an app and select your data store following the steps in Create a search app.

  • To preview how your search results appear after your app and data store are set up, see Get search results.

Troubleshoot data ingestion

If you are having problems with data ingestion, review these tips:

  • If you're using customer-managed encryption keys and data import fails (with error message The caller does not have permission), then make sure that the CryptoKey Encrypter/Decrypter IAM role (roles/cloudkms.cryptoKeyEncrypterDecrypter) on the key has been granted to the Cloud Storage service agent. For more information, see Before you begin in "Customer-managed encryption keys".

  • If you are using advanced website indexing and the Document usage for the data store is much lower than you expect, then review the URLs that you specified for indexing and make sure that the URL patterns specified cover the pages that you want to index and expand them if needed. For example, if you used www.example.com/*, you might need to add *.example.com/* to the sites you want indexed.

Create a data store using Terraform

You can use Terraform to create an empty data store. After the empty data store is created, you can ingest data into the data store using the Google Cloud console or API commands.

To learn how to apply or remove a Terraform configuration, see Basic Terraform commands.

To create an empty data store using Terraform, see google_discovery_engine_data_store.

About connecting multiple data stores

You can create a blended search app, where multiple data stores can be connected to a single generic search app. This feature lets you use one app to search across multiple sources and types of data.

To make a blended search app, select multiple data stores when creating a new generic search app.

When getting search results, you can either search across all data stores, or filter for results from a single data store by specifying dataStoreSpecs in the API or by selecting the data store in the search UI.

The following limitations apply:

  • Website data stores need to have advanced website indexing turned on in order to be used for blended search. For more information, see Advanced website indexing.
  • Data stores that contain unstructured data imported using BigQuery are not supported.
  • Adding and removing data stores:
    • To turn on blended search for an app, you must connect at least two data stores to it during app creation. You can add or remove data stores from a blended search app, but the app can't have fewer than two data stores connected to it at any time.
    • If you connect a single data store to a search app during app creation, then you can't add or remove that data store.
  • Blended search allows the following fields in search requests:
    • query
    • pageSize
    • offset
    • dataStoreSpec
    • pageToken
  • Blended search apps support the following fields only when search requests are filtered to get results from a single data store:
    • facetSpec
    • filter
  • Blended search apps don't support the following features:
    • Search with follow-ups
    • Serving controls
    • Summarization
    • Spell correction