Data stores

Data stores are used by data store agents to find answers for end-user's questions from your data. Data stores are a collection of websites and documents, each of which reference your data.

When an end-user asks the agent a question, the agent searches for an answer from the given source content and summarizes the findings into a coherent agent response. It also provides supporting links to the sources of the response for the end-user to learn more. The agent can provide up to five answer snippets for a given question.

Data store sources

There are different sources that you can supply for your data:

  • Website URLs: Automatically crawl website content from a list of domains or web pages.
  • BigQuery: Import data from your BigQuery table.
  • Cloud Storage: Import data from your Cloud Storage bucket.

Website content

When adding website content as a source, you can add and exclude multiple sites. When you specify a site, you can use individual pages or * as a wildcard for a pattern. All HTML and PDF content will be processed.

You must verify your domain when using website content as a source.

Limitations:

  • Files from public URLs must have been crawled by the Google Search indexer, so that they exist in the search index. You can check this with the Google Search Console.
  • Maximum of 200,000 pages are indexed. If the data store contains more pages, indexing fails and the last indexed content remains.

Import data

You can import your data from either BigQuery or Cloud Storage. This data can be structured or unstructured, and it can be with metadata or without metadata.

The following Data Import Options are available:

  • Add/Update Data: The provided documents are added to the data store. If a new document has the same ID as an old document, the new document replaces the old document.
  • Override Existing Data: All old data is deleted, then new data is uploaded. This is irreversible.

Structured data store

Structured data stores can hold answers to frequently asked questions (FAQ). When user questions are matched with high confidence to an uploaded question, the agent returns the answer to that question without any modification. You can provide a title and a URL for each question and answer pair that is displayed by the agent.

When uploading data to the data store, the CSV format must be used. Each file must have a header row describing the columns.

For example:

"question","answer","title","url"
"Why is the sky blue?","The sky is blue because of Rayleigh scattering.","Rayleigh scattering","https://en.wikipedia.org/wiki/Rayleigh_scattering"
"What is the meaning of life?","42","",""

The title and url columns are optional and can be omitted:

"answer","question"
"42","What is the meaning of life?"

During the upload process, a folder can be selected where each file is treated as a CSV file regardless of extension.

Limitations:

  • Extra space character after , causes an error.
  • Blank lines (even at the end of the file) cause an error.

Unstructured data store

Unstructured data stores can contain content in the following formats:

  • HTML
  • PDF
  • TXT
  • CSV

Limitations:

  • The maximum file size is 2.5MB for text-based formats, 100MB for other formats.

Data store with metadata

A title and URL can be provided as metadata. When the agent is in a conversation with a user, the agent can provide this information to the user. This can help users to quickly link to internal web pages not accessible by the Google Search indexer.

To import content with metadata, you provide one or more JSON Lines files. Each line of this file describes one document. You do not directly upload the actual documents; URIs that link to the Cloud Storage paths are provided in the JSON Lines file.

When providing your JSON Lines files, you provide a Cloud Storage folder that contains these files. Do not put any other files in this folder.

Field descriptions:

Field Type Description
id string Unique identifier for the document.
content.mimeType string MIME type of the document. "application/pdf" and "text/html" are supported.
content.uri string URI for the document in Cloud Storage.
content.structData string Single line JSON object with optional title and url fields.

For example:

{ "id": "d001", "content": {"mimeType": "application/pdf", "uri": "gs://example-import/unstructured/first_doc.pdf"}, "structData": {"title": "First Document", "url": "https://internal.example.com/documents/first_doc.pdf"} }
{ "id": "d002", "content": {"mimeType": "application/pdf", "uri": "gs://example-import/unstructured/second_doc.pdf"}, "structData": {"title": "Second Document", "url": "https://internal.example.com/documents/second_doc.pdf"} }
{ "id": "d003", "content": {"mimeType": "text/html", "uri": "gs://example-import/unstructured/mypage.html"}, "structData": {"title": "My Page", "url": "https://internal.example.com/mypage.html"} }

Data store without metadata

This type of content has no metadata. Just provide the documents to import. The content type is determined by the file extension.

Parse and chunk configuration

Depending on the data source, you might be able to configure parse and chunk settings as defined by Vertex AI Search.

Create a data store

To create a data store:

  1. Go to the Agent Builder console:

    Agent Builder console

  2. Select your project from the console drop-down.

  3. Read and agree to the Terms of Service, then click Continue and activate the API.

  4. Click Data Stores in the left navigation.

  5. Click New Data Store.

  6. Choose a data source.

  7. Enable Advanced website indexing. This is required for data store agents.

  8. Provide data and configuration for the data store source you selected. Your data store location should correspond to the agent location.

  9. Click Create to create the data store.

  10. Optionally set the data store language:

    1. From the list of data stores, click the data store you just created.
    2. Click the edit button for the language setting.
    3. Select a language and click the check to apply.
  11. Verify your website domain.

Using Cloud Storage for a data store document

If your content is not public, storing your content in Cloud Storage is the recommended option. When creating data store documents, you provide the URLs for your Cloud Storage objects in the form: gs://bucket-name/folder-name. Each document within the folder is added to the data store.

When creating the Cloud Storage bucket:

Follow the Cloud Storage quickstart instructions to create a bucket and upload files.

Languages

For supported languages, see the data store column in the Dialogflow language reference.

For best performance, it is recommended that data stores be created in a single language.

After creating a data store, you can optionally specify the data store language. If you set the data store language, you can connect the data store to a data store agent that is configured for a different language. For example, you can create a French data store that is connected to an English agent.

Supported regions

For supported regions, see the Dialogflow region reference.