Data stores are used by data store handlers and playbook data store tools to find answers for end-user's questions from your data. Data stores are a collection of websites and documents, each of which reference your data.
When an end-user asks the agent a question, the agent searches for an answer from the given source content and summarizes the findings into a coherent agent response. It also provides supporting links to the sources of the response for the end-user to learn more. The agent can provide up to five answer snippets for a given question.
Data store sources
There are different sources that you can supply for your data:
- Website URLs: Automatically crawl website content from a list of domains or web pages.
- BigQuery: Import data from your BigQuery table.
- Cloud Storage: Import data from your Cloud Storage bucket.
Website content
When adding website content as a source,
you can add and exclude multiple sites.
When you specify a site,
you can use individual pages or *
as a wildcard for a pattern.
All HTML and PDF content will be processed.
You must verify your domain when using website content as a source.
Limitations:
- Files from public URLs must have been crawled by the Google Search indexer, so that they exist in the search index. You can check this with the Google Search Console.
- Maximum of 200,000 pages are indexed. If the data store contains more pages, indexing fails and the last indexed content remains.
Import data
You can import your data from either BigQuery or Cloud Storage. This data can be structured or unstructured, and it can be with metadata or without metadata.
The following Data Import Options are available:
- Add/Update Data: The provided documents are added to the data store. If a new document has the same ID as an old document, the new document replaces the old document.
- Override Existing Data: All old data is deleted, then new data is uploaded. This is irreversible.
Structured data store
Structured data stores can hold answers to frequently asked questions (FAQ). When user questions are matched with high confidence to an uploaded question, the agent returns the answer to that question without any modification. You can provide a title and a URL for each question and answer pair that is displayed by the agent.
When uploading data to the data store, the CSV format must be used. Each file must have a header row describing the columns.
For example:
"question","answer","title","url"
"Why is the sky blue?","The sky is blue because of Rayleigh scattering.","Rayleigh scattering","https://en.wikipedia.org/wiki/Rayleigh_scattering"
"What is the meaning of life?","42","",""
The title
and url
columns are optional and can be omitted:
"answer","question"
"42","What is the meaning of life?"
During the upload process, a folder can be selected where each file is treated as a CSV file regardless of extension.
Limitations:
- Extra space character after
,
causes an error. - Blank lines (even at the end of the file) cause an error.
Unstructured data store
Unstructured data stores can contain content in the following formats:
- HTML
- TXT
- CSV
Limitations:
- The maximum file size is 2.5MB for text-based formats, 100MB for other formats.
Data store with metadata
A title and URL can be provided as metadata. When the agent is in a conversation with a user, the agent can provide this information to the user. This can help users to quickly link to internal web pages not accessible by the Google Search indexer.
To import content with metadata, you provide one or more JSON Lines files. Each line of this file describes one document. You do not directly upload the actual documents; URIs that link to the Cloud Storage paths are provided in the JSON Lines file.
When providing your JSON Lines files, you provide a Cloud Storage folder that contains these files. Do not put any other files in this folder.
Field descriptions:
Field | Type | Description |
---|---|---|
id | string | Unique identifier for the document. |
content.mimeType | string | MIME type of the document. "application/pdf" and "text/html" are supported. |
content.uri | string | URI for the document in Cloud Storage. |
structData | string | Single line JSON object with optional title and url fields. |
For example:
{ "id": "d001", "content": {"mimeType": "application/pdf", "uri": "gs://example-import/unstructured/first_doc.pdf"}, "structData": {"title": "First Document", "url": "https://internal.example.com/documents/first_doc.pdf"} }
{ "id": "d002", "content": {"mimeType": "application/pdf", "uri": "gs://example-import/unstructured/second_doc.pdf"}, "structData": {"title": "Second Document", "url": "https://internal.example.com/documents/second_doc.pdf"} }
{ "id": "d003", "content": {"mimeType": "text/html", "uri": "gs://example-import/unstructured/mypage.html"}, "structData": {"title": "My Page", "url": "https://internal.example.com/mypage.html"} }
Data store without metadata
This type of content has no metadata. Just provide the documents to import. The content type is determined by the file extension.
Parse and chunk configuration
Depending on the data source, you might be able to configure parse and chunk settings as defined by Vertex AI Search.
Create a data store
To create a data store:
Go to the Agent Builder console:
Select your project from the console drop-down.
Read and agree to the Terms of Service, then click Continue and activate the API.
Click Data Stores in the left navigation.
Click New Data Store.
Choose a data source.
Enable Advanced website indexing. This is required for data store agents.
Provide data and configuration for the data store source you selected. Your data store location should correspond to the agent location.
Click Create to create the data store.
Optionally set the data store language:
- From the list of data stores, click the data store you just created.
- Click the edit button for the language setting.
- Select a language and click the check to apply.
Using Cloud Storage for a data store document
If your content is not public,
storing your content in Cloud Storage
is the recommended option.
When creating data store documents,
you provide the URLs for your Cloud Storage objects in the form:
gs://bucket-name/folder-name
.
Each document within the folder is added to the data store.
When creating the Cloud Storage bucket:
- Be sure that you have selected the project you use for the agent.
- Use the Standard Storage class.
- Set the bucket location to the same location as your agent.
Follow the Cloud Storage quickstart instructions to create a bucket and upload files.
Languages
For supported languages, see the data store column in the language reference.
For best performance, it is recommended that data stores be created in a single language.
After creating a data store, you can optionally specify the data store language. If you set the data store language, you can connect the data store to an agent that is configured for a different language. For example, you can create a French data store that is connected to an English agent.
Supported regions
For supported regions, see the region reference.