How you prepare data depends on the kind of data you're importing and the way you choose to import it. Start with what kind of data you plan to import:
For limitations for blended search, where multiple data stores can be connected to a single app, see About connecting multiple data stores.
Third-party data
After you connect a third-party data source, Agentspace Enterprise ingests the data from the data source and syncs with it at a frequency you specify.
Before creating a third-party data connector:
An administrator for your identity provider must set up access control for your data source. For information about setting up access control and to review permissions needed for third-party data, see Identity and permissions.
An administrator for the third-party application must review the required credentials to connect a data source and set up authentication and permissions. For information about obtaining credentials, go to the section for the third-party data source that you plan to ingest from in Connect a third-party data source.
Unstructured data
Agentspace Enterprise supports search over documents that are in HTML, PDF with embedded text, and TXT format. PPTX and DOCX formats are available in Preview.
You import your documents from a Cloud Storage
bucket. You can import using Google Cloud console, by the
ImportDocuments
method, or by streaming ingestion
through CRUD methods.
For API reference information, see DocumentService
and documents
.
The following table lists the file size limits of each file type with different configurations (for more information, see Parse and chunk documents). You can import up to 100,000 files at a time.
File type | Default import | Import with layout-aware document chunking | Import with layout parser |
---|---|---|---|
Text-based files such as HTML, TXT, JSON, XHTML, and XML | < 2.5 MB | < 10 MB | < 10 MB |
PPTX, DOCX, and XLSX | < 200 MB | < 200 MB | < 200 MB |
< 200 MB | < 200 MB | < 40 MB |
If you plan to include embeddings in your unstructured data, see Use custom embeddings.
If you have non-searchable PDFs (scanned PDFs or PDFs with text inside images, such as infographics) we recommend turning on optical character recognition (OCR) processing during data store creation. This allows Agentspace Enterprise to extract elements such as text blocks and tables. If you have searchable PDFs that are mostly composed of machine-readable text and contain many tables, you can consider turning on OCR processing with the option for machine-readable text enabled in order to improve detection and parsing. For more information, see Parse and chunk documents.
If you want to use Agentspace Enterprise for retrieval-augmented generation (RAG), turn on document chunking when you create your data store. For more information, see Parse and chunk documents.
You can import unstructured data from the following sources:
Cloud Storage
You can import data from Cloud Storage with or without metadata.
Data import is not recursive. That is, if there are folders within the bucket or folder that you specify, files within those folders are not imported.
If you plan to import documents from Cloud Storage without metadata, put your documents directly into a Cloud Storage bucket. The document ID is an example of metadata.
For testing, you can use the following publicly available Cloud Storage folders, which contain PDFs:
gs://cloud-samples-data/agentspace/search/alphabet-investor-pdfs
gs://cloud-samples-data/agentspace/search/CUAD_v1
gs://cloud-samples-data/agentspace/search/kaiser-health-surveys
gs://cloud-samples-data/agentspace/search/stanford-cs-224
If you plan to import data from Cloud Storage with metadata, put a JSON file that contains the metadata into a Cloud Storage bucket whose location you provide during import.
Your unstructured documents can be in the same Cloud Storage bucket as your metadata or a different one.
The metadata file must be a JSON Lines or an NDJSON file. The document ID is an example of metadata. Each row of the metadata file must follow one of the following JSON formats:
- Using
jsonData
:{ "id": "<your-id>", "jsonData": "<JSON string>", "content": { "mimeType": "<application/pdf or text/html>", "uri": "gs://<your-gcs-bucket>/directory/filename.pdf" } }
- Using
structData
:{ "id": "<your-id>", "structData": { <JSON object> }, "content": { "mimeType": "<application/pdf or text/html>", "uri": "gs://<your-gcs-bucket>/directory/filename.pdf" } }
Use the uri
field in each row to point to the Cloud Storage location of
the document.
Here is an example of an NDJSON metadata file for an unstructured document. In
this example, each line of the metadata file points to a PDF document and
contains the metadata for that document. The first two lines use jsonData
and
the second two lines use structData
. With structData
you don't need to
escape quotation marks that appear within quotation marks.
{"id":"doc-0","jsonData":"{\"title\":\"test_doc_0\",\"description\":\"This document uses a blue color theme\",\"color_theme\":\"blue\"}","content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_0.pdf"}}
{"id":"doc-1","jsonData":"{\"title\":\"test_doc_1\",\"description\":\"This document uses a green color theme\",\"color_theme\":\"green\"}","content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_1.pdf"}}
{"id":"doc-2","structData":{"title":"test_doc_2","description":"This document uses a red color theme","color_theme":"red"},"content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_3.pdf"}}
{"id":"doc-3","structData":{"title":"test_doc_3","description":"This is document uses a yellow color theme","color_theme":"yellow"},"content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_4.pdf"}}
To create your data store, see Create a first-party data store.
BigQuery
If you plan to import metadata from BigQuery, create a BigQuery table that contains metadata. The document ID is an example of metadata.
Put your unstructured documents into a Cloud Storage bucket.
Use the following BigQuery schema. Use the uri
field in
each record to point to the Cloud Storage location of the document.
[
{
"name": "id",
"mode": "REQUIRED",
"type": "STRING",
"fields": []
},
{
"name": "jsonData",
"mode": "NULLABLE",
"type": "STRING",
"fields": []
},
{
"name": "content",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "mimeType",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "uri",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
For more information, see Create and use tables in the BigQuery documentation.
To create your data store, see Create a first-party data store.
Google Drive
Syncing data from Google Drive is supported for search.
If you plan to import data from Google Drive, you must set up Google Identity as your identity provider in Agentspace Enterprise. For information about setting up access control, see Identity and permissions.
To create your data store, see Create a first-party data store.
Structured data
Prepare your data according to the import method that you plan to use.
You can import structured data from the following sources:
When you import structured data from BigQuery or from Cloud Storage, you are given the option to import the data with metadata. (Structured with metadata is also referred to as enhanced structured data.)
BigQuery
You can import structured data from BigQuery datasets.
Your schema is auto-detected. After importing, Google recommends that you edit the auto-detected schema to map key properties, such as titles. If you import using the API instead of the Google Cloud console, you have the option to provide your own schema as a JSON object. For more information, see Provide or auto-detect a schema.
For examples of publicly available structured data, see the BigQuery public datasets.
If you plan to include embeddings in your structured data, see Use custom embeddings.
If you select to import structured data with metadata, you include two fields in your BigQuery tables:
An
id
field to identify the document. If you import structured data without metadata, then theid
is generated for you. Including metadata lets you specify the value ofid
.A
jsonData
field that contains the data. For examples ofjsonData
strings, see the preceding section Cloud Storage.
Use the following BigQuery schema for structured data with metadata imports:
[
{
"name": "id",
"mode": "REQUIRED",
"type": "STRING",
"fields": []
},
{
"name": "jsonData",
"mode": "NULLABLE",
"type": "STRING",
"fields": []
}
]
For instructions on creating your data store, see Create a first-party data store.
Cloud Storage
Structured data in Cloud Storage must be in either JSON Lines or NDJSON format. Each file must be 2 GB or smaller. You can import up to 100 files at a time.
For examples of publicly available structured data, refer to the following folders in Cloud Storage, which contain NDJSON files:
gs://cloud-samples-data/agentspace/search/kaggle_movies
gs://cloud-samples-data/agentspace/search/austin_311
If you plan to include embeddings in your structured data, see Use custom embeddings.
Here is an example of an NDJSON metadata file of structured data. Each line of the file represents a document and is made up of a set of fields.
{"hotel_id": 10001, "title": "Hotel 1", "location": {"address": "1600 Amphitheatre Parkway, Mountain View, CA 94043"}, "available_date": "2024-02-10", "non_smoking": true, "rating": 3.7, "room_types": ["Deluxe", "Single", "Suite"]}
{"hotel_id": 10002, "title": "Hotel 2", "location": {"address": "Manhattan, New York, NY 10001"}, "available_date": "2023-07-10", "non_smoking": false, "rating": 5.0, "room_types": ["Deluxe", "Double", "Suite"]}
{"hotel_id": 10003, "title": "Hotel 3", "location": {"address": "Moffett Park, Sunnyvale, CA 94089"}, "available_date": "2023-06-24", "non_smoking": true, "rating": 2.5, "room_types": ["Double", "Penthouse", "Suite"]}
To create your data store, see Create a first-party data store.
Local JSON data
You can directly upload a JSON document or object using the API.
Google recommends providing your own schema as a JSON object for better results. If you don't provide your own schema, the schema is auto-detected. After importing, we recommend that you edit the auto-detected schema to map key properties, such as titles. For more information, see Provide or auto-detect a schema.
If you plan to include embeddings in your structured data, see Use custom embeddings.
To create your data store, see Create a first-party data store.