This page describes refreshing structured and unstructured data.
Refresh structured data
You can refresh the data in a structured data store as long as you use a schema that is the same or backward compatible with the schema in the data store. For example, adding only new fields to an existing schema is backward compatible.
You can refresh structured data in the Google Cloud console or using the API.
Console
To use the Google Cloud console to refresh structured data from a branch of a data store, follow these steps:
In the Google Cloud console, go to the Agentspace page.
In the navigation menu, click Data Stores.
In the Name column, click the data store that you want to edit.
On the Documents tab, click
Import data.To refresh from Cloud Storage:
- In the Select a data source pane, select Cloud Storage.
- In the Import data from Cloud Storage pane, click Browse, select the bucket that contains your refreshed data, and then click Select. Alternatively, enter the bucket location directly in the gs:// field.
- Under Data Import Options, select an import option.
- Click Import.
To refresh from BigQuery:
- In the Select a data source pane, select BigQuery.
- In the Import data from BigQuery pane, click Browse, select a table that contains your refreshed data, and then click Select. Alternatively, enter the table location directly in the BigQuery path field.
- Under Data Import Options, select an import option.
- Click Import.
REST
Use the documents.import
method to refresh your data,
specifying the appropriate reconciliationMode
value.
To refresh structured data from BigQuery or Cloud Storage using the command line, follow these steps:
Find your data store ID. If you already have your data store ID, skip to the next step.
In the Google Cloud console, go to the Agentspace page and in the navigation menu, click Data Stores.
Click the name of your data store.
On the Data page for your data store, get the data store ID.
To import your structured data from BigQuery call the following method. You can import either from BigQuery or Cloud Storage. To import from Cloud Storage, skip to the next step.
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \ -d '{ "bigquerySource": { "projectId": "PROJECT_ID", "datasetId":"DATASET_ID", "tableId": "TABLE_ID", "dataSchema": "DATA_SCHEMA_BQ", }, "reconciliationMode": "RECONCILIATION_MODE", "autoGenerateIds": AUTO_GENERATE_IDS, "idField": "ID_FIELD", "errorConfig": { "gcsPrefix": "ERROR_DIRECTORY" } }'
- PROJECT_ID: the ID of your project.
- DATA_STORE_ID: the ID of the data store.
- DATASET_ID: the name of your BigQuery dataset.
- TABLE_ID: the name of your BigQuery table.
- DATA_SCHEMA_BQ: an optional field to specify the schema to use
when parsing data from the BigQuery source. Can have the
following values:
document
: the default value. The BigQuery table that you use must conform to the following default BigQuery schema. You can define the ID of each document yourself, while wrapping the entire data in thejson_data
string.custom
: any BigQuery table schema is accepted, and Google Agentspace Enterprise automatically generates the IDs for each document that is imported.
- ERROR_DIRECTORY: an optional field to specify a Cloud Storage directory for
error information about the import—for example,
gs://<your-gcs-bucket>/directory/import_errors
. Google recommends leaving this field empty to let Agentspace Enterprise automatically create a temporary directory. - RECONCILIATION_MODE: an optional field to specify how the
imported documents are reconciled with the existing documents in the
destination data store. Can have the following values:
INCREMENTAL
: the default value. Causes an incremental refresh of data from BigQuery to your data store. This does an upsert operation, which adds new documents and replaces existing documents with updated documents with the same ID.FULL
: causes a full rebase of the documents in your data store. Therefore, new and updated documents are added to your data store, and documents that are not in BigQuery are removed from your data store. TheFULL
mode is helpful if you want to automatically delete documents that you no longer need.
AUTO_GENERATE_IDS: an optional field to specify whether to automatically generate document IDs. If set to
true
, document IDs are generated based on a hash of the payload. Note that generated document IDs might not remain consistent over multiple imports. If you auto-generate IDs over multiple imports, Google highly recommends settingreconciliationMode
toFULL
to maintain consistent document IDs.Specify
autoGenerateIds
only whenbigquerySource.dataSchema
is set tocustom
. Otherwise anINVALID_ARGUMENT
error is returned. If you don't specifyautoGenerateIds
or set it tofalse
, you must specifyidField
. Otherwise the documents fail to import.ID_FIELD: an optional field to specify which fields are the document IDs. For BigQuery source files,
idField
indicates the name of the column in the BigQuery table that contains the document IDs.Specify
idField
only when both these conditions are satisfied, otherwise, anINVALID_ARGUMENT
error is returned:bigquerySource.dataSchema
is set tocustom
auto_generate_ids
is set tofalse
or is unspecified.
Additionally, the value of the BigQuery column name must be of string type, must be between 1 and 63 characters, and must conform to RFC-1034. Otherwise, the documents fail to import.
Here is the default BigQuery schema. Your BigQuery table must conform to this schema when you set
dataSchema
todocument
.[ { "name": "id", "mode": "REQUIRED", "type": "STRING", "fields": [] }, { "name": "jsonData", "mode": "NULLABLE", "type": "STRING", "fields": [] } ]
To import your structured data from Cloud Storage call the following method. You can either import from BigQuery or Cloud Storage. To import from BigQuery, go to the previous step.
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \ -d '{ "gcsSource": { "inputUris": ["GCS_PATHS"], "dataSchema": "DATA_SCHEMA_GCS", }, "reconciliationMode": "RECONCILIATION_MODE", "idField": "ID_FIELD", "errorConfig": { "gcsPrefix": "ERROR_DIRECTORY" } }'
- PROJECT_ID: the ID of your project.
- DATA_STORE_ID: the ID of the data store.
- GCS_PATHS: a list of comma-separated URIs to
Cloud Storage locations from where you want to import. Each URI can be
2,000 characters long. URIs can match the full path for a storage object
or can match the pattern for one or more objects. For example,
gs://bucket/directory/*.json
is a valid path. - DATA_SCHEMA_GCS: an optional field to specify the schema to use
when parsing data from the BigQuery source. Can have the
following values:
document
: the default value. The BigQuery table that you use must conform to the following default BigQuery schema. You can define the ID of each document yourself, while wrapping the entire data in thejson_data
string.custom
: any BigQuery table schema is accepted, and Google Agentspace Enterprise automatically generates the IDs for each document that is imported.
- ERROR_DIRECTORY: an optional field to specify a Cloud Storage directory for
error information about the import—for example,
gs://<your-gcs-bucket>/directory/import_errors
. Google recommends leaving this field empty to let Agentspace Enterprise automatically create a temporary directory. - RECONCILIATION_MODE: an optional field to specify how the
imported documents are reconciled with the existing documents in the
destination data store. Can have the following values:
INCREMENTAL
: the default value. Causes an incremental refresh of data from BigQuery to your data store. This does an upsert operation, which adds new documents and replaces existing documents with updated documents with the same ID.FULL
: causes a full rebase of the documents in your data store. Therefore, new and updated documents are added to your data store, and documents that are not in BigQuery are removed from your data store. TheFULL
mode is helpful if you want to automatically delete documents that you no longer need.
Python
Before trying this sample, follow the Python setup instructions in the Agentspace Enterprise quickstart using client libraries. For more information, see the Agentspace Enterprise Python API reference documentation.
To authenticate to Agentspace Enterprise, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Refresh unstructured data
You can refresh unstructured data in the Google Cloud console or using the API.
Console
To use the Google Cloud console to refresh unstructured data from a branch of a data store, follow these steps:
In the Google Cloud console, go to the Agentspace page.
In the navigation menu, click Data Stores.
In the Name column, click the data store that you want to edit.
On the Documents tab, click
Import data.To ingest from a Cloud Storage bucket (with or without metadata):
- In the Select a data source pane, select Cloud Storage.
- In the Import data from Cloud Storage pane, click Browse,
select the bucket that contains your refreshed data, and then click
Select. Alternatively, enter the bucket location directly in the
gs://
field. - Under Data Import Options, select an import option.
- Click Import.
To ingest from BigQuery:
- In the Select a data source pane, select BigQuery.
- In the Import data from BigQuery pane, click Browse, select a table that contains your refreshed data, and then click Select. Alternatively, enter the table location directly in the BigQuery path field.
- Under Data Import Options, select an import option.
- Click Import.
REST
To refresh unstructured data using the API, re-import it using the
documents.import
method, specifying the appropriate
reconciliationMode
value. For more information about importing unstructured
data, see Unstructured data.
Python
Before trying this sample, follow the Python setup instructions in the Agentspace Enterprise quickstart using client libraries. For more information, see the Agentspace Enterprise Python API reference documentation.
To authenticate to Agentspace Enterprise, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.