Schemas: auto-detecting versus providing your own

When you import structured data using the Google Cloud console, Vertex AI Agent Builder auto-detects the schema. You can either use this auto-detected schema in your engine or use the API to provide a schema to indicate the structure of the data.

If you provide a schema and later update it with a new schema, the new schema must be backward compatible with the original. Otherwise the schema update fails.

For reference information about the schema, see dataStores.schemas.

Approaches to providing the schema for your data store

There are various approaches to determining the schema for structured data.

Auto-detect alone

Let Vertex AI Agent Builder auto-detect the data in the data store and provide the schema for you.

  • Pros: This is the most straightforward and recommended when you create your first test engine.

  • Cons: The quality of the search results and recommendations will typically not be as good as the other approaches.

Auto-detect and edit

Let Vertex AI Agent Builder auto-detect and suggest an initial schema and then you refine the schema through the console interface or through the command line. In particular, we highly recommend that, after your fields have been auto-detected, you map key properties to all the important fields.

  • Pros: This approach provides higher quality results than using auto-detect alone.

  • Cons: This approach takes the longest of the three because it is a multi-step process and requires indexing and re-indexing the data. However, this can be partially mitigated by uploading a small amount of data to start. See About auto-detect and edit.

Providing the schema as a JSON object

Using the command line, or programmatically, you provide the schema to Vertex AI Agent Builder as a JSON object.

  • Pros: This approach is faster than auto-detect followed by editing and can provide equally good results.

  • Cons: This approach is not available from the console. You need to have prepared a correct JSON object.

Default schema for media data stores

If you create a data store for media, the default schema is applied to that data store. Auto-detect appends to the schema any new fields that it finds during data ingestion.

For information about creating a data store for media, see Create media apps and data stores.

About auto-detect and edit

Because it takes a long time to index large data stores and because data stores have to be re-indexed after the schema has been edited, the fastest way to apply the auto-detect and edit approach is as follows:

  1. Ingest a small but representative subset of your data.

    For instructions on ingesting data, see Create a search data store or Create a recommendations data store.

    Auto-detect will create a schema based on this ingested data. If you are ingesting only a small amount of data, it won't take long to create the schema and index the data.

  2. Review the schema suggested by auto-detect:

    1. Map the key properties to fields. For more information about mapping properties, see Configure field settings.

    2. Add new fields, if necessary.

      For example, if a few of your documents include a thumbnail, but none of the documents in your initial ingestion had the thumbnail, then add a thumbnail field to the schema. Or, if you know that a field isn't yet in your data but that it will be coming soon, add it now.

      For instructions on how to add fields to a schema, see Update your schema.

  3. After confirming that everything looks good, save the schema.

  4. Ingest the rest of your data into the data store. For a large data store, you can expect to wait some hours before the data store in indexed.

About providing your own schema as a JSON object

Except in the case of media data stores, you can define your own schema using the JSON Schema format, which is an open source, declarative language to define, annotate, and validate JSON documents. For example, this is a valid JSON schema annotation:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "keyPropertyMapping": "title",
      "retrievable": true,
      "completable": true
    },
    "description": {
      "type": "string",
      "keyPropertyMapping": "description"
    },
    "categories": {
      "type": "array",
      "items": {
        "type": "string",
        "keyPropertyMapping": "category",
        "dynamicFacetable": true
      }
    },
    "uri": {
      "type": "string",
      "keyPropertyMapping": "uri"
    },
    "locations": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "address": {
            "type": "string",
            "indexable": true,
            "searchable": true
          },
          "lat": {
            "type": "number",
            "indexable": true
          },
          "lng": {
            "type": "number",
            "indexable": true
          }
        }
      }
    }
  }
}

Here are some of the fields in this schema example:

  • keyPropertyMapping. A field that maps predefined keywords to critical fields in your documents, helping to clarify their semantic meaning. Values include title, description, uri, and category. Note that your field name doesn't need to match the keyPropertyValues value. For example, for a field that you named my_title, you can include a keyPropertyValues field with a value of title. For Vertex AI Search: fields marked with keyPropertyMapping are by default indexable and searchable, but not retrievable, completable, or dynamicFacetable. This means that you don't need to include the indexable or searchable fields with a keyPropertyValues field to get the expected default behavior.

  • type. The type of the field. This is a string value that is integer, datetime, geolocation or one of the primative types (boolean, object, array, number, or string).

The following fields apply only for search apps:

  • retrievable. Indicates whether this field can be returned in a search response. This can be set for fields of type number, string, boolean, integer, datetime, and geolocation. A maximum of 50 fields can be set as retrievable. User-defined fields and keyPropertyValues fields are not retrievable by default. To make a field retrievable, include "retrievable": true with the field.

  • indexable. Indicates whether this field can be filtered, faceted, boosted, or sorted in the servingConfigs.search method. This can be set for fields of type number, string, boolean, integer, datetime, and geolocation. A maximum of 50 fields can be set as indexable. User-defined fields are not indexable by default, except for fields containing the keyPropertyMapping field. To make a field indexable, include "indexable": true with the field.

  • dynamicFacetable. Indicates that the field can be used as a dynamic facet. This can be set for fields of type number, string, boolean, and integer. To make a field dynamically facetable, include "dynamicFacetable": true with the field.

  • searchable. Indicates whether this field can be reverse indexed to match unstructured text queries. This can only be set for fields of type string. A maximum of 50 fields can be set as searchable. User-defined fields are not searchable by default, except for fields containing the keyPropertyMapping field. To make a field searchable, include "searchable": true with the field.

  • completable. Indicates whether this field can be returned as an autocomplete suggestion. This can only be set for fields of type string. To make a field completable, include "completable": true with the field.

What's next