This page provides information about documents and data stores for media. If you're using media recommendations or media search, review the schema requirements for your documents and data stores on this page before uploading your data.
Overview
A document is any item that you upload into a Vertex AI Agent Builder data store. For
media, a document typically contains metadata information about
media content, such as videos, news articles, music files, or
podcasts. The Document
object in the API captures this metadata information.
Your data store contains a collection of documents that you have uploaded. When
you create a data store, you specify that it will contain media documents. Data
stores for media can only be attached to media apps, not to other app types such
as generic search and recommendations. Data stores are represented in the API by
the DataStore
resource.
The quality of the data that you upload has a direct effect on the quality of the results that media apps provide. In general, the more accurate and specific information you can provide, the higher quality your results.
The data that you upload to the data store must be formatted in a specific JSON schema. The data arranged in that schema must be in a BigQuery table, a file or set of files in Cloud Storage, or in a JSON object that can be uploaded directly using the Google Cloud console.
Google predefined schema versus custom schema
You have two options for your media data schema.
The Google predefined schema. If you haven't already designed a schema for your media data, the Google predefined schema is a good choice.
Your own schema. If you have your data already formatted in a schema, you can use your own schema, with the following requirement.
You must have fields in your schema that can be mapped to the five key properties for media:
title
uri
category
media_available_time
media_duration
This field is important for media recommendations apps where the business objective is to maximize the conversion rate (CVR) or the watch duration per visitor.
There are additional key properties that are not required, but for quality results, map as many of these as you can to your schema. These media properties are as follows:
description
(highly recommended)image
image_name
image_uri
language-code
media_aggregated_rating
media_aggregated_rating_count
media_aggregated_rating_score
media_aggregated_rating_source
media_content_index
media_content_rating
media_country_of_origin
media_expire_time
media_filter_tag
media_hash_tag
media_in_language
media_organization
media_organization_custom_role
media_organization_name
media_organization_rank
media_organization_role
media_organization_uri
media_person
media_person_custom_role
media_person_name
media_person_rank
media_person_role
media_person_uri
media_production_year
media_type
For more information about these properties, see Key properties. The names are similar but some vary slightly. (For example, some names are prefaced with
media_
and some are pluralized.)
JSON Schema for Document
When using media, documents can use the Google predefined JSON schema for media.
Documents are uploaded with either a JSON or Struct data representation. Make sure the document JSON or Struct conforms to the following JSON schema. The JSON schema uses JSON Schema 2020-12 for validation. For more about JSON Schema, also see the JSON Schema specification documentation at json-schema.org.
{ "$schema": "https://json-schema.org/draft/2020-12/schema", "type": "object", "properties": { "title": { "type": "string", }, "description": { "type": "string", }, "media_type": { "type": "string", }, "language_code": { "type": "string", }, "categories": { "type": "array", "items": { "type": "string", } }, "uri": { "type": "string", }, "images": { "type": "array", "items": { "type": "object", "properties": { "uri": { "type": "string", }, "name": { "type": "string", } }, } }, "in_languages": { "type": "array", "items": { "type": "string", } }, "country_of_origin": { "type": "string", }, "content_index": { "type": "integer", }, "persons": { "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string", }, "role": { "type": "string", }, "custom_role": { "type": "string", }, "rank": { "type": "integer", }, "uri": { "type": "string", } }, "required": ["name", "role"], } }, "organizations": { "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string", }, "role": { "type": "string", }, "custom_role": { "type": "string", }, "rank": { "type": "integer", }, "uri": { "type": "string", } }, "required": ["name", "role"], } }, "hash_tags": { "type": "array", "items": { "type": "string", } }, "filter_tags": { "type": "array", "items": { "type": "string", } }, "duration": { "type": "string", }, "content_rating": { "type": "array", "items": { "type": "string", } }, "aggregate_ratings": { "type": "array", "items": { "type": "object", "properties": { "rating_source": { "type": "string", }, "rating_score": { "type": "number", }, "rating_count": { "type": "integer", } }, "required": ["rating_source"], } }, "available_time": { "type": "string", }, "expire_time": { "type": "string", }, "production_year": { "type": "integer", } }, "required": ["title", "categories", "uri", "available_time"], }
Sample JSON Document
object
The following example shows an example of a JSON Document
object.
{ "title": "Test document title", "description": "Test document description", "media_type": "sports-game", "in_languages": [ "en-US" ], "language_code": "en-US", "categories": [ "sports > clip", "sports > highlight" ], "uri": "http://www.example.com", "images": [ { "uri": "http://example.com/img1", "name": "image_1" } ], "country_of_origin": "US", "content_index": 0, "persons": [ { "name": "sports person", "role": "player", "rank": 0, "uri": "http://example.com/person" }, ], "organizations": [ { "name": "sports team", "role": "team", "rank": 0, "uri": "http://example.com/team" }, ], "hash_tags": [ "tag1" ], "filter_tags": [ "filter_tag" ], "duration": "100s", "production_year": 1900, "content_rating": [ "PG-13" ], "aggregate_ratings": [ { "rating_source": "imdb", "rating_score": 4.5, "rating_count": 1250 } ], "available_time": "2022-08-26T23:00:17Z" }
Document fields
This section lists the field values you provide when you create documents for your data store. The values should correspond to the values used in your internal document database, and should accurately reflect the item represented.
Document
object fields
The following fields are top-level fields for the Document
object. Also
refer to these fields on the Document
reference page.
Field | Notes |
---|---|
name
|
The full, unique resource name of the document. Required for all
Document methods except for create and
import . During import, the name is automatically generated
and does not need to be manually provided.
|
id
|
The document ID used by your internal database. The ID field must
be unique across your entire data store. The same value is used when you
record a user event, and is also returned by the recommend
and search methods.
|
schemaId
|
Required. The identifier of the schema located in the same data store. Should be set as "default_schema", which is automatically created when the default data store is created. |
parentDocumentId
|
The ID of the parent document. For top-level (root) documents,
parent_document_id can be empty or can point to itself. For
child documents, parent_document_id should point to a valid
root document.
|
Key properties
The following properties are defined using the predefined JSON Schema format for media.
For more information about JSON properties, see the Understanding JSON Schema documentation for properties at json-schema.org.
The following table defines flat key properties.
Field name | Notes |
---|---|
title
|
String - required Document title from your database. A UTF-8 encoded string. Limited to 1000 characters. |
categories
|
String - required Document categories. This property is repeated for supporting one document belonging to several parallel categories. Use the full category path for higher quality results.
To represent the full path of a category, use the For example:
A document can contain at most 250 categories. Each category is a UTF-8 encoded string with a length limit of 5000 characters. |
uri
|
String - required URI of the document. Length limit of 5000 characters. |
description
|
String - highly recommended Description of the document. Length limit of 5000 characters. |
media_type
|
String - this field is required for movies and shows Top-level category.
Supported types:
The values |
language_code
|
String - optional Language of the title/description and other string attributes. Use language tags defined by BCP 47. For document recommendation, this field is ignored and the text language is detected automatically. The document can include text in different languages, but duplicating documents to provide text in multiple languages can result in degraded performance.
For document search this field is in use. It defaults to "en-US" if unset.
For example, |
duration
|
String - required for media recommendations apps where the business objective is click-through rate (CVR) or watch duration per session.
Duration of the media content. Duration should be encoded as a string.
Encoding should be the same as the |
available_time
|
String - required The time that the content is available to the end-users. This field identifies the freshness of a content for end-users. The timestamp should conform to RFC 3339 standard. For example:
|
expire_time
|
String - optional The time that the content will expire for the end-users. This field identifies the freshness of a content for end-users. The timestamp should conform to RFC 3339 standard. For example:
|
in_languages
|
String - optional - repeated Language of the media contents. Use language tags defined by BCP 47.
For example: |
country_of_origin
|
String - optional Media document country of origin. Length limit of 128 characters.
For example: |
content_index
|
Int - optional Content index of the media document. Content index field can be used to order the documents relative to others. For example, episode number can be used as the content index. Content index should be a non-negative integer.
For example: |
filter_tags
|
String - optional - repeated Filter tags for the document. At most 250 values are allowed per document with a length limit of 1000 characters. Otherwise, an INVALID_ARGUMENT error is returned.
This tag can be used for filtering recommendation results by passing the
tag as part of the
For example: |
hash_tags
|
String - optional - repeated Hashtags for the document. At most 100 values are allowed per document, with a length limit of 5000 characters.
For example: |
content_rating
|
String - optional - repeated The content rating, used for content advisory systems and content filtering based on the audience. At most 100 values are allowed per document with a length limit of 128 characters.
This tag can be used for filtering recommendation results by passing the
tag as part of the
For example: |
The following table defines hierarchical key properties.
Field name | Notes |
---|---|
images
|
Object - optional - repeated Root key property for encapsulating image-related properties. |
images.uri
|
String - optional URI of the image. Length limit of 5,000 characters. |
images.name
|
String - optional Name of the image. Length limit of 128 characters. |
persons
|
Object - optional - repeated Root key property for encapsulating the person-related properties.
For example:
|
persons.name
|
String - required Name of the person. |
persons.role
|
String - required The role of the person in the media item. Supported values: director, actor, player, team, league, editor, author, character, contributor, creator, editor, funder, producer, provider, publisher, sponsor, translator, music-by, channel, custom-role
If none of the supported values are applied to |
persons.custom_role
|
String - optional
|
persons.rank
|
Int - optional
Used for role ranking. For example, for first actor,
|
persons.uri
|
String - optional URI of the person. |
organizations
|
Object - optional - repeated
Root key property for encapsulating the
For example:
|
organizations.name
|
String - required Name of the organization. |
organizations.role
|
String - required The role of the organization in the media item. Supported values: director, actor, player, team, league, editor, author, character, contributor, creator, editor, funder, producer, provider, publisher, sponsor, translator, music-by, channel, custom-role
If none of the supported values are applied to |
organizations.custom_role
|
String - optional
|
organizations.rank
|
String - optional
Used for role ranking. For example, for first publisher:
|
organizations.uri
|
String - optional URI of the organization. |
aggregate_ratings
|
Object - optional - repeated
Root key property for encapsulating the
|
aggregate_ratings.rating_source
|
String - required
The source for rating. For example, |
aggregate_ratings.rating_score
|
Double - optional The aggregated rating. The rating should be normalized to the [1, 5] range. |
aggregate_ratings.rating_count
|
Int - optional The number of individual reviews. Should be a non-negative value. |
Document levels
Document levels determine the hierarchy in your data store. Typically, you should have a single-level data store or a two-level data store. Only two layers are supported.
For example, you can have a single-level data store where each document is an individual item. Alternatively, you might choose a two-level data store that contains both groups of items and individual items.
Document level types
There are two document level types:
Parent. Parent documents are what Vertex AI Search returns in recommendations and searches. Parents can be individual documents or groups of similar documents. This level type is recommended.
Child. Child documents are versions of a group's parent document. Children can only be individual documents. For example, if the parent document is "Example TV Show", children could be "Episode 1" and "Episode 2". This level type can be difficult to configure and maintain, and is not recommended.
About data store hierarchy
When planning your data store hierarchy, decide if your data store should contain only parents or parents and children. The key point to remember is that recommendations and searches only return parent documents.
For example, a parent-only data store might work well for audiobooks, where a recommendations panel returns a selection of individual audiobooks. On the other hand, if you uploaded TV show episodes as parent documents to a parent-only data store, several out-of-order episodes could be recommended in the same panel.
A TV show data store could work with both parents and children, where each parent document represents a TV show with child documents that represent the episodes of that TV show. This two-level data store allows the recommendation panel to show a range of similar TV shows. The end-user can click a particular show to select an episode to watch.
Because parent-child hierarchies can be difficult to configure and maintain, parent-only data stores are recommended.
For example, a TV show data store can work well as a parent-only data store where each parent document represents a TV show that can be recommended, and individual episodes are not included (and therefore not recommended).
If you determine that your data store needs to have both parents and children,
that is, groups and singular items, but you only have singular items now, you
need to create parents for the groups. The minimum information that you need to
provide for a parent is id
, title
, and categories
. For more information,
see the section Document fields.
BigQuery schema for media
If you plan to import your documents from BigQuery, use the predefined BigQuery schema to create a BigQuery table with the correct format and load it with your documents data before you import your documents.
[ { "name": "id", "mode": "REQUIRED", "type": "STRING", "fields": [] }, { "name": "schemaId", "mode": "REQUIRED", "type": "STRING", "fields": [] }, { "name": "parentDocumentId", "mode": "NULLABLE", "type": "STRING", "fields": [] }, { "name": "jsonData", "mode": "NULLABLE", "type": "STRING", "fields": [] } ]