This preview documentation is deprecated as of October 27, 2023. For GA documentation, go to the Vertex AI Search documentation.

Changes in GA:

Name: Discovery for Media is renamed to Vertex AI Search for media. Vertex AI Search includes media recommendations and media search.
Google Cloud Console page: Use the Agent Builder page in the console. The Discovery Engine console page is deprecated.
API reference: Continue to use the discoveryengine.googleapis.com service. The API remains the same but the documentation has moved. Go to the up-to-date, GA version of the Discovery Engine API reference in the Vertex AI Search documentation.
All changes: To see what else changed between preview and GA, see Switch from Discovery for Media to media recommendations in the Vertex AI Search documentation.

About documents and datastore

This page provides best practices for creating your datastore and populating it with document information.

Overview

Your datastore contains a collection of document objects, such as videos or news articles, that you have uploaded to Discovery for Media. This is represented in Discovery Engine by the DataStore resource.

A document is any item that you upload to Discovery for Media to be recommended. For Discovery for Media, you might upload a video, news article, music file, or podcast as a document. The Document object in Discovery Engine captures all metadata information of items to be recommended.

The data you import into Discovery for Media has a direct effect on the quality of the resulting model, and therefore on the quality of the results Discovery for Media provides. In general, the more accurate and specific information you can provide, the higher quality your model.

Your datastore should be kept up to date. You can upload datastore changes as often as needed; ideally, every day for a datastore with a high rate of change. There is no charge for uploading datastore information. For more information, see Keeping your datastore up to date.

You can have one datastore per Google Cloud project. After you enable Discovery for Media in the Discovery Engine console and accept Discovery Engine terms of use, the console prompts you to create the default datastore.

Documents

The datastore is a collection of document objects. The Document object in Discovery Engine captures all metadata information of items to be recommended.

A document is any item you upload to Discovery for Media to be recommended, such as a video or news article.

JSON Schema for `Document`

JSON Schema is a JSON-based format that describes the structure of JSON documents. When using Media Recommendations, documents must use the predefined JSON schema shown below.

For more about the JSON Schema, also see the JSON Schema specification documentation at json-schema.org.

Documents are uploaded with either a JSON or Struct data representation. Make sure the document JSON or Struct conforms to the following JSON schema. The JSON schema uses JSON Schema 2020-12 for validation.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
    },
    "description": {
      "type": "string",
    },
    "language_code": {
      "type": "string",
    },
    "categories": {
      "type": "array",
      "items": {
        "type": "string",
      }
    },
    "uri": {
      "type": "string",
    },
    "images": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "uri": {
            "type": "string",
          },
          "name": {
            "type": "string",
          }
        },
      }
    },
    "media_type": {
      "type": "string",
    },
    "in_languages": {
      "type": "array",
      "items": {
        "type": "string",
      }
    },
    "country_of_origin": {
      "type": "string",
    },
    "content_index": {
      "type": "integer",
    },
    "persons": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string",
          },
          "role": {
            "type": "string",
          },
          "custom_role": {
            "type": "string",
          },
          "rank": {
            "type": "integer",
          },
          "uri": {
            "type": "string",
          }
        },
        "required": ["name", "role"],
      }
    },
    "organizations": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string",
          },
          "role": {
            "type": "string",
          },
          "custom_role": {
            "type": "string",
          },
          "rank": {
            "type": "integer",
          },
          "uri": {
            "type": "string",
          }
        },
        "required": ["name", "role"],
      }
    },
    "hash_tags": {
      "type": "array",
      "items": {
        "type": "string",
      }
    },
    "filter_tags": {
      "type": "array",
      "items": {
        "type": "string",
      }
    },
    "duration": {
      "type": "string",
    },
    "content_rating": {
      "type": "array",
      "items": {
        "type": "string",
      }
    },
    "aggregate_ratings": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "rating_source": {
            "type": "string",
          },
          "rating_score": {
            "type": "number",
          },
          "rating_count": {
            "type": "integer",
          }
        },
        "required": ["rating_source"],
      }
    },
    "availability_start_time": {
      "type": "string",
    },
    "production_year": {
      "type": "integer",
    }
  },
  "required": ["title", "uri", "availability_start_time"],
}

Sample JSON `Document` object

The following example shows an example of a JSON Document object.

{
  "title": "Test document title",
  "description": "Test document description",
  "language_code": "en-US",
  "categories": [
    "sports > clip",
    "sports > highlight"
  ],
  "uri": "http://www.example.com",
  "images": [
    {
      "uri": "http://example.com/img1",
      "name": "image_1"
    }
  ],
  "media_type": "sports-game",
  "in_languages": [
    "en-US"
  ],
  "country_of_origin": "US",
  "content_index": 0,
  "persons": [
    {
      "name": "sports person",
      "role": "player",
      "rank": 0,
      "uri": "http://example.com/person"
    },
  ],
  "organizations": [
    {
      "name": "sports team",
      "role": "team",
      "rank": 0,
      "uri": "http://example.com/team"
    },
  ],
  "hash_tags": [
    "tag1"
  ],
  "filter_tags": [
    "filter_tag"
  ],
  "duration": "100s",
  "production_year": 1900,
  "content_rating": [
    "PG-13"
  ],
  "aggregate_ratings": [
    {
      "rating_source": "imdb",
      "rating_score": 4.5,
      "rating_count": 1250
    }
  ],
  "availability_start_time": "2022-08-26T23:00:17Z"
}

Document fields

This section lists the field values you provide when you create document items in your datastore. The values should correspond with the values used in your internal document database, and should accurately reflect the item represented, because they are included in training your models.

All document information you provide can be used to improve the quality of recommendations. Be sure to provide as many fields as possible.

`Document` object fields

The following fields are top-level API fields for the Document object. Also refer to these fields on the Document reference page.

Field	Notes
`name`	The full, unique resource name of the document. Required for all `Document` methods except for `create` and `import`. During import, the name is automatically generated and does not need to be manually provided.
`id`	The document ID used by your internal database. The ID field must be unique across your entire datastore. The same value is used when you record a user event, and is also returned by the `recommend` method.
`schemaId`	Required. The identifier of the schema located in the same datastore. Should be set as "default_schema", which is automatically created when the default datastore is created.
`parentDocumentId`	The ID of the parent document. For top level (root) documents, `parent_document_id` can be empty or can point to itself. For child documents, `parent_document_id` should point to a valid root document.

Key properties

The following properties are defined using the predefined JSON Schema format for Discovery for Media.

For more about JSON properties, also see the Understanding JSON Schema documentation for properties at json-schema.org.

The following table defines flat properties.

Field name	Notes
`title`	String - required Document title from your database. A UTF-8 encoded string. Limited to 1000 characters.
`categories`	String - required Document categories. This property is repeated for supporting one document belonging to several parallel categories. Use the full category path for higher quality results. To represent the full path of a category, use the `>` symbol to separate hierarchies. If `>` is part of the category name, replace it with another character(s). For example: `"categories": [ "sports > highlight" ]` A document can contain at most 250 categories. Each category is a UTF-8 encoded string with a length limit of 5000 characters.
`uri`	String - required URI of the document. Length limit of 5000 characters.
`description`	String - optional Description of the document. Length limit of 5000 characters.
`language_code`	String - optional Language of the title/description and other string attributes. Use language tags defined by BCP 47. For document recommendation, this field is ignored and the model automatically detects the text language. The document can include text in different languages, but duplicating documents to provide text in multiple languages can result in degraded model performance. For document search this field is in use. It defaults to "en-US" if unset. For example, `"language_code": "en-US"`.

The following table defines hierarchical key properties.

Field name Notes

Field name	Notes
`images`	Object - optional - repeated Root key property for encapsulating the image related properties.
`images.uri`	String - optional URI of the image. Length limit of 5,000 characters.
`images.name`	String - optional Name of the image. Length limit of 128 characters.

images

Object - optional - repeated

Root key property for encapsulating the image related properties.

images.uri

String - optional

URI of the image. Length limit of 5,000 characters.

images.name

String - optional

Name of the image. Length limit of 128 characters.

The following table defines flat key media properties.

Field name	Notes
`duration`	String - optional Duration of the media content. Duration should be encoded as a string. Encoding should be the same as the `google::protobuf::Duration JSON string encoding. For example: "5s", "1m"`
`availability_start_time`	String - required The time that the content is available to the end-users. This field identifies the freshness of a content for end-users. The timestamp should conform to RFC 3339 standard. For example: `"2022-08-26T23:00:17Z"`
`media_type`	String - optional Top-level category. Supported types: episode, movie, concert, event, live-event, broadcast, tv-series, video-game, clip, vlog, audio, audio-book, music, album , articles, news, radio, podcast, book, sports-game
`in_languages`	String - optional - repeated Language of the media contents. Use language tags defined by BCP 47. For example: `"in_languages": [ "en-US"]`
`country_of_origin`	String - optional Media document country of origin. Length limit of 128 characters. For example: `"country_of_origin": "US"`
`content_index`	Int - optional Content index of the media document. Content index field can be used to order the documents relative to others. For example, episode number can be used as the content index. Content index should be a non-negative integer. For example: `"content_index": 0`
`filter_tags`	String - optional - repeated Filter tags for the document. At most 250 values are allowed per document with a length limit of 1000 characters. Otherwise, an INVALID_ARGUMENT error is returned. This tag can be used for filtering recommendation results by passing the tag as part of the `RecommendRequest.filter`. For example: `"filter_tags": [ "filter_tag"]`
`hash_tags`	String - optional - repeated Hashtags for the document. At most 100 values are allowed per document, with a length limit of 5000 characters. For example: `"hash_tags": [ "soccer", "world cup"]`
`content_rating`	String - optional - repeated The content rating, used for content advisory systems and content filtering based on the audience. At most 100 values are allowed per document with a length limit of 128 characters. This tag can be used for filtering recommendation results by passing the tag as part of the `RecommendRequest.filter`. For example: `content_rating: ["PG-13"]`

The following table defines hierarchical key media properties.

Field name	Notes
`persons`	Object - optional - repeated Root key property for encapsulating the person-related properties. For example: `"persons":[{"name":"sports person","role":"player","rank":0,"uri":"http://example.com/person"}]`
`persons.name`	String - required Name of the person.
`persons.role`	String - required The role of the person in the media. Supported values: director, actor, player, team, league, editor, author, character, contributor, creator, editor, funder, producer, provider, publisher, sponsor, translator, music-by, channel, custom-role. If none of the supported values are applied to `role`, set `role` to `custom-role` and provide the value in the `custom_role` field.
`persons.custom_role`	String - optional `custom_role` is set if and only if the `role` is set to be a `custom-role`. Must be a UTF-8 encoded string with a length limit of 128 characters. Must match the pattern: `[a-zA-Z0-9][a-zA-Z0-9_]*`.
`persons.rank`	Int - optional Used for role ranking. For example, for first actor, `role = "actor", rank = 1`
`persons.uri`	String - optional URI of the person.
`organizations`	Object - optional - repeated Root key property for encapsulating the `organization` related properties. For example: `"organizations ":[{"name":"sports team","role":"team","rank":0,"uri":"http://example.com/team"}]`
`organizations.name`	String - required Name of the organization.
`organizations.role`	String - required The role of the organization in the media. Supported values: director, actor, player, team, league, editor, author, character, contributor, creator, editor, funder, producer, provider, publisher, sponsor, translator, music-by, channel, custom-role. If none of the supported values are applied to `role`, set `role` to `custom-role` and provide the value in the `custom_role` field.
`organizations.custom_role`	String - optional `custom_role` is set if and only if the `role` is set to be a `custom-role`. Must be a UTF-8 encoded string with a length limit of 128 characters. Must match the pattern: `[a-zA-Z0-9][a-zA-Z0-9_]*`.
`organizations.rank`	String - optional Used for role ranking. For example, for first publisher: `role = "publisher", rank = 1`.
`organizations.uri`	String - optional URI of the organization.
`aggregate_ratings`	Object - optional - repeated Root key property for encapsulating the `aggregate_rating` related properties.
`aggregate_ratings.rating_source`	String - required The source for rating. For example, `imdb` or `rotten_tomatoes`. Must be a UTF-8 encoded string with a length limit of 128 characters. Must match the pattern: `[a-zA-Z0-9][a-zA-Z0-9_]*`.
`aggregate_ratings.rating_score`	Double - optional The aggregated rating. The rating should be normalized to the [1, 5] range.
`aggregate_ratings.rating_count`	Int - optional The number of individual reviews. Should be a non-negative value.

Document levels

Document levels determine the hierarchy in your datastore. Typically, you should have a single-level datastore or two-level datastore. Only two layers are supported.

For example, you can have a single-level datastore where each document is an individual item. Alternatively, you might choose a two-level datastore that contains both groups of items and individual items.

Document-level types

There are two document-level types:

Parent items are what Discovery for Media returns in recommendations. Parents can be individual items or groups of similar items.
Child items are versions of a group's parent document. Children can only be individual items. For example, if the parent document is "Example TV Show", children could be "Episode 1" and "Episode 2".

About datastore hierarchy

When planning your datastore hierarchy, decide if your datastore should contain only parents or parents and children. The key point to remember is that recommendations only return parent items.

For example, a parent-only datastore might work well for audiobooks, where a recommendations panel returns a selection of individual audiobooks. However, a parent-only datastore for TV shows might return several out-of-order episodes of that show in the recommendation panel.

The TV show datastore could work better with both parents and children, with episodes as children and a parent representing a TV show as a group representing all those episodes. This two-level datastore allows the recommendation panel to show a range of similar TV shows. The end-user can click a particular show to select an episode to watch.

If you determine that your datastore should have both parents and children, that is, groups and singular items, but you only have singular items now, you need to create parents for the groups. The minimum information that you need to provide for a parent is id, title, and categories. For more information, see the section Document fields.

Datastore import

If your datastore is in Cloud Storage or BigQuery or some other storage, you can do a bulk data import.

For detailed information about how to upload a datastore, see Import datastore information.

Branches

Although Discovery for Media has datastore branching, Media Recommendations does not support multiple datastore branches.

When using Media Recommendations, do not use multiple datastore branches. Always use branch 0.

BigQuery schema for Discovery for Media

When importing a datastore from BigQuery, use the Discovery for Media schema below to create a BigQuery table with the correct format and load it with your documents data. Then, import the documents.

[
  {
    "name": "id",
    "mode": "REQUIRED",
    "type": "STRING",
    "fields": []
  },
  {
    "name": "schemaId",
    "mode": "REQUIRED",
    "type": "STRING",
    "fields": []
  },
  {
    "name": "parentDocumentId",
    "mode": "NULLABLE",
    "type": "STRING",
    "fields": []
  },
  {
    "name": "jsonData",
    "mode": "NULLABLE",
    "type": "STRING",
    "fields": []
  }
]