Create a media data store

This page explains how to create a data store for media and import data into it.

Before you begin

Make sure that you do the following:

  • Review the concepts related to media data and schema:

  • Decide whether you are using the predefined Google schema for your media data or your own schema.

  • If you're using your own schema, make sure your schema has fields that map well to the media key properties: title, url, category, and so on.

  • Put your media documents into the JSON schema and upload the data to BigQuery or Cloud Storage.

  • Review About user events and prepare your user events for import. User events are required for media recommendations and are recommended for media search.

Choose the procedure according to your data source

To create a media data store and import documents, go to the section for the source that you plan to use:

Import from BigQuery

Console

To use the Google Cloud console to create a media data store and import documents and user events from BigQuery, follow these steps:

  1. In the Google Cloud console, go to the Agent Builder page.

    Agent Builder

  2. Go to the Data Stores page.

  3. Click Create data store.

  4. On the Source page, select BigQuery.

  5. Select Media - BigQuery table with structured media data as the kind of data that you are importing.

  6. In the BigQuery path field, click Browse, select the BigQuery data that you have prepared for ingesting, and then click Select. Alternatively, enter the location directly in the BigQuery path field.

  7. If your data is in the predefined Google schema, choose Google predefined schema, click Continue, and skip to step 11.

  8. If your data is in your own schema, choose Custom schema and click Continue.

  9. Review the detected schema and use the Key properties menu to assign properties to your schema fields.

  10. Click Continue.

    You can't continue until the required key properties are mapped, indicated by green checkmarks instead of orange warning marks .

  11. Enter a name for your data store and click Create.

Import from Cloud Storage

Console

To use the Google Cloud console to create a media data store and import documents from Cloud Storage, follow these steps:

  1. In the Google Cloud console, go to the Agent Builder page.

    Agent Builder

  2. Go to the Data Stores page.

  3. Click Create data store.

  4. On the Source page, select Cloud Storage.

  5. Select Structured media data (JSONL containing media files) as the kind of data that you are importing.

  6. In the Select a folder or file you want to import section, select Folder or File.

  7. Click Browse and choose the data that you have prepared for ingesting, and then click Select. Alternatively, enter the location directly in the gs:// field.

  8. If your data is in the predefined Google schema, choose Google predefined schema, click Continue, and skip to step 11.

  9. If your data is in your own schema, choose Custom schema and click Continue.

  10. Review the detected schema and use the Key properties menu to assign properties to your schema fields.

  11. Click Continue.

    You can't continue until the required key properties are mapped, indicated by green checkmarks instead of orange warning marks .

  12. Enter a name for your data store and click Create.

Import documents using the API

If you are using the Google predefined schema, you can import your documents by making a POST request to the Documents:import REST method, using the InlineSource object to specify your data.

For an example of the JSON document format, see JSON document format.

Import requirements

Here are the requirements for importing media documents using the API:

  • Each document must be on its own line.

  • The maximum number of documents in a single import is 100.

Procedure

To import media documents using the API, do the following:

  1. Create a data store.

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores?dataStoreId=DATA_STORE_ID" \
    -d '{
      "displayName": "DATA_STORE_DISPLAY_NAME",
      "industryVertical": "MEDIA"
    }'
    

    Replace the following:

    • PROJECT_ID: the ID of your Google Cloud project.
    • DATA_STORE_ID: the ID of the Vertex AI Search data store that you want to create. This ID can contain only lowercase letters, digits, underscores, and hyphens.
    • DATA_STORE_DISPLAY_NAME: the display name of the Vertex AI Search data store that you want to create.
  2. Create the JSON file for your document and call it ./data.json:

    {
    "inlineSource": {
    "documents": [
      { DOCUMENT_1 },
      { DOCUMENT_2 }
    ]
    }
    }
    
  3. Call the POST method:

    curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     --data @./data.json \
    "https://discoveryengine.googleapis.com/v1/projects/PROJECT_ID/locations/global/dataStores/DATA_STORE_ID/branches/0/documents:import"
    • PROJECT_ID: The ID of your project.
    • DATA_STORE_ID: The ID of your data store.

JSON document format

The following examples show Document entries in JSON format.

Provide an entire document on a single line. Each document should be on its own line.

Minimum required fields:

{
   "id": "sample-01",
   "schemaId": "default_schema",
   "jsonData": "{\"title\":\"Test document title\",\"categories\":[\"sports > clip\",\"sports > highlight\"],\"uri\":\"http://www.example.com\",\"media_type\":\"sports-game\",\"available_time\":\"2022-08-26T23:00:17Z\"}"
}

Complete object:

{
   "id": "child-sample-0",
   "schemaId": "default_schema",
   "jsonData": "{\"title\":\"Test document title\",\"description\":\"Test document description\",\"language_code\":\"en-US\",\"categories\":[\"sports > clip\",\"sports > highlight\"],\"uri\":\"http://www.example.com\",\"images\":[{\"uri\":\"http://example.com/img1\",\"name\":\"image_1\"}],\"media_type\":\"sports-game\",\"in_languages\":[\"en-US\"],\"country_of_origin\":\"US\",\"content_index\":0,\"persons\":[{\"name\":\"sports person\",\"role\":\"player\",\"rank\":0,\"uri\":\"http://example.com/person\"},],\"organizations \":[{\"name\":\"sports team\",\"role\":\"team\",\"rank\":0,\"uri\":\"http://example.com/team\"},],\"hash_tags\":[\"tag1\"],\"filter_tags\":[\"filter_tag\"],\"production_year\":1900,\"duration\":\"100s\",\"content_rating\":[\"PG-13\"],\"aggregate_ratings\":[{\"rating_source\":\"imdb\",\"rating_score\":4.5,\"rating_count\":1250}],\"available_time\":\"2022-08-26T23:00:17Z\"}"
}

Monitor import and view data

  1. To check the status of your ingestion, go to the Data Stores page and click your data store name to see details about it on its Data page.

  2. Click the Activity tab.

    When the status column on the Activity tab changes from In progress to Import completed, the ingestion is complete.

    Depending on the size of your data, ingestion can take several minutes or several hours.

  3. Click Documents to view the data you imported.

Import user events

User events are required if you want to use your data store with a media recommendations app.

Although user events aren't required for media search apps, include user events to get better quality search results.

To import user events to your media data store:

What's next

  • Create a media recommendations app or a media search app.

  • Keep your document data fresh.

    Ideally, you should update your data store daily, by importing fresh data. Scheduling periodic imports prevents model quality from degrading over time. You can use Google Cloud Scheduler to automate imports.

    You can update only new or changed documents, or you can import the entire data store. If you import documents that are already in your data store, they are not added again. Any document that has changed is updated.

  • Keep your user-event data fresh.

    It is particularly important that you keep your user events fresh. The recommendations app stops working if there aren't enough fresh user events to meet the data requirements.

    For information about importing user event data in real time, see Record real-time user events.

    For information about monitoring user-event requirements, see Check data quality for media recommendations.