Import documents

This page describes how to import your documents to the Discovery for Media datastore and keep it up to date.

Before you begin

Before you can import your documents, complete the steps in Before you begin.

You must have the Discovery Engine Admin IAM role to perform the import.

Document import best practices

Discovery for Media requires high-quality data to generate high-quality results. If your data is missing fields or has placeholder values instead of actual values, the quality of your predictions and search results suffers.

When you import documents, ensure that you implement the following best practices:

  • Make sure to think carefully when determining which documents or groups of documents are primary and which are variants. Before you upload any data, see Document levels.

    Parent items are returned as recommendations. Child items are not.

    For example, if you are recommending shows and a parent group is the show "Awesome TV Show", then the model returns a recommendation of "Awesome TV Show", and, perhaps, recommendations for "Cool TV Show" and "Fantastic TV Show".

    However, if child documents are not used and instead every document is a top-level parent item, then every episode of "Awesome TV Show" is returned as a distinct item on the recommendation panel, such as "Awesome TV Show episode 1" and "Awesome TV Show episode 2".

  • Observe the document item import limits.

    For bulk import from Cloud Storage, the size of each file must be 2 GB or smaller. You can include up to 100 files at a time in a single bulk import request.

    For inline import, import no more than 100 document items at a time.

  • Make sure that all required document information is included and correct.

    Do not use placeholder values.

    See the documentation for Document fields.

  • Include as much optional information as possible.

  • Keep your datastore up to date.

    Ideally, you should update your datastore daily. Scheduling periodic imports prevents model quality from going down over time. You can schedule automatic, recurring imports when you import your documents using the Google Cloud console. Alternatively, you can use Google Cloud Scheduler to automate imports.

  • Do not record user events for documents that have not been imported yet.

About importing documents

You can import your documents from Cloud Storage, BigQuery, or specify the data inline in the request. Each of these procedures are one-time imports. Schedule regular imports (ideally, daily) to ensure that your datastore is current. See Keep your datastore up to date.

You can also import individual documents. For more information, see Upload a document.

Datastore import considerations

This section describes the methods that can be used for batch importing of your document data, when you might use each method, and some of their limitations.

BigQuery Description Import data from a previously loaded BigQuery table that uses the BigQuery schema for Discovery for Media. Can be performed using the Google Cloud console or curl.
When to use If you have large volumes of data. BigQuery import does not have a data limit.

If you already use BigQuery.
Limitations Requires the extra step of creating a BigQuery table that maps to the BigQuery schema for Discovery for Media.
Cloud Storage Description Import data in a JSON format from files loaded in a Cloud Storage bucket. Each file must be 2 GB or smaller and up to 100 files at a time can be imported. The import can be done using the Google Cloud console or curl.
When to use If you need to load a large amount of data in a single step.
Limitations Requires the extra step of mapping your data to the Cloud Storage schema.
Inline import Description Import using a call to the Document.import method. Uses the InlineSource object.
When to use If you have flat, non-relational data or a high frequency of updates.
Limitations No more than 100 documents can be imported at a time. However, many load steps can be performed.

Import documents from BigQuery

To import documents in the correct format from BigQuery, use the BigQuery schema for Discovery for Media to create a BigQuery table with the correct format and load the empty table with your documents. Then, upload your data to Discovery for Media.

For more help with BigQuery tables, see Introduction to tables. For help with BigQuery queries, see Overview of querying BigQuery data.

To import your datastore:

  1. If your BigQuery dataset is in another project, configure the required permissions so that Discovery for Media can access the BigQuery dataset. Learn more.

  2. Import your documents to Discovery for Media.

    Console

    1. Go to the Discovery Engine Data page in the Google Cloud console.

      Go to the Data page
    2. Click Import to open the Import data panel.
    3. For Data type, select Media catalog.
    4. For Source of data, select BigQuery.
    5. Enter the BigQuery table where your data is located.
    6. Click Import.

    curl

    1. Create a data file for the input parameters for the import.

      Use the BigQuerySource object to point to your BigQuery dataset.

      • DATASET_ID: The ID of the BigQuery dataset.
      • TABLE_ID: The ID of the BigQuery table holding your data.
      • PROJECT_ID: The project ID that the BigQuery source is in. If not specified, the project ID is inherited from the parent request.
      • STAGING_DIRECTORY: Optional. A Cloud Storage directory that is used as an interim location for your data before it is imported into BigQuery. Leave this field empty to let Discovery for Media automatically create a temporary directory (recommended).
      • ERROR_DIRECTORY: Optional. A Cloud Storage directory for error information about the import. Leave this field empty to let Discovery for Media automatically create a temporary directory (recommended).
      • dataSchema: For the dataSchema property, use value document (default).

      We recommend you don't specify staging or error directories so that Discovery for Media can automatically create a Cloud Storage bucket with new staging and error directories. These are created in the same region as the BigQuery dataset, and are unique to each import (which prevents multiple import jobs from staging data to the same directory, and potentially re-importing the same data). After three days, the bucket and directories are automatically deleted to reduce storage costs.

      An automatically created bucket name includes the project ID, bucket region, and data schema name, separated by underscores. The automatically created directories are called staging or errors, appended by a number (for example, staging2345 or errors5678).

      If you specify directories, the Cloud Storage bucket must be in the same region as the BigQuery dataset, or the import will fail. Provide the staging and error directories in the format gs://<bucket>/<folder>/; they should be different.

      {
        "bigquerySource": {
          "projectId":"PROJECT_ID",
          "datasetId":"DATASET_ID",
          "tableId":"TABLE_ID",
          "dataSchema":"document"
        }
      }
    2. Import your documents to Discovery for Media by making a POST request to the Documents:import REST method, providing the name of the data file (here, shown as input.json).

      curl -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json; charset=utf-8" -d @./input.json \
      "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_NUMBER/locations/global/dataStores/default_data_store/branches/0/documents:import"

      You can check the status programmatically using the API. You should receive a response object that looks something like this:

      {
      "name": "projects/PROJECT_NUMBER/locations/global/dataStores/default_data_store/branches/0/operations/import-documents-123456",
      "done": false
      }

      The name field is the ID of the operation object. To request the status of this object, replace the name field with the value returned by the import method, until the done field returns as true:

      curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_NUMBER/locations/global/dataStores/default_data_store/branches/0/operations/import-documents-123456"

      When the operation completes, the returned object has a done value of true, and includes a Status object similar to the following example:

      { "name": "projects/PROJECT_NUMBER/locations/global/dataStores/default_data_store/branches/0/operations/import-documents-123456",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.discoveryengine.v1beta.ImportDocumentsMetadata",
        "createTime": "2022-01-01T03:33:33.000001Z",
        "updateTime": "2022-01-01T03:34:33.000001Z",
        "successCount": "2",
        "failureCount": "1"
      },
      "done": true,
      "response": {
      "@type": "type.googleapis.com/google.cloud.discoveryengine.v1beta.ImportDocumentsResponse",
      },
      "errorConfig": {
        "gcsPrefix": "gs://error-bucket/error-directory"
      }
      }

      You can inspect the files in the error directory in Cloud Storage to see if errors occurred during the import.

Set up access to your BigQuery dataset

To set up access when your BigQuery dataset is in a different project than your Discovery Engine service, complete the following steps.

  1. Open the IAM page in the Google Cloud console.

    Open the IAM page

  2. Select your Discovery Engine project.

  3. Select the Include Google-provided role grants checkbox.

  4. Find the service account with the name Discovery Engine Service Account.

    If you have not previously initiated an import operation with Discovery Engine, this service account might not be listed. If you do not see this service account, return to the import task and initiate the import. When it fails due to permission errors, return here and complete this task.

  5. Copy the identifier for the service account, which looks like an email address (for example, service-525@gcp-sa-discoveryengine.iam.gserviceaccount.com).

  6. Switch to your BigQuery project (on the same IAM & Admin page) and click  Grant Access.

  7. For New principals, enter the identifier for the Discovery Engine service account and select the BigQuery > BigQuery Data Viewer role.

  8. Click Add another role and select BigQuery > BigQuery Job User.

  9. Click Save.

For more about BigQuery access, see Controlling access to datasets in the BigQuery documentation.

Import documents from Cloud Storage

To import documents in JSON format, you create one or more JSON files that contain the documents you want to import, and upload it to Cloud Storage. From there, you can import it to Discovery for Media.

For an example of the JSON document format, see Document JSON data format.

For help with uploading files to Cloud Storage, see Upload objects.

  1. If your Cloud Storage dataset is in another project, configure the required permissions so that Discovery for Media can access the Cloud Storage dataset. Learn more.

  2. Import your documents to Discovery for Media.

    Console

    1. Go to the Discovery Engine Data page in the Google Cloud console.

      Go to the Data page
    2. Click Import to open the Import data panel.
    3. For Data type, select Media catalog.
    4. For Source of data, select Cloud Storage.
    5. Enter the Cloud Storage location of your data.
    6. Click Import.

    curl

    1. Create a data file for the input parameters for the import. Use the GcsSource object to point to your Cloud Storage bucket.

      You can provide multiple files, or just one; this example uses two files.

      • INPUT_FILE: A file or files in Cloud Storage containing your documents.
      • ERROR_DIRECTORY: A Cloud Storage directory for error information about the import.

      The input file fields must be in the format gs://<bucket>/<path-to-file>/. The error directory must be in the format gs://<bucket>/<folder>/. If the error directory does not exist, Discovery for Media creates it. The bucket must already exist.

      {
      "gcsSource": {
      "inputUris": ["INPUT_FILE_1", "INPUT_FILE_2"]
      },
      "errorConfig":{
      "gcsPrefix":"ERROR_DIRECTORY"
      }
      }
    2. Import your documents to Discovery for Media by making a POST request to the Documents:import REST method, providing the name of the data file (here, shown as input.json).

      curl -X POST \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json; charset=utf-8" -d @./input.json \
      "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_NUMBER/locations/global/dataStores/default_data_store/branches/0/documents:import"

      The easiest way to check the status of your import operation is to use the Activity Status panel on the Google Cloud console Data page.

      You can also check the status programmatically using the API. You should receive a response object that looks something like this:

      {
      "name": "projects/PROJECT_NUMBER/locations/global/dataStores/default_data_store/branches/0/operations/import-documents-123456",
      "done": false
      }

      The name field is the ID of the operation object. You request the status of this object, replacing the name field with the value returned by the import method, until the done field returns as true:

      curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_NUMBER/locations/global/dataStores/default_data_store/branches/0/operations/import-documents-123456"

      When the operation completes, the returned object has a done value of true, and includes a Status object similar to the following example:

      { "name": "projects/PROJECT_NUMBER/locations/global/dataStores/default_data_store/branches/0/operations/import-documents-123456",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.discoveryengine.v1beta.ImportDocumentsMetadata",
        "createTime": "2022-01-01T03:33:33.000001Z",
        "updateTime": "2022-01-01T03:34:33.000001Z",
        "successCount": "2",
        "failureCount": "1"
      },
      "done": true,
      "response": {
      "@type": "type.googleapis.com/google.cloud.discoveryengine.v1beta.ImportDocumentsResponse"
      },
      "errorConfig": {
        "gcsPrefix": "gs://error-bucket/error-directory"
      }
      }

      You can inspect the files in the error directory in Cloud Storage to see what kind of errors occurred during the import.

Set up access to your Cloud Storage data

To set up access when your Cloud Storage data is in a different project than your Discovery Engine service, complete the following steps.

  1. Open the IAM page in the Google Cloud console.

    Open the IAM page

  2. Select your Discovery Engine project.

  3. Select the Include Google-provided role grants checkbox.

  4. Find the service account with the name Discovery Engine Service Account.

    If you have not previously initiated an import operation with Discovery Engine, this service account might not be listed. If you do not see this service account, return to the import task and initiate the import. When it fails due to permission errors, return here and complete this task.

  5. Copy the identifier for the service account, which looks like an email address (for example, service-525@gcp-sa-discoveryengine.iam.gserviceaccount.com).

  6. Switch to your Cloud Storage project (on the same IAM & Admin page) and click  Grant Access.

  7. For New principals, enter the identifier for the Discovery Engine service account and select the Cloud Storage > Storage Admin role.

  8. Click Save.

For more about Cloud Storage access, see Overview of access control in the Cloud Storage documentation.

Import documents inline

You import your documents to Discovery for Media inline by making a POST request to the Documents:import REST method, using the InlineSource object to specify your data.

Provide an entire document on a single line. Each document should be on its own line.

For an example of the JSON document format, see Document JSON data format.

  1. Create the JSON file for your document and call it ./data.json:

    {
    "inlineSource": {
    "documents": [
      { DOCUMENT_1 },
      { DOCUMENT_2 }
    ]
    }
    }
    
  2. Call the POST method:

    curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     --data @./data.json \
    "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_NUMBER/locations/global/dataStores/default_data_store/branches/0/documents:import"

Document JSON examples

The following examples show Document entries in JSON format.

Provide an entire document on a single line. Each document should be on its own line.

Minimum required fields:

{
   "id": "sample-01",
   "schemaId": "default_schema",
   "jsonData": "{\"title\":\"Test document title\",\"categories\":[\"sports > clip\",\"sports > highlight\"],\"uri\":\"http://www.example.com\",\"media_type\":\"sports-game\",\"availability_start_time\":\"2022-08-26T23:00:17Z\"}"  
}

Complete object:

{
   "id": "child-sample-0",
   "schemaId": "default_schema",
   "jsonData": "{\"title\":\"Test document title\",\"description\":\"Test document description\",\"language_code\":\"en-US\",\"categories\":[\"sports > clip\",\"sports > highlight\"],\"uri\":\"http://www.example.com\",\"images\":[{\"uri\":\"http://example.com/img1\",\"name\":\"image_1\"}],\"media_type\":\"sports-game\",\"in_languages\":[\"en-US\"],\"country_of_origin\":\"US\",\"content_index\":0,\"persons\":[{\"name\":\"sports person\",\"role\":\"player\",\"rank\":0,\"uri\":\"http://example.com/person\"},],\"organizations \":[{\"name\":\"sports team\",\"role\":\"team\",\"rank\":0,\"uri\":\"http://example.com/team\"},],\"hash_tags\":[\"tag1\"],\"filter_tags\":[\"filter_tag\"],\"production_year\":1900,\"duration\":\"100s\",\"content_rating\":[\"PG-13\"],\"aggregate_ratings\":[{\"rating_source\":\"imdb\",\"rating_score\":4.5,\"rating_count\":1250}],\"availability_start_time\":\"2022-08-26T23:00:17Z\"}"
}

Keep your datastore up to date

Discovery for Media relies on having current document information to provide you with the best results. We recommend that you import your datastore on a daily basis to ensure that your datastore is current. You can use Google Cloud Scheduler to schedule imports, or choose an automatic scheduling option when you import data using the Google Cloud console.

You can update only new or changed document items, or you can import the entire datastore. If you import documents that are already in your datastore, they are not added again. Any item that has changed is updated.

To update a single item, see Update document information.

Batch update

You can use the import method to batch update your datastore. You do this the same way you do the initial import; follow the steps in Import documents.

What's next