Exporting metadata and annotations from a dataset

Vertex AI allows you to export the metadata and annotation sets from a Dataset resource. This functionality can be useful if you wish to maintain a record of a specific collection of annotation changes, additions, or deletions.

When you export a Dataset, Vertex AI creates one or more JSONL files that contain your Dataset's metadata and annotations and saves these JSONL files to a Cloud Storage directory of your choice.

You can export image, text, and video Dataset resources. You cannot export tabular Dataset resources.

Exporting a Dataset does not create additional copies of the image, text, or video data that your Dataset is based on. The JSONL files created by the export processes include the original Cloud Storage URIs for your data that you specified when you imported that data into the Dataset.

Exporting a Dataset using the Cloud Console or the API

You can use the Google Cloud Console or the Vertex AI API to export a Dataset. Follow the steps in the corresponding tab:

Console

  1. In the Cloud Console, in the Vertex AI section, go to the Datasets page.

    Go to the Datasets page

  2. In the Region drop-down list, select the location where the Dataset is stored.

  3. Find the row of the Dataset. You can export metadata and annotations for all annotation sets or for a specific annotation set:

    • If you want to export metadata and annotations for all of the Dataset's annotation sets, then click View more and then click Export dataset.

      This tells Vertex AI to create a set of JSONL files for each annotation set.

    • If you want to export metadata and annotations for a specific annotation set, then do the following:

      1. Click Expand node to show rows for each of the Dataset's annotation sets.

      2. In the row of the annotation set that you want to export, click View more and then click Export annotation set.

      This tells Vertex AI to create a set of JSONL files for the annotation set that you specify.

  4. In the Export data dialog, enter a Cloud Storage directory where you want Vertex AI to save the exported JSONL files. Click Export.

REST & CMD LINE

Getting the Dataset's ID

To export a Dataset, you must know the numerical ID of the Dataset. If you know the display name of theDataset but not the ID, expand the following section to learn how to get the ID using the API:

Get a Dataset's ID from it's display name

Before using any of the request data below, make the following replacements:

  • LOCATION: The region where the Dataset is stored. For example, us-central.

  • PROJECT_ID: Your Google Cloud project ID.

  • DATASET_DISPLAY_NAME: The display name of the Dataset.

HTTP method and URL:

GET https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets?filter=displayName=DATASET_DISPLAY_NAME

To send your request, choose one of these options:

curl

Execute the following command:

curl -X GET \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets?filter=displayName=DATASET_DISPLAY_NAME

PowerShell

Execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets?filter=displayName=DATASET_DISPLAY_NAME" | Select-Object -Expand Content

The following example response has been truncated with ... to emphasize where you can find your Dataset's ID: it is the number that takes the place of DATASET_ID.

{
  "datasets": [
    {
      "name": "projects/PROJECT_NUMBER/locations/LOCATION/datasets/DATASET_ID",
      "displayName": "DATASET_DISPLAY_NAME",
      ...
    }
  ]
}

Alternatively, you can get the Dataset's ID from the Cloud Console: Go to the Vertex AI Datasets page and find the number in the ID column.

Go to the Datasets page

Exporting one or more annotation sets

Before using any of the request data below, make the following replacements:

  • LOCATION: The region where the Dataset is stored. For example, us-central.

  • PROJECT_ID: Your Google Cloud project ID.

  • DATASET_ID: The numerical ID of the Dataset.

  • EXPORT_DIRECTORY: Cloud Storage URI (beginning with gs://) of a directory where you want Vertex AI to save the exported JSONL files. This must be in a Cloud Storage bucket that you have access to, but the directory does not need to exist yet.

  • FILTER: A filter string that determines which annotation sets get exported.

    • If you want to export metadata and annotations for all of the Dataset's annotation sets, replace FILTER with an empty string (or omit the annotationsFilter field from the request body entirely). This tells Vertex AI to create a set of JSONL files for each annotation set.

    • If you want to export metadata and annotations for a specific annotation set, replace FILTER with the following:

      labels.aiplatform.googleapis.com/annotation_set_name=ANNOTATION_SET_ID
      

      This tells Vertex AI to create a set of JSONL files for the annotation set with the numerical ID ANNOTATION_SET_ID.

      To find the numerical ID of the annotation set that you want to specify, view the annotation set in the Cloud Console and look for the value following annotationSetId in the URL.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:export

Request JSON body:

{
  "exportConfig": {
    "gcsDestination": {
      "outputUriPrefix": "EXPORT_DIRECTORY"
    },
    "annotationsFilter": "FILTER"
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file called request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:export

PowerShell

Save the request body in a file called request.json, and execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:export" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_NUMBER/locations/LOCATION/datasets/DATASET_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.ExportDataOperationMetadata",
    "genericMetadata": {
      "createTime": "2021-02-17T00:54:58.827429Z",
      "updateTime": "2021-02-17T00:54:58.827429Z"
    },
    "gcsOutputDirectory": "EXPORT_DIRECTORY/export-data-DATASET_DISPLAY_NAME-2021-02-17T00:54:58.734772Z"
  }
}

Some requests start long-running operations that require time to complete. These requests return an operation name, which you can use to view the operation's status or cancel the operation. Vertex AI provides helper methods to make calls against long-running operations. For more information, see Working with long-running operations.

Understanding the exported files

Within the export directory that you specified in the previous section, Vertex AI creates a new directory labeled with the Dataset's display name and a timestamp; for example, export-data-DATASET_DISPLAY_NAME-2021-02-17T00:54:58.734772Z. Within this directory, you can find a subdirectory for each annotation set that you exported.

For each annotation set, you can find one or more JSONL files. Each row of each JSONL file represents a data item from the annotation set. Each data item may contain metadata and annotations that you specified when you imported the data to Vertex AI, as well as metadata and annotations that you added after you imported the data. For example, if you requested data labeling for your Dataset or if you manually added labels or annotations to the Dataset in the Cloud Console, then this information is included in the exported files.

If you export multiple annotation sets, the same data items might appear in multiple JSONL files. For example, if you export an image Dataset with multiple annotation sets, one JSONL file might contain a data item with a single-label classification annotation; another JSONL file for a different annotation set might contain the same data item, but with an object detection annotation instead.

The format of the exported files matches the format of the JSONL import files that you can use to import data into Vertex AI. For example, if you export an annotation set for single-label image classification, then each line of each JSONL file is formatted according to the gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_single_label_io_format_1.0.0.yaml schema file, as described in Preparing image data.

To learn more about the different JSONL formats for different types of annotation sets, view the following guides:

What's next