Export metadata and annotations from a dataset

Vertex AI lets you export the metadata and annotation sets from a Dataset resource. This capability can be useful if you want to maintain a record of a specific collection of annotation changes, additions, or deletions.

When you export a Dataset, Vertex AI creates one or more JSON Lines files that contain your Dataset's metadata and annotations, and saves these JSON Lines files to a Cloud Storage directory of your choice.

You can export image, text, and video Dataset resources. You cannot export tabular Dataset resources.

Exporting a Dataset doesn't create additional copies of the image, text, or video data that your Dataset is based on. The JSON Lines files created by the export processes include the original Cloud Storage URIs for your data that you specified when you imported that data into the Dataset.

Export a Dataset using the Google Cloud console or the API

You can use the Google Cloud console or the Vertex AI API to export a Dataset. Follow the steps in the corresponding tab:

Console

  1. In the Google Cloud console, in the Vertex AI section, go to the Datasets page.

    Go to the Datasets page

  2. In the Region drop-down list, select the location where the Dataset is stored.

  3. Find the row of the Dataset. You can export metadata and annotations for all annotation sets or for a specific annotation set:

    • If you want to export metadata and annotations for all of the Dataset's annotation sets, then click View more and then click Export dataset.

      This tells Vertex AI to create a set of JSON Lines files for each annotation set.

    • If you want to export metadata and annotations for a specific annotation set, then do the following:

      1. Click Expand node to show rows for each of the Dataset's annotation sets.

      2. In the row of the annotation set that you want to export, click View more and then click Export annotation set.

      This tells Vertex AI to create a set of JSON Lines files for the annotation set that you specified.

  4. In the Export data dialog, enter a Cloud Storage directory where you want Vertex AI to save the exported JSON Lines files. Click Export.

REST

Get the Dataset's ID

To export a Dataset, you must know the numerical ID of the Dataset. If you know the display name of theDataset but not the ID, expand the following section to learn how to get the ID using the API:

Get a Dataset's ID from its display name

Before using any of the request data, make the following replacements:

  • LOCATION: The location where the Dataset is stored. For example, us-central1.

  • PROJECT_ID: Your project ID.

  • DATASET_DISPLAY_NAME: The display name of the Dataset.

HTTP method and URL:

GET https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets?filter=displayName=DATASET_DISPLAY_NAME

To send your request, choose one of these options:

curl

Execute the following command:

curl -X GET \
-H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets?filter=displayName=DATASET_DISPLAY_NAME"

PowerShell

Execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets?filter=displayName=DATASET_DISPLAY_NAME" | Select-Object -Expand Content

The following example response has been truncated with ... to emphasize where you can find your Dataset's ID: it is the number that takes the place of DATASET_ID.

{
  "datasets": [
    {
      "name": "projects/PROJECT_NUMBER/locations/LOCATION/datasets/DATASET_ID",
      "displayName": "DATASET_DISPLAY_NAME",
      ...
    }
  ]
}

Alternatively, you can get the Dataset's ID from the Google Cloud console: Go to the Vertex AI Datasets page and find the number in the ID column.

Go to the Datasets page

Export one or more annotation sets

Before using any of the request data, make the following replacements:

  • LOCATION: The location where the Dataset is stored. For example, us-central1.

  • PROJECT_ID: Your project ID.

  • DATASET_ID: The numerical ID of the Dataset.

  • EXPORT_DIRECTORY: Cloud Storage URI (beginning with gs://) of a directory where you want Vertex AI to save the exported JSON Lines files. This must be in a Cloud Storage bucket that you have access to, but the directory does not need to exist yet.

  • FILTER: A filter string that determines which annotation sets get exported.

    • If you want to export metadata and annotations for all of the Dataset's annotation sets, replace FILTER with an empty string (or omit the annotationsFilter field from the request body entirely). This tells Vertex AI to create a set of JSON Lines files for each annotation set.

    • If you want to export metadata and annotations for a specific annotation set, replace FILTER with the following:

      labels.aiplatform.googleapis.com/annotation_set_name=ANNOTATION_SET_ID
      

      This tells Vertex AI to create a set of JSON Lines files for the annotation set with the numerical ID ANNOTATION_SET_ID.

      To find the numerical ID of the annotation set that you want to specify, view the annotation set in the Google Cloud console and look for the value following annotationSetId in the URL.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:export

Request JSON body:

{
  "exportConfig": {
    "gcsDestination": {
      "outputUriPrefix": "EXPORT_DIRECTORY"
    },
    "annotationsFilter": "FILTER"
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:export"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:export" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_NUMBER/locations/LOCATION/datasets/DATASET_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.ExportDataOperationMetadata",
    "genericMetadata": {
      "createTime": "2021-02-17T00:54:58.827429Z",
      "updateTime": "2021-02-17T00:54:58.827429Z"
    },
    "gcsOutputDirectory": "EXPORT_DIRECTORY/export-data-DATASET_DISPLAY_NAME-2021-02-17T00:54:58.734772Z"
  }
}

Some requests start long-running operations that require time to complete. These requests return an operation name, which you can use to view the operation's status or cancel the operation. Vertex AI provides helper methods to make calls against long-running operations. For more information, see Working with long-running operations.

Exported files explained

Within the export directory that you specified in the previous section, Vertex AI creates a new directory labeled with the Dataset's display name and a timestamp; for example, export-data-DATASET_DISPLAY_NAME-2021-02-17T00:54:58.734772Z. Within this directory, you can find a subdirectory for each annotation set that you exported.

For each annotation set, you can find one or more JSON Lines files. Each row of each JSON Lines file represents a data item from the annotation set. Each data item may contain metadata and annotations that you specified when you imported the data to Vertex AI, as well as metadata and annotations that you added after importing the data. For example, if you requested data labeling for your Dataset or if you manually added labels or annotations to the Dataset in the Google Cloud console, then this information is included in the exported files.

If you export multiple annotation sets, the same data items might appear in multiple JSON Lines files. For example, if you export an image Dataset with multiple annotation sets, one JSON Lines file might contain a data item with a single-label classification annotation; another JSON Lines file for a different annotation set might contain the same data item, but with an object detection annotation instead.

The format of the exported files matches the format of the JSON Lines import files that you can use to import data intoVertex AI. This format depends on the data type (image, tabular, text, video) and objective (such as object tracking, entity extraction, or classification). For example, if you export an annotation set for single-label image classification, then each line of each JSON Lines file is formatted according to the gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_single_label_io_format_1.0.0.yaml schema file, as described in Preparing image data.

To learn more about the different JSON lines formats for different types of annotation sets, view the following guides:

What's next