Use managed datasets

Using Vertex AI managed datasets with custom training is optional; you can write your training application to ingest training data directly.

Before you can use a managed dataset in your training application, you must create your dataset. You must create the dataset and the training pipeline that you use for training in the same region. You must use a region where Dataset resources are available.

Access a dataset from your training application

When you create a custom training pipeline, you can specify that your training application uses a Vertex AI dataset.

At runtime, Vertex AI passes metadata about your dataset to your training application by setting the following environment variables in your training container.

  • AIP_DATA_FORMAT: The format that your dataset is exported in. Possible values include: jsonl, csv, or bigquery.
  • AIP_TRAINING_DATA_URI: The location that your training data is stored at.
  • AIP_VALIDATION_DATA_URI: The location that your validation data is stored at.
  • AIP_TEST_DATA_URI: The location that your test data is stored at.

If the AIP_DATA_FORMAT of your dataset is jsonl or csv, the data URI values refer to Cloud Storage URIs, like gs://bucket_name/path/training-*. To keep the size of each data file relatively small, Vertex AI splits your dataset into multiple files. Because your training, validation, or test data may be split into multiple files, the URIs are provided in wildcard format.

Learn more about downloading objects using the Cloud Storage code samples.

If your AIP_DATA_FORMAT is bigquery, the data URI values refer to BigQuery URIs, like bq://project.dataset.table.

Learn more about paging through BigQuery data.

Dataset format

Use the following sections to learn more about how Vertex AI formats your data when passing a dataset to your training application.

Image datasets

Image datasets are passed to your training application in JSONL format. Select the tab for your dataset's objective, to learn more about how Vertex AI formats your dataset.

Single-label classification

Vertex AI uses the following publicly accessible schema when exporting a single-label image classification dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_single_label_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.



{
  "imageGcsUri": "gs://bucket/filename.ext",
  "classificationAnnotation": {
    "displayName": "LABEL",
    "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name": "displayName",
        "env": "prod"
      }
   },
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training/test/validation"
  }
}

Field notes:

  • imageGcsUri: The Cloud Storage URI of this image.
  • annotationResourceLabels: Contains any number of key-value string pairs. Vertex AI uses this field to specify the annotation set.
  • dataItemResourceLabels - Contains any number of key-value string pairs. Specifies the machine learning use of the data item, such as training, test, or validation.

Example JSONL



{"imageGcsUri": "gs://bucket/filename1.jpeg",  "classificationAnnotation": {"displayName": "daisy"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
{"imageGcsUri": "gs://bucket/filename2.gif",  "classificationAnnotation": {"displayName": "dandelion"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename3.png",  "classificationAnnotation": {"displayName": "roses"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename4.bmp",  "classificationAnnotation": {"displayName": "sunflowers"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename5.tiff",  "classificationAnnotation": {"displayName": "tulips"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}
...

Multi-label classification

Vertex AI uses the following publicly accessible schema when exporting a multi-label image classification dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_multi_label_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.


{
  "imageGcsUri": "gs://bucket/filename.ext",
  "classificationAnnotations": [
    {
      "displayName": "LABEL1",
      "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name":"displayName",
        "label_type": "flower_type"
      }
    },
    {
      "displayName": "LABEL2",
      "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name":"displayName",
        "label_type": "image_shot_type"
      }
    }
  ],
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training/test/validation"
  }
}

Field notes:

  • imageGcsUri: The Cloud Storage URI of this image.
  • annotationResourceLabels: Contains any number of key-value string pairs. Vertex AI uses this field to specify the annotation set.
  • dataItemResourceLabels - Contains any number of key-value string pairs. Specifies the machine learning use of the data item, such as training, test, or validation.

Example JSONL



{"imageGcsUri": "gs://bucket/filename1.jpeg",  "classificationAnnotations": [{"displayName": "daisy"}, {"displayName": "full_shot"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
{"imageGcsUri": "gs://bucket/filename2.gif",  "classificationAnnotations": [{"displayName": "dandelion"}, {"displayName": "medium_shot"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename3.png",  "classificationAnnotations": [{"displayName": "roses"}, {"displayName": "extreme_closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename4.bmp",  "classificationAnnotations": [{"displayName": "sunflowers"}, {"displayName": "closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename5.tiff",  "classificationAnnotations": [{"displayName": "tulips"}, {"displayName": "extreme_closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}
...

Object detection

Vertex AI uses the following publicly accessible schema when exporting an object detection dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_bounding_box_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.



{
  "imageGcsUri": "gs://bucket/filename.ext",
  "boundingBoxAnnotations": [
    {
      "displayName": "OBJECT1_LABEL",
      "xMin": "X_MIN",
      "yMin": "Y_MIN",
      "xMax": "X_MAX",
      "yMax": "Y_MAX",
      "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name": "displayName",
        "env": "prod"
      }
    },
    {
      "displayName": "OBJECT2_LABEL",
      "xMin": "X_MIN",
      "yMin": "Y_MIN",
      "xMax": "X_MAX",
      "yMax": "Y_MAX"
    }
  ],
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "test/train/validation"
  }
}

Field notes:

  • imageGcsUri: The Cloud Storage URI of this image.
  • annotationResourceLabels: Contains any number of key-value string pairs. Vertex AI uses this field to specify the annotation set.
  • dataItemResourceLabels - Contains any number of key-value string pairs. Specifies the machine learning use of the data item, such as training, test, or validation.

Example JSONL



{"imageGcsUri": "gs://bucket/filename1.jpeg", "boundingBoxAnnotations": [{"displayName": "Tomato", "xMin": "0.3", "yMin": "0.3", "xMax": "0.7", "yMax": "0.6"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
{"imageGcsUri": "gs://bucket/filename2.gif", "boundingBoxAnnotations": [{"displayName": "Tomato", "xMin": "0.8", "yMin": "0.2", "xMax": "1.0", "yMax": "0.4"},{"displayName": "Salad", "xMin": "0.0", "yMin": "0.0", "xMax": "1.0", "yMax": "1.0"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename3.png", "boundingBoxAnnotations": [{"displayName": "Baked goods", "xMin": "0.5", "yMin": "0.7", "xMax": "0.8", "yMax": "0.8"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename4.tiff", "boundingBoxAnnotations": [{"displayName": "Salad", "xMin": "0.1", "yMin": "0.2", "xMax": "0.8", "yMax": "0.9"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}
...

Tabular datasets

Vertex AI passes tabular data to your training application in CSV format or as a URI to a BigQuery table or view. For more information about the data source format and requirements, see Preparing your import source. Refer to the dataset in Google Cloud Console for more information about the dataset schema.

Text datasets

Text datasets are passed to your training application in JSONL format. Select the tab for your dataset's objective, to learn more about how Vertex AI formats your dataset.

Single-label classification

Vertex AI uses the following publicly accessible schema when exporting a single-label text classification dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_single_label_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.



{
  "classificationAnnotation": {
    "displayName": "label"
  },
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
},
{
  "classificationAnnotation": {
    "displayName": "label2"
  },
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

Multi-label classification

Vertex AI uses the following publicly accessible schema when exporting a multi-label text classification dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_multi_label_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.



{
  "classificationAnnotations": [{
    "displayName": "label1"
    },{
    "displayName": "label2"
  }],
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
},
{
  "classificationAnnotations": [{
    "displayName": "label2"
    },{
    "displayName": "label3"
  }],
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

Entity extraction

Vertex AI uses the following publicly accessible schema when exporting an entity extraction dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/text_extraction_io_format_1.0.0.yaml.

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.



{
    "textSegmentAnnotations": [
      {
        "startOffset":number,
        "endOffset":number,
        "displayName": "label"
      },
      ...
    ],
    "textContent": "inline_text"|"textGcsUri": "gcs_uri_to_file",
    "dataItemResourceLabels": {
      "aiplatform.googleapis.com/ml_use": "training|test|validation"
    }
}

Sentiment analysis

Vertex AI uses the following publicly accessible schema when exporting a sentiment analysis dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.

gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_text_sentiment_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.



{
  "sentimentAnnotation": {
    "sentiment": number,
    "sentimentMax": number
  },
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
},
{
  "sentimentAnnotation": {
    "sentiment": number,
    "sentimentMax": number
  },
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

Video datasets

Video datasets are passed to your training application in JSONL format. Select the tab for your dataset's objective, to learn more about how Vertex AI formats your dataset.

Action recognition

Vertex AI uses the following publicly accessible schema when exporting an action recognition dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/video_action_recognition_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.



{
  "videoGcsUri': "gs://bucket/filename.ext",
  "timeSegments": [{
    "startTime": "start_time_of_fully_annotated_segment",
    "endTime": "end_time_of_segment"}],
  "timeSegmentAnnotations": [{
    "displayName": "LABEL",
    "startTime": "start_time_of_segment",
    "endTime": "end_time_of_segment"
  }],
  "dataItemResourceLabels": {
    "ml_use": "train|test"
  }
}

Note: The time segments here are used to calculate the timestamps of the actions. startTime and endTime of timeSegmentAnnotations can be equal, and corresponds to the key frame of the action.

Example JSONL



{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"ml_use": "training"}}
{"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"ml_use": "test"}}
...

Classification

Vertex AI uses the following publicly accessible schema when exporting a classification dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/video_classification_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.



{
	"videoGcsUri": "gs://bucket/filename.ext",
	"timeSegmentAnnotations": [{
		"displayName": "LABEL",
		"startTime": "start_time_of_segment",
		"endTime": "end_time_of_segment"
	}],
	"dataItemResourceLabels": {
		"aiplatform.googleapis.com/ml_use": "train|test"
	}
}

Example JSONL - Video classification:



{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
...

Object tracking

Vertex AI uses the following publicly accessible schema when exporting an object tracking dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.

gs://google-cloud-aiplatform/schema/dataset/ioformat/object_tracking_io_format_1.0.0.yaml

Each data item in your exported dataset uses the following format. This example includes line breaks for readability.



{
	"videoGcsUri": "gs://bucket/filename.ext",
	"TemporalBoundingBoxAnnotations": [{
		"displayName": "LABEL",
		"xMin": "leftmost_coordinate_of_the_bounding box",
		"xMax": "rightmost_coordinate_of_the_bounding box",
		"yMin": "topmost_coordinate_of_the_bounding box",
		"yMax": "bottommost_coordinate_of_the_bounding box",
		"timeOffset": "timeframe_object-detected"
                "instanceId": "instance_of_object
                "annotationResourceLabels": "resource_labels"
	}],
	"dataItemResourceLabels": {
		"aiplatform.googleapis.com/ml_use": "train|test"
	}
}

Example JSONL



{'videoGcsUri': 'gs://demo-data/video1.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '4.000000s', 'xMin': '0.668912', 'yMin': '0.560642', 'xMax': '1.000000', 'yMax': '1.000000'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{'videoGcsUri': 'gs://demo-data/video2.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '71.000000s', 'xMin': '0.679056', 'yMin': '0.070957', 'xMax': '0.801716', 'yMax': '0.290358'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
...

What's next