使用代管式数据集

本页面介绍了如何使用 Vertex AI 托管式数据集来训练自定义模型。托管式数据集具有以下优点：

在一个中心位置管理您的数据集。
创建标签和多个注释集。
使用集成式数据标签创建人工标签任务。
跟踪治理和迭代开发的模型沿袭。
使用相同的数据集训练 AutoML 和自定义模型来比较模型性能。
生成数据统计信息和可视化效果。
自动将数据拆分为训练集、测试集和验证集。

准备工作

如需在训练应用中使用托管式数据集，您必须先创建数据集。您必须在同一区域内创建用于训练的数据集和训练流水线。您必须使用提供 Dataset 资源的区域。

从训练应用访问数据集

创建自定义训练流水线时，您可以指定训练应用使用 Vertex AI 数据集。

在运行时，Vertex AI 会在您的训练容器中设置以下环境变量，将数据集的相关元数据传递给训练应用。

AIP_DATA_FORMAT：数据集的导出格式。可能的值包括 jsonl、csv 或 bigquery。
AIP_TRAINING_DATA_URI：训练数据的 BigQuery URI 或训练数据文件的 Cloud Storage URI。
AIP_VALIDATION_DATA_URI：验证数据的 BigQuery URI 或验证数据文件的 Cloud Storage URI。
AIP_TEST_DATA_URI：测试数据的 BigQuery URI 或测试数据文件的 Cloud Storage URI。

如果数据集的 AIP_DATA_FORMAT 为 jsonl 或 csv，则数据 URI 值引用 Cloud Storage URI，例如 gs://bucket_name/path/training-*。为使每个数据文件保持相对较小的大小，Vertex AI 会将数据集拆分为多个文件。由于训练、验证或测试数据可能拆分成多个文件，因此 URI 采用通配符格式提供。

详细了解如何使用 Cloud Storage 代码示例下载对象。

如果 AIP_DATA_FORMAT 为 bigquery，则数据 URI 值引用 BigQuery URI，例如 bq://project.dataset.table。

详细了解如何对 BigQuery 数据进行分页。

数据集格式

请参阅以下部分，详细了解 Vertex AI 在将数据集传递给训练应用时如何设置数据的格式。

图片数据集

图片数据集以 JSON 行格式传递给训练应用。选择数据集目标的标签页，详细了解 Vertex AI 如何设置数据集的格式。

单标签分类

在导出单标签图片分类数据集时，Vertex AI 使用以下可公开访问的架构。此架构规定数据导出文件的格式。此架构的结构遵循 OpenAPI 架构。

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_single_label_io_format_1.0.0.yaml

导出的数据集中的每个数据项均采用以下格式。为了便于阅读，本示例包含换行符。



{
  "imageGcsUri": "gs://bucket/filename.ext",
  "classificationAnnotation": {
    "displayName": "LABEL",
    "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name": "displayName",
        "env": "prod"
      }
   },
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training/test/validation"
  }
}

字段说明：

imageGcsUri：此图片的 Cloud Storage URI。
annotationResourceLabels：包含任意数量的字符串键值对。Vertex AI 使用此字段指定注释集。
dataItemResourceLabels - 包含任意数量的字符串键值对。指定数据项的机器学习用途，例如训练、测试或验证。

示例 JSON 行



{"imageGcsUri": "gs://bucket/filename1.jpeg",  "classificationAnnotation": {"displayName": "daisy"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
{"imageGcsUri": "gs://bucket/filename2.gif",  "classificationAnnotation": {"displayName": "dandelion"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename3.png",  "classificationAnnotation": {"displayName": "roses"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename4.bmp",  "classificationAnnotation": {"displayName": "sunflowers"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename5.tiff",  "classificationAnnotation": {"displayName": "tulips"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}
...

多标签分类

在导出多标签图片分类数据集时，Vertex AI 使用以下可公开访问的架构。此架构规定数据导出文件的格式。此架构的结构遵循 OpenAPI 架构。

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_multi_label_io_format_1.0.0.yaml

导出的数据集中的每个数据项均采用以下格式。为了便于阅读，本示例包含换行符。


{
  "imageGcsUri": "gs://bucket/filename.ext",
  "classificationAnnotations": [
    {
      "displayName": "LABEL1",
      "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name":"displayName",
        "label_type": "flower_type"
      }
    },
    {
      "displayName": "LABEL2",
      "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name":"displayName",
        "label_type": "image_shot_type"
      }
    }
  ],
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training/test/validation"
  }
}

字段说明：

imageGcsUri：此图片的 Cloud Storage URI。
annotationResourceLabels：包含任意数量的字符串键值对。Vertex AI 使用此字段指定注释集。
dataItemResourceLabels - 包含任意数量的字符串键值对。指定数据项的机器学习用途，例如训练、测试或验证。

示例 JSON 行



{"imageGcsUri": "gs://bucket/filename1.jpeg",  "classificationAnnotations": [{"displayName": "daisy"}, {"displayName": "full_shot"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
{"imageGcsUri": "gs://bucket/filename2.gif",  "classificationAnnotations": [{"displayName": "dandelion"}, {"displayName": "medium_shot"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename3.png",  "classificationAnnotations": [{"displayName": "roses"}, {"displayName": "extreme_closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename4.bmp",  "classificationAnnotations": [{"displayName": "sunflowers"}, {"displayName": "closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename5.tiff",  "classificationAnnotations": [{"displayName": "tulips"}, {"displayName": "extreme_closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}
...

对象检测

在导出对象检测数据集时，Vertex AI 使用以下可公开访问的架构。此架构规定数据导出文件的格式。此架构的结构遵循 OpenAPI 架构。

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_bounding_box_io_format_1.0.0.yaml

导出的数据集中的每个数据项均采用以下格式。为了便于阅读，本示例包含换行符。



{
  "imageGcsUri": "gs://bucket/filename.ext",
  "boundingBoxAnnotations": [
    {
      "displayName": "OBJECT1_LABEL",
      "xMin": "X_MIN",
      "yMin": "Y_MIN",
      "xMax": "X_MAX",
      "yMax": "Y_MAX",
      "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name": "displayName",
        "env": "prod"
      }
    },
    {
      "displayName": "OBJECT2_LABEL",
      "xMin": "X_MIN",
      "yMin": "Y_MIN",
      "xMax": "X_MAX",
      "yMax": "Y_MAX"
    }
  ],
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "test/train/validation"
  }
}

字段说明：

imageGcsUri：此图片的 Cloud Storage URI。
annotationResourceLabels：包含任意数量的字符串键值对。Vertex AI 使用此字段指定注释集。
dataItemResourceLabels - 包含任意数量的字符串键值对。指定数据项的机器学习用途，例如训练、测试或验证。

示例 JSON 行



{"imageGcsUri": "gs://bucket/filename1.jpeg", "boundingBoxAnnotations": [{"displayName": "Tomato", "xMin": "0.3", "yMin": "0.3", "xMax": "0.7", "yMax": "0.6"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
{"imageGcsUri": "gs://bucket/filename2.gif", "boundingBoxAnnotations": [{"displayName": "Tomato", "xMin": "0.8", "yMin": "0.2", "xMax": "1.0", "yMax": "0.4"},{"displayName": "Salad", "xMin": "0.0", "yMin": "0.0", "xMax": "1.0", "yMax": "1.0"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename3.png", "boundingBoxAnnotations": [{"displayName": "Baked goods", "xMin": "0.5", "yMin": "0.7", "xMax": "0.8", "yMax": "0.8"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename4.tiff", "boundingBoxAnnotations": [{"displayName": "Salad", "xMin": "0.1", "yMin": "0.2", "xMax": "0.8", "yMax": "0.9"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}
...

表格数据集

Vertex AI 以 CSV 格式或 BigQuery 表或视图的 URI 的形式向训练应用传递表格数据。如需详细了解数据源格式和要求，请参阅准备导入源。如需详细了解数据集架构，请参阅 Google Cloud 控制台中的数据集。

文本数据集

文本数据集以 JSON 行格式传递给训练应用。选择数据集目标的标签页，详细了解 Vertex AI 如何设置数据集的格式。

单标签分类

在导出单标签文本分类数据集时，Vertex AI 使用以下可公开访问的架构。此架构规定数据导出文件的格式。此架构的结构遵循 OpenAPI 架构。

gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_single_label_io_format_1.0.0.yaml

导出的数据集中的每个数据项均采用以下格式。为了便于阅读，本示例包含换行符。

{
  "classificationAnnotation": {
    "displayName": "label"
  },
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}
{
  "classificationAnnotation": {
    "displayName": "label2"
  },
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

多标签分类

在导出多标签文本分类数据集时，Vertex AI 使用以下可公开访问的架构。此架构规定数据导出文件的格式。此架构的结构遵循 OpenAPI 架构。

gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_multi_label_io_format_1.0.0.yaml

导出的数据集中的每个数据项均采用以下格式。为了便于阅读，本示例包含换行符。

{
  "classificationAnnotations": [{
    "displayName": "label1"
    },{
    "displayName": "label2"
  }],
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}
{
  "classificationAnnotations": [{
    "displayName": "label2"
    },{
    "displayName": "label3"
  }],
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

实体提取

在导出实体提取数据集时，Vertex AI 使用以下可公开访问的架构。此架构规定数据导出文件的格式。此架构的结构遵循 OpenAPI 架构。

gs://google-cloud-aiplatform/schema/dataset/ioformat/text_extraction_io_format_1.0.0.yaml.

导出的数据集中的每个数据项均采用以下格式。为了便于阅读，本示例包含换行符。

{
    "textSegmentAnnotations": [
      {
        "startOffset":number,
        "endOffset":number,
        "displayName": "label"
      },
      ...
    ],
    "textContent": "inline_text",
    "dataItemResourceLabels": {
      "aiplatform.googleapis.com/ml_use": "training|test|validation"
    }
}
{
    "textSegmentAnnotations": [
      {
        "startOffset":number,
        "endOffset":number,
        "displayName": "label"
      },
      ...
    ],
    "textGcsUri": "gcs_uri_to_file",
    "dataItemResourceLabels": {
      "aiplatform.googleapis.com/ml_use": "training|test|validation"
    }
}

情感分析

在导出情感分析数据集时，Vertex AI 使用以下可公开访问的架构。此架构规定数据导出文件的格式。此架构的结构遵循 OpenAPI 架构。

gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_text_sentiment_1.0.0.yaml

导出的数据集中的每个数据项均采用以下格式。为了便于阅读，本示例包含换行符。

{
  "sentimentAnnotation": {
    "sentiment": number,
    "sentimentMax": number
  },
  "textContent": "inline_text",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}
{
  "sentimentAnnotation": {
    "sentiment": number,
    "sentimentMax": number
  },
  "textGcsUri": "gcs_uri_to_file",
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training|test|validation"
  }
}

视频数据集

视频数据集以 JSON 行格式传递给训练应用。选择数据集目标的标签页，详细了解 Vertex AI 如何设置数据集的格式。

动作识别

在导出操作识别数据集时，Vertex AI 使用以下可公开访问的架构。此架构规定数据导出文件的格式。此架构的结构遵循 OpenAPI 架构。

gs://google-cloud-aiplatform/schema/dataset/ioformat/video_action_recognition_io_format_1.0.0.yaml

导出的数据集中的每个数据项均采用以下格式。为了便于阅读，本示例包含换行符。



{
  "videoGcsUri': "gs://bucket/filename.ext",
  "timeSegments": [{
    "startTime": "start_time_of_fully_annotated_segment",
    "endTime": "end_time_of_segment"}],
  "timeSegmentAnnotations": [{
    "displayName": "LABEL",
    "startTime": "start_time_of_segment",
    "endTime": "end_time_of_segment"
  }],
  "dataItemResourceLabels": {
    "ml_use": "train|test"
  }
}

注意：此处的时间段用于计算动作的时间戳。timeSegmentAnnotations 的 startTime 和 endTime 可以相等，并且对应于动作的关键帧。

示例 JSON 行



{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"ml_use": "training"}}
{"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"ml_use": "test"}}
...

分类

在导出分类数据集时，Vertex AI 使用以下可公开访问的架构。此架构规定数据导出文件的格式。此架构的结构遵循 OpenAPI 架构。

gs://google-cloud-aiplatform/schema/dataset/ioformat/video_classification_io_format_1.0.0.yaml

导出的数据集中的每个数据项均采用以下格式。为了便于阅读，本示例包含换行符。



{
	"videoGcsUri": "gs://bucket/filename.ext",
	"timeSegmentAnnotations": [{
		"displayName": "LABEL",
		"startTime": "start_time_of_segment",
		"endTime": "end_time_of_segment"
	}],
	"dataItemResourceLabels": {
		"aiplatform.googleapis.com/ml_use": "train|test"
	}
}

JSON 行示例 - 视频分类：



{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
...

对象跟踪

在导出对象跟踪数据集时，Vertex AI 使用以下可公开访问的架构。此架构规定数据导出文件的格式。此架构的结构遵循 OpenAPI 架构。

gs://google-cloud-aiplatform/schema/dataset/ioformat/object_tracking_io_format_1.0.0.yaml

导出的数据集中的每个数据项均采用以下格式。为了便于阅读，本示例包含换行符。



{
	"videoGcsUri": "gs://bucket/filename.ext",
	"TemporalBoundingBoxAnnotations": [{
		"displayName": "LABEL",
		"xMin": "leftmost_coordinate_of_the_bounding box",
		"xMax": "rightmost_coordinate_of_the_bounding box",
		"yMin": "topmost_coordinate_of_the_bounding box",
		"yMax": "bottommost_coordinate_of_the_bounding box",
		"timeOffset": "timeframe_object-detected"
                "instanceId": "instance_of_object
                "annotationResourceLabels": "resource_labels"
	}],
	"dataItemResourceLabels": {
		"aiplatform.googleapis.com/ml_use": "train|test"
	}
}

示例 JSON 行



{'videoGcsUri': 'gs://demo-data/video1.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '4.000000s', 'xMin': '0.668912', 'yMin': '0.560642', 'xMax': '1.000000', 'yMax': '1.000000'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{'videoGcsUri': 'gs://demo-data/video2.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '71.000000s', 'xMin': '0.679056', 'yMin': '0.070957', 'xMax': '0.801716', 'yMax': '0.290358'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
...

后续步骤

了解如何通过创建训练流水线在自定义训练中使用托管式数据集。