准备要分类的图片训练数据

本页面介绍了如何准备图片训练数据，以便在 Vertex AI 数据集中使用以训练图片分类模型。

以下目标部分包含有关数据要求、输入/输出架构文件以及架构定义的数据导入文件（JSON 行和 CSV）的信息。

单标签分类

数据要求

训练数据：训练模型时，系统支持以下图片格式。Vertex AI API 预处理这些导入的图片后，它们将作为用于训练模型的数据。文件大小上限为 30MB。

JPEG
GIF
PNG
BMP
ICO

预测数据：请求对模型执行预测（查询）时，系统支持以下图片格式。文件大小上限为 1.5MB。

JPEG
GIF
PNG
WEBP
BMP
TIFF
ICO

用于训练 AutoML 模型的图像数据的最佳做法

以下最佳做法适用于使用 AutoML 训练模型的数据集。

AutoML 模型针对真实物体的照片进行了优化。
训练数据应尽可能接近要对其执行预测的数据。例如，如果您的使用场景涉及模糊的低分辨率图片（例如来自监控摄像头的图片），则训练数据应由模糊的低分辨率图片组成。一般来说，您还应该考虑为训练图片提供多种角度、分辨率和背景。
Vertex AI 模型通常不能预测人类无法分配的标签。因此，如果一个人经过训练，仍无法在观看图片 1-2 秒后分配标签，那么模型可能也无法通过训练达到此目的。
我们建议在每个标签下提供大约 1000 张训练图片。每个标签至少应有 10 个训练视频或视频片段。一般来说，每个标签下需要有更多示例，才能训练可为每个图片添加多个标签的模型，这种情况下得分也更难以解读。
如果最常见标签下的图片数量不超过最罕见标签下图片数量的 100 倍，则模型的效果最佳。建议移除出现频率极低的标签。
考虑添加一个 None_of_the_above 标签以及与您定义的任何标签都不匹配的图片。例如，对于花卉数据集，请添加不属于已加标签的品种的花卉图片，并将它们标记为 None_of_the_above。

YAML 架构文件

使用以下可公开访问的架构文件导入单标签图片分类注释。此架构文件规定数据输入文件的格式。此文件的结构遵循 OpenAPI 架构。

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_single_label_io_format_1.0.0.yaml

完整架构文件

title: ImageClassificationSingleLabel
description: >
 Import and export format for importing/exporting images together with
 single-label classification annotation. Can be used in
 Dataset.import_schema_uri field.
type: object
required:
- imageGcsUri
properties:
 imageGcsUri:
   type: string
   description: >
     A Cloud Storage URI pointing to an image. Up to 30MB in size.
     Supported file mime types: `image/jpeg`, `image/gif`, `image/png`,
     `image/webp`, `image/bmp`, `image/tiff`, `image/vnd.microsoft.icon`.
 classificationAnnotation:
   type: object
   description: Single classification Annotation on the image.
   properties:
     displayName:
       type: string
       description: >
         It will be imported as/exported from AnnotationSpec's display name,
         i.e. the name of the label/class.
     annotationResourceLabels:
       description: Resource labels on the Annotation.
       type: object
       additionalProperties:
         type: string
 dataItemResourceLabels:
   description: Resource labels on the DataItem.
   type: object
   additionalProperties:
     type: string

输入文件

JSON 行

每行上的 JSON 代码：


{
  "imageGcsUri": "gs://bucket/filename.ext",
  "classificationAnnotation": {
    "displayName": "LABEL",
    "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name": "displayName",
        "env": "prod"
      }
   },
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training/test/validation"
  }
}

字段说明：

imageGcsUri - 唯一的必填字段。
annotationResourceLabels - 可以包含任意数量的键值对字符串。系统保留的唯一键值对如下：
- "aiplatform.googleapis.com/annotation_set_name" : "value"
其中 value 是数据集中现有注释集的显示名之一。
dataItemResourceLabels - 可以包含任意数量的键值对字符串。下面是唯一的系统预留键值对，用于指定数据项的机器学习使用集：
- "aiplatform.googleapis.com/ml_use" : "training/test/validation"

示例 JSON 行 - `image_classification_single_label.jsonl`：


{"imageGcsUri": "gs://bucket/filename1.jpeg",  "classificationAnnotation": {"displayName": "daisy"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
{"imageGcsUri": "gs://bucket/filename2.gif",  "classificationAnnotation": {"displayName": "dandelion"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename3.png",  "classificationAnnotation": {"displayName": "roses"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename4.bmp",  "classificationAnnotation": {"displayName": "sunflowers"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename5.tiff",  "classificationAnnotation": {"displayName": "tulips"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}
...

CSV

CSV 格式：

[ML_USE],GCS_FILE_PATH,[LABEL]

列的列表

ML_USE（可选）- 在训练模型时用于拆分数据。使用训练、测试或验证。如需详细了解手动数据拆分，请参阅 AutoML 模型的数据拆分简介。
GCS_FILE_PATH - 此字段包含图片的 Cloud Storage URI。Cloud Storage URI 区分大小写。
LABEL（可选） - 标签必须以字母开头，且只能包含字母、数字和下划线。

CSV 示例 - `image_classification_single_label.csv`：

test,gs://bucket/filename1.jpeg,daisy
training,gs://bucket/filename2.gif,dandelion
gs://bucket/filename3.png
gs://bucket/filename4.bmp,sunflowers
validation,gs://bucket/filename5.tiff,tulips
...

多标签分类

数据要求

训练数据：训练模型时，系统支持以下图片格式。Vertex AI API 预处理这些导入的图片后，它们将作为用于训练模型的数据。文件大小上限为 30MB。

JPEG
GIF
PNG
BMP
ICO

预测数据：请求对模型执行预测（查询）时，系统支持以下图片格式。文件大小上限为 1.5MB。

JPEG
GIF
PNG
WEBP
BMP
TIFF
ICO

用于训练 AutoML 模型的图像数据的最佳做法

以下最佳做法适用于使用 AutoML 训练模型的数据集。

AutoML 模型针对真实物体的照片进行了优化。
训练数据应尽可能接近要对其执行预测的数据。例如，如果您的使用场景涉及模糊的低分辨率图片（例如来自监控摄像头的图片），则训练数据应由模糊的低分辨率图片组成。一般来说，您还应该考虑为训练图片提供多种角度、分辨率和背景。
Vertex AI 模型通常不能预测人类无法分配的标签。因此，如果一个人经过训练，仍无法在观看图片 1-2 秒后分配标签，那么模型可能也无法通过训练达到此目的。
我们建议在每个标签下提供大约 1000 张训练图片。每个标签至少应有 10 个训练视频或视频片段。一般来说，每个标签下需要有更多示例，才能训练可为每个图片添加多个标签的模型，这种情况下得分也更难以解读。
如果最常见标签下的图片数量不超过最罕见标签下图片数量的 100 倍，则模型的效果最佳。建议移除出现频率极低的标签。
考虑添加一个 None_of_the_above 标签以及与您定义的任何标签都不匹配的图片。例如，对于花卉数据集，请添加不属于已加标签的品种的花卉图片，并将它们标记为 None_of_the_above。

YAML 架构文件

使用以下可公开访问的架构文件导入多标签图片分类注释。此架构文件规定数据输入文件的格式。此文件的结构遵循 OpenAPI 架构。

gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_multi_label_io_format_1.0.0.yaml

完整架构文件

title: ImageClassificationMultiLabel
description: >
 Import and export format for importing/exporting images together with
 multi-label classification annotations. Can be used in
 Dataset.import_schema_uri field.
type: object
required:
- imageGcsUri
properties:
 imageGcsUri:
   type: string
   description: >
     A Cloud Storage URI pointing to an image. Up to 30MB in size.
     Supported file mime types: `image/jpeg`, `image/gif`, `image/png`,
     `image/webp`, `image/bmp`, `image/tiff`, `image/vnd.microsoft.icon`.
 classificationAnnotations:
   type: array
   description: Multiple classification Annotations on the image.
   items:
     type: object
     description: Classification annotation.
     properties:
       displayName:
         type: string
         description: >
           It will be imported as/exported from AnnotationSpec's display name,
           i.e. the name of the label/class.
       annotationResourceLabels:
         description: Resource labels on the Annotation.
         type: object
         additionalProperties:
           type: string
 dataItemResourceLabels:
   description: Resource labels on the DataItem.
   type: object
   additionalProperties:
     type: string

输入文件

JSON 行

每行上的 JSON 代码：


{
  "imageGcsUri": "gs://bucket/filename.ext",
  "classificationAnnotations": [
    {
      "displayName": "LABEL1",
      "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name":"displayName",
        "label_type": "flower_type"
      }
    },
    {
      "displayName": "LABEL2",
      "annotationResourceLabels": {
        "aiplatform.googleapis.com/annotation_set_name":"displayName",
        "label_type": "image_shot_type"
      }
    }
  ],
  "dataItemResourceLabels": {
    "aiplatform.googleapis.com/ml_use": "training/test/validation"
  }
}

字段说明：

imageGcsUri - 唯一的必填字段。
annotationResourceLabels - 可以包含任意数量的键值对字符串。系统保留的唯一键值对如下：
- "aiplatform.googleapis.com/annotation_set_name" : "value"
其中 value 是数据集中现有注释集的显示名之一。
dataItemResourceLabels - 可以包含任意数量的键值对字符串。下面是唯一的系统预留键值对，用于指定数据项的机器学习使用集：
- "aiplatform.googleapis.com/ml_use" : "training/test/validation"

示例 JSON 行 - `image_classification_multi_label.jsonl`：


{"imageGcsUri": "gs://bucket/filename1.jpeg",  "classificationAnnotations": [{"displayName": "daisy"}, {"displayName": "full_shot"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
{"imageGcsUri": "gs://bucket/filename2.gif",  "classificationAnnotations": [{"displayName": "dandelion"}, {"displayName": "medium_shot"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename3.png",  "classificationAnnotations": [{"displayName": "roses"}, {"displayName": "extreme_closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename4.bmp",  "classificationAnnotations": [{"displayName": "sunflowers"}, {"displayName": "closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"imageGcsUri": "gs://bucket/filename5.tiff",  "classificationAnnotations": [{"displayName": "tulips"}, {"displayName": "extreme_closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}}
...

CSV

CSV 格式：

[ML_USE],GCS_FILE_PATH,[LABEL₁,LABEL₂,...LABEL_n]

列的列表

ML_USE（可选）- 在训练模型时用于拆分数据。使用训练、测试或验证。如需详细了解手动数据拆分，请参阅 AutoML 模型的数据拆分简介。
GCS_FILE_PATH - 此字段包含图片的 Cloud Storage URI。Cloud Storage URI 区分大小写。
LABEL（可选） - 标签必须以字母开头，且只能包含字母、数字和下划线。

CSV 示例 - `image_classification_multi_label.csv`：

test,gs://bucket/filename1.jpeg,daisy,full_shot
training,gs://bucket/filename2.gif,dandelion,medium_shot
gs://bucket/filename3.png
gs://bucket/filename4.bmp,sunflowers,closeup
validation,gs://bucket/filename5.tiff,tulips,extreme_closeup
...

创建数据集

准备要分类的图片训练数据

单标签分类

数据要求

用于训练 AutoML 模型的图像数据的最佳做法

YAML 架构文件

完整架构文件

输入文件

JSON 行

示例 JSON 行 - image_classification_single_label.jsonl：

CSV

CSV 示例 - image_classification_single_label.csv：

多标签分类

数据要求

用于训练 AutoML 模型的图像数据的最佳做法

YAML 架构文件

完整架构文件

输入文件

JSON 行

示例 JSON 行 - image_classification_multi_label.jsonl：

CSV

CSV 示例 - image_classification_multi_label.csv：

示例 JSON 行 - `image_classification_single_label.jsonl`：

CSV 示例 - `image_classification_single_label.csv`：

示例 JSON 行 - `image_classification_multi_label.jsonl`：

CSV 示例 - `image_classification_multi_label.csv`：