准备用于对象跟踪的视频训练数据

本页面介绍了如何准备视频训练数据，以便在 Vertex AI 数据集中使用来训练视频对象跟踪模型。

以下各部分介绍数据要求、架构文件以及架构定义的数据导入文件（JSONL 和 CSV）的格式。

或者，您也可以导入尚未添加注解的视频，并在之后使用 Google Cloud 控制台添加注解（请参阅使用 Google Cloud 控制台添加标签）。

数据要求

以下要求适用于用于训练 AutoML 或自定义训练的模型的数据集。

Vertex AI 支持以下视频格式来训练模型或请求执行预测（为视频添加注释）。
- .MOV
- .MPEG4
- .MP4
- .AVI
要在网络控制台中查看视频内容或为视频添加注解，视频必须采用浏览器原生支持的格式。由于并非所有浏览器都以原生方式处理 .MOV 或 .AVI 内容，因此建议使用 .MPEG4 或 .MP4 视频格式。
文件大小的上限为 50 GB（时长不超过 3 小时）。不支持容器中格式错误或时间戳为空的单个视频文件。
每个数据集中的最大标签数限制为 1,000。
您可以向导入文件中的视频分配“ML_USE”标签。在训练时，您可以选择使用这些标签将视频及其对应的注释拆分为“训练”集或“测试”集。对于视频对象跟踪，请注意以下事项：
- 每个数据集中加标签的视频帧数上限为 150,000。
- 每个数据集中加注解的边界框总数上限为 1,000,000。
- 每个注解集中的标签数上限为 1,000。

用于训练 AutoML 模型的视频数据的最佳做法

以下做法适用于用于训练 AutoML 模型的数据集。

训练数据应尽可能接近要对其执行预测的数据。例如，如果您的用例涉及模糊的低分辨率视频（例如，来自监控摄像头的视频），那么您的训练数据应由模糊的低分辨率视频组成。一般来说，您还应该考虑为训练视频提供多种角度、分辨率和背景。
Vertex AI 模型通常不能预测人类无法分配的标签。如果一个人经过训练，仍无法在观看视频 1-2 秒后分配标签，那么模型可能也无法通过训练达到此目的。
如果最常见标签下的视频数量不超过最罕见标签下文档数量的 100 倍，则模型效果最佳。建议移除出现频率较低的标签。对于对象跟踪：
- 边界框大小下限：10×10 像素。
- 对于远远大于 1024×1024 像素的视频帧分辨率，在 AutoML 对象跟踪使用的帧归一化过程中，图片质量可能会在一定程度上受损。
- 每个唯一标签必须存在于至少 3 个不同的视频帧中。此外，每个标签还必须至少具有 10 个注解。

架构文件

创建用于导入注释的 jsonl 文件时，请使用以下可公开访问的架构文件。此架构文件规定数据输入文件的格式。文件的结构遵循 OpenAPI 架构测试。

对象跟踪架构文件：

gs://google-cloud-aiplatform/schema/dataset/ioformat/object_tracking_io_format_1.0.0.yaml

完整架构文件



title: VideoObjectTracking
version: 1.0.0
description: >
  Import and export format for importing/exporting videos together with
  temporal bounding box annotations.
type: object
required:
- videoGcsUri
properties:
  videoGcsUri:
    type: string
    description: >
      A Cloud Storage URI pointing to a video. Up to 50 GB in size and
      up to 3 hours in duration. Supported file mime types: `video/mp4`,
      `video/avi`, `video/quicktime`.
  TemporalBoundingBoxAnnotations:
    type: array
    description: Multiple temporal bounding box annotations. Each on a frame of the video.
    items:
      type: object
      description: >
        Temporal bounding box anntoation on video. `xMin`, `xMax`, `yMin`, and
        `yMax` are relative to the video frame size, and the point 0,0 is in the
        top left of the frame.
      properties:
        displayName:
          type: string
          description: >
            It will be imported as/exported from AnnotationSpec's display name,
            i.e., the name of the label/class.
        xMin:
          description: The leftmost coordinate of the bounding box.
          type: number
          format: double
        xMax:
          description: The rightmost coordinate of the bounding box.
          type: number
          format: double
        yMin:
          description: The topmost coordinate of the bounding box.
          type: number
          format: double
        yMax:
          description: The bottommost coordinate of the bounding box.
          type: number
          format: double
        timeOffset:
          type: string
          description: >
            A time offset of a video in which the object has been detected.
            Expressed as a number of seconds as measured from the
            start of the video, with fractions up to a microsecond precision, and
            with "s" appended at the end.
        instanceId:
          type: number
          format: integer
          description: >
            The instance of the object, expressed as a positive integer. Used to
            tell apart objects of the same type when multiple are present on a
            single video.
        annotationResourceLabels:
          description: Resource labels on the Annotation.
          type: object
          additionalProperties:
            type: string
  dataItemResourceLabels:
    description: Resource labels on the DataItem.
    type: object
    additionalProperties:
      type: string

输入文件

用于视频对象跟踪的训练数据格式如下。

如需导入数据，请创建 JSONL 或 CSV 文件。

JSONL

每行的 JSON 表示法：
如需了解详情，请参阅对象跟踪 YAML 文件。



{
	"videoGcsUri": "gs://bucket/filename.ext",
	"TemporalBoundingBoxAnnotations": [{
		"displayName": "LABEL",
		"xMin": "leftmost_coordinate_of_the_bounding box",
		"xMax": "rightmost_coordinate_of_the_bounding box",
		"yMin": "topmost_coordinate_of_the_bounding box",
		"yMax": "bottommost_coordinate_of_the_bounding box",
		"timeOffset": "timeframe_object-detected"
                "instanceId": "instance_of_object
                "annotationResourceLabels": "resource_labels"
	}],
	"dataItemResourceLabels": {
		"aiplatform.googleapis.com/ml_use": "train|test"
	}
}

示例 JSONL - 视频对象跟踪：



{'videoGcsUri': 'gs://demo-data/video1.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '4.000000s', 'xMin': '0.668912', 'yMin': '0.560642', 'xMax': '1.000000', 'yMax': '1.000000'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{'videoGcsUri': 'gs://demo-data/video2.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '71.000000s', 'xMin': '0.679056', 'yMin': '0.070957', 'xMax': '0.801716', 'yMax': '0.290358'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
...

CSV

CSV 文件中一行的格式：

[ML_USE,]VIDEO_URI,LABEL,[INSTANCE_ID],TIME_OFFSET,BOUNDING_BOX

列的列表

ML_USE（可选）。在训练模型时用于拆分数据。使用 TRAINING 或 TEST。
VIDEO_URI。此字段包含视频的 Cloud Storage URI。Cloud Storage URI 区分大小写。
LABEL。标签必须以字母开头，且只能包含字母、数字和下划线。您可以在 CSV 文件中添加多行，每行都标示同一视频片段但采用不同的标签，以此为视频指定多个标签。
INSTANCE_ID（可选）。用于标识视频中跨视频帧的对象实例的实例 ID。如果提供此 ID，AutoML 对象跟踪会将其用于对象跟踪调整、训练和评估。不同视频帧中出现的同一对象实例的边界框会标记为同一实例 ID。实例 ID 仅在每个视频中具有唯一性，但在数据集中并不唯一。例如，如果来自两个不同视频的两个对象具有同一实例 ID，这并不意味着它们是同一对象实例。
TIME_OFFSET。指示相对于视频开始时的时长偏移量的视频帧。时间偏移是浮点数，单位为秒。
BOUNDING_BOX。视频帧中对象的边界框。指定边界框涉及多个列。

A. x_relative_min,y_relative_min
B. x_relative_max,y_relative_min
C. x_relative_max,y_relative_max
D. x_relative_min,y_relative_max

每个顶点由 x、y 坐标值指定。坐标值必须是 0 到 1 范围内的浮点数，其中 0 表示最小 x 或 y 值，1 表示最大 x 或 y 值。
例如，(0,0) 表示左上角，(1,1) 表示右下角；整个图片的边界框表示为 (0,0,,,1,1,,) 或 (0,0,1,0,1,1,0,1)。
AutoML 对象跟踪不需要进行特定的顶点排序。此外，如果 4 个指定顶点未形成与图片边缘平行的矩形，Vertex AI 会指定可形成此类矩形的顶点。
可以通过以下两种方式之一指定对象的边界框：
1. 指定两个顶点，顶点由一组 x,y 坐标组成（如果它们是矩形的对角点）：
  A. x_relative_min,y_relative_min
  C. x_relative_max,y_relative_max
  如以下示例所示：
  x_relative_min, y_relative_min,,,x_relative_max,y_relative_max,,
2. 指定全部四个顶点，如下所示：
  x_relative_min,y_relative_min, x_relative_max,y_relative_min, x_relative_max,y_relative_max, x_relative_min,y_relative_max,
  如果指定的四个顶点未形成与图片边缘平行的矩形，Vertex AI 会指定可形成此类矩形的顶点。

数据集文件中的行的示例

以下行演示了如何在数据集中指定数据。该示例包括 Cloud Storage 上的视频路径、对象的标签、开始跟踪的时间偏移值以及两个对角顶点。VIDEO_URI.,LABEL,INSTANCE_ID,TIME_OFFSET,x_relative_min,y_relative_min,x_relative_max,y_relative_min,x_relative_max,y_relative_max,x_relative_min,y_relative_max

gs://folder/video1.avi,car,,12.90,0.8,0.2,,,0.9,0.3,,
gs://folder/video1.avi,bike,,12.50,0.45,0.45,,,0.55,0.55,,
其中，

VIDEO_URI 为 gs://folder/video1.avi，
LABEL 为 car，
INSTANCE_ID，（未指定）
TIME_OFFSET 为 12.90，
x_relative_min,y_relative_min 为 0.8,0.2，
x_relative_max,y_relative_min 未指定，
x_relative_max,y_relative_max 为 0.9,0.3，
x_relative_min,y_relative_max 未指定

如前所述，您还可以通过提供所有四个顶点来指定边界框，如以下示例所示。

gs://folder/video1.avi,car,,12.10,0.8,0.8,0.9,0.8,0.9,0.9,0.8,0.9 gs://folder/video1.avi,car,,12.90,0.4,0.8,0.5,0.8,0.5,0.9,0.4,0.9 gs://folder/video1.avi,car,,12.10,0.4,0.2,0.5,0.2,0.5,0.3,0.4,0.3

示例 CSV - 无标签：

您也可以在数据文件中提供视频，但不指定任何标签。然后，在训练模型之前，您必须使用 Google Cloud 控制台将标签应用于数据。为此，您只需提供视频的 Cloud Storage URI，后跟 11 个英文逗号，如以下示例所示。

未分配 ml_use 的示例：

  gs://folder/video1.avi
  ...

分配了 ml_use 的示例：

  TRAINING,gs://folder/video1.avi
  TEST,gs://folder/video2.avi
  ...

创建数据集