The following sections provide information about data requirements, schema files, and the format of the data import files (JSONL & CSV) that are defined by the schema.
Alternatively, you can import videos that have not been annotated and annotate them later using the Google Cloud console (see Labeling using the Google Cloud console).
Data requirements
The following requirements apply to datasets used to train AutoML or custom-trained models.
Vertex AI supports the following video formats for training your model or requesting a prediction (annotating a video).
- .MOV
- .MPEG4
- .MP4
- .AVI
To view the video content in the web console or to annotate a video, the video must be in a format that your browser natively supports. Since not all browsers handle .MOV or .AVI content natively, the recommendation is to use either .MPEG4 or .MP4 video format.
Maximum file size is 50 GB (up to 3 hours in duration). Individual video files with malformed or empty timestamps in the container aren't supported.
The maximum number of labels in each dataset is limited to 1,000.
You may assign "ML_USE" labels to the videos in the import files. At training time, you may choose to use those labels to split the videos and their corresponding annotations into "training" or "test" sets. For video object tracking, note the following:
- The maximum number of labeled video frames in each dataset is limited to 150,000.
- The maximum number of total annotated bounding boxes in each dataset is limited to 1,000,000.
- The maximum number of labels in each annotation set is limited to 1,000.
Best practices for video data used to train AutoML models
The following practices apply to datasets used to train AutoML models.
The training data should be as close as possible to the data on which predictions are to be made. For example, if your use case involves blurry and low-resolution videos (such as from a security camera), your training data should be composed of blurry, low-resolution videos. In general, you should also consider providing multiple angles, resolutions, and backgrounds for your training videos.
Vertex AI models can't generally predict labels that humans can't assign. If a human can't be trained to assign labels by looking at the video for 1-2 seconds, the model likely can't be trained to do it either.
The model works best when there are at most 100 times more videos for the most common label than for the least common label. We recommend removing low frequency labels. For object tracking:
- Minimum bounding box size is 10 px by 10 px.
- For video frame resolution much larger than 1024 pixels by 1024 pixels, some image quality can be lost during the frame normalization process used by AutoML object tracking.
- Each unique label must be present in at least three distinct video frames. In addition, each label must also have a minimum of ten annotations.
Schema files
Use the following publicly accessible schema file when creating the jsonl file for importing annotations. This schema file dictates the format of the data input files. The structure of the file follows the OpenAPI Schema test.
Object tracking schema file:
gs://google-cloud-aiplatform/schema/dataset/ioformat/object_tracking_io_format_1.0.0.yaml
Full schema file
title: VideoObjectTracking version: 1.0.0 description: > Import and export format for importing/exporting videos together with temporal bounding box annotations. type: object required: - videoGcsUri properties: videoGcsUri: type: string description: > A Cloud Storage URI pointing to a video. Up to 50 GB in size and up to 3 hours in duration. Supported file mime types: `video/mp4`, `video/avi`, `video/quicktime`. TemporalBoundingBoxAnnotations: type: array description: Multiple temporal bounding box annotations. Each on a frame of the video. items: type: object description: > Temporal bounding box anntoation on video. `xMin`, `xMax`, `yMin`, and `yMax` are relative to the video frame size, and the point 0,0 is in the top left of the frame. properties: displayName: type: string description: > It will be imported as/exported from AnnotationSpec's display name, i.e., the name of the label/class. xMin: description: The leftmost coordinate of the bounding box. type: number format: double xMax: description: The rightmost coordinate of the bounding box. type: number format: double yMin: description: The topmost coordinate of the bounding box. type: number format: double yMax: description: The bottommost coordinate of the bounding box. type: number format: double timeOffset: type: string description: > A time offset of a video in which the object has been detected. Expressed as a number of seconds as measured from the start of the video, with fractions up to a microsecond precision, and with "s" appended at the end. instanceId: type: number format: integer description: > The instance of the object, expressed as a positive integer. Used to tell apart objects of the same type when multiple are present on a single video. annotationResourceLabels: description: Resource labels on the Annotation. type: object additionalProperties: type: string dataItemResourceLabels: description: Resource labels on the DataItem. type: object additionalProperties: type: string
Input files
The format of your training data for video object tracking are as follows.
To import your data, create either a JSONL or CSV file.
JSONL
JSON on each line:
See Object tracking YAML file for details.
{ "videoGcsUri": "gs://bucket/filename.ext", "TemporalBoundingBoxAnnotations": [{ "displayName": "LABEL", "xMin": "leftmost_coordinate_of_the_bounding box", "xMax": "rightmost_coordinate_of_the_bounding box", "yMin": "topmost_coordinate_of_the_bounding box", "yMax": "bottommost_coordinate_of_the_bounding box", "timeOffset": "timeframe_object-detected" "instanceId": "instance_of_object "annotationResourceLabels": "resource_labels" }], "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "train|test" } }
Example JSONL - Video object tracking:
{'videoGcsUri': 'gs://demo-data/video1.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '4.000000s', 'xMin': '0.668912', 'yMin': '0.560642', 'xMax': '1.000000', 'yMax': '1.000000'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}} {'videoGcsUri': 'gs://demo-data/video2.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '71.000000s', 'xMin': '0.679056', 'yMin': '0.070957', 'xMax': '0.801716', 'yMax': '0.290358'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}} ...
Format of a row in the CSV file:CSV
[ML_USE,]VIDEO_URI,LABEL,[INSTANCE_ID],TIME_OFFSET,BOUNDING_BOX
List of columns
ML_USE
(Optional). For data split purposes when training a model. Use TRAINING or TEST.VIDEO_URI
. This field contains the Cloud Storage URI for the video. Cloud Storage URIs are case-sensitive.LABEL
. Labels must start with a letter and only contain letters, numbers, and underscores. You can specify multiple labels for a video by adding multiple rows in the CSV file that each identify the same video segment, with a different label for each row.INSTANCE_ID
(Optional). An instance ID that identifies the object instance across video frames in a video. If it's provided, AutoML object tracking uses them for object tracking tuning, training, and evaluation. The bounding boxes of the same object instance present in different video frames are labeled as the same instance ID. The instance id is only unique in each video but not in the dataset. For example, if two objects from two different videos have the same instance ID, it does not mean they are the same object instance.TIME_OFFSET
. The video frame that indicates the duration offset from the beginning of the video. The time offset is a floating number and the units are in seconds.BOUNDING_BOX
. A bounding box for an object in the video frame. Specifying a bounding box involves more than one column.
A.x_relative_min
,y_relative_min
B.x_relative_max
,y_relative_min
C.x_relative_max
,y_relative_max
D.x_relative_min
,y_relative_max
Each vertex is specified by x, y coordinate values. The coordinates values must be a float in the 0 to 1 range, where 0 represents the minimum x or y value, and 1 represents the greatest x or y value.
For example, (0,0) represents the top-left corner, and (1,1) represents the bottom right corner; a bounding box for the entire image is expressed as (0,0,,,1,1,,), or (0,0,1,0,1,1,0,1).
AutoML object tracking does not require a specific vertex ordering. Also, if four specified vertices don't form a rectangle parallel to image edges, Vertex AI specifies vertices that do form such a rectangle.
The bounding box for an object can be specified in one of two ways:- Two vertices specified consisting of a set of x,y coordinates
if they are diagonally opposite points of the rectangle:
A.x_relative_min
,y_relative_min
C.x_relative_max
,y_relative_max
as shown in this example:
x_relative_min, y_relative_min,,,x_relative_max,y_relative_max,,
- All four vertices specified as shown in:
x_relative_min,y_relative_min, x_relative_max,y_relative_min, x_relative_max,y_relative_max, x_relative_min,y_relative_max,
If the four specified vertices don't form a rectangle parallel to image edges, Vertex AI specifies vertices that do form such a rectangle.
- Two vertices specified consisting of a set of x,y coordinates
if they are diagonally opposite points of the rectangle:
Examples of rows in dataset files
The following rows demonstrate how to specify data in a dataset. The
example includes a path to a video on Cloud Storage, a label for the
object, a time offset to begin tracking, and two diagonal vertices.
VIDEO_URI.,LABEL,INSTANCE_ID,TIME_OFFSET,x_relative_min,y_relative_min,x_relative_max,y_relative_min,x_relative_max,y_relative_max,x_relative_min,y_relative_max
gs://folder/video1.avi,car,,12.90,0.8,0.2,,,0.9,0.3,,
gs://folder/video1.avi,bike,,12.50,0.45,0.45,,,0.55,0.55,,
where,
- VIDEO_URI is
gs://folder/video1.avi
, - LABEL is
car
, - INSTANCE_ID , (not specified)
- TIME_OFFSET is
12.90
, - x_relative_min,y_relative_min are
0.8,0.2
, - x_relative_max,y_relative_min not specified,
- x_relative_max,y_relative_max are
0.9,0.3
, - x_relative_min,y_relative_max are not specified
As stated previously, you can also specify your bounding boxes by providing all four vertices, as shown in the following examples.
gs://folder/video1.avi,car,,12.10,0.8,0.8,0.9,0.8,0.9,0.9,0.8,0.9
gs://folder/video1.avi,car,,12.90,0.4,0.8,0.5,0.8,0.5,0.9,0.4,0.9
gs://folder/video1.avi,car,,12.10,0.4,0.2,0.5,0.2,0.5,0.3,0.4,0.3
Example CSV - no labels:
You can also provide videos in the data file without specifying any labels. You must then use the Google Cloud console to apply labels to your data before you train your model. To do so, you only need to provide the Cloud Storage URI for the video followed by eleven commas, as shown in the following example.
Example without assigned ml_use:
gs://folder/video1.avi ...
Example with ml_use assigned:
TRAINING,gs://folder/video1.avi TEST,gs://folder/video2.avi ...