This page describes how to prepare video training data for use in a Vertex AI dataset to train a video action recognition model.
The following sections provide information about data requirements, schema files, and the format of the data import files (JSONL & CSV) that are defined by the schema.
Alternatively, you can import videos that have not been annotated and annotate them later using the Google Cloud console (see Labeling using the Google Cloud console).
Data requirements
The following requirements apply to datasets used to train AutoML or custom-trained models.
Vertex AI supports the following video formats for training your model or requesting a prediction (annotating a video).
- .MOV
- .MPEG4
- .MP4
- .AVI
To view the video content in the web console or to annotate a video, the video must be in a format that your browser natively supports. Since not all browsers handle .MOV or .AVI content natively, the recommendation is to use either .MPEG4 or .MP4 video format.
Maximum file size is 50 GB (up to 3 hours in duration). Individual video files with malformed or empty timestamps in the container aren't supported.
The maximum number of labels in each dataset is limited to 1,000.
You may assign "ML_USE" labels to the videos in the import files. At training time, you may choose to use those labels to split the videos and their corresponding annotations into "training" or "test" sets. For action recognition,there's a limitation when using the VAR labeling console, which means if you want to use the labeling tool to label actions, you must label all the actions in that video.
Best practices for video data used to train AutoML models
The following practices apply to datasets used to train AutoML models.
The training data should be as close as possible to the data on which predictions are to be made. For example, if your use case involves blurry and low-resolution videos (such as from a security camera), your training data should be composed of blurry, low-resolution videos. In general, you should also consider providing multiple angles, resolutions, and backgrounds for your training videos.
Vertex AI models can't generally predict labels that humans can't assign. If a human can't be trained to assign labels by looking at the video for 1-2 seconds, the model likely can't be trained to do it either.
The model works best when there are at most 100 times more videos for the most common label than for the least common label. We recommend removing low frequency labels. For action recognition, note the following:
- 100 or more training video frames per label are recommended.
- For video frame resolution much larger than 1024 pixels by 1024 pixels, some image quality may be lost during the frame normalization process used by Vertex AI.
Schema files
Use the following publicly accessible schema file when creating the jsonl file for importing annotations. This schema file dictates the format of the data input files. The structure of the file follows the OpenAPI Schema test.
Action Recognition schema file:
gs://google-cloud-aiplatform/schema/dataset/ioformat/video_action_recognition_io_format_1.0.0.yaml
Full schema file
title: VideoActionRecognition description: > Import and export format for importing/exporting videos together with action recognition annotations with time segment. Can be used in Dataset.import_schema_uri field. type: object required: - videoGcsUri properties: videoGcsUri: type: string description: > A Cloud Storage URI pointing to a video. Up to 50 GB in size and up to 3 hours in duration. Supported file mime types: `video/mp4`, `video/avi`, `video/quicktime`. timeSegments: type: array description: Multiple fully-labeled segments. items: type: object description: A time period inside the video. properties: startTime: type: string description: > The start of the time segment. Expressed as a number of seconds as measured from the start of the video, with "s" appended at the end. Fractions are allowed, up to a microsecond precision. default: 0s endTime: type: string description: > The end of the time segment. Expressed as a number of seconds as measured from the start of the video, with "s" appended at the end. Fractions are allowed, up to a microsecond precision, and "Infinity" is allowed, which corresponds to the end of the video. default: Infinity timeSegmentAnnotations: type: array description: > Multiple action recognition annotations. Each on a time segment of the video. items: type: object description: Annotation with a time segment on media (e.g., video). properties: displayName: type: string description: > It will be imported as/exported from AnnotationSpec's display name. startTime: type: string description: > The start of the time segment. Expressed as a number of seconds as measured from the start of the video, with "s" appended at the end. Fractions are allowed, up to a microsecond precision. default: 0s endTime: type: string description: > The end of the time segment. Expressed as a number of seconds as measured from the start of the video, with "s" appended at the end. Fractions are allowed, up to a microsecond precision, and "Infinity" is allowed, which means the end of the video. default: Infinity annotationResourceLabels: description: Resource labels on the Annotation. type: object additionalProperties: type: string dataItemResourceLabels: description: Resource labels on the DataItem. Overrides values set in ImportDataConfig at import time. Can set a user-defined label or the predefined `aiplatform.googleapis.com/ml_use` label, which is used to determine the data split and can be set to `training` and `test`. type: object additionalProperties: type: string
Input files
The format of your training data for video action recognition are as follows.
To import your data, create either a JSONL or CSV file.
JSONL
JSON on each line:
See Action recognition YAML file for details.
Note: The time segments here are used to calculate the timestamps
of the actions. startTime
and endTime
of
timeSegmentAnnotations
can
be equal, and corresponds to the key frame of the action.
{ "videoGcsUri': "gs://bucket/filename.ext", "timeSegments": [{ "startTime": "start_time_of_fully_annotated_segment", "endTime": "end_time_of_segment"}], "timeSegmentAnnotations": [{ "displayName": "LABEL", "startTime": "start_time_of_segment", "endTime": "end_time_of_segment" }], "dataItemResourceLabels": { "ml_use": "train|test" } }
Example JSONL - Video action recognition:
{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"ml_use": "training"}} {"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"ml_use": "test"}} ...
Each row has to be one of the following: Label two actions at different times:CSV
List of columns
TRAINING
,
TEST
specification.0.09845,1.3600555
, where the first
value (0.09845) is the start time and the second value (1.3600555) is
the end time of the video segment that you want labeled. To use the
entire content of the video, specify a start time of 0
and
an end time of the full-length of the video or "inf".
For example, 0,inf
.VIDEO_URI, TIME_SEGMENT_START, TIME_SEGMENT_END, LABEL, ANNOTATION_FRAME_TIMESTAMP
VIDEO_URI, , , LABEL, ANNOTATION_FRAME_TIMESTAMP
VIDEO_URI, TIME_SEGMENT_START, TIME_SEGMENT_END, LABEL, ANNOTATION_SEGMENT_START, ANNOTATION_SEGMENT_END
VIDEO_URI, , , LABEL, ANNOTATION_SEGMENT_START, ANNOTATION_SEGMENT_END
Some examples
gs://folder/video1.avi,kick,12.90,,
gs://folder/video1.avi,catch,19.65,,
There's no action of interest within the two time ranges. Note: the last row means that the fully labeled segment can contain no actions.
gs://folder/video1.avi,,,10.0,20.0
gs://folder/video1.avi,,,25.0,40.0
Your training data must have at least one label and one fully labeled segment.
Again, you do not need to specify validation data to verify the results of your trained model. Vertex AI automatically divides the rows identified for training into training and validation data. 80% for training and 20% for validation.
Save the contents as a CSV file in your Cloud Storage bucket.