The following sections provide information about data requirements, schema files, and the format of the data import files (JSONL & CSV) that are defined by the schema.
Alternatively, you can import videos that have not been annotated and annotate them later using the Google Cloud console (see Labeling using the Google Cloud console).
Data requirements
The following requirements apply to datasets used to train AutoML or custom-trained models.
Vertex AI supports the following video formats for training your model or requesting a prediction (annotating a video).
- .MOV
- .MPEG4
- .MP4
- .AVI
To view the video content in the web console or to annotate a video, the video must be in a format that your browser natively supports. Since not all browsers handle .MOV or .AVI content natively, the recommendation is to use either .MPEG4 or .MP4 video format.
Maximum file size is 50 GB (up to 3 hours in duration). Individual video files with malformed or empty timestamps in the container aren't supported.
The maximum number of labels in each dataset is limited to 1,000.
You may assign "ML_USE" labels to the videos in the import files. At training time, you may choose to use those labels to split the videos and their corresponding annotations into "training" or "test" sets. For video classification, note the following:
- At least two different classes are required for model training. For example, "news" and "MTV", or "game" and "others".
- Consider including a "None_of_the_above" class and video segments that do not match any of your defined classes.
Best practices for video data used to train AutoML models
The following practices apply to datasets used to train AutoML models.
The training data should be as close as possible to the data on which predictions are to be made. For example, if your use case involves blurry and low-resolution videos (such as from a security camera), your training data should be composed of blurry, low-resolution videos. In general, you should also consider providing multiple angles, resolutions, and backgrounds for your training videos.
Vertex AI models can't generally predict labels that humans can't assign. If a human can't be trained to assign labels by looking at the video for 1-2 seconds, the model likely can't be trained to do it either.
The model works best when there are at most 100 times more videos for the most common label than for the least common label. We recommend removing low frequency labels. For video classification, the recommended number of training videos per label is about 1,000. The minimum per label is 10, or 50 for advanced models. In general, it takes more examples per label to train models with multiple labels per video, and resulting scores are harder to interpret.
Schema files
Use the following publicly accessible schema file when creating the jsonl file for importing annotations. This schema file dictates the format of the data input files. The structure of the file follows the OpenAPI Schema test.
Video classification schema file:
gs://google-cloud-aiplatform/schema/dataset/ioformat/video_classification_io_format_1.0.0.yaml
Full schema file
title: VideoClassification description: > Import and export format for importing/exporting videos together with classification annotations with time segment. Can be used in Dataset.import_schema_uri field. type: object required: - videoGcsUri properties: videoGcsUri: type: string description: > A Cloud Storage URI pointing to a video. Up to 50 GB in size and up to 3 hours in duration. Supported file mime types: `video/mp4`, `video/avi`, `video/quicktime`. timeSegmentAnnotations: type: array description: > Multiple classification annotations. Each on a time segment of the video. items: type: object description: Annotation with a time segment on media (e.g., video). properties: displayName: type: string description: > It will be imported as/exported from AnnotationSpec's display name. startTime: type: string description: > The start of the time segment. Expressed as a number of seconds as measured from the start of the video, with "s" appended at the end. Fractions are allowed, up to a microsecond precision. default: 0s endTime: type: string description: > The end of the time segment. Expressed as a number of seconds as measured from the start of the video, with "s" appended at the end. Fractions are allowed, up to a microsecond precision, and "Infinity" is allowed, which corresponds to the end of the video. default: Infinity annotationResourceLabels: description: Resource labels on the Annotation. type: object additionalProperties: type: string dataItemResourceLabels: description: Resource labels on the DataItem. type: object additionalProperties: type: string
Input files
The format of your training data for video classification are as follows.
To import your data, create either a JSONL or CSV file.
JSONL
JSON on each line:
See Classification schema (global) file for details.{ "videoGcsUri": "gs://bucket/filename.ext", "timeSegmentAnnotations": [{ "displayName": "LABEL", "startTime": "start_time_of_segment", "endTime": "end_time_of_segment" }], "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "train|test" } }
Example JSONL - Video classification:
{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}} {"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}} ...
CSV
Format of a row in the CSV:
[ML_USE,]VIDEO_URI,LABEL,START,END
List of columns
-
ML_USE
(Optional). For data split purposes when training a model. Use TRAINING or TEST. VIDEO_URI
. This field contains the Cloud Storage URI for the video. Cloud Storage URIs are case-sensitive.LABEL
. Labels must start with a letter and only contain letters, numbers, and underscores. You can specify multiple labels for a video by adding multiple rows in the CSV file that each identify the same video segment, with a different label for each row.START,END
. These two columns, START and END, respectively, identify the start and end time of the video segment to analyze, in seconds. The start time must be less than the end time. Both values must be non-negative and within the time range of the video. For example,0.09845,1.36005
. To use the entire content of the video, specify a start time of0
and an end time of the full-length of the video or "inf". For example,0,inf
.
Example CSV - Classification using single label
Single-label on the same video segment:
TRAINING,gs://YOUR_VIDEO_PATH/vehicle.mp4,mustang,0,5.4 ...
Example CSV - multiple labels:
Multi-label on the same video segment:
gs://YOUR_VIDEO_PATH/vehicle.mp4,fiesta,0,8.285 gs://YOUR_VIDEO_PATH/vehicle.mp4,ranger,0,8.285 gs://YOUR_VIDEO_PATH/vehicle.mp4,explorer,0,8.285 ...
Example CSV - no labels:
You can also provide videos in the data file without specifying any labels. You must then use the Google Cloud console to apply labels to your data before you train your model. To do so, you only need to provide the Cloud Storage URI for the video followed by three commas, as shown in the following example.
gs://YOUR_VIDEO_PATH/vehicle.mp4,,, ...
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-01-09 UTC.
-