Prepare video training data for action recognition

This page describes how to prepare video training data for use in a Vertex AI dataset to train a video action recognition model.

The following sections provide information about data requirements, schema files, and the format of the data import files (JSONL & CSV) that are defined by the schema.

Alternatively, you can import videos that have not been annotated and annotate them later using the Google Cloud console (see Labeling using the Google Cloud console).

Data requirements

The following requirements apply to datasets used to train AutoML or custom-trained models.

Vertex AI supports the following video formats for training your model or requesting a prediction (annotating a video).
- .MOV
- .MPEG4
- .MP4
- .AVI
To view the video content in the web console or to annotate a video, the video must be in a format that your browser natively supports. Since not all browsers handle .MOV or .AVI content natively, the recommendation is to use either .MPEG4 or .MP4 video format.
Maximum file size is 50 GB (up to 3 hours in duration). Individual video files with malformed or empty timestamps in the container aren't supported.
The maximum number of labels in each dataset is limited to 1,000.
You may assign "ML_USE" labels to the videos in the import files. At training time, you may choose to use those labels to split the videos and their corresponding annotations into "training" or "test" sets. For action recognition,there's a limitation when using the VAR labeling console, which means if you want to use the labeling tool to label actions, you must label all the actions in that video.

Best practices for video data used to train AutoML models

The following practices apply to datasets used to train AutoML models.

The training data should be as close as possible to the data on which predictions are to be made. For example, if your use case involves blurry and low-resolution videos (such as from a security camera), your training data should be composed of blurry, low-resolution videos. In general, you should also consider providing multiple angles, resolutions, and backgrounds for your training videos.
Vertex AI models can't generally predict labels that humans can't assign. If a human can't be trained to assign labels by looking at the video for 1-2 seconds, the model likely can't be trained to do it either.
The model works best when there are at most 100 times more videos for the most common label than for the least common label. We recommend removing low frequency labels. For action recognition, note the following:
- 100 or more training video frames per label are recommended.
- For video frame resolution much larger than 1024 pixels by 1024 pixels, some image quality may be lost during the frame normalization process used by Vertex AI.

Schema files

Use the following publicly accessible schema file when creating the jsonl file for importing annotations. This schema file dictates the format of the data input files. The structure of the file follows the OpenAPI Schema test.

Action Recognition schema file:

gs://google-cloud-aiplatform/schema/dataset/ioformat/video_action_recognition_io_format_1.0.0.yaml

Full schema file


title: VideoActionRecognition
description: >
  Import and export format for importing/exporting videos together with
  action recognition annotations with time segment. Can be used in
  Dataset.import_schema_uri field.
type: object
required:
- videoGcsUri
properties:
  videoGcsUri:
    type: string
    description: >
      A Cloud Storage URI pointing to a video. Up to 50 GB in size and
      up to 3 hours in duration. Supported file mime types: `video/mp4`,
      `video/avi`, `video/quicktime`.
  timeSegments:
    type: array
    description: Multiple fully-labeled segments.
    items:
      type: object
      description: A time period inside the video.
      properties:
        startTime:
          type: string
          description: >
            The start of the time segment. Expressed as a number of seconds as
            measured from the start of the video, with "s" appended at the end.
            Fractions are allowed, up to a microsecond precision.
          default: 0s
        endTime:
          type: string
          description: >
            The end of the time segment. Expressed as a number of seconds as
            measured from the start of the video, with "s" appended at the end.
            Fractions are allowed, up to a microsecond precision, and "Infinity"
            is allowed, which corresponds to the end of the video.
          default: Infinity
  timeSegmentAnnotations:
    type: array
    description: >
      Multiple action recognition annotations. Each on a time segment of the video.
    items:
      type: object
      description: Annotation with a time segment on media (e.g., video).
      properties:
        displayName:
          type: string
          description: >
            It will be imported as/exported from AnnotationSpec's display name.
        startTime:
          type: string
          description: >
            The start of the time segment. Expressed as a number of seconds as
            measured from the start of the video, with "s" appended at the end.
            Fractions are allowed, up to a microsecond precision.
          default: 0s
        endTime:
          type: string
          description: >
            The end of the time segment. Expressed as a number of seconds as
            measured from the start of the video, with "s" appended at the end.
            Fractions are allowed, up to a microsecond precision, and "Infinity"
            is allowed, which means the end of the video.
          default: Infinity
        annotationResourceLabels:
          description: Resource labels on the Annotation.
          type: object
          additionalProperties:
            type: string
  dataItemResourceLabels:
    description: Resource labels on the DataItem. Overrides values set in
      ImportDataConfig at import time. Can set a user-defined label
      or the predefined `aiplatform.googleapis.com/ml_use` label, which is
      used to determine the data split and can be set to `training` and `test`.
    type: object
    additionalProperties:
      type: string

Input files

The format of your training data for video action recognition are as follows.

To import your data, create either a JSONL or CSV file.

JSONL

JSON on each line:
See Action recognition YAML file for details.
Note: The time segments here are used to calculate the timestamps of the actions. startTime and endTime of timeSegmentAnnotations can be equal, and corresponds to the key frame of the action.



{
  "videoGcsUri': "gs://bucket/filename.ext",
  "timeSegments": [{
    "startTime": "start_time_of_fully_annotated_segment",
    "endTime": "end_time_of_segment"}],
  "timeSegmentAnnotations": [{
    "displayName": "LABEL",
    "startTime": "start_time_of_segment",
    "endTime": "end_time_of_segment"
  }],
  "dataItemResourceLabels": {
    "ml_use": "train|test"
  }
}

Example JSONL - Video action recognition:



{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"ml_use": "training"}}
{"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"ml_use": "test"}}
...

CSV

List of columns

Validation data.(Optional)TRAINING, TEST specification.
The content to be categorized or annotated. This field contains Cloud Storage URI for the video. Cloud Storage URIs are case-sensitive.
A label that identifies how the video is categorized. Labels must start with a letter and only contain letters, numbers, and underscores. You can specify multiple labels for a video by adding multiple rows in the CSV file that each identify the same video segment, with a different label for each row.
Start and end time of the video segment. These two, comma-separated fields identify the start and end time of the video segment to analyze, in seconds. The start time must be less than the end time. Both values must be non-negative and within the time range of the video. For example, 0.09845,1.3600555, where the first value (0.09845) is the start time and the second value (1.3600555) is the end time of the video segment that you want labeled. To use the entire content of the video, specify a start time of 0 and an end time of the full-length of the video or "inf". For example, 0,inf.
Annotation. Annotation is a label with either frame timestamp or time segment.

Each row has to be one of the following:

VIDEO_URI, TIME_SEGMENT_START, TIME_SEGMENT_END, LABEL, ANNOTATION_FRAME_TIMESTAMP

VIDEO_URI, , , LABEL, ANNOTATION_FRAME_TIMESTAMP

VIDEO_URI, TIME_SEGMENT_START, TIME_SEGMENT_END, LABEL, ANNOTATION_SEGMENT_START, ANNOTATION_SEGMENT_END

VIDEO_URI, , , LABEL, ANNOTATION_SEGMENT_START, ANNOTATION_SEGMENT_END

Some examples

Label two actions at different times:

gs://folder/video1.avi,kick,12.90,,

gs://folder/video1.avi,catch,19.65,,

There's no action of interest within the two time ranges. Note: the last row means that the fully labeled segment can contain no actions.

gs://folder/video1.avi,,,10.0,20.0

gs://folder/video1.avi,,,25.0,40.0

Your training data must have at least one label and one fully labeled segment.

Again, you do not need to specify validation data to verify the results of your trained model. Vertex AI automatically divides the rows identified for training into training and validation data. 80% for training and 20% for validation.

Save the contents as a CSV file in your Cloud Storage bucket.

Create dataset