Prepare video data

This page describes how to prepare video data for use in a Vertex AI dataset. The format of video input data depends on the objective. The following video objectives are currently supported:

  • Action recognition
  • Classification
  • Object tracking

The following sections provide information about data requirements, schema files, and the format of the data import files (JSONL & CSV) that are defined by the schema.

Alternatively, you can import videos that have not been annotated and annotate them later using the console (see Labeling using the Google Cloud Console).

Data requirements

The following requirements apply to datasets used to train AutoML or custom-trained models.

  • Vertex AI supports the following video formats for training your model or requesting a prediction (annotating a video).

    • .MOV
    • .MPEG4
    • .MP4
    • .AVI

    Maximum file size is 50GB (up to 3 hours in duration). Individual video files with malformed or empty timestamps in the container aren't supported.

  • The maximum number of labels in each dataset is currently limited to 1,000.

  • You may assign "ML_USE" labels to the videos in the import files. At training time, you may choose to use those labels to split the videos and their corresponding annotations into "training" or "test" sets.

    Action Recognition

    • Limitation: There is a limitation when using the VAR labeling console, which means if you want to use the labeling tool to label actions, you must label all the actions in that video.

    Classification

    • At least two different classes are required for model training. For example, "news" and "MTV", or "game" and "others".
    • Consider including a "None_of_the_above" class and video segments that do not match any of your defined classes.

    Object tracking

    • The maximum number of labeled video frames in each dataset is currently limited to 150,000.
    • The maximum number of total annotated bounding boxes in each dataset is currently limited to 1,000,000.
    • The maximum number of labels in each annotation set is currently limited to 1,000.

Best practices for video data used to train AutoML models

The following practices apply to datasets used to train AutoML models.

  • The training data should be as close as possible to the data on which predictions are to be made. For example, if your use case involves blurry and low-resolution videos (such as from a security camera), your training data should be composed of blurry, low-resolution videos. In general, you should also consider providing multiple angles, resolutions, and backgrounds for your training videos.

  • Vertex AI models can't generally predict labels that humans can't assign. If a human can't be trained to assign labels by looking at the video for 1-2 seconds, the model likely can't be trained to do it either.

  • The model works best when there are at most 100 times more videos for the most common label than for the least common label. We recommend removing very low frequency labels.

    Action Recognition

    • 100 or more training video frames per label are recommended.
    • For video frame resolution much larger than 1024 pixels by 1024 pixels, some image quality may be lost during the frame normalization process used by Vertex AI.

    Classification

    The recommended number of training videos per label is about 1000 training videos per label. The minimum per label is 10, or 50 for advanced models. In general it takes more examples per label to train models with multiple labels per video, and resulting scores are harder to interpret.

    Object tracking

    • Minimum bounding box size is 10px by 10px.
    • For video frame resolution much larger than 1024 pixels by 1024 pixels, some image quality can be lost during the frame normalization process used by AutoML object tracking.
    • Each unique label must be present in at least 3 distinct video frames. In addition, each label must also have a minimum of 10 annotations.

Schema files

  • Use the following publicly accessible schema file when creating the jsonl file for importing annotations. This schema file dictates the format of the data input files. This file's structure follows the OpenAPI Schematest.

    Action Recognition

    gs://google-cloud-aiplatform/schema/dataset/ioformat/video_action_recognition_io_format_1.0.0.yaml

    
    
    title: VideoActionRecognition
    description: >
      Import and export format for importing/exporting videos together with
      action recognition annotations with time segment. Can be used in
      Dataset.import_schema_uri field.
    type: object
    required:
    - videoGcsUri
    properties:
      videoGcsUri:
        type: string
        description: >
          A Cloud Storage URI pointing to a video. Up to 50 GB in size and
          up to 3 hours in duration. Supported file mime types: `video/mp4`,
          `video/avi`, `video/quicktime`.
      timeSegments:
        type: array
        description: Multiple fully-labeled segments.
        items:
          type: object
          description: A time period inside the video.
          properties:
            startTime:
              type: string
              description: >
                The start of the time segment. Expressed as a number of seconds as
                measured from the start of the video, with "s" appended at the end.
                Fractions are allowed, up to a microsecond precision.
              default: 0s
            endTime:
              type: string
              description: >
                The end of the time segment. Expressed as a number of seconds as
                measured from the start of the video, with "s" appended at the end.
                Fractions are allowed, up to a microsecond precision, and "Infinity"
                is allowed, which corresponds to the end of the video.
              default: Infinity
      timeSegmentAnnotations:
        type: array
        description: >
          Multiple action recognition annotations. Each on a time segment ofthe video.
        items:
          type: object
          description: Annotation with a time segment on media (e.g. video).
          properties:
            displayName:
              type: string
              description: >
                It will be imported as/exported from AnnotationSpec's display name.
            startTime:
              type: string
              description: >
                The start of the time segment. Expressed as a number of seconds as
                measured from the start of the video, with "s" appended at the end.
                Fractions are allowed, up to a microsecond precision.
              default: 0s
            endTime:
              type: string
              description: >
                The end of the time segment. Expressed as a number of seconds as
                measured from the start of the video, with "s" appended at the end.
                Fractions are allowed, up to a microsecond precision, and "Infinity"
                is allowed, which means the end of the video.
              default: Infinity
            annotationResourceLabels:
              description: Resource labels on the Annotation.
              type: object
              additionalProperties:
                type: string
      dataItemResourceLabels:
        description: Resource labels on the DataItem. Overrides values set in
          ImportDataConfig at import time. Can set a user-defined label
          or the predefined `aiplatform.googleapis.com/ml_use` label, which is
          used to determine the data split and can be set to `training`, `test`,
          and `validation`.
        type: object
        additionalProperties:
          type: string

    Classification

    gs://google-cloud-aiplatform/schema/dataset/ioformat/video_classification_io_format_1.0.0.yaml

    
    
    title: VideoClassification
    description: >
      Import and export format for importing/exporting videos together with
      classification annotations with time segment. Can be used in
      Dataset.import_schema_uri field.
    type: object
    required:
    - videoGcsUri
    properties:
      videoGcsUri:
        type: string
        description: >
          A Cloud Storage URI pointing to a video. Up to 50 GB in size and
          up to 3 hours in duration. Supported file mime types: `video/mp4`,
          `video/avi`, `video/quicktime`.
      timeSegmentAnnotations:
        type: array
        description: >
          Multiple classification annotations. Each on a time segment of the video.
        items:
          type: object
          description: Annotation with a time segment on media (e.g., video).
          properties:
            displayName:
              type: string
              description: >
                It will be imported as/exported from AnnotationSpec's display name.
            startTime:
              type: string
              description: >
                The start of the time segment. Expressed as a number of seconds as
                measured from the start of the video, with "s" appended at the end.
                Fractions are allowed, up to a microsecond precision.
              default: 0s
            endTime:
              type: string
              description: >
                The end of the time segment. Expressed as a number of seconds as
                measured from the start of the video, with "s" appended at the end.
                Fractions are allowed, up to a microsecond precision, and "Infinity"
                is allowed, which corresponds to the end of the video.
              default: Infinity
            annotationResourceLabels:
              description: Resource labels on the Annotation.
              type: object
              additionalProperties:
                type: string
      dataItemResourceLabels:
        description: Resource labels on the DataItem.
        type: object
        additionalProperties:
          type: string
    

    Object Tracking

    gs://google-cloud-aiplatform/schema/dataset/ioformat/object_tracking_io_format_1.0.0.yaml

    
    
    title: VideoObjectTracking
    version: 1.0.0
    description: >
      Import and export format for importing/exporting videos together with
      temporal bounding box annotations.
    type: object
    required:
    - videoGcsUri
    properties:
      videoGcsUri:
        type: string
        description: >
          A Cloud Storage URI pointing to a video. Up to 50 GB in size and
          up to 3 hours in duration. Supported file mime types: `video/mp4`,
          `video/avi`, `video/quicktime`.
      TemporalBoundingBoxAnnotations:
        type: array
        description: Multiple temporal bounding box annotations. Each on a frame of the video.
        items:
          type: object
          description: >
            Temporal bounding box anntoation on video. `xMin`, `xMax`, `yMin`, and
            `yMax` are relative to the video frame size, and the point 0,0 is in the
            top left of the frame.
          properties:
            displayName:
              type: string
              description: >
                It will be imported as/exported from AnnotationSpec's display name,
                i.e., the name of the label/class.
            xMin:
              description: The leftmost coordinate of the bounding box.
              type: number
              format: double
            xMax:
              description: The rightmost coordinate of the bounding box.
              type: number
              format: double
            yMin:
              description: The topmost coordinate of the bounding box.
              type: number
              format: double
            yMax:
              description: The bottommost coordinate of the bounding box.
              type: number
              format: double
            timeOffset:
              type: string
              description: >
                A time offset of a video in which the object has been detected.
                Expressed as a number of seconds as measured from the
                start of the video, with fractions up to a microsecond precision, and
                with "s" appended at the end.
            instanceId:
              type: number
              format: integer
              description: >
                The instance of the object, expressed as a positive integer. Used to
                tell apart objects of the same type when multiple are present on a
                single video.
            annotationResourceLabels:
              description: Resource labels on the Annotation.
              type: object
              additionalProperties:
                type: string
      dataItemResourceLabels:
        description: Resource labels on the DataItem.
        type: object
        additionalProperties:
          type: string

Input files

The format of your training data depends on your objective. Video has three possible objectives:

  • Action recognition
  • Classification
  • Object tracking

Action recognition

Create either a JSONL or CSV file to import your data.

JSONL

JSON on each line:
See Action recognition yaml file for details.
Note: The time segments here are used to calculate the timestamps of the actions. startTime and endTime of timeSegmentAnnotations can be equal, and corresponds to the key frame of the action.



{
  "videoGcsUri': "gs://bucket/filename.ext",
  "timeSegments": [{
    "startTime": "start_time_of_fully_annotated_segment",
    "endTime": "end_time_of_segment"}],
  "timeSegmentAnnotations": [{
    "displayName": "LABEL",
    "startTime": "start_time_of_segment",
    "endTime": "end_time_of_segment"
  }],
  "dataItemResourceLabels": {
    "ml_use": "train|test"
  }
}

Example JSONL - Video action recognition:



{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"ml_use": "training"}}
{"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"ml_use": "test"}}
...

CSV

List of columns
  1. Validation data.(Optional)TRAINING, TEST specification.
  2. The content to be categorized or annotated. This field contains Cloud Storage URI for the video. Cloud Storage URIs are case-sensitive.
  3. A label that identifies how the video is categorized. Labels must start with a letter and only contain letters, numbers, and underscores. You can specify multiple labels for a video by adding multiple rows in the CSV file that each identify the same video segment, with a different label for each row.
  4. Start and end time of the video segment. These two, comma-separated fields identify the start and end time of the video segment to analyze, in seconds. The start time must be less than the end time. Both values must be non-negative and within the time range of the video. For example, 0.09845,1.3600555, where the first value (0.09845) is the start time and the second value (1.3600555) is the end time of the video segment that you want labeled.. To use the entire content of the video, specify a start time of 0 and an end time of the full length of the video or "inf". For example, 0,inf.
  5. Annotation. Annotation is a label with either frame timestamp or time segment.

Each row has to be one of the following:

VIDEO_URI, TIME_SEGMENT_START, TIME_SEGMENT_END, LABEL, ANNOTATION_FRAME_TIMESTAMP
VIDEO_URI, , , LABEL, ANNOTATION_FRAME_TIMESTAMP
VIDEO_URI, TIME_SEGMENT_START, TIME_SEGMENT_END, LABEL, ANNOTATION_SEGMENT_START, ANNOTATION_SEGMENT_END
VIDEO_URI, , , LABEL, ANNOTATION_SEGMENT_START, ANNOTATION_SEGMENT_END

Some examples

Label two actions at different times.:

  gs://folder/video1.avi,kick,12.90,,
gs://folder/video1.avi,catch,19.65,,

There is no action of interest within the two time ranges. Note: the last row means that the fully labeled segment can contain no actions.

  gs://folder/video1.avi,,,10.0,20.0
gs://folder/video1.avi,,,25.0,40.0

Your training data must have at least 1 label and 1 fully labeled segment.


Again, you do not need to specify validation data to verify the results of your trained model. Vertex AI automatically divides the rows identified for training into training and validation data. 80% for training and 20% for validation.

Save the contents as a CSV file in your Cloud Storage bucket.

Classification

Create either a JSONL or CSV file to import your data.

JSONL

JSON on each line:
See Classification schema (global) file for details.



{
	"videoGcsUri": "gs://bucket/filename.ext",
	"timeSegmentAnnotations": [{
		"displayName": "LABEL",
		"startTime": "start_time_of_segment",
		"endTime": "end_time_of_segment"
	}],
	"dataItemResourceLabels": {
		"aiplatform.googleapis.com/ml_use": "train|test"
	}
}

Example JSONL - Video classification:



{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
...

CSV

Format of a row in the CSV:

[ML_USE,]VIDEO_URI,LABEL,START,END

List of columns

  1. ML_USE (Optional). For data split purposes when training a model. Use TRAINING or TEST.
  2. VIDEO_URI. This field contains the Cloud Storage URI for the video. Cloud Storage URIs are case-sensitive.
  3. LABEL. Labels must start with a letter and only contain letters, numbers, and underscores. You can specify multiple labels for a video by adding multiple rows in the CSV file that each identify the same video segment, with a different label for each row.
  4. START,END. These two columns, START and END, respectively, identify the start and end time of the video segment to analyze, in seconds. The start time must be less than the end time. Both values must be non-negative and within the time range of the video. For example, 0.09845,1.36005. To use the entire content of the video, specify a start time of 0 and an end time of the full length of the video or "inf". For example, 0,inf.

Example CSV - Classification using single label

Single-label on the same video segment:

  TRAINING,gs://YOUR_VIDEO_PATH/vehicle.mp4,mustang,0,5.4
  ...
  

Example CSV - multiple labels:

Multi-label on the same video segment:

  gs://YOUR_VIDEO_PATH/vehicle.mp4,fiesta,0,8.285
  gs://YOUR_VIDEO_PATH/vehicle.mp4,ranger,0,8.285
  gs://YOUR_VIDEO_PATH/vehicle.mp4,explorer,0,8.285
  ...
  

Example CSV - no labels:

You can also provide videos in the data file without specifying any labels. You must then use the Google Cloud Console to apply labels to your data before you train your model. To do so, you only need to provide the Cloud Storage URI for the video followed by three commas, as shown in the following example.

  gs://YOUR_VIDEO_PATH/vehicle.mp4
  ...
  

Object tracking

Create either a JSONL or CSV file to import your data.

JSONL

JSON on each line:
See Object tracking yaml file for details.



{
	"videoGcsUri": "gs://bucket/filename.ext",
	"TemporalBoundingBoxAnnotations": [{
		"displayName": "LABEL",
		"xMin": "leftmost_coordinate_of_the_bounding box",
		"xMax": "rightmost_coordinate_of_the_bounding box",
		"yMin": "topmost_coordinate_of_the_bounding box",
		"yMax": "bottommost_coordinate_of_the_bounding box",
		"timeOffset": "timeframe_object-detected"
                "instanceId": "instance_of_object
                "annotationResourceLabels": "resource_labels"
	}],
	"dataItemResourceLabels": {
		"aiplatform.googleapis.com/ml_use": "train|test"
	}
}

Example JSONL - Video object tracking:



{'videoGcsUri': 'gs://demo-data/video1.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '4.000000s', 'xMin': '0.668912', 'yMin': '0.560642', 'xMax': '1.000000', 'yMax': '1.000000'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}}
{'videoGcsUri': 'gs://demo-data/video2.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '71.000000s', 'xMin': '0.679056', 'yMin': '0.070957', 'xMax': '0.801716', 'yMax': '0.290358'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}}
...

CSV

Format of a row in the CSV file:

[ML_USE,]VIDEO_URI,LABEL,[INSTANCE_ID],TIME_OFFSET,BOUNDING_BOX

List of columns

  • ML_USE (Optional). For data split purposes when training a model. Use TRAINING or TEST.
  • VIDEO_URI. This field contains the Cloud Storage URI for the video. Cloud Storage URIs are case-sensitive.
  • LABEL. Labels must start with a letter and only contain letters, numbers, and underscores. You can specify multiple labels for a video by adding multiple rows in the CSV file that each identify the same video segment, with a different label for each row.
  • INSTANCE_ID (Optional). An instance ID that identifies the object instance across video frames in a video. If it is provided, AutoML object tracking uses them for object tracking tuning, training and evaluation. The bounding boxes of the same object instance present in different video frames are labeled as the same instance ID. The instance id is only unique in each video but not in the dataset. For example, if two objects from two different videos have the same instance ID, it does not mean they are the same object instance.
  • TIME_OFFSET. The video frame that indicates the duration offset from the beginning of the video. The time offset is a floating number and the units are in seconds.
  • BOUNDING_BOX. A bounding box for an object in the video frame. Specifying a bounding box involves more than one column.
    bounding_box
    A. x_relative_min,y_relative_min
    B. x_relative_max,y_relative_min
    C. x_relative_max,y_relative_max
    D. x_relative_min,y_relative_max

    Each vertex is specified by x, y coordinate values. The coordinates values must be a float in the 0 to 1 range, where 0 represents the minimum x or y value, and 1 represents the greatest x or y value.
    For example, (0,0) represents the top left corner, and (1,1) represents the bottommright corner; a bounding box for the entire image is expressed as (0,0,,,1,1,,), or (0,0,1,0,1,1,0,1).
    AutoML object tracking does not require a specific vertex ordering. Additionally, if 4 specified vertices don't form a rectangle parallel to image edges, Vertex AI specifies vertices that do form such a rectangle.
    The bounding box for an object can be specified in one of two ways:
    1. Two vertices specified consisting of a set of x,y coordinates if they are diagonally opposite points of the rectangle:
      A. x_relative_min,y_relative_min
      D. x_relative_min,y_relative_max
      as shown in this example:
      x_relative_min, y_relative_min,,,x_relative_max,y_relative_max,,
    2. All four vertices specified as shown in:
      x_relative_min,y_relative_min, x_relative_max,y_relative_min, x_relative_max,y_relative_max, x_relative_min,y_relative_max,
      If the four specified vertices don't form a rectangle parallel to image edges, Vertex AI specifies vertices that do form such a rectangle.

Examples of rows in dataset files

The following rows demonstrate how to specify data in a dataset. The example includes a path to a video on Cloud Storage, a label for the object, a time offset to begin tracking, and two diagonal vertices. VIDEO_URI.,LABEL,INSTANCE_ID,TIME_OFFSET,x_relative_min,y_relative_min,x_relative_max,y_relative_min,x_relative_max,y_relative_max,x_relative_min,y_relative_max

gs://folder/video1.avi,car,,12.90,0.8,0.2,,,0.9,0.3,,
gs://folder/video1.avi,bike,,12.50,0.45,0.45,,,0.55,0.55,,
where,

  • VIDEO_URI is gs://folder/video1.avi,
  • LABEL is car,
  • INSTANCE_ID , (not specified)
  • TIME_OFFSET is 12.90,
  • x_relative_min,y_relative_min are 0.8,0.2,
  • x_relative_max,y_relative_min not specified,
  • x_relative_max,y_relative_max are 0.9,0.3,
  • x_relative_min,y_relative_max are not specified

As stated previously, you can also specify your bounding boxes by providing all four vertices, as shown in the following examples.

gs://folder/video1.avi,car,,12.10,0.8,0.8,0.9,0.8,0.9,0.9,0.8,0.9 gs://folder/video1.avi,car,,12.90,0.4,0.8,0.5,0.8,0.5,0.9,0.4,0.9 gs://folder/video1.avi,car,,12.10,0.4,0.2,0.5,0.2,0.5,0.3,0.4,0.3

Example CSV - no labels:

You can also provide videos in the data file without specifying any labels. You must then use the Google Cloud Console to apply labels to your data before you train your model. To do so, you only need to provide the Cloud Storage URI for the video followed by eleven commas, as shown in the following example.

Example without assigned ml_use:

  gs://folder/video1.avi
  ...
  

Example with ml_use assigned:

  TRAINING,gs://folder/video1.avi
  TEST,gs://folder/video2.avi
  ...
  

What's next