Preparing your training data

This page describes how to prepare your data for training a AutoML Video Object Tracking model.

Preparing your videos

This page describes how to prepare your training and test data so that AutoML Video Intelligence Object Tracking can create a custom video annotation model for you.

  • AutoML Video Intelligence Object Tracking supports the following video formats for training your model or requesting a prediction.

    • .MOV
    • .MPEG4
    • .MP4
    • .AVI
  • Maximum file size for training videos is 50GB and up to 3 hours in duration. Individual video files with malformed or empty time offsets in the container aren't supported.

  • The training data should be as close as possible to the data on which you want to make predictions. For example, if your use case involves blurry and low-resolution videos (such as from a security camera), your training data should be composed of blurry, low-resolution videos. In general, you should also consider providing multiple angles, resolutions, and backgrounds for your training videos.

  • AutoML Video Object Tracking models can't generally predict labels that humans can't assign. For example, if a human can't be trained to assign labels by looking at the video for 1-2 seconds, the model likely can't be trained to do it either.

  • You should provide at least 100 training video frames per label and in each frame all objects of the interested labels are labeled. The minimum number of bounding boxes per label is 10.

  • Minimum bounding box size is 10px by 10px.

  • For video frame resolution much larger than 1024 pixels by 1024 pixels, some image quality can be lost during the frame normalization process used by AutoML Video Object Tracking.

  • The models work best when there are at most 100 times more frames for the most common label than for the least common label. You might consider removing very low frequency labels from your datasets.

  • Each unique label must be present in at least 3 distinct video frames. In addition, each label must also have a minimum of 10 annotations.

  • The maximum number of labeled video frames in each dataset is currently limited to 150,000.

  • The maximum number of total annotated bounding boxes in each dataset is currently limited to 1,000,000.

  • The maximum number of labels in each dataset is currently limited to 1,000.

  • Your training data must have at least 1 label.

Training, validation, and test datasets

The data in a dataset is divided into three datasets when training a model: a training dataset, a validation dataset (optional), and a test dataset.

  • A training dataset is used to build a model. While searching for patterns in the training data, multiple algorithms and parameters are attempted.
  • As patterns are identified, the validation dataset is used to test the algorithms and patterns. The best performing algorithms and patterns are chosen from those identified during the training stage.
  • After the best performing algorithms and patterns have been identified, they are tested for error rate, quality, and accuracy using the test dataset.

Both a validation and a test dataset are used in order to avoid bias in the model. During the validation stage, optimal model parameters are used, which can result in biased metrics. Using the test dataset to assess the quality of the model after the validation stage provides an unbiased assessment of the quality of the model.

To identify your training and test dataset, use CSV files.

Create CSV files with video URIs and labels

Once your files have been uploaded to Cloud Storage, you can create CSV files that list all of your training data and the category labels for that data. The CSV files can have any filenames, must be UTF-8 encoded, and must end with a .csv extension.

There are three files that you can use for training and verifying your model:

File Description
Model training file list

Contains paths to the training and test CSV files.

This file is used to identify the locations of separate CSV files that describe your training and testing data.

Here are some examples of the contents of the file list CSV file:

Example 1:


Example 2:

Training data

Used to train the model. Contains URIs to video files, the label identifying the object category, the instance id that identifies the object instance across video frames in a video (optional), the time offset of the labeled video frame, and the object bounding box coordinates.

If you specify a training data CSV file, you must also specify a test or an unassigned data CSV file.

Test data

Used for testing the model during the training phase. Contains the same fields as in the training data.

If you specify a test data CSV file, you must also specify a training or an unassigned data CSV file.

Unassigned data

Used for both training and testing the model. Contains the same fields as in the training data. Rows in the unassigned file are automatically divided into training and testing data, typically 80% for training and 20% for testing.

You can specify only an unassigned data CSV file without training and testing data CSV files. You can also specify only the training and testing data CSV files without an unassigned data CSV file.

The training, test, and unassigned files must have one row of an object bounding box in the set you are uploading, with these columns in each row:

  1. The content to be categorized or annotated. This field contains a Cloud Storage URI for the video. Cloud Storage URIs are case-sensitive.

  2. A label that identifies how the object is categorized. Labels must start with a letter and only contain letters, numbers, and underscores. AutoML Video Object Tracking also allows you to use labels with white spaces.

  3. An instance ID that identifies the object instance across video frames in a video (Optional). If it is provided, AutoML Video Object Tracking uses them for object tracking tuning, training and evaluation. The bounding boxes of the same object instance present in different video frames are labeled as the same instance ID. The instance id is only unique in each video but not in the dataset. For example, if two objects from two different videos have the same instance ID, it does not mean they are the same object instance.

  4. The time offset of the video frame that indicates the duration offset from the beginning of the video. The time offset is a floating number and the units are in second.

  5. A bounding box for an object in the video frame The bounding box for an object can be specified in two ways:

    • Using 2 vertices consisting of a set of x,y coordinates if they are diagonally opposite points of the rectangle, as shown in this example:


    x_relative_min, y_relative_min,,,,x_relative_max,,y_relative_max,,,
    • Using all 4 vertices:

    Each vertex is specified by x, y coordinate values. These coordinates must be a float in the 0 to 1 range, where 0 represents the minimum x or y value, and 1 represents the greatest x or y value.

    For example, (0,0) represents the top left corner, and (1,1) represents the bottom right corner; a bounding box for the entire image is expressed as (0,0,,,1,1,,), or (0,0,1,0,1,1,0,1).

    AutoML Video Object Tracking API does not require a specific vertex ordering. Additionally, if four specified vertices don't form a rectangle parallel to image edges, AutoML Video Object Tracking API specifies vertices that do form such a rectangle.

Examples of CSV dataset files

The following rows demonstrate how to specify data in a dataset. The example includes a path to a video on Cloud Storage, a label for the object, a times offset to begin tracking, and two diagonal vertices.


As stated previously, you can also specify your bounding boxes by providing all four vertices, as shown in the following examples.


You do not need to specify validation data to verify the results of your trained model. AutoML Video Object Tracking automatically divides the rows identified for training into training and validation data, where 80% is used for training and 20% for validation.

Troubleshooting CSV dataset issues

If you have issues specifying your dataset using a CSV file, check the CSV file for the following list of common errors:

  • Using unicode characters in labels. For example, Japanese characters are not supported.
  • Using spaces and non-alphanumeric characters in labels
  • Empty lines
  • Empty columns (lines with two successive commas)
  • Incorrect capitalization of Cloud Storage video paths
  • Incorrect access control configured for your video files. Your service account should have read or greater access, or files must be publicly-readable.
  • References to non-video files (such as PDF or PSD files). Likewise, files that are not video files but that have been renamed with a video extension will cause an error.
  • URI of video points to a different bucket than the current project. Only videos in the project bucket can be accessed.
  • Non-CSV-formatted files.