This legacy version of AutoML Video Intelligence is deprecated and will no longer be available on Google Cloud after January 23, 2024. All the functionality of legacy AutoML Video Intelligence and new features are available on the Vertex AI platform. See Migrate to Vertex AI to learn how to migrate your resources.

Best practices

This page describes best practices around preparing your data, evaluating your models, and improving model performance.

Preparing your data

The data that you use for training should be as close as possible to the data that you want to make predictions upon. For example, if your use case involves blurry and low-resolution videos (such as security camera footage), your training data should include blurry, low-resolution videos. In general, you should also consider providing multiple angles, resolutions, and backgrounds for your training videos.
Your training data should meet certain minimal requirements:
- The minimum number of bounding boxes per label is 10.
- Your labels must be valid strings (no commas).
- All video frames must have valid timestamps.
- All video URIs in your CSV must be stored in an accessible Cloud Storage bucket.
The more training and test data that you have, the better. The more powerful the model is, the more data hungry it becomes.
The amount of data needed to train a good model depends on different factors:
- The amount of classes. The more unique classes you have, the more samples per class are needed.
- You should have about 100 training video frames per label. In each frame all objects of the interested labels should be labeled.
- Complexity or diversity of classes. Neural networks might quickly distinguish between cats and birds, but would need a lot more samples to correctly classify 30 different species of birds.
- Some image quality might be lost during the frame normalization process for video frame resolutions larger than 1024 pixels by 1024 pixels.
Avoid training a model with highly imbalanced data. In many cases, the number of samples per class is not equal. When the differences are not big, it's not that bad. However, when there is a bigger imbalance—for example, some classes present more than 10 times more often than others—this becomes a problem.

To learn more, see the information on preparing your data.

Splitting your data

In machine learning, you usually divide your datasets into three separate subsets: a training dataset, a validation dataset, and a test dataset. A training dataset is used to build a model. The model tries multiple algorithms and parameters while searching for patterns in the training data. As the model identifies patterns, it uses the validation dataset to test the algorithms and patterns. The best performing algorithms and patterns are chosen from those identified during the training stage.

After the best performing algorithms and patterns have been identified, they are tested for error rate, quality, and accuracy using the test dataset. You should have a separate test dataset that you can use to test your model independently.

Both a validation and a test dataset are used in order to avoid bias in the model. During the validation stage, optimal model parameters are used, which can result in biased metrics. Using the test dataset to assess the quality of the model after the validation stage provides an unbiased assessment of the quality of the model.

Use the following best practices when splitting your data:

We recommend that all datasets (also called dataset splits) represent the same population, have similar videos, with similar distribution of labels.

When you provide your data, AutoML Video Object Tracking can automatically split it into training, validation, and testing datasets. You can assign the train split labels yourself as well. Note that when AutoML Video Object Tracking generates the train, validation, and test splits from the CSV file, it operates on video level. All labels from one video can only fall into one of the three datasets. Thus, having only a few videos is not recommended if the videos are different or if the label distribution is not the same among the videos.

For example, if you have only 3 videos with thousands of annotated video segments inside, but some classes are present only in individual videos, it can happen that the model doesn't train for some labels and thus can miss those labels during prediction.
Avoid data leakage. Data leakage happens when the algorithm is able to use information during model training that it should not and that information will not be available during future predictions. This can lead to overly optimistic results on train, validation, and test datasets; but might not perform as well when asked to make predictions future unseen data.

Some leakage examples include: a bias based upon camera viewing angle or light conditions (morning/evening); a bias towards videos that have commentators versus those that don't; a bias based upon videos with certain labels that come from specific regions, language groups, commentators, or that include the same logo.

To avoid data leakage, do the following:
- Have a well-diversified set of videos and video frame samples.
- Review the videos to make sure there are no hidden hints (for example, videos with positive samples were taken during the afternoon, while videos with negative samples were taken during the morning).

To learn more, see the information on preparing your data.

Example data sources

For an example of data source, see the following publicly available video detection datasets:

Youtube-BB: Video Object Detection Data Set (5.6 Million Bounding Boxes, 240,000 Videos, 23 types of object)
ImageNet-VID: Imagenet Video Object Detection Challenge (30 basic-level categories)

Training your model

The same data can be used to train different models and to generate different prediction types, depending on what you need. As well, training the same model on the same data might lead to slightly different results. The neural network model training involves randomized operations, so that one cannot guarantee to train exactly the same model with the same inputs, and the predictions might be slightly different.

To learn more, see the information on managing models.

Evaluating your model

Once your model finished training, you can evaluate the performance on validation and test datasets or on your own new datasets.

Common object tracking evaluation concepts and metrics include:

Intersection over union (IOU) measures the overlap between two bounding boxes, which are usually between ground truth and prediction. You can use that to measure how much our predicted bounding box overlaps with the ground truth.
Area Under the Precision/Recall Curve (AuPRC), also known as average precision (AP). It is the integral of precision values over the range of the recall values. It is best interpreted for binary problems.
Mean average precision (mAP or MAP) can be considered as the mean of average precision (AP) metrics over multiple classes or labels. Sometimes, mAP and AP are used interchangeably.
For binary and multi-class problems, you can also examine precision and recall at various confidence score thresholds independently.
If there are not too many multi-class labels, you can examine the confusion matrix which shows which labels were missing.

Not all metrics can be used for the different video object tracking problems. For example, the intuitive understanding of precision and recall from an object tracking problem becomes more ambiguous when considering multiple classes (multi-class problem) or if there are multiple valid labels per sample (multi-label problem). Especially for the later case there is no single and well adopted metric. However, you might consider the mean Average Precision metric for evaluation decisions.

However, keep in mind that it's overly simplified to evaluate model performance based on a single number or metric. Consider looking at different metrics and as well at the plots of precision-recall curves for each of the object classes.

Here are some other helpful tips when evaluating your model:

Be mindful of the distribution of labels in your training and testing datasets. If the datasets are imbalanced, high accuracy metrics might be misleading. Since by default every sample has the same weight during evaluation, a more frequent label can have more weight. For example, if there are 10 times more positive labels then negatives, and the network decides to simply assign all samples to the positive labels, then you can still achieve an accuracy of 91%, but it doesn't mean that the trained model is that useful.
You can also try to analyze ground truth and prediction labels for example in a Python script using scikit-learn. There, you could look into different ways to weight the labels during evaluation: common approaches include macro averaging (metrics are computed per class, and then averaged), weighted (metrics are computed per class, and then averaged based with weights based on the frequency of individual classes), micro (each sample has the same weight, independent of any potential imbalance).
Debugging a model is more about debugging the data than the model itself. If at any point your model starts acting in an unexpected manner as you're evaluating its performance before and after pushing to production, you should return and check your data to see where it might be improved.
Sometimes a testing video has a few scene-cut cases where new scenes appear without too much context association. For example, in live soccer game broadcasts, the camera view switches from top-down view to side view. In such a scenario, it usually takes 2-3 frames for the model to catch up with the change.

To learn more, see the information on evaluating your model.

Testing your model

AutoML Video Object Tracking uses 20% of your data automatically—or, if you chose your data split yourself, whatever percentage you opted to use—to test the model.

Improving model performance

If you get your initial model performance and would want to continue to improve it, you can try a few different approaches:

Increase the number of labeled samples (especially for under-represented classes).
Label more meaningful frames:
- Select frames to label where multiple objects appeear instead of a single object or no object present.
- Select frames that contain more moving objects. This can provide more temporal information for the model to learn during training.
- Make sure all the selected frames are fully labeled. For example, if training a vehicle detection model, you should label all of the vehicles that can be visually observed in a frame.
- Do not select frames at the very beginning of a video. The algorithm can rewind to fetch frames for training to capture motion context. Such information can be lost if there is no/less frames ahead of selected frames.
Closely examine where your model performs not so well:
- Maybe a class is too broad and it would make sense to split it into two or more classes?
- Or maybe some classes are too specific and could be merged without affecting the final goal of the project?
- Consider labeling more samples, particularly for the classes which perform relatively worse.
Reduce data imbalance. Either add more samples or potentially try to reduce the number of samples of high frequency class, particularly in cases when there is a big imbalance, for example, 1-to-100 or more.
Check carefully and try to avoid any potential data leakage.
Drop less important classes to concentrate on fewer critical ones.
Review other options available to you on the Support page.