Case study: Automating metadata extraction for video commercials

By Quantiphi, Inc.

This article outlines a webservice designed by Quantiphi to extract brand-specific information from proprietary video commercials. The solution uses a combination of the Cloud Vision API on Google Cloud and open source natural-language processing toolkits. The solution allows internal applications to augment the manual metadata generation process by combining computer vision with expert human judgment.

By combining image processing and fuzzy algorithms with expert human validation, the webservice augments manual metadata generation capabilities. This collaborative approach renders time and cost efficiency improvements to the overall process by increasing the number of extracted metadata fields and reducing inconsistencies in annotations. Moreover, by learning from expert human feedback, the algorithms perform better over time.


Media penetrates every aspect of life through television and digital streaming platforms. Multinational corporations invest a large portion of their marketing budget to advertise their products and promote their brands at a global scale. Assessing the impact of television commercials based on parameters such as brand visibility and its corresponding revenue impact, for instance, requires careful human annotation of brand-specific information that appears in television broadcasts across multiple TV networks. For large corporations, this annotation is an expensive operation whose throughput linearly scales with the amount of human labor.

By automating a large part of the effort and combining it with human intelligence, Quantiphi's system lets users validate and extract contextually richer information. This automation aids organizations optimize labor costs and improve the efficiency of the overall metadata management process.

Solution architecture

The solution framework built by Quantiphi is a semi-automated two-step pipeline. In the first step, the Vision API and text processing capabilities are used to extract brand-specific information from video frames. This is followed by running a full-text search to retrieve the closest matching entry from a proprietary historical database of annotated video ads. For all low-confidence estimates, human annotators validate the information and feed it back into the database.

This collaborative, human-in-the-loop approach improves the efficiency of the process and reduces inconsistencies. Figure 1 illustrates this semi-automated approach:

Automated metadata extraction pipeline uses the Vision API and human input
Figure 1. Automated metadata extraction pipeline uses the Vision API and human input

The system consumes a video feed and feeds it into the pipeline. The extraction pipeline consists of three major components:

  1. Frame clipping: Relevant frames from the video are identified and clipped.
  2. Extraction algorithms
  3. Logo detection: The Vision API infers the brand name from the logo.
  4. OCR: The Vision API makes the text within images searchable.
  5. Full-text Search Engine: The closest matching ads are looked up and retrieved.
  6. Natural language processing (NLP) heuristics: Brand-specific metadata fields are ranked and classified based on similarity.

Close matches are flagged for manual review before they are re-ingested to continue fine-tuning the extraction algorithm.

The following sections provide more detail about these components.

Frame clipping

Video commercials are structured to convey the maximum amount of information in a limited duration of time. As illustrated in Figure 2, frames toward the end of video commercials have a higher probability of containing brand-specific information.

A video ad visualized as a temporal sequence of
    frames. Frames towards the end have a higher probability of containing
    brand-specific information
Figure 2. A video ad visualized as a temporal sequence of frames. Frames towards the end have a higher probability of containing brand-specific information

Often, in those clipped frames, brands can be identified by their tagline, logo, or name. Quantiphi uses a pre-identified sampling strategy to clip informative frames from the end of video commercials.

Metadata extraction and natural language processing

Relevant frames are clipped and passed through to the MetaExtract webservice that serves prediction. The algorithms combine logo recognition and OCR features of the Vision API. The results are aggregated into a query, and a full-text search is performed over the historical metadata database. Finally, the retrieved fields go through an NLP engine to return high-confidence predictions for the brand name, product name, and brand tagline.

Figure 3 shows a Python code snippet that includes a sample JSON request that is sent to Vision API to invoke logo recognition and OCR over video frames.

Code snippet showing call to the Vision API
Figure 3. Code snippet showing call to the Vision API

Predictions prefill the tagging tool used by human annotators, thus allowing them to focus on mining novel metadata fields to reduce repeated efforts. Finally, this updated output is fed back into the primary database. Instances where human annotators override the decisions of the extraction algorithms are automatically logged in the backend and updated in ElasticSearch. This allows the system to incorporate human expertise into the system and allow generalizing to newer advertisements.


Quantiphi built the system as a microservice with a Flask frontend. It has a REST endpoint to other internal system applications in the backend to leverage AI capabilities. At ingestion time, frames from each video are sampled and passed to the microservice, and the metadata predictions are used to prepopulate the tagging tool.

What's next