SCMP: Leverages AI through Project Dali to classify millions of images

About South China Morning Post

Founded in 1903, South China Morning Post is Hong Kong's journal of record and has teams in Asia and the United States. The organization develops news content 24 hours a day, 7 days a week, and aims to "lead the global conversation about China".

Industries: Media & Entertainment
Location: Hong Kong

With Google Cloud solutions, South China Morning Post tags about four million images to improve internal search capabilities, elevate search engine rankings, and gain traction across content and syndication partners.

Google Cloud Results

  • Improves search engine rankings and gains traction across content and syndication partners
  • Eliminates laborious manual tagging of images
  • Streamlines location and identification of images for news teams

Classifies millions of images for easy access

Founded in 1903, South China Morning Post publishes its eponymous newspaper that operates as Hong Kong's journal of record, with a Monday through Saturday circulation of more than 100,000 and a Sunday version circulation of about 80,000. The business runs teams in Asia and the United States, while developing news content 24 hours a day, 7 days a week. South China Morning Post aims to "lead the global conversation about China".

South China Morning Post also publishes magazines such as Cosmopolitan, ELLE, Esquire, and Harper's BAZAAR, and operates the jobs website cpjobs.com.

Extensive media disruption

New technologies and digital business models are disrupting the media industry worldwide and South China Morning Post is no exception. The organization is using new technologies to create and deliver content to readers 24 hours a day, 7 days a week via its scmp.com website, smartphone and tablet applications, and social media and messaging platforms.

"Our data team is already using BigQuery for analytics, so Google Cloud Platform was a natural fit for our project."

Korey Lee, Head of Data, South China Morning Post

Over its 115-year history, South China Morning Post has built a repository of millions of photographs taken by its own photographers and news wire services. This repository grows by hundreds of new photographs weekly. All these images – most of which are untagged – reside in a content management system with only limited search and discovery features. "Our content resource team has to make a tremendous effort to tag images manually," explains Korey Lee, Head of Data, South China Morning Post. "At the moment, it is very inefficient for our editors to search for the people and scenery represented on images."

Tagging all images manually would be prohibitively expensive, prone to human error and subjective decision-making, and require thousands of hours of employee time to complete. The exercise continued on an ongoing basis as photographers added new images.

AI opens up new opportunities

Fortunately, artificial intelligence and machine learning opened up opportunities to automate image tagging and classification workflows at scale. The business established an effort, which it called Project Dali, and reviewed image recognition to determine whether an in-house-developed solution or a vendor product best met its needs.

"Understanding image recognition and careful project scoping was key to avoiding pitfalls," explains Lee. "We had to make sure users needed the features we were building and, most importantly, create a tool that could scale and evolve with our business."

The business reviewed which functions were most affected by the lack of image tags and other metadata, enabling it to pinpoint the teams and individuals who could provide user requirements.

"In particular, Google Kubernetes Engine gave us the flexibility to deliver prototypes quickly and efficiently, so it was very helpful in allowing us to bring together all the pieces of our solution."

Korey Lee, Head of Data, South China Morning Post

"Our newsroom had difficulty retrieving relevant images when searching through our content management system and problems recognizing faces, places, and other entities on some images," says Lee. "Furthermore, our content resources team could not feasibly tag all the images it processed on a daily basis at the level of detail required. Finally our marketing team believed discoverability of images with strong tagging could help search engine optimization. Our images could rise in search rankings, and gain traction across other content and syndication partners."

Facial recognition and object detection

Following discussions with its newsroom and a comprehensive review of its image archives, South China Morning Post elected to focus initially on facial recognition and object detection. The Project Dali team opted for open source and in-house development to properly understand the areas.

"This choice empowered us by giving us full control and autonomy over the design of the solution and the training of the model," says Lee. "Our initial goal was to train a facial recognition model that could identify faces from a custom dataset of about 100 people."

The data team at the South China Morning Post reviewed several open source deep learning libraries. Given there were multiple examples and resources available that addressed its use cases, the team opted for TensorFlow. The data team built a prototype – based on Facenet facial recognition – capable of recognizing hundreds of celebrities, heads of state, and persons of interest. Since deep learning models needed quality data to predict accurately the identity of a photographed person, the organization needed at least 25 images for each face and preferably 100 images per face. The data team then accessed an online library of images and filtered out the poorer quality or irrelevant images itself.

"We plan to build on our success by creating a database on Cloud SQL to store the tags, as well as workflows and user interfaces that enable people to correct inaccurate predictions."

Korey Lee, Head of Data, South China Morning Post

Why was Google Cloud Platform the best fit?

The South China Morning Post needed to deploy a cloud-based infrastructure to run the prototype and quickly determined Google Cloud Platform – particularly Google Kubernetes Engine – best met its requirements. "Our data team is already using BigQuery for analytics, so Google Cloud Platform was a natural fit for our project," says Lee. "In particular, Google Kubernetes Engine gave us the flexibility to deliver prototypes quickly and efficiently, so it was very helpful in allowing us to bring together all the pieces of our solution."

The business is now using Google Kubernetes Engine, Container Registry, Cloud Storage, and Cloud SQL to run the prototype. The publisher is running a face recognition model and is building additional models for places, logos, and optical character recognition.

The data team is also utilizing a Google Kubernetes Engine cluster with a pool of graphics processing unit virtual machines to train the model, and make it servable via a REST API.

Over the last quarter of 2018, South China Morning Post finalized the prototype and started training the face model on 165 faces on a Google Images dataset, in parallel with deploying a user interface to upload images and test predictions.

"We plan to build on our success by creating a database on Cloud SQL to store the tags, as well as workflows and user interfaces that enable people to correct inaccurate predictions," says Lee. "Another important piece of work relies upon the ability to detect unknown faces, cluster the images showing the same unknown face together, and notify the team to connect to the user interface to input this new face name in the database."

Project Dali is seen within South China Morning Post as an example of technology doing what it should – making human activities less tedious and more effective. Thanks to Google Kubernetes Engine, Container Registry, Cloud Storage and Cloud SQL, South China Morning Post has gained additional time to focus on refining stories and delivering even higher quality content for readers.

While Project Dali is still in its initial stages, South China Morning Post is confident it will enhance efficiency in its newsroom. As the brand continues its transformation from Hong Kong's legacy newspaper of record to an international, digital-first global news leader, Project Dali and similar efforts facilitate the use of data to streamline processes and enable its teams to better lead the global conversation on China.

About South China Morning Post

Founded in 1903, South China Morning Post is Hong Kong's journal of record and has teams in Asia and the United States. The organization develops news content 24 hours a day, 7 days a week, and aims to "lead the global conversation about China".

Industries: Media & Entertainment
Location: Hong Kong
Google Cloud Platform logo

12 Months FREE TRIAL

Try Kubernetes Engine, BigQuery, and other Cloud Platform products with $300 in free credit and 12 months.

TRY IT FREE
Google Cloud Platform logo

12 Months FREE TRIAL

Try Kubernetes Engine, BigQuery, and other Cloud Platform products with $300 in free credit and 12 months.

TRY IT FREE