Picture what the cloud can do: How the New York Times is using Google Cloud to find untold stories in millions of archived photos
Technical Director, Cloud Office of the CTO
Google Cloud has teamed up with The New York Times to help them digitize their vast photo collection. It’s making use of numerous tools within Google Cloud Platform that allow them to securely store their images, provide them with a better interface for finding photos, and find new insights even from the data locked on the backs of images.
For over 100 years, The Times has archived approximately five to seven million of its old photos in hundreds of file cabinets three stories below street level near their Times Square offices in a location called the “morgue.” Many of the photos have been stored in folders and not seen in years. Although a card catalog provides an overview of the archive’s contents, there are many details in the photos that are not captured in an indexed form.
Photo by Earl Wilson for The New York Times
Preserving visual history
The morgue contains photos from as far back as the late 19th century, and many of its contents have tremendous historical value—some that are not stored anywhere else in the world. In 2015, a broken pipe flooded the archival library, putting the entire collection at risk. Luckily, only minor damage was done, but the event raised the question: How can some of the company’s most precious physical assets be safely stored?
“The morgue is a treasure trove of perishable documents that are a priceless chronicle of not just The Times’s history, but of nearly more than a century of global events that have shaped our modern world,” said Nick Rockwell, chief technology officer, The New York Times.
It’s not only the photos’ imagery that contains valuable information. In many cases the back of the photos include the time when and the place where the photo was taken. Adds Rockwell: “Staff members across the photo department and on the business side have been exploring possible avenues for digitizing the morgue’s photos for years. But as recently as last year, the idea of a digitized archive still seemed out of reach.”
To preserve this priceless history, and to give The Times the ability enhance its reporting with even more visual storytelling and historical context, The Times is digitizing its archive, using Cloud Storage to store high-resolution scans of all of the images in the morgue.
Cloud Storage is our durable system for storing objects, and it provides customers like The Times with automatic life-cycle management, storage in geographically distinct regions, and an easy-to-use management interface and API.
Creating an asset management system
Simply storing high-resolution images is not enough to create a system that photo editors can easily use. A working asset management system must allow the users to be able to browse and search for photos easily. The Times built a processing pipeline that stores and processes the photos and will use cloud technology to process and recognize text, handwriting and other details that can be found in the images.
Here’s how it works. Once an image is ingested into Cloud Storage, The Times uses Cloud Pub/Sub to kick off the processing pipeline to accomplish several tasks. Images are resized through services running on Google Kubernetes Engine (GKE) and the image’s metadata is stored in a PostgreSQL database running on Cloud SQL, Google’s fully-managed database offering.
Cloud Pub/Sub helped The New York Times create its processing pipeline without having to build complex APIs or business process systems. It’s a fully-managed solution, so there’s no time spent maintaining the underlying infrastructure.
In order to resize the images and modify image metadata, The Times uses “ImageMagick” and “ExifTool”, open-source command-line programs. They added ImageMagick and exiftool wrapped with Go services to Docker images in order to run them on GKE in a horizontally-scalable manner with minimal administrative effort. Adding more capacity to process more images is trivial, and The Times can stop or start its Kubernetes cluster when the service is not needed. The images are also stored in Cloud Storage multi-region buckets for availability in multiple locations.
The final piece of the archive is tracking both images and their metadata as they move through The Times’s systems. Cloud SQL is a great choice. For their developers, Cloud SQL provides a standard PostgreSQL instance—as a fully managed service, removing the need to install new versions, apply security patches, or set up complex replication configurations. Cloud SQL provides a simple and easy way for engineers to use a standard SQL solution.
Machine learning for additional insights
Storing the images is only one half of the story. To make an archive like The Times’ morgue even more accessible and useful, it’s beneficial to leverage additional GCP features. In the case of The Times, one of the bigger challenges in scanning their photo archive has been adding data regarding the contents of the images. The Cloud Vision API can help fill that gap.
Let’s take a look at this photo of the old Penn Station from The Times as an example. Here, we are showing you the front and the back of the photo:
It’s a beautiful black and white photo, but without additional context, it’s not clear from the front of the photo what it contains. The back of the photos contains a wealth of useful information, and the Cloud Vision API can help us process, store, and read it. When we submit the back of the image to the API with no additional processing, we can see that the Cloud Vision API detects the following text:
NOV 27 1985
JUL 28 1992
Clock hanging above an entrance to the main concourse of Pennsylvania Station in 1942, and, right, exterior of the station before it was demolished in 1963.
PUBLISHED IN NYC
RESORT APR 30 ‘72
The New York Time THE WAY IT WAS - Crowded Penn Station in 1942, an era “when only the brave flew - to Washington, Miami and assorted way stations.”
Penn Station’s Good Old Days | A Buff’s Journey into Nostalgia
( OCT 3194
PHOTOGRAPH BY The New York Times Crowds, top, streaming into the old Pennsylvania Station in New Yorker collegamalan for City in 1942. The former glowegoyercaptouwd a powstation at what is now the General Postadigesikha designay the firm of Hellmuth, Obata & Kassalariare accepted and financed.
Pub NYT Sun 5/2/93 Metro
THURSDAY EARLY RUN o cos x ET RESORT
EB 11 1988
RECEIVED DEC 25 1942 + ART DEPT. FILES
The New York Times Business at rail terminals is reflected in the hotels
OUTWARD BOUND FOR THE CHRISTMAS HOLIDAYS The scene in Pennsylvania Station yesterday afternoor afternoothe New York Times (Greenhaus)
This is the actual output from our Cloud Vision API with no additional preprocessing of the image. Of course, the digital text transcription isn’t perfect, but it’s faster and more cost effective than alternatives for processing millions of images.
Bringing the past into the future
This is only the start of what’s possible for companies with physical archives. They can use the Vision API to identify objects, places and images. For example, if we pass the black and white photo above through the Cloud Vision API with Logo Detection, we can see that Pennsylvania Station is recognized. Furthermore, AutoML can be used to better identify images in collections using a corpus of already captioned images.
The Cloud Natural Language API could be used to add additional semantic information to recognized text. For example, if we pass the text “The New York Time THE WAY IT WAS - Crowded Penn Station in 1942, an era when only the brave flew - to Washington, Miami and assorted way stations.” through the Cloud Natural Language API, it correctly identifies “Penn Station,” “Washington,” and “Miami” as locations, and classifies the entire sentence into the category “travel” and the subcategory “bus & rail.”
Helping The New York Times transform its photo archive fits perfectly with Google’s mission to organize the world’s information and make it universally accessible and useful. We hope that by sharing what we did, we can inspire more organizations—not just publishers—to look to the cloud, and tools like Cloud Vision API, Cloud Storage, Cloud Pub/Sub, and Cloud SQL, to preserve and share their rich history.
Visit our site to read more about AI and machine learning on Google Cloud.