Folha celebrates its 100th anniversary by indexing 2.5 million historic photos in the cloud
About Grupo Folha
Grupo Folha is one of the largest media conglomerates in Brazil. It controls Folha de S. Paulo’s newspaper and website, the Datafolha research institute, Folhapress news agency, CTG-F printing plant, logistics and distribution companies Transfolha and SPDL, and the printing company FolhaGráfica.
Tell us your challenge. We're here to help.
Contact usAbout Assetway
Assetway offers a platform using Google Cloud’s AI technologies to manage digital assets in the cloud.
In partnership with the Google News Initiative and Assetway, the project made it easier to search for images in a 100-year-old archive, and increased the safety of files now stored in Google Cloud.
Results
- 2.5 million images, 26,000 cartoons and 350 million words indexed
- The project evolved from 6,000 pictures processed per month to 200,000 per day
- Easy access to the archive, optimizing time, enhancing stories, and improving team productivity
- Potential revenue increase through the commercial use of images that were previously hard to access
- Safer option for preserving journalistic photographs that are a century old
Migrated 10 TB of data files
If journalism helps build a portrait of history, Folha de S. Paulo has played a leading role in recording the events of the last 100 years in Brazil. Founded in 1921, it is one of the largest newspapers in the country, with a rich archive of materials, ranging from everyday stories from São Paulo to events that disrupted Brazil and the world. Its photographic archive alone has more than 27 million photographs.
That physical material, looked after by the company’s Database team, took up an entire floor in one of Grupo Folha's buildings. After a series of restructuring processes in the department in 2010, the team realized new measures would be needed to keep the safety of these archives, that also contained pictures from the now defunct newspapers Última Hora and Notícias Populares. After all, these materials represent every event they had reported since they were established, in other words, a 100-year history in images.
A 15-strong team working in two shifts began digitizing part of that archive to Grupo Folha’s on-premises server. The team set priorities for the 100,000 file folders based on the needs of the editorial team and Folhapress, the group’s news agency. It was four years of hard work, treating images, digitizing them in high-speed scanners, renaming files, and reorganizing physical material.
The task was made even more onerous by each image’s peculiarities, as there were pictures in various formats and notes written on the back that also needed to be digitized. “It was a real assembly line. We began by cleaning a picture and, by the end of the day, it was already in its file box ready to be stored,” says Jair dos Santos, who coordinated Folha’s Database Digitization Project.
They processed an average of 6,000 images per day, totaling 2.5 million photographs and 26,000 cartoons in 10 TB of information. As digitization progressed, however, new issues emerged.
Besides extending processing times, the on-premises server did not allow file indexing. Searching for an image could take hours, not just because of the slow loading times for high-resolution pictures, but also because it was often necessary to search in several dozen folders, which did not always suit the editorial staff’s urgent deadlines.
Case in point: searching for pictures of carnival director Joãosinho Trinta. There was a specific folder with portraits of this artist, but also more options in the folders of every samba school where he had worked in Rio de Janeiro. Without a sorting and searching mechanism using keywords, some of the stories ended up not being told as finding specific photos was too hard.
“A lot of stories did not get published or the staff gave up and didn’t print a picture because they couldn’t find it or, if they found the folder, it took too long to find the image. We had this two-million-plus photo behemoth but couldn’t use it, because it had not been indexed,” the coordinator explains.
But the greatest challenge was the massive volume of files amassed throughout Folha’s history. Even with the digitization team’s steady work rate, they estimated it would take 35 years to digitize the entire archive. “It was so overwhelming. Even with a bigger budget, it would have taken years and years to complete,” says Juliana Laurino, Administrative Manager of the editorial staff and General Manager of Folhapress and the database.
A new process for automatic indexation
Everything changed in 2020 through a partnership with the Google News Initiative (GNI), a global program that helps foster innovation and digital sustainability in news journalism. Through the GNI, Google creates products and partnerships, offers training, and designs programs to help news outlets develop their business in the digital channel. “We work alongside partners to identify the most important challenges currently facing journalism and seek to solve them using technology,” says Erica Noda, Manager of Google Brazil’s partnerships team.
Inspired by previous joint initiatives between Grupo Folha and Google and by the program’s case study with The New York Times, the Database team created a partnership project and successfully pitched it to the GNI office in Brazil. The team embraced the idea and involved another key partner in this process: Assetway, a company offering a platform based on Google Cloud technologies for managing digital assets in the cloud.
The synergy between Folha’s Database Digitization Project and the GNI’s role in the region was clear from the start. “In Folha’s case, we heard of their editorial staff’s difficulties in using their archive to find pictures and check facts, when their work is dynamic and demands a fast turnaround. Technology makes access easier and more democratic, and saves data more safely,” says Erica.
“We started talking with Google when our spirits were really low and our projections had been scrapped. Discovering it could be done and that all the material would finally be indexed and available to the editorial staff was very encouraging.”
—Juliana Laurino, Administrative Manager of the Editorial Staff, General Manager of Folhapress and Grupo Folha’s databaseThe project consisted of deploying the Assetway Media Center platform in Grupo Folha to migrate digitized images to a cloud environment—and, most importantly, making it possible to sort and search images quickly and accurately.
The new platform deployment process took around a year, most of which was spent by Assetway’s team analyzing the archive and adapting the system. “We didn’t just build a system and deliver it. The kernel of this project was a continuous evolution process, and we really appreciate user feedback. We consulted with Jair often and held meetings with a few key users from Folha’s departments to gather their impressions on the system,” explains Thiago Souza, Product Manager at Assetway.
Before starting the migration, manual adjustments were needed so the indexation mechanism could recognize the information in each file more accurately, since there were inaccuracies and divergences in the names and texts written on the back of the pictures. This task, dubbed hygienization, was carried out by Folha’s Database team, which created a standardization process and implemented a taxonomic structure to facilitate automatic indexation.
The files were then migrated to Assetway Media Center. Fully based on Google Cloud’s infrastructure, this platform was built under the microservices model and runs in Google Kubernetes Engine. Files are stored in Cloud Storage and imported/processed to make them searchable using Pub/Sub. The latter tool also organizes task sequences for every file. All resources are monitored 24/7 using Cloud Monitoring and Cloud Logging. The result is an ideal architecture for a large-sized, complex project.
Assertive search using AI
Another important component of the architecture is API Vision, used to process artificial intelligence in the system. This technology is essential for the platform to deliver quick and smart search functionalities, as it can detect texts and objects in the images to create metadata relevant for indexation. In the particular case of Folha’s archive, recognition of handwritten text on the back of the pictures using OCR (optical character recognition) was a must.
“Our choice of Google Cloud has a lot to do with assertiveness, because there are a lot of AI models that are only good for English. When we want to detect handwritten text in Portuguese using OCR, Google’s solutions are the most cutting-edge. They recognize details very well.”
—Thiago Souza, Product Manager, AssetwayBy the end of the process, 350 million image-related words had been indexed, making searches far easier and accurate. Using cloud infrastructure to store pictures also streamlined processing, which went from 6,000 files per month to over 200,000 per day. Images that were previously lost in the mass of folders can now be found in seconds.
The new system was released gradually. First, it was tested with some editorial staff members who worked with images and made an initial assessment. Next, as adjustments were made, it was released to more users. For Grupo Folha, the platform increased its journalists’ productivity and enhanced their stories, since they can now locate a large array of pictures quickly. Folhapress will also be able to increase revenue by exploiting these materials commercially.
A case in point was the special content created for Folha’s centennial, which included articles and a book collection with historic pictures published in the newspaper. Creating this content from late 2020 to early 2021 for the paper’s centennial in February involved an extensive use of image searches on the platform.
“Many images printed in the articles and the centennial’s book collection wouldn’t be there if it wasn’t for Assetway’s program. It would have taken us too long to find certain things, because one subject could appear in up to ten different folders. On the platform, all we have to do is type.”
—Jair dos Santos, Coordinator Database Digitization Project, FolhaSafety and security for a 100-year-old archive
Safety concerns for physical archives are widespread among news outlets. Grupo Folha nearly lost a part of its negatives when a slab in the room where they were stored fell during a period of heavy rainfall in 2016. Frequent handling of such old material itself risks damaging it. Keeping digitized versions in the cloud contributes to the content’s safety and prevents its loss, even in the case of accidents.
The risk of deletion of digital files by accident or due to on-premises server damage is also reduced, as they are stored in Google Cloud, with various protection and data encryption resources. Assetway Media Center also provides a detailed permission level to set up various access restrictions for users.
“We heard stories from other companies that had lost most of their archives to fires, flooding, and sometimes they literally lost files, i.e., they were in a folder and then they disappeared. That doesn’t happen in a highly professional storage environment such as Google Cloud’s,” explains Assetway’s Product Manager.
With 2.5 million images protected, Grupo Folha now plans to expand the project to the rest of the archive and then, digitize and index the full collection of print editions, page by page. It is an excellent way to celebrate Folha’s first 100 years and it may become a beacon for the entire Brazilian print news market.
“I’m sure our example will really help to preserve these stories, which are not just corporate stories. It’s not the story of Folha, or newspapers, it’s the story of Brazil, of society,” Juliana concludes.
Tell us your challenge. We're here to help.
Contact usAbout Grupo Folha
Grupo Folha is one of the largest media conglomerates in Brazil. It controls Folha de S. Paulo’s newspaper and website, the Datafolha research institute, Folhapress news agency, CTG-F printing plant, logistics and distribution companies Transfolha and SPDL, and the printing company FolhaGráfica.
About Assetway
Assetway offers a platform using Google Cloud’s AI technologies to manage digital assets in the cloud.