The evolution of data architecture at The New York Times
Ed Podojil
Executive Director, Data Products, The New York Times
Like virtually every business across the globe, The New York Times had to quickly adapt to the challenges of the coronavirus pandemic last year. Fortunately, our data system with Google Cloud positioned us to perform quickly and efficiently in the new normal.
How we use data
We have an end-to-end type of data platform; on one side we work very closely with our product teams to collect the right level of data that they're interested in, such as which articles people are reading, and how long they’re staying onsite. We frequently measure our audience to understand our user segments, and how they come onsite or use our apps. We then provide that data to analysts for end-to-end analytics.
On the other side, the newsroom is also focused on audience, and we build tools to help them understand how Google Search or different social promotions play a role in a person's decision to read The New York Times, and also to get a better sense of their behavior on our pages. With this data, the newsroom can make decisions about information that should be displayed on our homepage or in push notifications.
Ultimately, we’re interested in behavioral analytics—how people engage with our site and our apps. We want to understand different behavioral patterns, and which factors or features will encourage users to register and subscribe with us.
We also use data to create or curate preferences around personalization, to ensure we're delivering to our users fresh content, or content that they may not have normally read. Likewise, our data also gets used in our targeting system, so that we can send out the right messaging about our various subscription packages to the right users.
Choosing to migrate to Google Cloud
When I came to The New York Times over five years ago, our data architecture was not working for us. Our infrastructure was gathering data that proved harder for analysts to crunch on a daily basis. We were also hitting hang ups with how that data was streaming into our system and environment. Back then we’d run a query and then go grab some coffee, hoping that the query would finish or give us the right data by the time we came back to our desks. Sometimes it would, sometimes it wouldn’t.
We realized that Hadoop was definitely not going to be the on-premises solution for us, and that’s when we started talking with the Google Cloud team. We began our digital transformation with a migration to BigQuery, their fully managed, serverless database warehouse. We were under a pretty aggressive migration timeline, focusing first on moving over analytics. We made sure our analysts got a top-of-the-line system that treated them the way that they themselves would want to treat the data.
One significant prominent requirement in our data architecture choice was to enable analysts to be able to work as quickly as they needed to provide high-quality deliverables for their business partners. For our analysts, the transition to BigQuery was night and day. I still remember when my manager ran his very first query on BigQuery and was ready to go grab his coffee, but the query finished by the time he got up from his chair. Our analysts talk about that to this day.
While we were doing the BigQuery transition, we did have concerns about our other systems not scaling correctly. Two years ago, we weren’t sure we’d be able to scale up to the audience we expected on that election day. We were able to band-aid a solution back then, but we knew we only had two more years to figure out a real, dependable solution.
During that time, we moved our streaming pipeline over to Google Cloud, primarily using App Engine, which has been a flexible environment that enabled quick scaling changes and requirements as needed. Dataflow and Pub/Sub also played significant roles in managing the data. In Q4 of 2020 we had our most significant traffic ever recorded, at 273 million global readers, and four straight days of the highest traffic we've had compared to other election weeks. We were proud to see that there was no data loss.
A couple of years ago, on our legacy system, I was up until three in the morning one night trying to keep data running for their needs. This year, for election night, I relaxed and ate a pint of ice cream because I was able to more easily manage our data environment, allowing us to set and meet higher expectations for data ingestion, analysis and insight among our partners in the newsroom.
How COVID-19 changed our 2020 roadmap
The coronavirus pandemic definitely wasn't on my team's roadmap for 2020, and it’s important to mention here that The New York Times is not fundamentally a data company. Our job is to get the news out to our users every single day in paper, on apps, and onsite. Our newsroom didn’t expect the need to build out a giant coronavirus database that would enrich the news they share every day.
Our newsroom moves quickly, and our engineers have built one of the most comprehensive datasets on COVID-19 in the U.S. With Google, The New York Times decided to make our data publicly available on BigQuery Google’s COVID-19 public dataset. Check out this webinar for more details on our evolution architecture:
Flexible approach
We have many different teams that work within Google Cloud, and they’ve been able to pick from the range of available services and tailor project requirements keeping those tools available in mind.
One challenge we think about with the data platform at The New York Times is determining the priorities of what we build. Our ability to engage with product teams at Google though the Data Analytics Customer Council allows us to see into the BigQuery roadmap, or the data analytics roadmap, and plays a significant role in determining where we focus our own development. For example, we've built tools like our Data Reporting API, which reads data directly from BigQuery, in order to take advantage of tools like BigQuery BI Engine. This approach encourages our analysts to be better managers of their domains around dimensions and metrics, but not have to focus on building caching mechanisms of their data. Getting that kind of clarity helps us plan how to build The New York Times in the new normal and beyond.
If you are interested to learn more about the data teams at the New York Times, take a look at our open tech roles here and you’ll find many interesting articles at NYT data blog.