Goodbye Hadoop. Building a streaming data processing pipeline on Google Cloud
Platform Engineer, Qubit
Product Manager, Qubit
Editor’s note: Today we hear from Qubit, whose personalization technology helps leading brands in retail, travel, and online gaming curate their customers’ online experiences. Qubit recently moved its data ingestion platform from Hadoop and MapReduce, to a stack consisting of fully managed Google Cloud tools like Cloud Pub/Sub, Cloud Dataflow and BigQuery. Here’s their story.
At Qubit, we firmly believe that the most effective way to increase customer loyalty and lifetime value is through personalization. Our customers use our data activation technology to personalize customer experiences and to help them get closer to their customers by using real-time data that we collect from browsers and mobile devices.
Collecting e-commerce clickstream data to the tune of hundreds of thousands of events per second results in absolutely massive datasets. Traditional systems can take days to deliver insight from data at this scale. Relatively modern tools like Hadoop can help reduce this time tremendously— from days to hours. But to really get the most out of this data, and generate for example real-time recommendations, we needed to be able to get live insights as the data comes in. That means scaling the underlying compute, storage, and processing infrastructure quickly and transparently. We'll walk you through how building a data collection pipeline using serverless architecture on Google Cloud let us do just that.
Our data collection and processing infrastructure is built entirely on Google Cloud Platform (GCP) managed services (Cloud Dataflow, PubSub, and BigQuery). It streams, processes, and stores more than 120,000 events per second (during peak traffic) in BigQuery, with a very low end-to-end latency (sub-second). We then make that data, often hundreds of terabytes per day, available and actionable through an app that plugs into our ingestion infrastructure—all of this without ever provisioning a single VM.
In our early days, we built most of our data processing infrastructure ourselves, from the ground up. It was split across two data centers with 300 machines deployed in each. We collected and stored the data in storage buckets, dumped it into Hadoop clusters and then waited for massive MapReduce jobs on the data to finish. This meant spinning up hundreds of machines overnight, which was a problem because many of our customers expected near real-time access in order to power experiences to their in-session visitors. And then there’s the sheer scale of our operation: at the end of two years, we had stored 4PB of data.
Auto-scaling was pretty uncommon back then. And provisioning and scaling machines and capacity is tricky, to say the least, eating up valuable man-hours and causing a lot of pain. Scaling up to handle an influx of traffic, for instance during a big game or Black Friday, was an ordeal. Requesting additional machines for our pool had to be done at least two months in advance and it took another two weeks for our infrastructure team to provision them. Once the peak traffic period was over, the scaled up machines sat idle, incurring costs. A team of four full-time engineers needed to be on-hand to keep the system operational.
In a second phase, we switched to stream processing on Apache Storm to overcome the latency inherent in batch processing. We started running Hive jobs on our datasets so that they could be accessible to our analytics layer and integrated business intelligence tools like Tableau and Looker.
With the new streaming pipeline, it now took hours instead of days for the data to become available for analytics. But while this was an incredible improvement, it still wasn’t good enough to drive real-time personalization solutions like recommendations and social proof. This launched our venture into GCP.
Fast forward to today, and our pipeline is built entirely on Google Cloud managed services, namely Cloud Dataflow as the data processing engine, and BigQuery as the data warehouse. Events flow through our pipeline in three stages: validation, enrichment, and ingestion.
Every incoming event is first validated against our schema, enriched with metadata like geo location, currency conversion, etc, and finally written to partitioned tables in BigQuery. The data then flows through Cloud Pub/Sub topics between dataflows within a Protobuf envelope. Finally, an additional dataflow reads failed events emitted by the three main dataflows from a dead letter Pub/Sub topic and stores them in BigQuery for reporting and monitoring. All this happens in less than a few seconds.
Getting to this point took quite a bit of experimentation, though. As an early adopter of Google Cloud, we were one of the first businesses to test Cloud Bigtable, and actually built Stash, our in-house key/value datastore, on top of it. In parallel, we had also been building a similar version with HBase, but the ease of use and deployment, manageability, and flexibility we experienced with Bigtable convinced us of the power of using serverless architecture, setting us on our journey to build our new infrastructure on managed services.
As with many other Google Cloud services, we were fortunate to get early access to Cloud Dataflow, and started scoping out the idea of building a new streaming pipeline with it. It took our developers only a week to build a functional end-to-end prototype for the new pipeline. This ease of prototyping and validation cemented our decision to use it for a new streaming pipeline, since it allowed us to rapidly iterate ideas.
We briefly experimented with building a hybrid platform, using GCP for the main data ingestion pipeline and using another popular cloud provider for data warehousing. However, we quickly settled on using BigQuery as our exclusive data warehouse. It scales effortlessly and on-demand, lets you run powerful analytical queries in familiar SQL, and supports streaming writes, making it a perfect fit for our use case. At the time of writing this article, our continuously growing data storage in BigQuery is at a 5PB mark and we run queries processing over 10PB of data every month.
Our data storage architecture requires that the events be routed to different BigQuery datasets based on client identifiers baked into the events. This, however, was not supported by the early versions of the Dataflow SDK, so we wrote some custom code for the BigQuery streaming write transforms to connect our data pipeline to BigQuery. Routing data to multiple BigQuery tables has since been added as a feature in the new Beam SDKs.
Within a couple of months, we had written a production-ready data pipeline and ingestion infrastructure. Along the way, we marveled at the ease of development, management, and maintainability that this serverless architecture offered, and observed some remarkable engineering and business-level optimizations. For example, we reduced our engineering operational cost by approximately half. We no longer needed to pay for idle machines, nor manually provision new ones as traffic increases, as everything is configured to autoscale based on incoming traffic. We now pay for what we use. We have dealt with massive traffic spikes (10-25X) during major retail and sporting events like Black Friday, Cyber Monday, Boxing Day etc, for three years in a row without any hiccups.
Google’s fully managed services allow us to save on the massive engineering efforts required to scale and maintain our infrastructure, which is a huge win for our infrastructure team. Reduced management efforts mean our SREs have more time to build useful automation and deployment tools.
It also means that our product teams can get proof-of concepts out the door faster, enabling them to validate ideas quickly, reject the ones that don’t work, and rapidly iterate over the successful ones.
This serverless architecture has helped us build a federated data model fed by a central Cloud Pub/Sub firehose that serves all our teams internally, thus eliminating data silos. BigQuery serves as a single source of truth for all our teams and the data infrastructure that we built on Google Cloud powers our app and drives all our client-facing products.
Our partnership with Google is fundamental to our success—underpinning our own technology stack, it ensures that every customer interaction on web, mobile, or in-app is smarter, sharper, and more personal. Serverless architecture has helped us build a powerful ingestion infrastructure that forms the basis of our personalization platform. In upcoming posts, we’ll look into some tools we developed to work with the managed services, for example an open source tool to launch dataflows and our in-house Pub/Sub event router. We’ll also look at how we monitor our platform. Finally, we’ll deep dive into some of the personalization solutions that we built that leverage serverless architecture, like our recommendation engine. In the interim, feel free to reach out to us with comments and questions.