Data Analytics

Burst data lake processing to Dataproc using on-prem Hadoop data

May 21, 2020

Adit Madan

Lead Engineer at Alluxio

Many companies have data stored in a Hadoop Distributed File System (HDFS) cluster in their on-premises environment. As the amount of stored data grows, and the number of workloads coming in from analytics frameworks like Apache Spark, Presto, Apache Hive, and more grow, this type of fixed on-premises infrastructure becomes costly and causes latency in data processing jobs. One method to tackle this problem is to use Alluxio, an open source data orchestration platform for analytics and AI applications, to “burst” workloads to Dataproc, Google Cloud’s managed service for Hadoop, Spark, and other clusters. You can enable high performance for data analytics across a hybrid environment by mounting on-premises data sources into Alluxio and save on costs in your private data center without copying data.

Zero-copy hybrid bursting to Dataproc

Alluxio integrates with both your private computing environment on-premises and with Google Cloud. Workloads can burst to Google Cloud on-demand without moving data between computing environments first. This allows for high data analytics performance with low I/O overhead.

For example, you may use a compute framework like Spark or Presto for your on-prem data stored in HDFS. By bursting that data to Google Cloud with Alluxio, you’re able run the related analytics in the cloud on-demand, without the time and resource-intensive process of migrating large amounts of data to the cloud. Even then, after data is transferred to Cloud Storage, it is hard to access near real-time data as on-prem data pipelines and source systems change and evolve. With a data orchestration platform, you can selectively migrate data on-demand and seamlessly make updated data accessible in the cloud without the need to persist all data to Cloud Storage.

We hear from customers that they find this especially helpful for enterprises in financial service, healthcare, and retail as they want to store and mask sensitive data on-prem and burst masked data and machine learning or ETL jobs to Google Cloud.

A typical architecture may look something like this:

https://storage.googleapis.com/gweb-cloudblog-publish/images/typical_architecture.max-1000x1000.jpg

How a large national retailer uses burst processing

A leading retailer that balances physical and digital stores moved to a zero-copy hybrid solution powered with Alluxio and Google Cloud as their on-premises data center became increasingly overloaded with the amount of data coming in. They wanted to take advantage of Google Cloud’s scalability, but couldn’t persist data into the cloud due to security concerns.

So they added Alluxio to burst their Presto workloads to Google Cloud to query large datasets that before wouldn’t have been able to move to the cloud. Before Alluxio, they manually copied large amounts of data at times three to four more than required for the jobs to run. Once those jobs completed, they had to delete that data, which made use of compute in Google Cloud very inefficient. By moving to Alluxio, they saw a vast improvement in query performance overall and were able to take advantage of the scalability and flexibility of Google Cloud. Here’s how their architecture looks:

https://storage.googleapis.com/gweb-cloudblog-publish/images/Alluxio_to_burst_1.max-1000x1000.jpg

Using a zero-copy hybrid burst approach for better costs and performance

We see some common challenges that users face with on-prem Hadoop or other big data workloads:

Hadoop clusters are running beyond 100% CPU capacity
Hadoop clusters can’t be expanded due to high load on the master NameNode
Compute capacity is not enough to offer the desired service-level agreement (SLA) to ad-hoc business users, and there’s no separation between SLA and non-SLA workloads
Infrequent and bursty analytics, such as a compute-intensive job for generating monthly compliance reports, compete for scarce resources
Cost containment is a problem at scale
Operational costs of self-maintained infrastructure adds to the total cost of ownership, with indirect costs related to salaries of the IT staff

Using the cloud to meet these challenges allows for independent scaling and on-demand provisioning of both compute and storage, the flexibility to use multiple compute engines (the right tool for the job), and reduced overload on existing infrastructure by moving ephemeral workloads.

To learn more about the zero-copy hybrid burst solution, register for the upcoming tech talk hosted by Alluxio and the Google Cloud Dataproc team on Thursday, May 28.

Posted in