Data Analytics

Migrating Apache Hadoop clusters to Google Cloud

May 27, 2020

Robert Saxby

Product Manager

Apache Hadoop and the big data ecosystem around it has served businesses well for years, offering a way to tackle big data problems and build actionable analytics. As these on-prem deployments of Hadoop and Apache Spark, Presto, and more moved out of experiments and into thousand-node clusters, cost, performance, and governance challenges emerged. While these challenges grew on-prem, Google Cloud emerged as a solution for many Hadoop admins looking to decouple compute from storage to increase performance while only paying for the resources they use. Managing costs and meeting data and analytics SLAs, while still providing secure and governed access to open source innovation, became a balancing act that the public cloud could solve without large upfront machine costs.

How to think about on-prem Hadoop migration costs

There is no one right way to estimate Hadoop migration costs. Some folks will look at their footprint on-prem today, then try to compare directly with the cloud, byte for byte and CPU cycle for CPU cycle. There is nothing wrong with this approach, and when you consider opex and capex, and discounts such as those for sustained compute usage, the cost case will start to look pretty compelling. Cloud economics work!

But what about taking a workload-centric approach? When you run your cloud-based Hadoop and Spark proof of concepts, consider the specific workload by measuring the units billed to run just that workload. Spoiler: it is quite easy when you spin up a cluster, run the data pipeline, and then tear down the cluster after you are finished.

Now, consider making a change to that workload. For example, use a later version of Spark and then redeploy. This is a seemingly easy task—but how would you accomplish it today on your on-prem cluster, and what would have been the cost to plan and implement such a change? These are all things to consider when you are building a TCO analysis of migrating your entire, or just a piece of, your on-prem Hadoop cluster.

Where to begin your on-prem Hadoop migration

It’s important to note that you are not migrating a cluster, but rather the user and workloads, from a place where you shoulder the burden of maintaining and operating a cluster to a place where you share that responsibility with Google. Starting with these users and workloads allows you to build a better, more agile experience.

Consider the data engineer who wants to update their pipeline to use the latest Spark APIs. When you migrate their code, you can choose to run it on its own ephemeral cluster—you are not forced to update the code for all your other workloads. They can run on their own cluster(s) and continue to leverage the previous version of the Spark APIs.

Or for the data analyst who may need additional resources to run their Hive query in time to meet a reporting deadline, you can choose to enable autoscaling. Or for a data scientist who has been wanting to decrease their ML training job duration, you can provide them with a familiar notebook interface and spin up a cluster as needed with GPUs attached.

These benefits all sound great, but there is hard work involved in migrating workloads and users. Where should you start?

In the Migrating Data Processing Hadoop Workloads to Google Cloud blog post, we start the journey by helping data admins, architects, and engineers consider, plan, and run a data processing job. Spoilers: You can precisely select which APIs and versions are available for any specific workload. You can size and scale your cluster as needed to meet the workload’s requirements.

Once you’re storing and processing your data in Google Cloud, you’ll want to think about enabling your analysis and exploration tools, wherever they are running, to work with your data. The work required here is all about proxies, networking, and security—but don’t worry, this is well-trodden ground. In Migrating Hadoop clusters to GCP - Visualization Security: Part I - Architecture, we’ll help your architects and admins to enable your analysts.

For data science workloads and users, we have recently released Dataproc Hub, which enables your data scientists and IT admins to access on-demand clusters tailored to their specific data science needs as securely as possible.

The Apache Hadoop ecosystem offers some of the best data processing and analytical capabilities out there. A successful migration is one in which we have unleashed them for your users and workloads; one in which the workload defines a cluster, and not the other way around. Get in touch with your Google Cloud contact and we’ll make your Hadoop migration a success together.

Posted in