Data Analytics

Modernize Apache Spark with Cloud Dataproc on Kubernetes

September 10, 2019

Christopher Crosbie

Product Manager, Data Analytics

James Malone

Product Manager, Google Cloud

Google Cloud Dataproc provides open source data and analytic processing for data engineers and data scientists who need to process data and train models faster at scale. However, as enterprise infrastructure becomes increasingly hybrid in nature, machines can sit idle, single workload clusters continue to sprawl, and open source software and libraries continue to become outdated and incompatible with your stack. It’s critical that Cloud Dataproc continues to empower data professionals to focus more on workloads than infrastructure by combining the best of cloud and open source. We’re happy to announce alpha availability of Cloud Dataproc for Kubernetes so that we can continue to support this vision. With this announcement, we are bringing enterprise-grade support, management, and security to Apache Spark jobs running on GKE clusters. (Get all the technical details here.)

“Enterprises are increasingly looking for products and services that support data processing across multiple locations and platforms,” said Matt Aslett, Research Vice President at 451 Research. “The launch of Cloud Dataproc on Kubernetes is significant in that it provides customers with a single control plane for deploying and managing Apache Spark jobs on Google Kubernetes Engine in both public cloud and on-premises environments.”

This is the first step in a larger journey to a container-first world. While Apache Spark is the first open source processing engine we will bring to Cloud Dataproc on Kubernetes, it won’t be the last. Kubernetes has flipped the big data and machine learning open source software (OSS) world on its head, since it gives data scientists and data engineers a way to unify resource management, isolate jobs, and build resilient infrastructures across any environment.

Deploy unified resource management

With this alpha announcement, big data professionals are no longer obligated to deal with two separate cluster management interfaces to manage open source components running on Kubernetes and YARN. Using Cloud Dataproc’s new capabilities, you’ll get one central view that can span both cluster management systems. Supporting both YARN and Kubernetes can bring your enterprise the needed flexibility to modernize certain hybrid workloads while continuing to monitor YARN-based workloads.

Isolate OSS jobs to accelerate the analytics life cycle

Containerizing and isolating OSS jobs on Kubernetes will allow data professionals to move faster and remove the version and library dependencies associated with traditional big data technologies. You can move models and new ETL pipelines from dev to production without having to worry about compatibility. Building on a new agile infrastructure like Kubernetes will make OSS easier and faster to upgrade.

Build resilient infrastructure

Deploying Spark jobs on a self-healing GKE environment will help mission-critical ETL and machine learning jobs run smoothly. Data scientists and data engineers don’t have to worry about sizing and building clusters, manipulating Docker files, or messing around with Kubernetes networking configurations. It just works. With leading support from the team that built Kubernetes, enterprises have access to the skills they need to close any Kubernetes skills gap on their team.

Open source has always been a core pillar of Google Cloud’s data and analytics strategy. As we continue to work with the community to set industry standards, we continue to integrate those standards into our services so organizations around the world can unlock the value of data faster.

Moving Cloud Dataproc to Kubernetes involved changes to Cloud Dataproc and the open-source ecosystem that we run as a managed service. We will continue to work with other open source communities, like Apache Flink, to enable Cloud Dataproc on Kubernetes capabilities for more and more open source projects. This alpha announcement of bringing enterprise-grade support, management, and security to Apache Spark jobs on Kubernetes is the first of many as we aim to simplify infrastructure complexities for data scientists and data engineers around the world.

Email us for more information and to join the alpha program. Also, be sure to check out the tech deep dive on this alpha.

Posted in