Cloud Dataproc Spark Jobs on GKE: How to get started
Christopher Crosbie
Product Manager, Data Analytics
Patrick Clay
Software Engineer, Google Cloud
Cloud Dataproc is Google Cloud’s fully managed Apache Hadoop and Spark service. The mission of Cloud Dataproc has always been to make it simple and intuitive for developers and data scientists to apply their existing tools, algorithms, and programming languages to cloud-scale datasets. Its flexibility means you can continue to use the skills and techniques you are already using to explore data of any size. We hear from enterprises and SaaS companies around the world that they’re using Cloud Dataproc for data processing and analytics.
Cloud Dataproc now offers alpha access to Spark jobs on Google Kubernetes Engine (GKE). (Find out more about the program here.) This means you can take advantage of the latest approaches in machine learning and big data analysis (Apache Spark and Google Cloud) together with the state-of-the-art cloud management capabilities that developers and data scientists have come to rely upon with Kubernetes and GKE. Using these tools together can bring you flexibility, auto-healing jobs, and a unified infrastructure, so you can focus on workloads, not maintaining infrastructure. Email us for more information and to join the alpha program.
Let’s take a look at Cloud Dataproc in its current form and what the new GKE alpha offers.
Cloud Dataproc now: Cloud-native Apache Spark
Cloud Dataproc has democratized big data and analytics processing for thousands of customers, offering the ability to spin up a fully loaded and configured Apache Spark cluster in minutes. With Cloud Dataproc, features such as Component Gateway enable secure access to notebooks with zero setup or installation, letting you immediately start exploring data of any size. These notebooks, combined with Cloud Dataproc Autoscaling, make it possible to run ML training or process data of various sizes without ever having to leave your notebook or worry about how the job will get done. The underlying Cloud Dataproc cluster simply adjusts compute resources as needed, within predefined limits.
Once your ML model or data engineering job is ready for production, or for use in an automated or recurring way, you can use the Cloud Dataproc Jobs API to submit a job to an existing Cloud Dataproc cluster with a jobs.submit call over HTTP, using the gcloud command-line tool, or in the Google Cloud Platform Console itself. Submitting your Spark code with the Jobs APIs ensures the jobs are logged and monitored, in addition to having them managed across the cluster. It also makes it easy to separate the permissions of who has access to submit jobs on a cluster and who has permissions to reach the cluster itself, without needing a gateway node or an application like Livy.
Cloud Dataproc next: Extending the Jobs API with GKE
The Cloud Dataproc Jobs API has been a perfect match for companies who prefer to wrap their job automation and extract, transform, and load processing (ETL) jobs in custom tooling such as Spotify’s Spydra or Cloud Dataproc’s Workflow Templates.
However, developers and data scientists who have embraced containerization and the cloud management capabilities of Kubernetes have started to demand more from their big data processing services. In order to automate your Spark job today, you would either need to continue running the cluster that created the job (expensive and does not take advantage of the pay-as-you-need capability of the cloud), or you need to carefully track how to re-create that same cluster environment in the cloud, which can become a complicated mixture of configurations, initialization scripts, conda environments, and library/package management scripts. This process can be additionally cumbersome in multi-tenant environments, where various software packages, configurations, and OS updates may conflict.
With Cloud Dataproc on Kubernetes, you can eliminate the need for multiple types of clusters that have various sets of software, and the complexity that’s involved. By extending the Cloud Dataproc Jobs API to GKE, you can package all the various dependencies of your job into a single Docker container. This Docker container allows you to integrate Spark jobs directly into the rest of your software development pipelines.
Additionally, by extending the Cloud Dataproc Jobs API for GKE, administrators have a unified management system where they can tap into their Kubernetes knowledge. You can avoid having a silo of Spark applications that need to be managed in standalone virtual machines or in Apache Hadoop YARN.
Kubernetes: Yet another resource negotiator?
Apache Hadoop YARN (introduced in 2012) is a resource negotiator commonly found in Spark platforms across on-prem and cloud. YARN provides the core capabilities of scheduling computing resources in Cloud Dataproc clusters that are based on Compute Engine. By extending the Jobs API in Cloud Dataproc with GKE, you can choose to replace your YARN management with Kubernetes. There are some key advantages to using Kubernetes over YARN:
1. Flexibility.Greater flexibility of production jobs can be achieved by having a consistent configuration of software libraries embedded with the Spark code. Containerizing Spark jobs isolates dependencies and resources at the job level instead of the cluster level. This flexibility will give you more predictable workload cycles and make it easier to target your troubleshooting when something does go wrong.
2. Auto-healing.Kubernetes provides declarative configuration for your Spark jobs. This means that you can declare at the start of the job the resources required to process the job. If for some reason Kubernetes resources (i.e., executors) become unhealthy, Kubernetes will automatically restore them and your job will continue to run with the resources you declared at the onset.
3. Unified infrastructure.At Google, we have used a system called Borg to unify all of our processing, whether it’s a data analytics workload, web site, or anything else. Borg’s architecture served as the basis for Kubernetes, which you can use to remove the need for a big data (YARN) silo.
By migrating Spark jobs to a single cluster manager, you can focus on modern cloud management in Kubernetes. At Google, having a single cluster manager system has led to more efficient use of resources and provided a unified logging and management framework. This same capability is now available to your organization.
Kubernetes is not just “yet another resource negotiator” for big data processing. It’s an entirely new way of approaching big data that can greatly improve the reliability and management of your data and analytics workloads.
Spark jobs on GKE in action
Let’s walk through what is involved with submitting an Apache Spark job to Cloud Dataproc on GKE during the alpha phase.
Step 0: Register your GKE cluster with Cloud Dataproc
Before you can execute Cloud Dataproc jobs on GKE, you must first register your GKE cluster with Cloud Dataproc. During alpha, the registration will be completed with a helm installation. Once the GKE cluster is registered, you will be able to see your GKE cluster unified with the rest of your Cloud Dataproc clusters by running the command:Step 1: Define your Cloud Dataproc Docker container
Cloud Dataproc offers Docker images that will match the bundle of software provided on the Cloud Dataproc image version list. The alpha offering contains an image based on Debian 9 Stretch that mirrors the same Spark 2.4.3 package as Cloud Dataproc 1.4. This makes it seamless to port Spark code between Cloud Dataproc running on Compute Engine and Cloud Dataproc jobs on GKE.
This Docker container encapsulates not only Cloud Dataproc’s agent for job management but also builds on top of Google Cloud’s Spark Operator for Kubernetes (in beta). This fully open source operator provides many of the integrations between Kubernetes and the rest of the Google Cloud Platform, including:
Integration with BigQuery, Google’s serverless data warehouse
Google Cloud Storage as a replacement for HDFS
Logs shipped to Stackdriver Monitoring
Access to sparkctl, a command-line tool that simplifies client-local application dependencies in a Kubernetes environment.
This Cloud Dataproc Docker container can be customized to include all the packages and configurations needed for your Spark job.
Step 2: Submit your job
Once the Docker container is ready, you can submit a Cloud Dataproc job to the GKE cluster. You can follow the same instructions that you would use for any Cloud Dataproc Spark job.
Extending Cloud Dataproc with your own container
Running the above job will mirror a software environment on Kubernetes to that found on Cloud Dataproc. However, with the GKE option, there is an extra benefit of being able to specify a container image associated with the job. This container property provides a reliable pairing of your job code and necessary software configurations.
Starting your Cloud Dataproc on Kubernetes testing
At Google Cloud, we work with thousands of customers who have migrated production workloads to Kubernetes and reaped the benefits described in this post. However, it’s important to note that while Cloud Dataproc is a generally available service used to run a variety of mission-critical applications across enterprises, the Cloud Dataproc on GKE feature is in alpha and still under active development.
Kubernetes support in the latest stable version of Spark is still considered an experimental feature. In future versions, there may be behavior changes around configuration, container images, and entry points. The Google Cloud Spark Operator that is core to this Cloud Dataproc offering is also a beta application and subject to the same stipulations.
So far, we’ve been very impressed and excited by the preliminary adoption and new workloads customers have opened by running their Spark processing on Kubernetes. We’re looking forward to taking this journey to production together with our customers and invite you to join our alpha program.
Email us for more information and to join the alpha program.