Data Analytics

Fastest track to Apache Hadoop and Spark success: using job-scoped clusters on cloud-native architecture

A combination of rapid startup time, per-minute billing and cloud-native architecture is transformative for operators. 

It’s 2017, and running Apache Hadoop (or Apache Spark) is still too hard, whether on-premise or in the cloud. Ironically, the main source of this difficulty is one of Hadoop’s strengths: the ability to run multitenant workloads. When concurrency is limited to one user (which in most enterprises happens rarely, if ever), that first Terasort on your shiny, new idle cluster runs relatively fast. Under real-world load, with dozens of analysts running brute-force, join-heavy jobs that create massive resource contention, your cluster could grind to a halt.

Using YARN for resource management is helpful for mitigating this contention, but at the cost of increased complexity: YARN makes it difficult to understand the resource utilization of each job, diagnose performance bottlenecks or do job accounting and resource chargebacks. (Furthermore, even if YARN helps you improve utilization, you’ll still pay for resources that you don’t use.) Thus the all-too-common solution for any and all issues is to ask for more nodes.

Google’s recommended solution is to run an ephemeral job-scoped cluster, giving each job its own dedicated cluster for the duration of said job. In essence, instead of running one large homogeneous cluster, you operate many smaller ephemeral clusters, with each one coming online to execute that single job and then deleting itself once the job is done and results are saved. The benefits of this approach are transformative:

  • You’re essentially running each job in its own cluster with concurrency of one, yet you're not limited on how many clusters and jobs you can run in parallel.
  • You needn’t get entangled in YARN complexities for managing/allocating resources when multitenancy is involved.
  • You can point multiple clusters to the same dataset (by replacing HDFS with Cloud Storage), with all potential conflicts resolved automatically for you.
  • Monitoring for runaway queries that compromise overall cluster health is far, far less critical to your environment.
  • You get a clear audit trail, clear per-job resource allocation, transparent per-job accounting and contention-free job execution.
  • You have the agility and freedom to experiment with different job configurations (for example, to double the resources for a specific job to see if the job completes in half the time).
If this solution sounds attractive, you'll likely appreciate the fact that users of Cloud Dataproc (Google Cloud Platform’s fully-managed Hadoop/Spark/Flink service) are already doing the above. Instead of deploying a huge, homogeneous cluster and then filling it up with jobs, these customers think of cluster deployment as just a step in executing a specific job, and furthermore, can customize (scope) the cluster for that job.

So, how does Cloud Dataproc (and Google Cloud Platform (GCP) generally) make this “job-scoping” approach possible?

Job-scoping in action

Simply put, the following two qualities — combined with the well-known characteristics of cloud-native architectures (separation of storage from compute, availability of lots of high-throughput object storage and fast-scaling compute) ​— facilitate this new approach to cluster management:

  • Rapid start-up time: Thanks to Google Cloud’s hyper-efficient infrastructure, Cloud Dataproc clusters spin up in as quickly as 45-60 seconds (but conservatively, assume 90 seconds).
  • Per-minute billing: Per-minute billing (10-min minimum) allows you to start/stop clusters on a dime and pay only for the resources you use.
This approach is simple in practice. Let’s say you have data living in Cloud Storage and you write a PySpark job that's hungry for said data. Instead of submitting the job to a “generic” homogeneous cluster, you ask Cloud Dataproc to deploy a cluster, just for this single job. In well under 2 minutes, your cluster is live and the job is running — and once the job is completed, output is saved to persistent storage (Cloud Storage, BigQuery, Bigtable, or others, depending on use case) and the cluster can be shut down. (Note: due to the initial cluster bootup time, this approach will not be advisable for latency-sensitive ad-hoc SQL workloads, in which cases BigQuery is a better fit.)

In other words, in exchange for a roughly 1-minute initial tax on job processing, you can tear down that one-size-fits-all Hadoop cluster in favor of multiple ephemeral job-scoped clusters, all pointing toward the same storage. With per-minute billing, you needn’t bin-pack your workloads to round up to the nearest hour, and with rapid cluster bootup, your patience is hardly tested. The end result is that Cloud Dataproc users think of their jobs as chunks of CPU/RAM minutes, and can tinker with that math as desired to arrive at the best possible configuration. (As added flexibility in this process, customizable machine types let you create your own VM shapes via tailored CPU/RAM combinations, preemptible VMs provide a flat and predictable discount from VM prices and sustained use discounts can apply if you end up needing a long-lived cluster.)


Next steps

These new operational opportunities are inspiring the ecosystem to build or integrate tooling that helps take advantage of them. Our friends at Spotify recently open-sourced the Spydra framework for Hadoop, which they use to automate the management of job-scoped Hadoop clusters and jobs, both on-premises and in Cloud Dataproc. The efforts by the Apache Airflow community to build deep integration with Cloud Dataproc and other GCP services for workflow scheduling is a similarly positive sign that the GCP ecosystem is thriving and growing, which is great news for customers and partners.

We hope this post will also inspire you to explore if the job-scoping approach can work in your shop. The best place to start is with a free trial!