Fastest track to Apache Hadoop and Spark success: using job-scoped clusters on cloud-native architecture
A combination of rapid startup time, per-minute billing and cloud-native architecture is transformative for operators.
It’s 2017, and running Apache Hadoop (or Apache Spark) is still too hard, whether on-premise or in the cloud. Ironically, the main source of this difficulty is one of Hadoop’s strengths: the ability to run multitenant workloads. When concurrency is limited to one user (which in most enterprises happens rarely, if ever), that first Terasort on your shiny, new idle cluster runs relatively fast. Under real-world load, with dozens of analysts running brute-force, join-heavy jobs that create massive resource contention, your cluster could grind to a halt.
Using YARN for resource management is helpful for mitigating this contention, but at the cost of increased complexity: YARN makes it difficult to understand the resource utilization of each job, diagnose performance bottlenecks or do job accounting and resource chargebacks. (Furthermore, even if YARN helps you improve utilization, you’ll still pay for resources that you don’t use.) Thus the all-too-common solution for any and all issues is to ask for more nodes.
Google’s recommended solution is to run an ephemeral job-scoped cluster, giving each job its own dedicated cluster for the duration of said job. In essence, instead of running one large homogeneous cluster, you operate many smaller ephemeral clusters, with each one coming online to execute that single job and then deleting itself once the job is done and results are saved. The benefits of this approach are transformative:
- You’re essentially running each job in its own cluster with concurrency of one, yet you're not limited on how many clusters and jobs you can run in parallel.
- You needn’t get entangled in YARN complexities for managing/allocating resources when multitenancy is involved.
- You can point multiple clusters to the same dataset (by replacing HDFS with Cloud Storage), with all potential conflicts resolved automatically for you.
- Monitoring for runaway queries that compromise overall cluster health is far, far less critical to your environment.
- You get a clear audit trail, clear per-job resource allocation, transparent per-job accounting and contention-free job execution.
- You have the agility and freedom to experiment with different job configurations (for example, to double the resources for a specific job to see if the job completes in half the time).
Job-scoping in actionSimply put, the following two qualities — combined with the well-known characteristics of cloud-native architectures (separation of storage from compute, availability of lots of high-throughput object storage and fast-scaling compute) — facilitate this new approach to cluster management:
- Rapid start-up time: Thanks to Google Cloud’s hyper-efficient infrastructure, Cloud Dataproc clusters spin up in as quickly as 45-60 seconds (but conservatively, assume 90 seconds).
- Per-minute billing: Per-minute billing (10-min minimum) allows you to start/stop clusters on a dime and pay only for the resources you use.
In other words, in exchange for a roughly 1-minute initial tax on job processing, you can tear down that one-size-fits-all Hadoop cluster in favor of multiple ephemeral job-scoped clusters, all pointing toward the same storage. With per-minute billing, you needn’t bin-pack your workloads to round up to the nearest hour, and with rapid cluster bootup, your patience is hardly tested. The end result is that Cloud Dataproc users think of their jobs as chunks of CPU/RAM minutes, and can tinker with that math as desired to arrive at the best possible configuration. (As added flexibility in this process, customizable machine types let you create your own VM shapes via tailored CPU/RAM combinations, preemptible VMs provide a flat and predictable discount from VM prices and sustained use discounts can apply if you end up needing a long-lived cluster.)