Google Cloud

Understanding cloud pricing part 6: Big data processing engines

March 17, 2016

Peter-Mark Verwoerd

Platform Solutions Architect, Google Cloud

Karan Bhatia

Google Cloud Platform Solutions Architect

Our goal in this series is to help clarify and provide understanding about how workloads are priced in the cloud for our customers. We’ve previously written about Virtual Machines, Local SSD, Data Warehouses (part 1, part 2) and NoSQL databases (part 1, part 2). In this post we continue the series with an examination of pricing of big data processing engines, specifically, managed deployments of Hadoop running in cloud infrastructures.

The growth of big data has been predicated, in part, on the rapidly falling price of commodity storage that has fallen from $10/GB in 2000 to less than $.10/GB in 2010¹ to $.01/GB as of today. At the same time, commodity compute costs have also decreased exponentially, enabling new applications that store and process vast amounts of data to build insights into everything from Monte Carlo analysis of financial portfolios to personalized medicine to predictive analytics.

Google pioneered the development of a new generation of software tools that enabled these applications by leveraging generic commodity compute and storage systems and by developing new parallel programing paradigms that scale to thousands of nodes. Google’s GFS (2003), for example, provided a scalable distributed file system for large distributed data-intensive applications; Google’s MapReduce (2004) provided a parallel programming framework for processing the ever increasing unstructured datasets across large clusters of commodity servers; and Google’s Bigtable (2006) provided a distributed structured storage system designed to scale to petabytes of data across thousands of commodity servers.

Based largely on the publications describing Google’s internal systems, the open source community developed the Hadoop processing engine and associated big data software stack. The first public version of Hadoop became available in 2007 and has remained the center of gravity of the open source big data ecosystem since. Today, the Hadoop software is an official Apache open source project and is available in both free and supported versions both on-premise and on-cloud. The big data software stack includes NoSQL databases, workflow monitors, columnar storage, batch and streaming support and much more. And it continues to evolve at a rapid rate.

This rapid evolution and lack of maturity has meant that building and running a Hadoop cluster in production reliably and efficiently is a complex endeavor. Over the past decade, cloud computing has become the defacto standard for running big data workloads. Managed Hadoop offerings from Google Cloud Platform Cloud Dataproc, Amazon Web Services Elastic MapReduce and Microsoft Azure HDInsight all offer near instant deployment, deep integration with cloud storage systems and a pay-only-for-what-you-use pricing model. In this post, we examine the features and pricing of these three managed Hadoop services.

Overview of managed Hadoop offerings

Cloud Dataproc from Google Cloud Platform, Elastic MapReduce from Amazon Web Services and HDInsight from Microsoft Azure each provide a managed Hadoop environment with automatic provisioning and configuration, simple job management, sophisticated monitoring and flexible pricing. Table 1 summarizes the main features of each as of March 2016.

Table 1: Overview of Hadoop Processing Engines

Feature	Cloud Dataproc2	Elastic MapReduce3	HDInsight4
Hadoop distribution	From Apache Source	From Apache Source	Hortonworks
Hadoop version (current)	Hadoop 2.7.2	Hadoop 2.7.1	Hadoop 2.7.1
Apache Pig	0.15.0	0.14.0	0.14.0
Apache Hive	1.2.1	1.0	0.14.0
Apache Spark	1.6.0	1.6.0	1.3.1
Apache Storm	-na-	-na-	0.9.3
Apache HBase	-na-5	0.94.186	0.98.4
# of supported instances types	thousands⁷	37	17
Live resizeable	yes	yes	scale up only
Fault tolerant	no	no	yes⁸
Automatic configuration	yes	yes	yes
User Customizable	Initialization actions	Bootstrap actions	Script actions
Preemptible Discounts	yes	yes	no
Long-term Discounts	Sustained use Discounts (no commitment required)	Reserved Pricing (requires commitment)	no
Pipeline execution	Apache Oozie	DataPipeline	Apache Oozie

Platform

Each of the three systems provide a deployment of the same open-source Hadoop software. EMR and Dataproc start with the distribution from the Apache repository and make changes or additions in order to support the specific provisioning on their cloud platform. HDInsight uses the Hortonworks Data Platform (HDP) as its base distribution that is itself based on the Apache version and provides additional packaging and support options. The programmatic and runtime differences between the three systems are mostly due to the version differences of the source packages. For example, HDInsight uses an older version of Spark that may cause incompatibilities with Spark applications written for a later version.

Each of the systems leverages the underlying cloud computing platform for provisioning and management of the system. HDInsight supports deployment on 17 different compute instance types varying in price and system properties such as cores, RAM and disk technology. EMR, likewise, supports 37 different underlying instance types. Cloud Dataproc supports 19 predefined instance types and, in addition, provides users with near infinite variation through support for custom instance types. Custom instance types allow users to define the cluster characteristics to exactly match the application requirements to optimize both performance and cost.

All three systems support some form of dynamic scaling after cluster creation. Scaling allows users to add or remove nodes from the system as need arises. For example, a steady-state cluster may need to be scaled up temporarily if a large number of jobs are submitted into the cluster. When the backlog is resolved, the cluster can be scaled down back to the steady state size. Cloud Dataproc and EMR support scaling cluster both up and down when needed; HDInsight supports scaling up, but, scaling down requires jobs to be restarted⁹.

Cluster pricing

Cloud Dataproc and EMR are priced as an additional charge over and above the cost of the underlying compute services. EMR pricing¹⁰ depends on the instance type but is roughly 25% of the hourly EC2 instance charge. The Cloud Dataproc add-on charges are priced¹¹ at $.01/vcore of the underlying instance. HDInsight charges¹² include the price of the underlying instances and vary based on the particular instances used.

EMR charges by the hour of use, rounded up to the nearest whole hour, Cloud Dataproc and HDInsight charge by the minute. If running a 2-hour workload and the job execution exceeds the 2-hour mark by even a minute, EMR will charge for 3 hours of work. Cloud Dataproc and HDInsight round usage to the nearest minute, reducing the overall cost of the job and also significantly simplifying the application lifecycle management for customers.

Both Cloud Dataproc and EMR offer additional pricing options including both preemptible discounts (EMR calls it “spot pricing”) and long-term use discounting. With EMR’s spot pricing, users place a “bid” on the underlying instances and AWS performs a periodic auction for the resources. Depending on the capacity of the spot fleet and global set of bids for those resources, users with the highest bid prices will obtain the resources. If a user loses the auction, then the resources may be reclaimed without notice. Users must manage the auction and bids themselves, but doing so can reduce the cost of compute instances by up to 50-80% off the on-demand price¹³.

Similar to AWS’s spot pricing, Google Cloud Platform’s Preemptible VM’s provide resources at a significant discount with the possibility that the instances will be preempted (reclaimed) without notice. Unlike spot, however, there's no auction to manage, and the price discount is fixed at 80%.

Cluster execution models

All three systems support both transient and persistent cluster models. Persistent clusters, as the name implies, are started up and run continuously 24/7 for a long period of time. In this case, jobs are constantly being submitted into the system either due to user demand or continuous data ingest.

In contrast, transient clusters are those that are created to run a particular task after which the cluster is destroyed. Typical use cases for this are ETL jobs that process data in batch mode. For example, log data may be aggregated in an object storage system and processed nightly and added to a datamart or data warehouse. In order to support this model, the data needs to be streamed into the cluster from the reliable storage. Cloud Dataproc uses a Google Cloud Storage (GCS) connector to stream data to and from the cluster. Similarly EMR supports streaming from AWS S3, and HDInsight supports Azure Blob Storage.

Comparing pricing for typical workloads

In order to compare pricing, we assume a transient cluster model where the cluster is created in order to process a particular volume of data initially located on the associated object storage system. After the cluster processes the data, the results are pushed back to the object storage and the cluster's terminated. The total cost that we consider is the monthly cost for processing and storage (bandwidth, APIs, etc are not considered). We analyze both the standard pricing and preemptible discount pricing. We don't consider negotiated or long-term discounts in this comparison.

We assume an average input data volume of 50TB as that is typical for log processing on enterprise workloads. And we compare clusters containing 20 worker nodes with each node having 4 cores and 15GBs of RAM. For Cloud Dataproc, this requirement is satisfied with a 21-node cluster (1 master and 20 worker nodes) of n1-standard-4 instance type that has 4 virtual cores per node, 15GB of RAM and 80GB of SSD disk. We compare with a 21-node EMR cluster (1 head node and 20 workers) of m3.xlarge instance type that has 4 cores 15GB RAM and 2x40 GB locally attached SSD storage.

Finally, we include a 20-node HDInsight cluster (20 workers, 2 head nodes and 3 zookeeper nodes) of D3v2 instances with 4 cores, 14GB RAM and 200GB disk. We assume that the cluster has a runtime of 5 hours and that the cluster runs on a daily basis: that is, every day the cluster is created to process 50TB of data and, 5 hours later, the cluster is terminated. This model is common for transient clusters. Our cost metric then is the monthly cost including the cost for computation and the cost of storage.

Table 2 shows the total monthly cost for our application using each of the three processing engines with standard pricing and with preemptible discounts. We calculate standard pricing costs using the associated online pricing calculators. For short-term discounts, we estimate costs using the method described below. HDinsight has no short-term discounts.

Table 2: Costs of Processing 50TB of data for 5 hours daily as of Feb 2016

Pricing Model	Dataproc	EMR	HDInsight
Standard Pricing	$2,154.40	$2,613.27	$3,261.33
Short term discount	$1934.43	$1,936.1714	-na-

Standard pricing

Standard pricing results in costs of $2,154.40 per month for Cloud Dataproc, $2,613.27 per month for EMR (an additional 21% over the cost of Cloud Dataproc), and $3,261.33 per month for HDInsight (an additional 51% of the cost of Cloud Dataproc).

In real-world use, Cloud Dataproc and HDInsight are likely to be less costly than the table indicates since they use per-minute billing, whereas EMR rounds up to the next hour. While in this exercise, we assume the application runs for exactly 5 hours, it's generally difficult to estimate runtime of applications so exactly. Even going over by 1 minute in this scenario will incur an additional hour of billed time resulting in 20% higher compute cost for the run.

Preemptible discount

Both Google Cloud Platform’s Preemptible VM and AWS’s Spot model provides a significant price reduction to users if they're willing to have their nodes reclaimed without notice. If that happens, the cluster may be terminated and the work completed to that point may be lost. However, in those cases, a new cluster can be re-started to process the dataset from object storage — hence, no data will be lost. For time critical or very long running clusters this is generally not recommended.

HDInsight does not provide preemptible discounts, whereas Google Cloud Platform and AWS offer significant discounts for this flexibility. Google Cloud Platform offers a fixed 80% reduction in price for preemptible vms as compared to the standard price while AWS uses a market mechanism in which users bid for unused instances and an auction determines the market price for the instances. Due to varying supply and demand of the instances, the spot market price can vary quite substantially. In these comparisons, we calculated the mean price for the m3.xlarge instance type in us-east-1 region over a 7-day window. The mean price was $.065/hour — a 75% discount off the on-demand price of $.266. However, there were price spikes during that time of up to $5.32/hour, over 20 times the on-demand price. It's left to the users of the spot market to manage their bids and allocation strategy to minimize their costs. The cost of the cluster is the spot price of the underlying EC2 instances plus the EMR uplift. The EMR uplift is fixed and not dependent on the spot market.

Using the average spot price of $.065/hour for the EMR Cluster, the monthly cost is $1,936.17, roughly equivalent to the cost of using Cloud Dataproc $1,934.43; however, Cloud Dataproc provides this discount without the need to explicitly manage bids and price spikes.

Summary

Cloud platforms provide an easy and cost effective method of building and deploying Hadoop clusters based on current accepted best practices within the Hadoop community. Deploying in the cloud also provides integration with large object storage systems, data warehousing systems and workflow management tools. All three of the systems we discussed here provide essentially the same Hadoop capabilities based on open source Hadoop, yet the costs and user experiences vary greatly among them.

As described in detail above, Google’s Cloud Dataproc not only provides the most cost effective solution — up to 17% less than EMR and 34% less than HDInsights — it does so while providing flexibility and simplified usage.

Cloud Dataproc provides custom instance types that allow users to specify CPU, RAM and local storage independently to match the requirements of the application
Cloud Dataproc provides significant preemptible discounts without the need to manage complex auction process and highly variable price spikes
Cloud Dataproc provides per-minute billing so that users don’t have to actively manage cluster runtimes — pay for what you use without worrying about exceeding the hour threshold.

If you have any comments ideas, or questions, send them our way. We’re happy to talk.