Jump to Content
Data Analytics

Leave manual cluster resizing behind with Cloud Dataproc’s autoscaling

November 15, 2019
Karthik Palaniappan

Software Engineer for Cloud Dataproc

Building real-time, interactive data products with open source data and analytics processing technology is not a trivial task. It involves constantly balancing cluster costs with service-level agreements (SLAs). Whether you are using Apache Hadoop and Spark to build a customer-facing web application or a real-time interactive dashboard for your product team, it’s extremely difficult to handle heavy spikes in traffic from a data and analytics perspective.

We’re pleased to announce Cloud Dataproc’s new autoscaling capabilities, now generally available, that can remove the need for complex capacity planning that always results in either missed SLAs or resources sitting idle.

How can autoscaling help your team?
These new capabilities can help a range of teams, whether data engineers building complex ETL pipelines, data analysts running ad hoc SQL queries, or data scientists training a new model. Cloud Dataproc’s autoscaling capabilities allow cluster admins to build ephemeral or long-standing clusters in 90 seconds and apply an autoscaling policy to the cluster to minimize costs and maximize the user experience without manual intervention. 

Whether you’re part of the team at a technology company building a SaaS application, a telecommunications company analyzing network traffic, or a retailer monitoring clickstream data during the holidays, you no longer have to worry about right-sizing clusters. 

Here’s a look at some common use cases:

https://storage.googleapis.com/gweb-cloudblog-publish/images/autoscaling.max-900x900.png

Core Cloud Dataproc autoscaling capabilities include:

  • Right-sizing your cluster: Estimating the "right" number of cluster workers (nodes) for a workload is difficult, and a single cluster size for an entire pipeline is often not ideal. Don’t worry about manually right-sizing your cluster with autoscaling. 

  • One autoscaling policy, multiple clusters: An autoscaling policy is a reusable configuration that describes how clusters using the autoscaling policy should scale. It defines scaling boundaries, frequency, and aggressiveness to provide fine-grained control over cluster resources throughout the cluster lifetime.

  • Budget optimization: Scale in and scale out clusters while setting limits in the autoscaling policy to make sure you don’t exceed budget. 

  • YARN integration: Autoscaling policies integrate with YARN automatically to trigger VM scaling when needed, so you have one central resource management system for all of your Cloud Dataproc jobs.

  • Monitor autoscaling jobs: Integrate with Stackdriver Monitoring to view the metrics from the autoscaling clusters, view the number of Node Managers in your cluster, and understand why autoscaling did or did not scale your cluster. Use Stackdriver Logging to view autoscaler decisions.

  • Multi-region support: Deploy autoscaling clusters in any region where Cloud Dataproc clusters are running. 

Check out our documentation to access everything you need to get started with Cloud Dataproc autoscaling. Autoscaling is supported through the v1 API on cluster image versions 1.0.99+, 1.1.90+, 1.2.22+, 1.3.0+, and 1.4.0+.

Posted in