Google Cloud Platform

Control and granularity with Spark and Hadoop on Cloud Dataproc

It’s always great when customer feedback leads to great new feature ideas, which is exactly the case in the latest release of Google Cloud Dataproc.

Cloud Dataproc’s primary goal is to simplify, speed up, and economize clusters that run the Apache Spark and Apache Hadoop ecosystem. We’ve heard from our customers that they really appreciate the speed at which they can create clusters, and the fact that per-minute billing can help them save money. But our customers also told us they wanted more control and granularity for Cloud Dataproc clusters. They asked for more granularity over cluster permissions, for example, and controls to ensure people won’t forget about running clusters. These requests led to many new features in the latest Cloud Dataproc release, including granular IAM, scheduled deletion, and per-second billing.

Granular IAM

Cloud Dataproc clusters support Identity Access Management (IAM) to control who can do what with the service. Previously, Cloud Dataproc supported IAM settings — such as permissions for who can create or delete clusters — at the project level. While this approach contributes to security, it has drawbacks for customers who require  more granular permissions, such as needing to secure one cluster for use by only one group. To address these needs, the latest release of Cloud Dataproc includes beta support for granular IAM. With granular IAM, you can set Cloud Dataproc roles and their corresponding permissions on a specific cluster.

The granular IAM documentation has an example of how you can use this feature to apply granular IAM permissions to Cloud Dataproc clusters through the Google Cloud SDK (gcloud) or the Cloud Dataproc API.

Scheduled deletion

Cloud Dataproc clusters are fast to create, scale, and delete, but in the past, the deletion step had to always be manually triggered. One of the most prevalent customer requests was to be able to automatically delete clusters that are no longer needed. The team thought it was a great idea, and we recently launched a new beta feature called scheduled deletion in our September 6th release.

Instead of just having one metric to judge when a cluster should be deleted, we wanted to provide three to cover a wide range of use cases. With scheduled deletion, you can set a cluster to automatically delete itself under three circumstances:

  • Max idle time — Timeout based on the last job sent to the cluster via the Cloud Dataproc Jobs functionality. For example, you can use this to set a cluster to delete if no one submits a job via the gcloud dataproc jobs submit command in a certain period of time.
  • Expiration time — Specific date, in ISO 8601 datetime format, at which point the cluster will delete itself.
  • Maximum age — A maximum age in seconds, minutes, hours, or days the cluster should run after creation, at which point the cluster will delete itself.
The scheduled deletion documentation has more information on how to use this feature, including formatting details and notes on which settings can or cannot be used together.

Per-second billing

As announced by the Google Compute Engine team, Cloud Dataproc is moving from a per-minute to a per-second billing model, with a one minute minimum. We want our billing to be customer-centric alongside Compute Engine.

This change is effective today for both Compute Engine used in Cloud Dataproc clusters and for the Cloud Dataproc premium charge. For more information, see the Cloud Dataproc pricing documentation.

The Cloud Dataproc team will continue to look for ways to  make using Spark and Hadoop faster, easier, and more cost-effective. We look forward to more feedback from our customers since it is an important component in our vision. If you want to try these features on Cloud Dataproc, you can sign up today and get $300 in free credits.