Google Cloud

Cloud Dataproc is now even faster and easier to use for running Apache Spark and Apache Hadoop

July 26, 2017

James Malone

Product Manager, Google Cloud

Since its initial release, Cloud Dataproc has given users a faster, easier and more cost-effective way to run Apache Spark, Apache Hadoop and other components in the open source data processing ecosystem, bringing down the traditional barriers to success with those platforms. With 90-second cluster spin-up time, per-minute billing and fully-managed infrastructure, Cloud Dataproc helps you re-think how you do operations.

Over the past few weeks, we’ve done several releases — including component updates and new features, fixes and API changes — which collectively add yet more performance, ease-of-use and efficiency to the user experience, as well as provide access to the latest innovations from the open source community.

Cloud Dataproc 1.2

Cloud Dataproc uses image versions to bundle software components such as the various releases of Spark, Hadoop, Apache Hive and other projects. Cloud Dataproc image 1.2 contains a number of important updates and changes (especially related to performance), including:

Software component updates

Apache Spark has been updated to 2.2.0 (upstream current).
Apache Hadoop has been updated to 2.8.0 (upstream current).

Environment configuration changes

The default security (SSL) provider used by the Cloud Storage connector has been changed to one based on Conscrypt. This change should more efficiently utilize the CPU for SSL operations. In many cases, this change should result in better read and write performance between Cloud Dataproc and Cloud Storage.
The reported block size for Cloud Storage is now 128MB.
Memory configuration for Hadoop and Spark have both been adjusted to improve performance and stability.

YARN changes

YARN fair ordering is now enabled within the capacity scheduler.
YARN graceful decommissioning is now supported if you enable it on your cluster.

You can review the Cloud Dataproc image version list for more information about all of the changes in Cloud Dataproc 1.2.

Regional endpoints

Regional endpoints are now generally available for Cloud Dataproc. Prior to regional endpoints, Cloud Dataproc had one region: global. With the new regional endpoints, you can specify a Compute Engine region instead. You may want regional endpoints in a few cases, including:

To improve regional isolation and protection

To improve regional performance compared to the default "global" namespace

You can use regional endpoints with the Google Developers Console, the Google Cloud SDK (gcloud dataproc), and the Cloud Dataproc REST API.

Autozone (Beta)

Cloud Dataproc can now automatically select a zone when you create a cluster. This feature is useful if you want to create a cluster in a specific region but do not have a preference where in that region (the zone) that cluster is created. Cloud Dataproc autozone is currently available in beta mode through the Cloud SDK or the Cloud Dataproc REST API.

Next steps

We hope you agree that these latest updates positively contribute to your Cloud Dataproc user experience. To stay up-to-date about future Cloud Dataproc releases, you can subscribe to the Cloud Dataproc release XML feed or view the release notes. If you have any feedback about these or other Cloud Dataproc releases, please don’t hesitate to contact us via email or Stack Overflow (google-cloud-dataproc tag).

Posted in