Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Cloud Dataproc is now even faster and easier to use for running Apache Spark and Apache Hadoop

Wednesday, July 26, 2017

By James Malone, Product Manager

Since its initial release, Cloud Dataproc has given users a faster, easier and more cost-effective way to run Apache Spark, Apache Hadoop and other components in the open source data processing ecosystem, bringing down the traditional barriers to success with those platforms. With 90-second cluster spin-up time, per-minute billing and fully-managed infrastructure, Cloud Dataproc helps you re-think how you do operations.

Over the past few weeks, we’ve done several releases including component updates and new features, fixes and API changes which collectively add yet more performance, ease-of-use and efficiency to the user experience, as well as provide access to the latest innovations from the open source community.

Cloud Dataproc 1.2

Cloud Dataproc uses image versions to bundle software components such as the various releases of Spark, Hadoop, Apache Hive and other projects. Cloud Dataproc image 1.2 contains a number of important updates and changes (especially related to performance), including:

Software component updates

  • Apache Spark has been updated to 2.2.0 (upstream current).
  • Apache Hadoop has been updated to 2.8.0 (upstream current).
Environment configuration changes
  • The default security (SSL) provider used by the Cloud Storage connector has been changed to one based on Conscrypt. This change should more efficiently utilize the CPU for SSL operations. In many cases, this change should result in better read and write performance between Cloud Dataproc and Cloud Storage.
  • The reported block size for Cloud Storage is now 128MB.
  • Memory configuration for Hadoop and Spark have both been adjusted to improve performance and stability.
YARN changes

You can review the Cloud Dataproc image version list for more information about all of the changes in Cloud Dataproc 1.2.

Regional endpoints

Regional endpoints are now generally available for Cloud Dataproc. Prior to regional endpoints, Cloud Dataproc had one region: global. With the new regional endpoints, you can specify a Compute Engine region instead. You may want regional endpoints in a few cases, including:

  • To improve regional isolation and protection
  • To improve regional performance compared to the default "global" namespace

You can use regional endpoints with the Google Developers Console, the Google Cloud SDK (gcloud dataproc), and the Cloud Dataproc REST API.

Autozone (Beta)

Cloud Dataproc can now automatically select a zone when you create a cluster. This feature is useful if you want to create a cluster in a specific region but do not have a preference where in that region (the zone) that cluster is created. Cloud Dataproc autozone is currently available in beta mode through the Cloud SDK or the Cloud Dataproc REST API.

Next steps

We hope you agree that these latest updates positively contribute to your Cloud Dataproc user experience. To stay up-to-date about future Cloud Dataproc releases, you can subscribe to the Cloud Dataproc release XML feed or view the release notes. If you have any feedback about these or other Cloud Dataproc releases, please don’t hesitate to contact us via email or Stack Overflow (google-cloud-dataproc tag).

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.

TRY IT FREE

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.