Cloud Dataproc is now even faster and easier to use for running Apache Spark and Apache Hadoop
Product Manager, Google Cloud
Since its initial release, Cloud Dataproc has given users a faster, easier and more cost-effective way to run Apache Spark, Apache Hadoop and other components in the open source data processing ecosystem, bringing down the traditional barriers to success with those platforms. With 90-second cluster spin-up time, per-minute billing and fully-managed infrastructure, Cloud Dataproc helps you re-think how you do operations.
Over the past few weeks, we’ve done several releases — including component updates and new features, fixes and API changes — which collectively add yet more performance, ease-of-use and efficiency to the user experience, as well as provide access to the latest innovations from the open source community.
Cloud Dataproc 1.2Cloud Dataproc uses image versions to bundle software components such as the various releases of Spark, Hadoop, Apache Hive and other projects. Cloud Dataproc image 1.2 contains a number of important updates and changes (especially related to performance), including:
Software component updates
- Apache Spark has been updated to 2.2.0 (upstream current).
- Apache Hadoop has been updated to 2.8.0 (upstream current).
- The default security (SSL) provider used by the Cloud Storage connector has been changed to one based on Conscrypt. This change should more efficiently utilize the CPU for SSL operations. In many cases, this change should result in better read and write performance between Cloud Dataproc and Cloud Storage.
- The reported block size for Cloud Storage is now 128MB.
- Memory configuration for Hadoop and Spark have both been adjusted to improve performance and stability.
- YARN fair ordering is now enabled within the capacity scheduler.
- YARN graceful decommissioning is now supported if you enable it on your cluster.
Regional endpointsRegional endpoints are now generally available for Cloud Dataproc. Prior to regional endpoints, Cloud Dataproc had one region: global. With the new regional endpoints, you can specify a Compute Engine region instead. You may want regional endpoints in a few cases, including:
- To improve regional isolation and protection
- To improve regional performance compared to the default "global" namespace