Release Notes

These release notes apply to the core Cloud Dataproc service. You can periodically check this page for announcements about new or updated features, bug fixes, known issues, and deprecated functionality.

See the Cloud Dataproc version list for a list of current and past software components supported by the software images used for Cloud Dataproc virtual machines.

Subscribe to the Cloud Dataproc release notes. Subscribe

Important cross-update notes

  • In the future, we will be migrating from several GitHub repositories for Cloud Dataproc material (such as initialization actions and documentation into this consolidated repository. This should make it easier to find all Cloud Dataproc-related materials in GitHub. During the migration and for a period of time after migration, content will be available in both locations.

October 11, 2017

  • The fluentd on Cloud Dataproc clusters has been reconfigured to concatenate multi-line error messages. This should make error messages easier to locate.
  • Clusters created with Cloud Dataproc Workflows can now use auto-zone placement.
  • Starting with this release, sub-minor releases to the Cloud Dataproc images will be mentioned in the release notes.
  • New sub-minor versions of the Cloud Dataproc images - 1.0.53, 1.1.44, 1.2.8.
  • Fixed a bug reading ORC files in Hive 2.1 on Cloud Dataproc 1.1. To fix this issue, HIVE-17448 has been patched into to Hive 2.1.
  • Fixed an issue where Spark memoryOverhead was improperly set for clusters with high-memory master machines and low-memory workers. The memoryOverhead will now be appropriately set for these types of clusters.
  • The Cloud Dataproc agent has improved logic to start jobs in the order in which they were submitted.
  • The HUE initialization action has been fixed to work with Cloud Dataproc 1.2.
  • Fixed a bug where initialization action failures were not properly reported.

October 4, 2017

  • Cloud Dataproc Workflow Templates (Beta) – This new Cloud Dataproc resource allows jobs to be composed into a graph to execute on an ephemeral or an existing cluster. The template can create a cluster, run jobs, then delete the cluster when the workflow is finished. Graph progress can be monitored by polling a single operation. See Workflow templates—Overview for more information.

September 27, 2017

  • Cloud Dataproc Granular IAMBeta – Now you can set IAM roles and their corresponding permissions on a per-cluster basis. This provides a mechanism to have different IAM settings for Cloud Dataproc clusters. See the Cloud Dataproc IAM documentation for more information.
  • Fixed a bug which prevented Apache Pig and Apache Tez from working together in Cloud Dataproc 1.2. This fix was applied to Cloud Dataproc 1.1 in a previous release.
  • Fixed a bug involving Hive schema validation. This fix specifically addresses HIVE-17448 and HIVE-12274.

September 19, 2017

  • New Subminor Image Versions – The latest subminor image versions for 1.0, 1.1, and 1.2 now map to 1.0.51, 1.1.42, 1.2.6, respectively.

September 6, 2017

  • Cluster Scheduled DeletionBeta – Cloud Dataproc clusters can now be created with an scheduled deletion policy. Clusters can be scheduled for deletion either after a specified duration or at a specified time, or after a specified period of inactivity. See Cluster Scheduled Deletion for more information.

September 5, 2017

Cloud Dataproc is now available in the southamerica-east1 region (São Paulo, Brazil).

August 18, 2017

  • New Subminor Image Versions – The latest subminor image versions for 1.0, 1.1, and 1.2 now map to 1.0.49, 1.1.40, 1.2.4, respectively.
  • All Cloud Dataproc clusters now have a goog-dataproc-cluster-name label that is propagated to underlying Google Compute Engine resources and can be used to determine combined Cloud Dataproc related costs in exported billing data.
  • PySpark drivers are now launched under a changed process group ID to allow the Cloud Dataproc agent to correctly clean up misbehaving or cancelled jobs.
  • Fixed a bug where updating clusters labels and the number of secondary workers in a single update resulted in a stuck update operation and an undeletable cluster.

August 8, 2017

Beginning today, Cloud Dataproc 1.2 will be the default version for new clusters. To use older versions of Cloud Dataproc, you will need to manually select the version on cluster creation.

August 4, 2017

Graceful decomissioning – Cloud Dataproc clusters running Cloud Dataproc 1.2 or later now support graceful YARN decommissioning. Graceful decomissioning enables the removal of nodes from the cluster without interrupting jobs in progress. A user-specified timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes. The Cloud Dataproc scaling documentation contains details how enable graceful decomissioning.

Apache Hadoop on Cloud Dataproc 1.2 has been updated to version 2.8.1

August 1, 2017

Cloud Dataproc is now available in the europe-west3 region (Frankfurt, Germany).

July 21, 2017

  • Cloud Dataproc 1.2 – A new image version for Cloud Dataproc is now generally available: 1.2. It will become the default image version for new clusters starting in 2 weeks. See the Cloud Dataproc version list for more information. Some important changes included in this new image version:
    • Apache Spark has been updated to version 2.2.0.
    • Apache Hadoop has been updated to version 2.8.0.
    • The default security (SSL) provider used by the Cloud Storage connector has been changed to one based on Conscrypt. This change should more efficiently utilize the CPU for SSL operations. In many cases, this change should result in better read and write performance between Cloud Dataproc and Cloud Storage.
    • The reported block size for Cloud Storage is now 128MB.
    • Memory configuration for Hadoop and Spark have both been adjusted to improve performance and stability.
    • HDFS daemons no longer use ephemeral ports in accordance with new port assignments outlined in HDFS-9427. This eliminates certain rare race conditions that could cause occasional daemon startup failures.
    • YARN Capacity Scheduler fair ordering from YARN-3319 is now enabled by default.

Starting with the Cloud Dataproc 1.2 release, the ALPN boot jars are no longer provided on the Cloud Dataproc image. To avoid Spark job breakage, upgrade Cloud Bigtable client versions, and bundle boringssl-static with Cloud Dataproc jobs. Our initialization action repository contains initialization actions to revert to the previous (deprecated) behavior of including the jetty-alpn boot jar. This change should only impact you if you use Cloud Bigtable or other Java gRPC clients from Cloud Dataproc.

July 11, 2017

  • Spark 2.2.0 in Preview – The Cloud Dataproc preview image has been updated to Spark 2.2.0.

June 28, 2017

  • Regional endpoints generally availableRegional endpoints for Cloud Dataproc are now generally available.
  • AutozoneBeta – When you create a new cluster as an alternative to choosing a zone, you can use the Cloud Dataproc Auto Zone feature to let Cloud Dataproc select a zone within your selected region for the placement of the cluster.
  • Conscrypt for Cloud Storage connector – The default security (SSL) provider used by the Cloud Storage connector on the Cloud Dataproc preview image has been changed to one based on Conscrypt. This change should more efficiently utilize the CPU for SSL operations. In many cases, this change should result in better read and write performance between Cloud Dataproc and Cloud Storage.

June 26, 2017

  • The v1alpha1 and v1beta1 Cloud Dataproc APIs are now deprecated and cannot be used. Instead, you should use the current v1 API.

June 20, 2017

Cloud Dataproc is now available in the australia-southeast1 region (Sydney).

June 6, 2017

Cloud Dataproc is now available in the europe-west2 region (London).

April 28, 2017

Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.6.1 and the BigQuery connector has been upgraded to bigquery-connector-0.10.2. For more information, review the Cloud Storage or BigQuerychange notes in the GitHub repository.

The v1alpha1 and v1beta1 Cloud Dataproc APIs are now deprecated and cannot be used. Instead, you should use the current v1 API.

April 21, 2017

April 12, 2017

Apache Hive on Cloud Dataproc 1.1 has been updated to version 2.1.1.

April 7, 2017

Cloud Dataproc worker IAM role – A new Cloud Dataproc IAM role called Dataproc/Dataproc Worker has been added. This role is intended specifically for use with service accounts.

The Conscrypt security provider has been temporarily changed from the default to an optional security provider. This change was made due to incompatibilities with some workloads. The Conscrypt provider will be re-enabled as the default with the release of Cloud Dataproc 1.2 in the future. In the meantime, you can re-enable the Conscrypt provider when creating a cluster by specifying this Cloud Dataproc property:

--properties dataproc:dataproc.conscrypt.provider.enable=true

March 30, 2017

Conscrypt for Cloud Storage connector – The default security (SSL) provider used by the Cloud Storage connector has been changed to one based on Conscrypt. This change should more efficiently utilize the CPU for SSL operations. In many cases, this change should result in better read and write performance between Cloud Dataproc and Cloud Storage.

Updates to user labels applied to Cloud Dataproc clusters will now be applied to managed instance group templates. Since preemptible virtual machines are included in a managed instance group, label updates are now applied to preemptible VMs.

March 17, 2017

As mentioned in the February 9th release notes, Cloud Audit Logs for Cloud Dataproc are no longer emitted to the dataproc_cluster resource type. Starting with this release, Cloud Audit Logs are emitted to the new cloud_dataproc_cluster resource type.

The gcloud command now requires a double dash (--) between gcloud-specific arguments and arguments to those commands. For example, if you used this command in the past:

gcloud dataproc jobs submit spark --cluster example-cluster \
--class sample_class --jars jar_file 1000
The new format for the command requires a double-dash with a space before and after the double dash:
gcloud dataproc jobs submit spark --cluster example-cluster \
--class sample_class --jars jar_file -- 1000
.

March 7, 2017

March 1, 2017

  • Restartable jobsBeta – Cloud Dataproc jobs now have an optional setting to restart jobs that have failed. When you set a job to restart, you specify the maximum number of retries per hour. Restartable jobs allow you to mitigate common types of job failure and are especially useful for long-running and streaming jobs.
  • Single node clustersBetaSingle node clusters are Cloud Dataproc clusters with only one node that acts as the master and worker for your Cloud Dataproc cluster. Single node clusters are useful for a number of activities, including development, education, and lightweight data science.

February 9, 2017

  • Cloud Dataproc Stackdriver Logging Changes
    • With new images, cluster logs are now exported to Stackdriver as resource type cloud_dataproc_cluster (was previously dataproc_cluster).
    • Cloud Audit logs will be emitted to both cloud_dataproc_cluster and dataproc_cluster (deprecated) until the March 9th release.
    • Stackdriver logs for new images are indexed first by cluster name and then cluster UUID to assist in filtering logs by cluster name or cluster instance.
  • Cloud Dataproc Stackdriver Monitoring Changes
  • Cloud Dataproc User Labels Changes

January 19, 2017

  • Cloud Dataproc 1.2 preview – The preview image has been updated to reflect the planned Cloud Dataproc 1.2 release. This image includes Apache Spark 2.1 and Apache Hadoop 2.8-SNAPSHOT. This preview image is provided so we can provide access to Hadoop 2.8 in Dataproc 1.2 once Hadoop 2.8 is formally released and access to release candidates.

January 5, 2017

  • Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.6.0 and the BigQuery connector has been upgraded to bigquery-connector-0.10.1. For more information, review the Cloud Storage or BigQuerychange notes in the GitHub repository.
  • The diagnose command has been updated to include the jstack output of the agent and spawned drivers.

December 16, 2016

  • Google Stackdriver Agent Installed – The Stackdriver monitoring agent is now installed by default on Cloud Dataproc clusters. The Cloud Dataproc Stackdriver monitoring documentation has information on how to use Stackdriver monitoring with Cloud Dataproc. The agent for monitoring and logging can be enabled and disabled by adjusting the cluster properties when you create a cluster.
  • Cloud Dataproc 1.1.15 and 1.0.24 – The 1.1 and 1.0 images have been updated with non-impacting updates, bug fixes, and enhancements.

December 7, 2016

  • Starting with this release, the Google Cloud Dataproc API must be enabled for your project for Cloud Dataproc to function properly. You can use the API dashboard to enable the Cloud Dataproc API. Existing projects with the CLoud Dataproc API enabled will not be impacted.
  • Cloud Dataproc 1.1.14 and 1.0.23 – The 1.1 and 1.0 images have been updated with non-impacting updates, bug fixes, and enhancements.
  • Increased number of situations where Cloud-Dataproc services are automatically restarted by systemd on clusters in the event of unexpected or unhealthy behavior.

November 29, 2016

  • Custom service account support – When creating a Cloud Dataproc cluster, you can now specify a user-managed (non-default) service account. This service account will be used to run the Compute Engine virtual machines in your cluster. This enables much more fine-grained permissions across services for each individual cluster. See the service account documentation for more information.
  • Cloud Dataproc 1.1.13 and 1.0.22 – The 1.1 image for Cloud Dataproc has been updated to include support for Apache Spark 2.0.2, Apache Zeppelin 0.6.2, and Apache Flink 1.1.3. The 1.1 and 1.0 images have also been updated with non-impacting bug fixes and enhancements. See the Cloud Dataproc version list for more information about Cloud Dataproc image versions.

November 14, 2016

  • Fixed an issue where the --jars argument was missing from gcloud dataproc jobs submit pyspark command.

November 08, 2016

  • Google BigQuery connector upgrade – The BigQuery connector has been upgraded to bigquery-connector-0.10.1-SNAPSHOT. This version introduces the new IndirectBigQueryOutputFormat that uses Hadoop output formats that write directly to a temporary Google Cloud Storage bucket, and issues a single BigQuery load job per Hadoop/Spark job at job-commit time. For more information, review the BigQuerychange notes in the GitHub repository.

November 07, 2016

November 2, 2016

  • User Labels [BETA] – You can now apply user-specified key=value labels to Cloud Dataproc clusters and jobs. This allows you to group resources and related operations for later filtering and listing. As an example, you can use labels with clusters to break out Cloud Dataproc usage by groups or individuals. For more information see the user labels documentation.
  • Fixed an issue where failures during a cluster update caused the cluster to fail. Now, update failures return the cluster to Running state.
  • Fixed an issue where submitting a large number of jobs rapidly or over a long period of time caused a cluster to fail.
  • Increased the maximum number of concurrent jobs per cluster.

October 18, 2016

  • Fixed an issue where HiveServer2 was not healthy for up to 60 seconds after the cluster was deployed. Hive jobs should now successfully connect to the required HiveServer2 immediately after a cluster is deployed.

October 11, 2016

  • Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.5.4 and the BigQuery connector has been upgraded to bigquery-connector-0.8.0. For more information, review the Cloud Storage or BigQuerychange notes in the GitHub repository.
  • dataproc.localssd.mount.enable – Added the new property dataproc.localssd.mount.enable that can be set at cluster deployment time to make Cloud Dataproc ignore local SSDs. If set, Cloud Dataproc will use the main persistent disks for HDFS and temporary Hadoop directories so the local SSDs can be used separately for user-defined purposes. This property can be set by using the argument --properties dataproc:dataproc.localssd.mount.enable=false when creating a Cloud Dataproc cluster.
  • Fixed an issue where CPU quota validation for preemptible virtual machines was being performed against the non-preemptible CPU quota even when preemptible CPU quota was set.

October 07, 2016

  • Google Cloud Platform Console
    • Now, up to 8 local SSDs can be added to worker nodes. The previous limit was 4.
    • When looking at a cluster's details, the "Jobs" page now shows Stop and Delete buttons for every job in the list. Previously, the buttons were only visible on the row where the mouse hovered.
  • Optimized the listing of resources by state and cluster UUID. This should reduce several list operations from seconds to milliseconds.

September 29, 2016

  • Hadoop High Availability Mode [BETA] – Cloud Dataproc clusters can be created with high availability mode enabled. This is an optional feature when creating a cluster. In high availability mode, Cloud Dataproc clusters have three master nodes instead of one. This enables both HDFS High Availability and YARN High Availability to allow uninterrupted YARN and HDFS operations despite any single-node failures or reboots.

    Presently, this feature is available when creating clusters with the gcloud command-line tool or the Cloud Dataproc REST API. A future release will enable support for creating clusters with high availability in the Google Cloud Platform Console.

    See the high availability mode documentation for more information.

  • Optimized how jobs are listed based on state or cluster uuid. This should significantly decrease the time required to list jobs.

September 22, 2016

  • Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.5.3 and the BigQuery connector has been upgraded to bigquery-connector-0.7.9. For more information, review the change notes in the GitHub repository.
  • While Cloud Dataproc has been using Java 8 since its Beta launch in September 2015, there is now a hard dependency on Java 8 or higher.
  • The --preemptible-worker-boot-disk-size command no longer requires that you specify 0 preemptible workers if you do not want to add preemptible machines when you create a cluster.

September 16, 2016

  • Preemptible boot disk sizes - The disk size for preemptible workers can now be set via the gcloud command-line tool at cluster creation, even when preemptibles are not added to a cluster using the commend --preemptible-worker-boot-disk-size.

September 1, 2016

  • Identity & Access Management support [BETA] - Cloud Dataproc now has beta support for Google Cloud Identity and Access Management (IAM). Cloud Dataproc IAM permissions allow users to perform specific actions on Cloud Dataproc clusters, jobs, and operations. See Cloud Dataproc Permissions and IAM Roles for more information.
  • LZO support - Cloud Dataproc clusters now natively support the LZO data compression format.
  • Google Stackdriver logging toggle - It is now possible to disable Google Stackdriver logging on Cloud Dataproc clusters. To disable Stackdriver logging, use the command `--properties dataproc:dataproc.logging.stackdriver.enable=false` when creating a cluster with the `gcloud` command-line tool.
  • Cluster resource definitions for newly deployed clusters now display a fully resolved sub-minor image-version (e.g. 1.0.11 instead of 1.0). This makes it easier to temporarily revert to an older sub-minor version. See the Cloud Dataproc versioning for more information.
  • The message displayed after submitting a long-running operation in the Google Cloud Platform Console, such as creating or deleting a cluster, will now indicate the operation has been "submitted" rather than "has succeeded."

August 25, 2016

Cloud Dataproc 1.1 defaultCloud Dataproc 1.1 is now the default image version for new clusters.
  • Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.5.2 and the BigQuery connector has been upgraded to bigquery-connector-0.7.8, which bring performance improvements. See the release notes for the gcs-connector and bigquery-connector for more information.
  • Apache Zeppelin 0.6.1 – The Apache Zeppelin package built for Cloud Dataproc and installable with this initialization action has been upgraded to version 0.6.1. This new version of Zeppelin brings support for Google BigQuery.
  • Fixed an issue where adding many (~200+) nodes to a cluster would cause some nodes to fail.
  • Fix an issue where output from initialization actions that timed out was not copied to Google Cloud Storage.

August 16, 2016

The two Cloud Dataproc image versions released during the Cloud Dataproc beta, 0.1 and 0.2, will no longer receive updates. You can continue to use the beta images; however, no new updates, such as bug fixes and connector updates, will be applied to these two deprecated image versions.
Image versions released after Cloud Dataproc became generally available, starting with 1.0, will be subject to the Cloud Dataproc versioning policy.

August 8, 2016

Cloud Dataproc 1.1 – A new image version, Cloud Dataproc 1.1, has been released. Several components have been updated for this image version including:

To create a cluster with the 1.1 image, you can use the gcloud command-line tool with the --image-version argument, such as gcloud dataproc clusters create --image-version 1.1.

Cloud SDK release 121.0.0 – Updated several gcloud dataproc arguments.
  • The --preemptible-worker-boot-disk-size argument has been promoted to general availability and can be used to adjust the persistent disk size (in GB) of preemptible workers.
  • The --master-boot-disk-size-gb and --worker-boot-disk-size-gb arguments have been removed. Use --master-boot-disk-size and --worker-boot-disk-size instead.

August 2, 2016

Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.5.1 and the BigQuery connector has been upgraded to bigquery-connector-0.7.7. See the release notes for the gcs-connector and bigquery-connector for more information.
Updated preview image – Several preview image components have been updated, including:
Fixed an issue in which the NFS-based Google Cloud Storage consistency cache would not get cleaned on long-running clusters with sustained high file creation rates (> ~1,000,000 files per hour for a sustained period of time).

July 19, 2016

New Features

  • Support for new us-west1 region - Cloud Dataproc is available from day one in the newly announced west-1 region. As mentioned in the announcement, some West Coast users may see reductions in latency.
  • Apache Spark upgrade to 1.6.2 - Apache Spark on the 1.0 Cloud Dataproc image version has been upgraded from 1.6.1 to 1.6.2.
  • Cloud Storage and BigQuery connector upgrades - The Cloud Storage connector has been upgraded to gcs-connector-1.5.0 and the BigQuery connector has been upgraded to bigquery-connector-0.7.6. These new versions bring a number of new features and fixes.
    • Appendable output streams - GHFS (Google Hadoop File System) now contains an option to enable support for appendable output streams. You can enable this option by setting the fs.gs.outputstream.type property to SYNCABLE_COMPOSITE.
    • Auto retries on 429 errors - HTTP 429 (rate limit) errors from Google APIs will now be automatically retried with a backoff.
    • Cloud Storage performance - Improved the Cloud Storage connector read performance, especially for lots of small reads or lots of seeks. See the detailed change log for more information.

Bugfixes

  • Google Cloud Platform Console
    • The Google Cloud Platform Console now uses the Cloud Dataproc v1 instead of the v1beta1 API. Clicking on the equivalent REST link will show the appropriate v1 API paths and resource names.
  • Fixed an issue where some HDFS nodes did not join a cluster because their domain name could not be resolved on first boot.

July 1, 2016

New Features

  • gcloud command-line tool
    • Added the flag --preemptible-worker-boot-disk-size which can be used to adjust the boot disk size of preemptible workers. This was added in the gcloud beta track.
    • The --*-boot-disk-size-gb flag is now deprecated in all tracks and has been replaced by the --*-boot-disk-size commands

Bugfixes

  • Fixed a bug introduced in a June release that caused clusters to fail only after waiting for ~30 minutes. This occurred most frequently when initialization actions failed during cluster creation. Now clusters should fail within 1 minute of an initialization action failure.
  • Decreased job startup time for SparkSQL jobs with partitioned/nested directories by applying a patch for Spark (SPARK-9926)
  • Further optimized job startup time for any job with a lot of file inputs by applying a patch for Hadoop (HADOOP-12810)

June 10, 2016

New Features

May 4, 2016

New Features

  • Cloud SQL Initialization Action - Cloud Dataproc now has a Cloud SQL I/O and Hive Metastore initialization action. This initialization action installs a Google Cloud SQL proxy on every node in a Cloud Dataproc cluster. It also configures the cluster to store Apache Hive metadata on a given Cloud SQL instance

April 29, 2016

Bugfixes

  • The staging directory for a Cloud Dataproc job is now automatically cleared when a job completes.
  • If a cluster fails to delete properly, it will now be transitioned to a FAILED state instead of remaining in a DELETING state.
  • Fixed an issue which prevented the Cloud Dataproc --properties command from changing MapReduce properties.
  • Fixed a bug where an exception would be thrown when trying to set YARN log-aggregation to output to Cloud Storage (related to YARN-3269).

March 30, 2016

New Features

  • Spark 1.6.1 - The Cloud Dataproc image version 1.0 has been updated to include the Spark 1.6.1 maintenance release instead of Spark 1.6.0.
  • OSS upgrades - This release upgrades the Google Cloud Storage and Google BigQuery connectors to gcs-connector-1.4.5 and bigquery-connector-0.7.5, respectively.

Bugfixes

  • It is now possible to specify --num-preemptible-workers 0 through the gcloud command-line tool. Previously this would fail.
  • Fixed a validation issue which produced 500 HTTP errors when the response should have been 400 bad input or 200 OK.
  • Resolved a cache validation issue and re-enabled re-enable directory inference for the Cloud Storage connector (fs.gs.implicit.dir.infer.enable).
  • Adjusted Compute Engine migration settings due to unexpected host failures - normal VMs will automatically restart after migration and preemptible machines will not. Previously all VMs were set to not automatically restart after a migration.
  • Addressed an issue where rapid job submission would result in a Too many pending operations on a resource error.

March 8, 2016

New Features

  • Subnetwork support - Cloud Dataproc now supports subnetworks through the gcloud command-line tool. You can now use the --subnet SUBNET command to specify a subnetwork when you create a Cloud Dataproc cluster.

Bugfixes

  • Added strict validation of full compute resource URIs. The following patterns are supported:
    • https://<authority>/compute/<version>/projects/...
    • compute/<version>/projects/...
    • projects/...
  • Fixed an issue where disk quota was not being checked when a cluster size was increased.

February 22, 2016

Cloud Dataproc is now generally available! For more information, see our announcement blog post

New Features

  • Custom compute engine machine types - Cloud Dataproc clusters now support custom Compute Engine machine types for both master and worker nodes. This means you can create clusters with customized amounts of virtual CPUs and memory. For more information, please read the Dataproc documentation for custom machine types.
  • OSS upgrades - We have released Cloud Dataproc version 1.0. This release includes an upgrade to Apache Spark 1.6.0 and Apache Hadoop 2.7.2. This release also includes new versions of the Google Cloud Storage and Google BigQuery connectors.
  • v1 API - The v1 API for Cloud Dataproc is now live. This API includes support for regionality along with minor fixes and adjustments. The API is available in the APIs Explorer and also has a Maven artifact in Maven Central. For more information, review the REST API documentation.
  • Support for --jars for PySpark - Added support for using the --jars option for PySpark jobs.
  • API auto-enable - Enabling the Cloud Dataproc API now automatically enables required dependent APIs, such as Google Cloud Storage and Google Compute Engine

Bugfixes

  • Resolved several issues that occasionally caused some clusters to hang being scaled down.
  • Improved validation of certain types of malformed URLs, which previously failed during cluster deployment.

February 3, 2016

New Features

  • A new --image-version option has been added: preview
    • Unlike other numerical versions like 0.1 and 0.2, the preview version will contain newer Hadoop/Spark/Pig/Hive components targeted for potential release in the next stable Cloud Dataproc distribution version, and will change over time.
    • As of February 3, 2016, the preview version contains Spark 1.6.0, with the same Hadoop/Pig/Hive versions as Cloud Dataproc 0.2.
    • The preview option is being rolled out to the Google Cloud Platform Console gradually, so may not be visible in your account for another week or so. For all users, the preview option is accessible by deploying clusters with the gcloud command-line tool.

Bugfixes

  • Improved reliability of the DeleteJob command.
  • Fixed a bug that caused jobs to remain in a RUNNING state after the job completed successfully.

January 27, 2016

New Features

  • Two new options have been added to the Cloud Dataproc gcloud command-line tool for adding tags and metadata to virtual machines used in Cloud Dataproc clusters. These tags and metadata will apply to both regular and preemptible instances.
    • The --tags option will add tags to the Google Compute Engine instances in a cluster. For example, using the argument --tags foo,bar,baz will add three tags to the virtual machine instances in the cluster.
    • The --metadata option will add metadata to the compute engine instances. For example, using --metadata 'meta1=value1,key1=value2' will add two key-value pairs of metadata.
  • Support for heterogeneous clusters where the master node and worker nodes have different amounts of memory. Some memory settings were previously based on the master node which caused some problems as described in this Stack Overflow question. Cloud Dataproc now better supports clusters with master and worker nodes which use different machine types.
  • Google Cloud Platform Console
    • The Output tab for a job now includes a Line wrapping option to make it easier to view job output containing very long lines

Bugfixes

  • Fixed two issues which would sometimes cause virtual machines to remain active after a cluster deletion request was submitted
  • The Spark maxExecutors setting is now set to 10000 to avoid the AppMaster failing on jobs with many tasks
  • Improved handling for aggressive job submission by making several changes to the Cloud Dataproc agent, including:
    • Limiting the number of concurrent jobs so they are proportional to the memory of the master node
    • Checking free memory before scheduling new jobs
    • Rate limiting how many jobs can be scheduled per cycle
  • Improved how HDFS capacity is calculated before commissioning or decommissioning nodes to prevent excessively long updates

January 21, 2016

New Features

  • The dataproc command in the Google Cloud SDK now includes a --properties option for adding or updating properties in some cluster configuration files, such as core-site.xml. Properties are mapped to configuration files by specifying a prefix, such as core:io.serializations. This command makes it possible to modify multiple properties and files when creating a cluster. For more information, see the Cloud Dataproc documentation for the --properties command.
  • Google Cloud Platform Console
    • An option has been added to the “Create Clusters” form to enable the cloud-platform scope for a cluster. This lets you view and manage data across all Google Cloud Platform services from Cloud Dataproc clusters. You can find this option by expanding the Preemptible workers, bucket, network, version, initialization, & access options section at the bottom of the form.

Bugfixes

  • SparkR jobs no longer immediately fail with a “permission denied” error (Spark JIRA issue)
  • Configuring logging for Spark jobs with the --driver-logging-levels option no longer interferes with Java driver options
  • Google Cloud Platform Console
    • The error shown for improperly-formatted initialization actions now properly appears with information about the problem
    • Very long error messages now include a scrollbar so the Close button remains on-screen re ## January 7, 2016 #### Bugfixes
  • Fixed issue in Dataproc version 0.1 that caused zero-byte _SUCCESS and _FAILURE files for each job to be continually re-written to Cloud Storage.

December 16, 2016

New Features

  • Cloud Dataproc clusters now have vim, git, and bash-completion installed by default
  • The Cloud Dataproc API now has an official Maven artifact, Javadocs, and a downloadable .zip file
  • Google Cloud Platform Console
    • Properties can now be specified when submitting a job, and can be seen in the Configuration tab of a job
    • A Clone button has been added that allows you to easily copy all information about a job to a new job submission form
    • The left-side icons for Clusters and Jobs are now custom icons rather than generic ones
    • An Image version field has been added to the bottom of the create cluster form that allows you to select a specific Cloud Dataproc image version when creating a cluster
    • A VM Instances tab has been added on the cluster detail page, which you can use to display a list of all VMs in a cluster and easily SSH into the master node
    • An Initialization Actions field has been added to the bottom of the create cluster form, which allows you to specify initialization actions when creating a cluster
    • Paths to Google Cloud Storage buckets that are displayed in error messages are now clickable links.

Bugfixes

  • Forced distcp settings to match mapred-site.xml settings to provide additional fixes for the distcp command (see this related JIRA)
  • Ensured that workers created during an update do not join the cluster until after custom initialization actions are complete
  • Ensured that workers always disconnect from a cluster when the Cloud Dataproc agent is shutdown
  • Fixed a race condition in the API frontend that occurred when validating a request and marking cluster as updating
  • Enhanced validation checks for quota, Cloud Dataproc image, and initialization actions when updating clusters
  • Improved handling of jobs when the Cloud Dataproc agent is restarted
  • Google Cloud Platform Console
    • Allowed duplicate arguments when submitting a job
    • Replaced generic Failed to load message with details about the cause of an error when an error occurs that is not related to Cloud Dataproc
    • When a single jar file for a job is submitted, allowed it to be listed only in the Main class or jar field on the Submit a Job form, and no longer required it to also be listed in the Jar files field

November 18, 2015

Rollouts are staged to occur over four days, and should be deployed or be available for use in your Cloud Dataproc clusters by the end of the fourth day from the version's announced release date.

New features

  • Version selection - With the release of Cloud Dataproc version 0.2, you can now select among different versions of Cloud Dataproc (see Cloud Dataproc Versioning for information on support for previous versions and Cloud Dataproc Version List a list of the software components supported in each version). You can select a Cloud Dataproc version when creating a cluster through the Cloud Dataproc API, Cloud SDK (using the gcloud beta dataproc clusters create --image-version command) or through the Google Cloud Platform Console. Note that within four days of the release of a new version in a region, the new version will become the default version used to create new clusters in the region.
  • OSS upgrades - We have released Cloud Dataproc version 0.2. The new Spark component includes a number of bug fixes. The new Hive component enables use of the hive command, contains performance improvements, and has a new metastore.
  • Connector updates - We released updates to our BigQuery and Google Cloud Storage connectors (0.7.3 and 1.4.3, respectively.) These connectors fix a number of bugs and the new versions are now included in Cloud Dataproc version 0.2.
  • Hive Metastore - We introduced a MySQL-based per-cluster persistent metastore, which is shared between Hive and SparkSQL. This also fixes the hive command.
  • More Native Libraries - Cloud Dataproc now includes native Snappy libraries. It also includes native BLAS, LAPACK and ARPACK libraries for Spark’s MLlib.
  • Clusters --diagnose command - The Cloud SDK now includes a --diagnose command for gathering logging and diagnostic information about your cluster. More details about this command are available in the Cloud Dataproc support documentation.

Bugfixes

  • Fixed the ability to delete jobs that fast-failed before some cluster and staging directories were created
  • Fixed some remaining errors with vmem settings when using the distcp command
  • Fixed a rare bug in which underlying Compute Engine issues could lead to VM instances failing to be deleted after the Cloud Dataproc cluster had been successfully deleted
  • Hive command has been fixed
  • Fixed error reporting when updating the number of workers (standard and preemptible) in a cluster
  • Fixed some cases when Rate Limit Exceeded errors occurred
  • The maximum cluster name length is now correctly 55 instead of 56 characters
  • Google Cloud Platform Console
    • Cluster list now includes a Created column, and the cluster configuration tab now includes a Created field, telling the creation time of the cluster
    • In the cluster-create screen, cluster memory sizes greater than 999 GB are now displayed in TB
    • Fields that were missing from the PySpark and Hive job configuration tab (Additional Python Files and Jar Files) have been added
    • The option to add preemptible nodes when creating a cluster is now in the “expander” at the bottom of the form
    • Machine types with insufficient memory (less than 3.5 GB) are no longer displayed in the list of machine types (previously, selecting one of these small machine types would lead to an error from the backend
    • The placeholder text in the Arguments field of the submit-job form has been corrected

Core service improvements

  • If set, a project's default zone setting is now used as the default value for the zone in the create-cluster form in the Cloud Platform Console.

Optimizations

  • Hive performance has been greatly increased, especially for partitioned tables with a large number of partitions
  • Multithreaded listStatus has now been enabled, which should speed up job startup time for FileInputFormats reading large numbers of files and directories in Google Cloud Storage

October 23, 2015

New Features

  • Google Cloud Platform Console

October 15, 2015

Bugfixes**

  • Fixed a bug in which DataNodes failed to register with the NameNode on startup, resulting in less-than-expected HDFS capacity.
  • Prevented the submission of jobs in an Error state.
  • Fixed bug that prevented clusters from deleting cleanly in some situations.
  • Reduced HTTP 500 errors when deploying Cloud Dataproc clusters.
  • Corrected distcp out-of-memory errors with better cluster configuration.
  • Fixed a situation in which jobs failed to delete properly and were stuck in a Deleting state.

Core service improvements

  • Provided more detail about HTTP 500 errors instead of showing 4xx errors.
  • Added information on existing resources for Resource already exists errors.
  • Specific information now provided instead of generic error message for errors related to Google Cloud Storage.
  • Listing operations now support pagination.

Optimizations

  • Significantly improved YARN utilization for MapReduce jobs running directly against Cloud Storage.
  • Made adjustments to yarn.scheduler.capacity.maximum-am-resource-percent to enable better utilization and concurrent job support.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Google Cloud Dataproc Documentation