Release Notes

These release notes apply to the core Cloud Dataproc service. You can periodically check this page for announcements about new or updated features, bug fixes, known issues, and deprecated functionality.

See the Cloud Dataproc version list for a list of current and past software components supported by the software images used for Cloud Dataproc virtual machines.

Subscribe to the Cloud Dataproc release notes. Subscribe

Important cross-update notes

In the future, Cloud Dataproc will be migrating from several GitHub repositories for Cloud Dataproc material (such as initialization actions and documentation into this consolidated repository. This migration should make it easier to find all Cloud Dataproc-related materials in GitHub. During the migration and for a period of time after migration, content will be available in both locations.

August 10, 2018

  • New sub-minor versions of Cloud Dataproc images - 1.0.90, 1.1.81, 1.2.45, 1.3.5
  • Set maximum number of open files to 65535 for all Systemd services.

August 3, 2018

  • New sub-minor versions of Cloud Dataproc images - 1.0.89, 1.1.80, 1.2.44, 1.3.4
  • In HA clusters, hadoop.security.token.service.use_ip is now set to false.
  • Upgraded Hadoop to 2.8.4. (Dataproc 1.2)
  • Fixed issue in which Hive jobs would fail on 1.3 HA clusters.
  • Fixed the default value of mapreduce.jobhistory.recovery.store.fs.uri by setting it back to ${hadoop.tmp.dir}/mapred/history/recoverystore. It was set to hdfs:///mapred/history/recoverystore in error as part of the July 6 release.
  • Backported ZOOKEEPER-1576 into ZooKeeper 3.4.6 in Dataproc 1.2 and 1.3. This bug caused Zookeper connections to fail if any one of the servers failed to resolve.

July 31, 2018

  • New sub-minor versions of Cloud Dataproc images - 1.3.3
  • Changes to 1.3 image only:
    • Disabled node blacklisting in Tez jobs (set tez.am.node-blacklisting.enabled=false). This affects all Hive jobs, which run on Tez by default.
    • Fixed issue breaking native Snappy compression in spark-shell (SPARK-24018) and Zeppelin.
    • Fixed issue where gsutil and gcloud do not work on cluster VMs when the ANACONDA optional component is selected.

July 18, 2018

July 13, 2018

  • New sub-minor versions of Cloud Dataproc images - 1.0.88, 1.1.79, 1.2.43, 1.3.2
  • Cloud Dataproc now adds resource location to generated cloud audit logs.

July 10, 2018

  • Cloud Dataproc is now available in the us-west2 region (Los Angeles).

July 6, 2018

  • Announcing the alpha release of Cloud Dataproc Optional Components. This feature allows users to specify additional components to install when creating new Dataproc clusters.
  • New sub-minor versions of Cloud Dataproc images - 1.0.87, 1.1.78, 1.2.42, 1.3.1
  • Changes to 1.3 image only:
    • The per-driver Spark web UI has been re-enabled.
    • The HCatalog library is installed by default.
  • MapReduce job history server recovery is now enabled by default.
  • A race condition in HA cluster creation with the resolveip utility has been resolved.

June 29, 2018

  • Cloud Dataproc 1.3 - A new image version for Cloud Dataproc is now generally available.
    • Image version 1.3 will become the default image version for new clusters starting 07/30/2018. See the Cloud Dataproc Version List for more information.
    Image version 1.3 includes the following changes:
    • Apache Spark has been updated to version 2.3.0.
    • Apache Hadoop has been updated to version 2.9.0.
    • Apache Hive has been updated to version 2.3.2.
    • Hive runs on Apache Tez by default.
    • The YARN Timeline Server is enabled by default.
  • Announcing the general availability (GA) release of Cloud Dataproc Custom Images (was Beta). This feature allows users to create and save custom images with packages pre-installed. These custom images can then be used to create Cloud Dataproc clusters.
  • New sub-minor versions of Cloud Dataproc images - 1.0.86, 1.1.77, 1.2.41, 1.3.0
  • Changes to 1.3 image only:
    • Cloud Storage connector was upgraded to version 1.9.0 (see the change notes in the GitHub repository).
    • The NFS Kernel Server is no longer installed.

June 27, 2018

June 22, 2018

  • New sub-minor versions of Cloud Dataproc images - 1.0.85, 1.1.76, 1.2.40
  • Upgraded the Cloud Storage and BigQuery connectors in 1.0.85, 1.1.76, 1.2.40 (for more information, review the change notes in the GitHub repository):
    • Cloud Storage connector has been upgraded to version 1.6.7
    • BigQuery connector has been upgraded to version 0.10.8

June 15, 2018

  • New sub-minor versions of Cloud Dataproc images - 1.0.84, 1.1.75, 1.2.39
  • Initialization Action output is now available in Stackdriver under google.dataproc.startup log.
  • Cloud Dataproc will refuse to create new clusters based on most images created before 2018-02-14. Customers do not need to change minor versions, but if they specify a subminor version that falls into this group, they will need to use a newer subminor version (e.g. 1.1.39 can not be used for new clusters, but 1.1 and 1.1.73 are valid).

June 11, 2018

  • Cloud Dataproc is now available in the europe-north1 region (Finland).

June 8, 2018

  • Google Cloud SDK 203.0.0 (2018-05-29)
    • Changes include:
      • Added gcloud beta dataproc workflow-templates instantiate-from-file to enable instantiation of workflow templates directly from a YAML file.
      • Added gcloud beta dataproc clusters create-from-file to enable creation of clusters directly from a YAML file.
    • See the Cloud SDK reference documentation for more information.
  • New sub-minor versions of Cloud Dataproc images - 1.0.83, 1.1.74, 1.2.38
  • Change the jdbc connect string passed to beeline when submitting Hive jobs to high availability clusters through the Cloud Dataproc Jobs API. The new connect string takes advantage of high availability of HiveServer2.
  • WorkflowTemplates will now correctly report Job failures.

May 28, 2018

  • New sub-minor versions of Cloud Dataproc images - 1.0.82, 1.1.73, 1.2.37
  • Hive Server 2 now runs all three masters in High Availablility mode.
  • Preview image changes (Dataproc 1.3):
    • Now requires a 15 GB minimum boot disk size.
    • The NameNode Service RPC port has changed from 8040 to 8051.
    • The SPARK_HOME environment variable is now globally set.
  • ALPN boot jar removed from 1.2. This regression was introduced in 1.2.35.

May 21, 2018

  • New sub-minor versions of Cloud Dataproc images - 1.0.81, 1.1.72, 1.2.36
  • Upgraded the Cloud Storage and BigQuery connectors in 1.0.81, 1.1.72, 1.2.36 (for more information, review the change notes in the GitHub repository):
    • Cloud Storage connector has been upgraded to version 1.6.6
    • BigQuery connector has been upgraded to version 0.10.7
  • New version of Cloud Dataproc 1.3 preview image:
    • Remove BigQuery connector from image. Users should instead include the BigQuery connector with jars for their jobs.
    • Cloud Dataproc 1.3 is not supported
    • See Cloud Dataproc version list for more information
  • Hive Metastore is configured to run on all three masters in High Availability mode
  • Fixed an issue in which accelerator quota was being incorrectly validated, for example, cluster creation might fail with an "Insufficient 'NVIDIA_K80_GPUS' quota" error, even though quota was sufficient.

May 14, 2018

  • New Cloud Dataproc 1.3 image track available in preview.
    • Changes include:
      • Spark 2.3, Hadoop 2.9, Hive 2.3, Pig 0.17, Tez 0.9
      • Hive on Tez by default (no need for the Tez initialization action).
    • Cloud Dataproc 1.3 is not officially supported.
    • See Cloud Dataproc version list for more information.

May 4, 2018

  • Fixed issue in which preemptible workers were not getting cleaned up from node membership files once they had left the cluster.

April 27, 2018

April 20, 2018

  • New sub-minor versions of Cloud Dataproc images - 1.0.77, 1.1.68, 1.2.32
  • Changed the Namenode HTTP port from 50070 to 9870 on HA clusters in preview image. WebHDFS, for example, is accessible at http://clustername-m-0:9870/webhdfs/v1/. This is consistent with standard and single node clusters in Dataproc 1.2+. Dataproc 1.0 and 1.1 clusters will continue to use port 50070 for all cluster modes.
  • Upgraded the Cloud Storage and BigQuery connectors (for more information, review the change notes in the GitHub repository):
    • Cloud Storage connector has been upgraded to version 1.6.5
    • BigQuery connector has been upgraded to version 0.10.6
  • Fixed issue where cluster can go into ERROR state due to error when resizing a managed instance group.
  • Backported PIG-4967 and MAPREDUCE-6762 into Cloud Datproc image version 1.2 to fix an occasional NullPointerException in Pig jobs.
  • Fixed an uncommon issue in which a Cloud Dataproc agent restart during a small window of a cluster downscale operation could cause problems decommissioning data nodes.

April 13, 2018

  • Fixed rare issue where Cloud Dataproc agent fails to initialize HDFS configuration and reports too few DataNodes reporting.
  • Fixed how Cloud Dataproc determines that HDFS decommission is complete.

April 6, 2018

March 30, 2018

March 23, 2018

March 22, 2018

March 16, 2018

March 9, 2018

  • Fixed an issue where ZooKeeper was not configured to periodically clean up its data directories.

March 5, 2018

  • Cloud Dataproc Custom Images - Beta. Users can now create and save custom images that have their packages pre-installed. The custom images can then be used to create Cloud Dataproc clusters.
  • New sub-minor versions of Cloud Dataproc images - 1.0.70, 1.1.61, 1.2.25.
  • An optional requestId field has been added to CreateCluster, UpdateCluster, DeleteCluster, and SubmitJob. The requestId field can be used to prevent the processing of duplicate requests (subsequent requests with a requestId that is the same as a previous requestId are ignored).
  • Increased the MapReduce and Spark History Server heap sizes when running on large master nodes.
  • Fixed an issue where initialization actions could fail to execute with the error "errno 26 Text file is busy".

February 23, 2018

February 16, 2018

  • New sub-minor versions of Cloud Dataproc images - 1.0.68, 1.1.59, 1.2.23.
  • Updating cluster labels now also updates labels on PDs attached to master and primary worker VMs.
  • Fixed an issue where cluster deletion could be slow if multiple delete cluster requests were in progress.
  • Fixed an issue where jobs became stuck if logging failed.
  • Fixed an issue where a cluster downsize operation failed when the dataproc agent incorrectly identified HDFS datanode decommission as stuck.
  • Fixed an issue where the dataproc agent incorrectly reported two YARN metrics.

February 9, 2018

  • New sub-minor versions of the Cloud Dataproc images - 1.0.67, 1.1.58, 1.2.22.
  • The High Availability Mode feature is now in public release (was Beta). Cloud Dataproc clusters can be created with high availability mode enabled. This is an optional feature when creating a cluster. In high availability mode, Cloud Dataproc clusters have three master nodes instead of one. This enables both HDFS High Availability and YARN High Availability to allow uninterrupted YARN and HDFS operations despite any single-node failures or reboots.

    This feature is available when creating clusters with the gcloud command-line tool, the Cloud Dataproc REST API, and Google Cloud Platform Console. See High Availability Mode for more information.

  • An Update Cluster operation now returns a DONE operation if the update request has no work to perform.
  • Fixed an issue where a workflow can get stuck due to a manually deleted cluster.

February 2, 2018

  • Added support for setting hadoop-env, mapred-env, spark-env, and yarn-env dataproc properties through new prefixes. NOTE: applies only to new subminor image versions.
  • Added a button to link to Stackdriver logs for a cluster on the Google Cloud Platform Console Cluster details page.
  • Fixed a Hadoop issue where an insufficient number of Datanodes were reporting.
  • Sped up commitJob on Cloud Storage for jobs with many last-stage (reduce) tasks.

January 10, 2018

  • New sub-minor versions of the Cloud Dataproc images - 1.0.63, 1.1.54, 1.2.18.
  • Automatic retry of commitJob (introduced in MAPREDUCE-5485 is now enabled by default; set mapreduce.fileoutputcommitter.failures.attempt to 1 to revert to the old behavior.
  • Applied patch for CVE-2017-5754 ("Meltdown") along with other security patches referenced in DSA-4082-1
  • Local SSDs are now properly reformatted on boot after an ungraceful host migration; previously, ungraceful host migrations on nodes with local SSDs could cause workers to become defunct
  • Improved reliability of High Availability cluster startup for cases where one or more masters' startup is delayed
  • It is now possible to instantiate Dataproc workflows directly without creating a WorkflowTemplate by using the new InstantiateInline method
  • Announcing the Beta release of Cloud Dataproc Persistent Solid State Drive (PD-SSD) Boot Disks, which allows you to create clusters that use PD-SSDs for the master and worker node boot disks.
  • Cloud Dataproc is now available in the northamerica-northeast1 region (Montréal, Canada).
  • Cloud Dataproc is now available in the europe-west4 region (Netherlands).

December 20, 2017

  • The Google Cloud Dataproc Graceful Decommissioning feature is now in public release (was Beta). Graceful Decomissioning enables the removal of nodes from the cluster without interrupting jobs in progress. A user-specified timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes. This feature is available when updating clusters using the gcloud command-line tool, the Cloud Dataproc REST API, and Google Cloud Platform Console. See Graceful Decommissioning for more information.
  • The Single Node Clusters feature is now in public release (was Beta). Single node clusters are Cloud Dataproc clusters with only one node that acts as the master and worker. Single node clusters are useful for a number of activities, including development, education, and lightweight data science.

    This feature is available when creating clusters with the gcloud command-line tool, the Cloud Dataproc REST API, and Google Cloud Platform Console. See Single node clustersfor more information.

December 8, 2017

  • The Restartable Jobs feature is now in public release (was Beta). Cloud Dataproc jobs have an optional setting to restart failed jobs. When you set a job to restart, you specify the maximum number of retries per hour (maximum is 10). Restartable jobs allow you to mitigate common types of job failure, and are especially useful for long-running and streaming jobs. This feature is available when submitting jobs using the gcloud command-line tool, the Cloud Dataproc REST API, and Google Cloud Platform Console. See Restartable jobs for more information.

November 17, 2017

  • New sub-minor versions of the Cloud Dataproc images - 1.0.58, 1.1.49, 1.2.13.
  • Added new optimization that increases the performance of list operations for operations and jobs when tags are used.

November 10, 2017

  • New sub-minor versions of the Cloud Dataproc images - 1.0.57, 1.1.48, 1.2.12.
  • Apache Hadoop has been upgraded to 2.8.2 in the Cloud Dataproc 1.2 image.

November 1, 2017

  • When using workflow cluster selectors, if more than one cluster matches the specified label(s), Cloud Dataproc will select the cluster with the most free YARN memory. This change replaces the old behavior of choosing a random cluster with the matching label.
  • New sub-minor versions of the Cloud Dataproc images - 1.0.56, 1.1.47, 1.2.11.
  • HTTP 404 and 409 errors will now show the full resource name in order to provide more useful error messages.
  • Fixed a bug that prevented workflow templates from handling /locations/ in resource names.

October 31, 2017

Cloud Dataproc is now available in the asia-south1 region (Mumbai, India).

October 24, 2017

October 17, 2017

  • Fixed a bug where HTTP keep-alive was causing java.lang.NullPointerException: ssl == null errors when accessing Cloud Storage.
  • The Apache Oozie initialization action has been fixed to work with Cloud Dataproc 1.2.

October 11, 2017

  • The fluentd on Cloud Dataproc clusters has been reconfigured to concatenate multi-line error messages. This should make error messages easier to locate.
  • Clusters created with Cloud Dataproc Workflows can now use auto-zone placement.
  • Starting with this release, sub-minor releases to the Cloud Dataproc images will be mentioned in the release notes.
  • New sub-minor versions of the Cloud Dataproc images - 1.0.53, 1.1.44, 1.2.8.
  • Fixed a bug reading ORC files in Hive 2.1 on Cloud Dataproc 1.1. To fix this issue, HIVE-17448 has been patched into to Hive 2.1.
  • Fixed an issue where Spark memoryOverhead was improperly set for clusters with high-memory master machines and low-memory workers. The memoryOverhead will now be appropriately set for these types of clusters.
  • The Cloud Dataproc agent has improved logic to start jobs in the order in which they were submitted.
  • The HUE initialization action has been fixed to work with Cloud Dataproc 1.2.
  • Fixed a bug where initialization action failures were not properly reported.

October 4, 2017

  • Cloud Dataproc Workflow Templates (Beta) – This new Cloud Dataproc resource allows jobs to be composed into a graph to execute on an ephemeral or an existing cluster. The template can create a cluster, run jobs, then delete the cluster when the workflow is finished. Graph progress can be monitored by polling a single operation. See Workflow templates—Overview for more information.

September 27, 2017

  • Cloud Dataproc Granular IAMBeta – Now you can set IAM roles and their corresponding permissions on a per-cluster basis. This provides a mechanism to have different IAM settings for Cloud Dataproc clusters. See the Cloud Dataproc IAM documentation for more information.
  • Fixed a bug which prevented Apache Pig and Apache Tez from working together in Cloud Dataproc 1.2. This fix was applied to Cloud Dataproc 1.1 in a previous release.
  • Fixed a bug involving Hive schema validation. This fix specifically addresses HIVE-17448 and HIVE-12274.

September 19, 2017

  • New Subminor Image Versions – The latest subminor image versions for 1.0, 1.1, and 1.2 now map to 1.0.51, 1.1.42, 1.2.6, respectively.

September 6, 2017

  • Cluster Scheduled DeletionBeta – Cloud Dataproc clusters can now be created with an scheduled deletion policy. Clusters can be scheduled for deletion either after a specified duration or at a specified time, or after a specified period of inactivity. See Cluster Scheduled Deletion for more information.

September 5, 2017

Cloud Dataproc is now available in the southamerica-east1 region (São Paulo, Brazil).

August 18, 2017

  • New Subminor Image Versions – The latest subminor image versions for 1.0, 1.1, and 1.2 now map to 1.0.49, 1.1.40, 1.2.4, respectively.
  • All Cloud Dataproc clusters now have a goog-dataproc-cluster-name label that is propagated to underlying Google Compute Engine resources and can be used to determine combined Cloud Dataproc related costs in exported billing data.
  • PySpark drivers are now launched under a changed process group ID to allow the Cloud Dataproc agent to correctly clean up misbehaving or cancelled jobs.
  • Fixed a bug where updating clusters labels and the number of secondary workers in a single update resulted in a stuck update operation and an undeletable cluster.

August 8, 2017

Beginning today, Cloud Dataproc 1.2 will be the default version for new clusters. To use older versions of Cloud Dataproc, you will need to manually select the version on cluster creation.

August 4, 2017

Graceful decomissioning – Cloud Dataproc clusters running Cloud Dataproc 1.2 or later now support graceful YARN decommissioning. Graceful decomissioning enables the removal of nodes from the cluster without interrupting jobs in progress. A user-specified timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes. The Cloud Dataproc scaling documentation contains details how enable graceful decomissioning.

Apache Hadoop on Cloud Dataproc 1.2 has been updated to version 2.8.1

August 1, 2017

Cloud Dataproc is now available in the europe-west3 region (Frankfurt, Germany).

July 21, 2017

  • Cloud Dataproc 1.2 – A new image version for Cloud Dataproc is now generally available: 1.2. It will become the default image version for new clusters starting in 2 weeks. See the Cloud Dataproc version list for more information. Some important changes included in this new image version:
    • Apache Spark has been updated to version 2.2.0.
    • Apache Hadoop has been updated to version 2.8.0.
    • The default security (SSL) provider used by the Cloud Storage connector has been changed to one based on Conscrypt. This change should more efficiently utilize the CPU for SSL operations. In many cases, this change should result in better read and write performance between Cloud Dataproc and Cloud Storage.
    • The reported block size for Cloud Storage is now 128MB.
    • Memory configuration for Hadoop and Spark have both been adjusted to improve performance and stability.
    • HDFS daemons no longer use ephemeral ports in accordance with new port assignments outlined in HDFS-9427. This eliminates certain rare race conditions that could cause occasional daemon startup failures.
    • YARN Capacity Scheduler fair ordering from YARN-3319 is now enabled by default.

Starting with the Cloud Dataproc 1.2 release, the ALPN boot jars are no longer provided on the Cloud Dataproc image. To avoid Spark job breakage, upgrade Cloud Bigtable client versions, and bundle boringssl-static with Cloud Dataproc jobs. Our initialization action repository contains initialization actions to revert to the previous (deprecated) behavior of including the jetty-alpn boot jar. This change should only impact you if you use Cloud Bigtable or other Java gRPC clients from Cloud Dataproc.

July 11, 2017

  • Spark 2.2.0 in Preview – The Cloud Dataproc preview image has been updated to Spark 2.2.0.

June 28, 2017

  • Regional endpoints generally availableRegional endpoints for Cloud Dataproc are now generally available.
  • AutozoneBeta – When you create a new cluster as an alternative to choosing a zone, you can use the Cloud Dataproc Auto Zone feature to let Cloud Dataproc select a zone within your selected region for the placement of the cluster.
  • Conscrypt for Cloud Storage connector – The default security (SSL) provider used by the Cloud Storage connector on the Cloud Dataproc preview image has been changed to one based on Conscrypt. This change should more efficiently utilize the CPU for SSL operations. In many cases, this change should result in better read and write performance between Cloud Dataproc and Cloud Storage.

June 26, 2017

  • The v1alpha1 and v1beta1 Cloud Dataproc APIs are now deprecated and cannot be used. Instead, you should use the current v1 API.

June 20, 2017

Cloud Dataproc is now available in the australia-southeast1 region (Sydney).

June 6, 2017

Cloud Dataproc is now available in the europe-west2 region (London).

April 28, 2017

Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.6.1 and the BigQuery connector has been upgraded to bigquery-connector-0.10.2. For more information, review the Cloud Storage or BigQuerychange notes in the GitHub repository.

The v1alpha1 and v1beta1 Cloud Dataproc APIs are now deprecated and cannot be used. Instead, you should use the current v1 API.

April 21, 2017

April 12, 2017

Apache Hive on Cloud Dataproc 1.1 has been updated to version 2.1.1.

April 7, 2017

Cloud Dataproc worker IAM role – A new Cloud Dataproc IAM role called Dataproc/Dataproc Worker has been added. This role is intended specifically for use with service accounts.

The Conscrypt security provider has been temporarily changed from the default to an optional security provider. This change was made due to incompatibilities with some workloads. The Conscrypt provider will be re-enabled as the default with the release of Cloud Dataproc 1.2 in the future. In the meantime, you can re-enable the Conscrypt provider when creating a cluster by specifying this Cloud Dataproc property:

--properties dataproc:dataproc.conscrypt.provider.enable=true

March 30, 2017

Conscrypt for Cloud Storage connector – The default security (SSL) provider used by the Cloud Storage connector has been changed to one based on Conscrypt. This change should more efficiently utilize the CPU for SSL operations. In many cases, this change should result in better read and write performance between Cloud Dataproc and Cloud Storage.

Updates to user labels applied to Cloud Dataproc clusters will now be applied to managed instance group templates. Since preemptible virtual machines are included in a managed instance group, label updates are now applied to preemptible VMs.

March 17, 2017

As mentioned in the February 9th release notes, Cloud Audit Logs for Cloud Dataproc are no longer emitted to the dataproc_cluster resource type. Starting with this release, Cloud Audit Logs are emitted to the new cloud_dataproc_cluster resource type.

The gcloud command now requires a double dash (--) between gcloud-specific arguments and arguments to those commands. For example, if you used this command in the past:

gcloud dataproc jobs submit spark --cluster example-cluster \
--class sample_class --jars jar_file 1000
The new format for the command requires a double-dash with a space before and after the double dash:
gcloud dataproc jobs submit spark --cluster example-cluster \
--class sample_class --jars jar_file -- 1000
.

March 7, 2017

March 1, 2017

  • Restartable jobsBeta – Cloud Dataproc jobs now have an optional setting to restart jobs that have failed. When you set a job to restart, you specify the maximum number of retries per hour. Restartable jobs allow you to mitigate common types of job failure and are especially useful for long-running and streaming jobs.
  • Single node clustersBetaSingle node clusters are Cloud Dataproc clusters with only one node that acts as the master and worker. Single node clusters are useful for a number of activities, including development, education, and lightweight data science.

February 9, 2017

  • Cloud Dataproc Stackdriver Logging Changes
    • With new images, cluster logs are now exported to Stackdriver as resource type cloud_dataproc_cluster (was previously dataproc_cluster).
    • Cloud Audit logs will be emitted to both cloud_dataproc_cluster and dataproc_cluster (deprecated) until the March 9th release.
    • Stackdriver logs for new images are indexed first by cluster name and then cluster UUID to assist in filtering logs by cluster name or cluster instance.
  • Cloud Dataproc Stackdriver Monitoring Changes
  • Cloud Dataproc User Labels Changes

January 19, 2017

  • Cloud Dataproc 1.2 preview – The preview image has been updated to reflect the planned Cloud Dataproc 1.2 release. This image includes Apache Spark 2.1 and Apache Hadoop 2.8-SNAPSHOT. This preview image is provided so we can provide access to Hadoop 2.8 in Dataproc 1.2 once Hadoop 2.8 is formally released and access to release candidates.

January 5, 2017

  • Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.6.0 and the BigQuery connector has been upgraded to bigquery-connector-0.10.1. For more information, review the Cloud Storage or BigQuerychange notes in the GitHub repository.
  • The diagnose command has been updated to include the jstack output of the agent and spawned drivers.

December 16, 2016

  • Google Stackdriver Agent Installed – The Stackdriver monitoring agent is now installed by default on Cloud Dataproc clusters. The Cloud Dataproc Stackdriver monitoring documentation has information on how to use Stackdriver monitoring with Cloud Dataproc. The agent for monitoring and logging can be enabled and disabled by adjusting the cluster properties when you create a cluster.
  • Cloud Dataproc 1.1.15 and 1.0.24 – The 1.1 and 1.0 images have been updated with non-impacting updates, bug fixes, and enhancements.

December 7, 2016

  • Starting with this release, the Google Cloud Dataproc API must be enabled for your project for Cloud Dataproc to function properly. You can use the Google Cloud Platform Console to enable the Cloud Dataproc API. Existing projects with the CLoud Dataproc API enabled will not be impacted.
  • Cloud Dataproc 1.1.14 and 1.0.23 – The 1.1 and 1.0 images have been updated with non-impacting updates, bug fixes, and enhancements.
  • Increased number of situations where Cloud-Dataproc services are automatically restarted by systemd on clusters in the event of unexpected or unhealthy behavior.

November 29, 2016

  • Custom service account support – When creating a Cloud Dataproc cluster, you can now specify a user-managed (non-default) service account. This service account will be used to run the Compute Engine virtual machines in your cluster. This enables much more fine-grained permissions across services for each individual cluster. See the service account documentation for more information.
  • Cloud Dataproc 1.1.13 and 1.0.22 – The 1.1 image for Cloud Dataproc has been updated to include support for Apache Spark 2.0.2, Apache Zeppelin 0.6.2, and Apache Flink 1.1.3. The 1.1 and 1.0 images have also been updated with non-impacting bug fixes and enhancements. See the Cloud Dataproc version list for more information about Cloud Dataproc image versions.

November 14, 2016

  • Fixed an issue where the --jars argument was missing from gcloud dataproc jobs submit pyspark command.

November 8, 2016

  • Google BigQuery connector upgrade – The BigQuery connector has been upgraded to bigquery-connector-0.10.1-SNAPSHOT. This version introduces the new IndirectBigQueryOutputFormat that uses Hadoop output formats that write directly to a temporary Cloud Storage bucket, and issues a single BigQuery load job per Hadoop/Spark job at job-commit time. For more information, review the BigQuerychange notes in the GitHub repository.

November 7, 2016

November 2, 2016

  • User Labels [BETA] – You can now apply user-specified key=value labels to Cloud Dataproc clusters and jobs. This allows you to group resources and related operations for later filtering and listing. As an example, you can use labels with clusters to break out Cloud Dataproc usage by groups or individuals. For more information see the user labels documentation.
  • Fixed an issue where failures during a cluster update caused the cluster to fail. Now, update failures return the cluster to Running state.
  • Fixed an issue where submitting a large number of jobs rapidly or over a long period of time caused a cluster to fail.
  • Increased the maximum number of concurrent jobs per cluster.

October 18, 2016

  • Fixed an issue where HiveServer2 was not healthy for up to 60 seconds after the cluster was deployed. Hive jobs should now successfully connect to the required HiveServer2 immediately after a cluster is deployed.

October 11, 2016

  • Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.5.4 and the BigQuery connector has been upgraded to bigquery-connector-0.8.0. For more information, review the Cloud Storage or BigQuerychange notes in the GitHub repository.
  • dataproc.localssd.mount.enable – Added the new property dataproc.localssd.mount.enable that can be set at cluster deployment time to make Cloud Dataproc ignore local SSDs. If set, Cloud Dataproc will use the main persistent disks for HDFS and temporary Hadoop directories so the local SSDs can be used separately for user-defined purposes. This property can be set by using the argument --properties dataproc:dataproc.localssd.mount.enable=false when creating a Cloud Dataproc cluster.
  • Fixed an issue where CPU quota validation for preemptible virtual machines was being performed against the non-preemptible CPU quota even when preemptible CPU quota was set.

October 7, 2016

  • Google Cloud Platform Console
    • Now, up to 8 local SSDs can be added to worker nodes. The previous limit was 4.
    • When looking at a cluster's details, the "Jobs" page now shows Stop and Delete buttons for every job in the list. Previously, the buttons were only visible on the row where the mouse hovered.
  • Optimized the listing of resources by state and cluster UUID. This should reduce several list operations from seconds to milliseconds.

September 29, 2016

  • Hadoop High Availability Mode [BETA] – Cloud Dataproc clusters can be created with high availability mode enabled. This is an optional feature when creating a cluster. In high availability mode, Cloud Dataproc clusters have three master nodes instead of one. This enables both HDFS High Availability and YARN High Availability to allow uninterrupted YARN and HDFS operations despite any single-node failures or reboots.

    Presently, this feature is available when creating clusters with the gcloud command-line tool or the Cloud Dataproc REST API. A future release will enable support for creating clusters with high availability in the Google Cloud Platform Console.

    See the high availability mode documentation for more information.

  • Optimized how jobs are listed based on state or cluster uuid. This should significantly decrease the time required to list jobs.

September 22, 2016

  • Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.5.3 and the BigQuery connector has been upgraded to bigquery-connector-0.7.9. For more information, review the change notes in the GitHub repository.
  • While Cloud Dataproc has been using Java 8 since its Beta launch in September 2015, there is now a hard dependency on Java 8 or higher.
  • The --preemptible-worker-boot-disk-size command no longer requires that you specify 0 preemptible workers if you do not want to add preemptible machines when you create a cluster.

September 16, 2016

  • Preemptible boot disk sizes - The disk size for preemptible workers can now be set via the gcloud command-line tool at cluster creation, even when preemptibles are not added to a cluster using the commend --preemptible-worker-boot-disk-size.

September 1, 2016

  • Identity & Access Management support [BETA] - Cloud Dataproc now has beta support for Google Cloud Identity and Access Management (IAM). Cloud Dataproc IAM permissions allow users to perform specific actions on Cloud Dataproc clusters, jobs, and operations. See Cloud Dataproc Permissions and IAM Roles for more information.
  • LZO support - Cloud Dataproc clusters now natively support the LZO data compression format.
  • Google Stackdriver logging toggle - It is now possible to disable Google Stackdriver logging on Cloud Dataproc clusters. To disable Stackdriver logging, use the command `--properties dataproc:dataproc.logging.stackdriver.enable=false` when creating a cluster with the `gcloud` command-line tool.
  • Cluster resource definitions for newly deployed clusters now display a fully resolved sub-minor image-version (e.g. 1.0.11 instead of 1.0). This makes it easier to temporarily revert to an older sub-minor version. See the Cloud Dataproc versioning for more information.
  • The message displayed after submitting a long-running operation in the Google Cloud Platform Console, such as creating or deleting a cluster, will now indicate the operation has been "submitted" rather than "has succeeded."

August 25, 2016

Cloud Dataproc 1.1 defaultCloud Dataproc 1.1 is now the default image version for new clusters.
  • Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.5.2 and the BigQuery connector has been upgraded to bigquery-connector-0.7.8, which bring performance improvements. See the release notes for the gcs-connector and bigquery-connector for more information.
  • Apache Zeppelin 0.6.1 – The Apache Zeppelin package built for Cloud Dataproc and installable with this initialization action has been upgraded to version 0.6.1. This new version of Zeppelin brings support for Google BigQuery.
  • Fixed an issue where adding many (~200+) nodes to a cluster would cause some nodes to fail.
  • Fix an issue where output from initialization actions that timed out was not copied to Cloud Storage.

August 16, 2016

The two Cloud Dataproc image versions released during the Cloud Dataproc beta, 0.1 and 0.2, will no longer receive updates. You can continue to use the beta images; however, no new updates, such as bug fixes and connector updates, will be applied to these two deprecated image versions.
Image versions released after Cloud Dataproc became generally available, starting with 1.0, will be subject to the Cloud Dataproc versioning policy.

August 8, 2016

Cloud Dataproc 1.1 – A new image version, Cloud Dataproc 1.1, has been released. Several components have been updated for this image version including:

To create a cluster with the 1.1 image, you can use the gcloud command-line tool with the --image-version argument, such as gcloud dataproc clusters create --image-version 1.1.

Cloud SDK release 121.0.0 – Updated several gcloud dataproc arguments.
  • The --preemptible-worker-boot-disk-size argument has been promoted to general availability and can be used to adjust the persistent disk size (in GB) of preemptible workers.
  • The --master-boot-disk-size-gb and --worker-boot-disk-size-gb arguments have been removed. Use --master-boot-disk-size and --worker-boot-disk-size instead.

August 2, 2016

Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.5.1 and the BigQuery connector has been upgraded to bigquery-connector-0.7.7. See the release notes for the gcs-connector and bigquery-connector for more information.
Updated preview image – Several preview image components have been updated, including:
Fixed an issue in which the NFS-based Cloud Storage consistency cache would not get cleaned on long-running clusters with sustained high file creation rates (> ~1,000,000 files per hour for a sustained period of time).

July 19, 2016

New Features

  • Support for new us-west1 region - Cloud Dataproc is available from day one in the newly announced west-1 region. As mentioned in the announcement, some West Coast users may see reductions in latency.
  • Apache Spark upgrade to 1.6.2 - Apache Spark on the 1.0 Cloud Dataproc image version has been upgraded from 1.6.1 to 1.6.2.
  • Cloud Storage and BigQuery connector upgrades - The Cloud Storage connector has been upgraded to gcs-connector-1.5.0 and the BigQuery connector has been upgraded to bigquery-connector-0.7.6. These new versions bring a number of new features and fixes.
    • Appendable output streams - GHFS (Google Hadoop File System) now contains an option to enable support for appendable output streams. You can enable this option by setting the fs.gs.outputstream.type property to SYNCABLE_COMPOSITE.
    • Auto retries on 429 errors - HTTP 429 (rate limit) errors from Google APIs will now be automatically retried with a backoff.
    • Cloud Storage performance - Improved the Cloud Storage connector read performance, especially for lots of small reads or lots of seeks. See the detailed change log for more information.

Bugfixes

  • Google Cloud Platform Console
    • The Google Cloud Platform Console now uses the Cloud Dataproc v1 instead of the v1beta1 API. Clicking on the equivalent REST link will show the appropriate v1 API paths and resource names.
  • Fixed an issue where some HDFS nodes did not join a cluster because their domain name could not be resolved on first boot.

July 1, 2016

New Features

  • gcloud command-line tool
    • Added the flag --preemptible-worker-boot-disk-size which can be used to adjust the boot disk size of preemptible workers. This was added in the gcloud beta track.
    • The --*-boot-disk-size-gb flag is now deprecated in all tracks and has been replaced by the --*-boot-disk-size commands

Bugfixes

  • Fixed a bug introduced in a June release that caused clusters to fail only after waiting for ~30 minutes. This occurred most frequently when initialization actions failed during cluster creation. Now clusters should fail within 1 minute of an initialization action failure.
  • Decreased job startup time for SparkSQL jobs with partitioned/nested directories by applying a patch for Spark (SPARK-9926)
  • Further optimized job startup time for any job with a lot of file inputs by applying a patch for Hadoop (HADOOP-12810)

June 10, 2016

New Features

May 4, 2016

New Features

  • Cloud SQL Initialization Action - Cloud Dataproc now has a Cloud SQL I/O and Hive Metastore initialization action. This initialization action installs a Google Cloud SQL proxy on every node in a Cloud Dataproc cluster. It also configures the cluster to store Apache Hive metadata on a given Cloud SQL instance

April 29, 2016

Bugfixes

  • The staging directory for a Cloud Dataproc job is now automatically cleared when a job completes.
  • If a cluster fails to delete properly, it will now be transitioned to a FAILED state instead of remaining in a DELETING state.
  • Fixed an issue which prevented the Cloud Dataproc --properties command from changing MapReduce properties.
  • Fixed a bug where an exception would be thrown when trying to set YARN log-aggregation to output to Cloud Storage (related to YARN-3269).

March 30, 2016

New Features

  • Spark 1.6.1 - The Cloud Dataproc image version 1.0 has been updated to include the Spark 1.6.1 maintenance release instead of Spark 1.6.0.
  • OSS upgrades - This release upgrades the Cloud Storage and Google BigQuery connectors to gcs-connector-1.4.5 and bigquery-connector-0.7.5, respectively.

Bugfixes

  • It is now possible to specify --num-preemptible-workers 0 through the gcloud command-line tool. Previously this would fail.
  • Fixed a validation issue which produced 500 HTTP errors when the response should have been 400 bad input or 200 OK.
  • Resolved a cache validation issue and re-enabled re-enable directory inference for the Cloud Storage connector (fs.gs.implicit.dir.infer.enable).
  • Adjusted Compute Engine migration settings due to unexpected host failures - normal VMs will automatically restart after migration and preemptible machines will not. Previously all VMs were set to not automatically restart after a migration.
  • Addressed an issue where rapid job submission would result in a Too many pending operations on a resource error.

March 8, 2016

New Features

  • Subnetwork support - Cloud Dataproc now supports subnetworks through the gcloud command-line tool. You can now use the --subnet SUBNET command to specify a subnetwork when you create a Cloud Dataproc cluster.

Bugfixes

  • Added strict validation of full compute resource URIs. The following patterns are supported:
    • https://<authority>/compute/<version>/projects/...
    • compute/<version>/projects/...
    • projects/...
  • Fixed an issue where disk quota was not being checked when a cluster size was increased.

February 22, 2016

Cloud Dataproc is now generally available! For more information, see our announcement blog post

New Features

  • Custom compute engine machine types - Cloud Dataproc clusters now support custom Compute Engine machine types for both master and worker nodes. This means you can create clusters with customized amounts of virtual CPUs and memory. For more information, please read the Dataproc documentation for custom machine types.
  • OSS upgrades - We have released Cloud Dataproc version 1.0. This release includes an upgrade to Apache Spark 1.6.0 and Apache Hadoop 2.7.2. This release also includes new versions of the Cloud Storage and Google BigQuery connectors.
  • v1 API - The v1 API for Cloud Dataproc is now live. This API includes support for regionality along with minor fixes and adjustments. The API is available in the APIs Explorer and also has a Maven artifact in Maven Central. For more information, review the REST API documentation.
  • Support for --jars for PySpark - Added support for using the --jars option for PySpark jobs.
  • API auto-enable - Enabling the Cloud Dataproc API now automatically enables required dependent APIs, such as Cloud Storage and Google Compute Engine

Bugfixes

  • Resolved several issues that occasionally caused some clusters to hang being scaled down.
  • Improved validation of certain types of malformed URLs, which previously failed during cluster deployment.

February 3, 2016

New Features

  • A new --image-version option has been added: preview
    • Unlike other numerical versions like 0.1 and 0.2, the preview version will contain newer Hadoop/Spark/Pig/Hive components targeted for potential release in the next stable Cloud Dataproc distribution version, and will change over time.
    • As of February 3, 2016, the preview version contains Spark 1.6.0, with the same Hadoop/Pig/Hive versions as Cloud Dataproc 0.2.
    • The preview option is being rolled out to the Google Cloud Platform Console gradually, so may not be visible in your account for another week or so. For all users, the preview option is accessible by deploying clusters with the gcloud command-line tool.

Bugfixes

  • Improved reliability of the DeleteJob command.
  • Fixed a bug that caused jobs to remain in a RUNNING state after the job completed successfully.

January 27, 2016

New Features

  • Two new options have been added to the Cloud Dataproc gcloud command-line tool for adding tags and metadata to virtual machines used in Cloud Dataproc clusters. These tags and metadata will apply to both regular and preemptible instances.
    • The --tags option will add tags to the Google Compute Engine instances in a cluster. For example, using the argument --tags foo,bar,baz will add three tags to the virtual machine instances in the cluster.
    • The --metadata option will add metadata to the compute engine instances. For example, using --metadata 'meta1=value1,key1=value2' will add two key-value pairs of metadata.
  • Support for heterogeneous clusters where the master node and worker nodes have different amounts of memory. Some memory settings were previously based on the master node which caused some problems as described in this Stack Overflow question. Cloud Dataproc now better supports clusters with master and worker nodes which use different machine types.
  • Google Cloud Platform Console
    • The Output tab for a job now includes a Line wrapping option to make it easier to view job output containing very long lines

Bugfixes

  • Fixed two issues which would sometimes cause virtual machines to remain active after a cluster deletion request was submitted
  • The Spark maxExecutors setting is now set to 10000 to avoid the AppMaster failing on jobs with many tasks
  • Improved handling for aggressive job submission by making several changes to the Cloud Dataproc agent, including:
    • Limiting the number of concurrent jobs so they are proportional to the memory of the master node
    • Checking free memory before scheduling new jobs
    • Rate limiting how many jobs can be scheduled per cycle
  • Improved how HDFS capacity is calculated before commissioning or decommissioning nodes to prevent excessively long updates

January 21, 2016

New Features

  • The dataproc command in the Google Cloud SDK now includes a --properties option for adding or updating properties in some cluster configuration files, such as core-site.xml. Properties are mapped to configuration files by specifying a prefix, such as core:io.serializations. This command makes it possible to modify multiple properties and files when creating a cluster. For more information, see the Cloud Dataproc documentation for the --properties command.
  • Google Cloud Platform Console
    • An option has been added to the “Create Clusters” form to enable the cloud-platform scope for a cluster. This lets you view and manage data across all Google Cloud Platform services from Cloud Dataproc clusters. You can find this option by expanding the Preemptible workers, bucket, network, version, initialization, & access options section at the bottom of the form.

Bugfixes

  • SparkR jobs no longer immediately fail with a “permission denied” error (Spark JIRA issue)
  • Configuring logging for Spark jobs with the --driver-logging-levels option no longer interferes with Java driver options
  • Google Cloud Platform Console
    • The error shown for improperly-formatted initialization actions now properly appears with information about the problem
    • Very long error messages now include a scrollbar so the Close button remains on-screen re ## January 7, 2016 #### Bugfixes
  • Fixed issue in Dataproc version 0.1 that caused zero-byte _SUCCESS and _FAILURE files for each job to be continually re-written to Cloud Storage.

December 16, 2016

New Features

  • Cloud Dataproc clusters now have vim, git, and bash-completion installed by default
  • The Cloud Dataproc API now has an official Maven artifact, Javadocs, and a downloadable .zip file
  • Google Cloud Platform Console
    • Properties can now be specified when submitting a job, and can be seen in the Configuration tab of a job
    • A Clone button has been added that allows you to easily copy all information about a job to a new job submission form
    • The left-side icons for Clusters and Jobs are now custom icons rather than generic ones
    • An Image version field has been added to the bottom of the create cluster form that allows you to select a specific Cloud Dataproc image version when creating a cluster
    • A VM Instances tab has been added on the cluster detail page, which you can use to display a list of all VMs in a cluster and easily SSH into the master node
    • An Initialization Actions field has been added to the bottom of the create cluster form, which allows you to specify initialization actions when creating a cluster
    • Paths to Cloud Storage buckets that are displayed in error messages are now clickable links.

Bugfixes

  • Forced distcp settings to match mapred-site.xml settings to provide additional fixes for the distcp command (see this related JIRA)
  • Ensured that workers created during an update do not join the cluster until after custom initialization actions are complete
  • Ensured that workers always disconnect from a cluster when the Cloud Dataproc agent is shutdown
  • Fixed a race condition in the API frontend that occurred when validating a request and marking cluster as updating
  • Enhanced validation checks for quota, Cloud Dataproc image, and initialization actions when updating clusters
  • Improved handling of jobs when the Cloud Dataproc agent is restarted
  • Google Cloud Platform Console
    • Allowed duplicate arguments when submitting a job
    • Replaced generic Failed to load message with details about the cause of an error when an error occurs that is not related to Cloud Dataproc
    • When a single jar file for a job is submitted, allowed it to be listed only in the Main class or jar field on the Submit a Job form, and no longer required it to also be listed in the Jar files field

November 18, 2015

Rollouts are staged to occur over four days, and should be deployed or be available for use in your Cloud Dataproc clusters by the end of the fourth day from the version's announced release date.

New features

  • Version selection - With the release of Cloud Dataproc version 0.2, you can now select among different versions of Cloud Dataproc (see Cloud Dataproc Versioning for information on support for previous versions and Cloud Dataproc Version List a list of the software components supported in each version). You can select a Cloud Dataproc version when creating a cluster through the Cloud Dataproc API, Cloud SDK (using the gcloud beta dataproc clusters create --image-version command) or through the Google Cloud Platform Console. Note that within four days of the release of a new version in a region, the new version will become the default version used to create new clusters in the region.
  • OSS upgrades - We have released Cloud Dataproc version 0.2. The new Spark component includes a number of bug fixes. The new Hive component enables use of the hive command, contains performance improvements, and has a new metastore.
  • Connector updates - We released updates to our BigQuery and Google Cloud Storage connectors (0.7.3 and 1.4.3, respectively.) These connectors fix a number of bugs and the new versions are now included in Cloud Dataproc version 0.2.
  • Hive Metastore - We introduced a MySQL-based per-cluster persistent metastore, which is shared between Hive and SparkSQL. This also fixes the hive command.
  • More Native Libraries - Cloud Dataproc now includes native Snappy libraries. It also includes native BLAS, LAPACK and ARPACK libraries for Spark’s MLlib.
  • Clusters --diagnose command - The Cloud SDK now includes a --diagnose command for gathering logging and diagnostic information about your cluster. More details about this command are available in the Cloud Dataproc support documentation.

Bugfixes

  • Fixed the ability to delete jobs that fast-failed before some cluster and staging directories were created
  • Fixed some remaining errors with vmem settings when using the distcp command
  • Fixed a rare bug in which underlying Compute Engine issues could lead to VM instances failing to be deleted after the Cloud Dataproc cluster had been successfully deleted
  • Hive command has been fixed
  • Fixed error reporting when updating the number of workers (standard and preemptible) in a cluster
  • Fixed some cases when Rate Limit Exceeded errors occurred
  • The maximum cluster name length is now correctly 55 instead of 56 characters
  • Google Cloud Platform Console
    • Cluster list now includes a Created column, and the cluster configuration tab now includes a Created field, telling the creation time of the cluster
    • In the cluster-create screen, cluster memory sizes greater than 999 GB are now displayed in TB
    • Fields that were missing from the PySpark and Hive job configuration tab (Additional Python Files and Jar Files) have been added
    • The option to add preemptible nodes when creating a cluster is now in the “expander” at the bottom of the form
    • Machine types with insufficient memory (less than 3.5 GB) are no longer displayed in the list of machine types (previously, selecting one of these small machine types would lead to an error from the backend
    • The placeholder text in the Arguments field of the submit-job form has been corrected

Core service improvements

  • If set, a project's default zone setting is now used as the default value for the zone in the create-cluster form in the GCP Console.

Optimizations

  • Hive performance has been greatly increased, especially for partitioned tables with a large number of partitions
  • Multithreaded listStatus has now been enabled, which should speed up job startup time for FileInputFormats reading large numbers of files and directories in Cloud Storage

October 23, 2015

New Features

  • Google Cloud Platform Console

October 15, 2015

Bugfixes**

  • Fixed a bug in which DataNodes failed to register with the NameNode on startup, resulting in less-than-expected HDFS capacity.
  • Prevented the submission of jobs in an Error state.
  • Fixed bug that prevented clusters from deleting cleanly in some situations.
  • Reduced HTTP 500 errors when deploying Cloud Dataproc clusters.
  • Corrected distcp out-of-memory errors with better cluster configuration.
  • Fixed a situation in which jobs failed to delete properly and were stuck in a Deleting state.

Core service improvements

  • Provided more detail about HTTP 500 errors instead of showing 4xx errors.
  • Added information on existing resources for Resource already exists errors.
  • Specific information now provided instead of generic error message for errors related to Cloud Storage.
  • Listing operations now support pagination.

Optimizations

  • Significantly improved YARN utilization for MapReduce jobs running directly against Cloud Storage.
  • Made adjustments to yarn.scheduler.capacity.maximum-am-resource-percent to enable better utilization and concurrent job support.
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation