This page documents production updates to Cloud Data Fusion. Check this page for announcements about new or updated features, bug fixes, known issues, and deprecated functionality.
You can see the latest product updates for all of Google Cloud on the Google Cloud release notes page.
To get the latest product updates delivered to you, add the URL of this page to your
reader, or add the feed URL directly:
October 27, 2020
You can now specify which Cloud Data Fusion version to use when you create an instance.
You can now specify the service account to use for running your Cloud Data Fusion pipeline on Dataproc:
October 21, 2020
In Cloud Data Fusion versions before 6.2, there is a known issue where pipelines get stuck during execution. Stopping the pipeline results in the following error:
Malformed reply from SOCKS server. To fix this, delete the Dataproc cluster, and then update the memory settings in the compute profile.
September 30, 2020
This release is in parallel with the CDAP 6.2.2 release.
Cloud Data Fusion now supports autoscaling Dataproc clusters.
Cloud Data Fusion now displays the number of pending preview runs, if any, before the current run. In the Studio, the number of pending runs is displayed under the timer.
Improved performance for skewed joins by including Distribution in the Joiner plugin settings.
August 24, 2020
Cloud Data Fusion 6.1.4 provides performance and scalability improvements that increase developer productivity and optimize pipeline runtime performance. The release includes scaled-up previews that support up to 50 concurrent runs, capabilities to handle large and complex schemas in Pipeline Studio, an enhanced log viewer, and other critical improvements and fixes.
This release is in parallel with the CDAP 6.1.4 release.
You can now create autoscaling Dataproc clusters.
You can now use the schema support feature in the UI to edit the precision and scale fields.
Cloud Data Fusion now has improved memory performance in pipelines by utilizing a disk-only auto-caching strategy.
Cloud Data Fusion previews now support up to 50 concurrent runs.
Cloud Data Fusion now supports large and deeply nested schemas (>5K fields with 20+ levels of nesting).
Fixed a bug where the setting for the number of executors in streaming pipelines was ignored.
Fixed a race condition where runtime monitoring failed when programs launched concurrently.
Fixed the preview page in table mode so that it shows multiple inputs and outputs with tabs.
Fixed the stability of state transitions for starting pipelines in AppFabric when AppFabric restarts.
Fixed a bug where a metric incorrectly counted the number of the records written in the Google Cloud Storage sink.
July 16, 2020
Cloud Data Fusion version 6.1.3 is now available. This version includes performance improvements and minor bug fixes.
- Improved performance of Joiner plugins, aggregators, program startup, and previews.
- Added support for custom images. You can select a custom Dataproc image by specifying the image URI.
- Added support for rendering large schemas (>1000 fields) in the pipelines UI.
- Added payload compression support to the messaging service.
April 22, 2020
Cloud Data Fusion version 6.1.2 is now available. This version includes several stability and performance improvements and new features.
- Added support for Field Level Lineage for Spark plugins and Streaming pipelines
- Added support for Spark 2.4
- Added an option to skip header in the files in delimited, CSV, TSV, and text formats
- Added an option for database source to replace the characters in the field names
Reduced preview startup by 60%. Also added limit to max concurrent preview runs (10 by default).
Fixed a bug that caused errors when Wrangler's parse-as-csv with header was used when reading multiple small files.
Fixed a bug that caused zombie processes when using the Remote Hadoop Provisioner.
Fixed a bug that caused DBSource plugin to fail in preview mode.
Fixed a race condition that caused a failure when running a Spark program.
January 10, 2020
Cloud Data Fusion version 6.1.1 is now available. This version includes several stability and performance improvements, as well as these new features:
- Azure Data Lake storage support in Wrangler
- Enabled Field Level Lineage (Beta)
- Data Loss Prevention plugin to identify, tokenize, or encrypt sensitive data at scale (Beta)
December 10, 2019
Cloud Data Fusion version 220.127.116.11 is now available. This version includes several stability and performance improvements.
November 21, 2019
Cloud Data Fusion is now generally available.
Added support for creating Cloud Data Fusion instances that use private IP addresses.
Added support for creating private Cloud Data Fusion instances and executing data pipelines in a VPC-SC environment.
Added support to encrypt resources created in Cloud Storage, BigQuery, and Pub/Sub using Cloud Data Fusion with Customer Managed Encryption Keys.
Added reference documentation for creating and managing pipelines and datasets.
The Cloud Data Fusion UI is now available at a different URL in the format:
May 31, 2019
Renamed "Cloud Dataprep service" to "Wrangler service" in the System Admin page of the Cloud Data Fusion UI.
Added a version number field to the Cloud Data Fusion Instance details page in the GCP Console.
Fixed a bug that caused Cloud Data Fusion to launch Cloud Dataproc clusters in an incorrect project.
Added support for specifying a subnet for the Cloud Dataproc provisioner.
Fixed the Cloud Dataproc provisioner to handle networks that do not use automatic subnet creation.
April 10, 2019
Cloud Data Fusion is now publicly available in beta.