This page documents production updates to Cloud Data Fusion. Check this page for announcements about new or updated features, bug fixes, known issues, and deprecated functionality.
You can see the latest product updates for all of Google Cloud on the Google Cloud page, browse and filter all release notes in the Google Cloud Console, or you can programmatically access release notes in BigQuery.
To get the latest product updates delivered to you, add the URL of this page to your
reader, or add the feed URL directly:
September 02, 2021
Features in 6.5.0:
Preview: Cloud Data Fusion now supports role-based access control (RBAC). This gives administrators fine-grained access control over what users can do at the namespace level.
Preview: Cloud Data Fusion now supports customer-managed encryption keys (CMEK), which provide user encryption control over the data written to Google internal resources in tenant projects, and data written by Cloud Data Fusion pipelines.
Preview: Cloud Data Fusion Instance Admins can now create, view, duplicate, delete, import, and export connections from the Pipeline Studio, Wrangler, or the Namespace Admin page. A connection stores sensitive data, such as user credentials and host information, needed to connect to data sources. For more information, see Managing connections.
Preview: Transformation pushdown is now available. It helps you efficiently design and execute ELT workloads by pushing join transformations down to BigQuery. It gives users that prefer ELT in BigQuery access to the same visual experience that ETL users get in Cloud Data Fusion, without needing to maintain complex SQL scripts. When you enable Transformation pushdown, Cloud Data Fusion executes Join operations in BigQuery (instead of Apache Spark). All other stages in a pipeline are executed using Spark. For pipelines that perform multiple complex joins, BigQuery can execute these joins operations faster than Spark.
Preview: Dataproc cluster reuse is now available. It can be used to speed up pipeline run startup by reusing clusters from previous runs.
Changes in 6.5.0:
In version 6.5.0, Spark 3 is the new default engine used when using Cloud Data Fusion Preview and when running pipelines on Dataproc clusters. After an instance is upgraded to version 6.5.0, any new or upgraded pipeline that uses a Dataproc profile without an explicit image version will use the latest Dataproc image 2.0 that has Spark 3.1 bundled. For more information, see Upgrade notes for Spark 3.
Added support for labels in the Dataproc provisioner.
Added authorization checks for preferences, logging, compute profiles, and metadata endpoints.
Added support to search for tables based on schema name when you select tables for a Replication job.
Added additional trace logging in the authorization flow for debugging.
Added support for
BIGNUMERICdata type for BigQuery target in replication.
Behavior change: MySQL, Oracle, Postgres, and SQL Server batch sources, sinks, actions, and pipeline alerts are now installed by default as system plugins. Previously, these plugins were available in the Hub as user plugins.
Fixed in 6.5.0 preview version (for more information, see the CDAP release note):
Fixed an issue in Replication that caused jobs to fail if more than 1000 tables were selected for replication.
Fixed an issue that caused replication jobs to hang when there were too many Delete or DDL events.
Fixed an issue that caused Wrangler to ignore all the other columns other than the given column when parsing Excel files.
Fixed Wrangler to fail pipelines upon error. In Wrangler 6.2 and above, there was a backwards-incompatible change where pipelines did not fail if there was an error and instead were marked as completed.
Improved resilience of TMS.
Fixed an issue that caused File Source Plugin validation to fail when there was a macro in the Format field.
You can create connections for Database, MySQL, Oracle, PostgreSQL, and SQL Server sources, but the plugin properties do not include Use Connection. This means that you cannot reference a connection in a database source plugin. For more information, see Known issues: Database connections.
August 16, 2021
SQL Server source plugin version 1.5.5 is now available. This version fixes a
NullPointerException bug that occurs in version 1.5.4. Versions 1.5.4 and above support the Datetime data type. In versions 1.5.3 and earlier, if you had a Datetime column in your SQL Server source, it mapped to the Timestamp data type. Upgrades to version 1.5.4 are backwards incompatible, but upgrades to version 1.5.5 are compatible. For more information, see Troubleshooting and the CDAP SQL Server Batch Source.
June 23, 2021
Preview: You can now replicate data continuously and in real time from operational data stores in Oracle into BigQuery using the Oracle (by Datastream) plugin. The plugin is available in Cloud Data Fusion version 6.4.0 or later.
June 16, 2021
The SAP accelerator for the order to cash process is now available. It provides sample pipelines that you can use to build your end-to-end order to cash process and analytics with Cloud Data Fusion, BigQuery, and Looker. The accelerator is a sample implementation of the SAP Table Batch Source plugin, which enables bulk data integration from SAP applications with Cloud Data Fusion. The accelerator is available in Cloud Data Fusion environments running in version 6.3.0 and above.
May 27, 2021
In Cloud Data Fusion version 6.4.1, Replication supports the Datetime data type in BigQuery targets. You can now read and write to tables that contain Datetime fields.
Fixed in 6.4.1 (for more information, see the CDAP release note):
Fixed an issue that caused pipelines with aggregations and Decimal fields to fail with an exception.
Fixed the Join Condition Type so that it is displayed in the Joiner plugin for pipelines that were upgraded from versions before 6.4.0.
Fixed Wrangler so that pipelines fail when there is an error. In Wrangler 6.2 and above, there was a backwards-incompatible change where pipelines did not fail if there was an error and were instead marked as complete.
Fixed an issue that prevented new previews from being scheduled after the preview manager had been stopped ten times.
Fixed an issue while writing non-null values to a nullable field in BigQuery.
Fixed an issue in the BigQuery plugins to correctly delete temporary storage buckets.
Fixed an issue in the BigQuery sink that caused pipelines to fail when the input schema was not provided.
Fixed an issue in the BigQuery sink that caused pipelines to fail or give incorrect results.
Fixed an issue that caused pipelines to fail when a Pub/Sub source Subscription field was a macro.
May 05, 2021
There is an issue in the BigQuery sink plugin version 0.17.0, which causes data pipelines to fail or give incorrect results. This issue is resolved in BigQuery sink plugin version 0.17.1. For more information, see the Cloud Data Fusion Troubleshooting page.
March 31, 2021
Features in 6.4.0:
GA: You can now ingest data from SAP tables with the SAP Table Batch Source plugin.
- BigQuery batch source
- BigQuery sink
- BigQuery multi table sink
- Bigtable batch source
- Bigtable sink
- Datastore batch source
- Datastore sink
- GCS file batch source
- GCS file sink
- GCS multi file sink
- Spanner batch source
- Spanner sink
- File source
- File sink
- Amazon S3 batch source
- Amazon S3 sink
- Database source
You can configure machine type, cluster properties, and idle TTL for the Dataproc provisioner. For the available settings, see the CDAP documentation.
Adding, editing, and deleting comments on draft data pipelines is now supported. For more information, see Adding comments to a data pipeline.
Advanced join conditions are now available in the Joiner plugin. You can specify an arbitrary SQL condition to join on. For more information, see Join Condition Type.
A new post-action plugin is now available: GCS Done File Marker. To help you orchestrate downstream/dependent processes, this post-action plugin marks the end of a pipeline run by creating and storing an empty
SUCCESSfile in the given GCS bucket upon a pipeline completion, success, or failure.
Changed in version 6.4.0:
- Behavior change: When you validate a plugin, macros get resolved with preferences. In previous releases, to validate a plugin's configuration, you had to change the pipeline to remove the macros.
Behavior change: Cloud Data Fusion now determines the schema dynamically at runtime instead of requiring arguments to be set. Multi sink runtime argument requirements have been removed, which lets you add simple transformations in multi-source/multi-sink pipelines. In previous releases, multi-sink plugins require the pipeline to set a runtime argument for each table, with the schema for each table.
You can now filter tables in the Multiple Database Tables Batch Source.
Multiple Database Batch Source and BigQuery multi-table sink have better error handling and let pipelines continue if one or more tables fail.
Cloud Data Fusion Replication changes:
- Renamed Replication pipelines to Replication jobs.
- The Customer-managed encryption key (CMEK) configuration property is now available for BigQuery targets in your Replication jobs.
- On the BigQuery Target properties page, renamed the Staging Bucket Location property to Location.
- Improved reliability by restarting Replication from the last known checkpoint.
You can now use files with ISO-8859, Windows and EBCDIC encoding types with Amazon S3, File and GCS File Reader batch source plugins.
Cloud Data Fusion now supports running pipelines on a Hadoop cluster with Kerberos enabled.
Fixed in 6.4.0 (for more information, see the CDAP release note):
- Fixed Bigtable batch source plugin. In previous versions, pipelines that included the Bigtable source would fail.
- FTP batch source now works with empty File System Properties.
- Strings are now supported in Min/Max aggregate functions (used in both Group By and Pivot plugins).
- Fixed Salesforce plugin to correctly parse the schema as Avro schema to be sure all the field names are accepted by Avro.
- Fixed data pipeline with BigQuery sink that failed with INVALID_ARGUMENT exception if the range specified was a macro.
- Fixed a class conflict in the Kinesis Spark Streaming source plugin. You can now run pipelines with this source.
- Fixed an issue in field validation logic in pipelines with BigQuery sink that caused a NullPointerException.
- Fixed the Wrangler Generate UUID directive to correctly generate a universally unique identifier (UUID) of the record.
- Fixed advanced joins to recognize auto broadcast setting.
- Fixed Pipeline Studio to use current namespace when it fetches data pipeline drafts.
- Fixed Replication statistics to display on the dashboard for SQL Server.
- Fixed an issue where clicking the Delete button on Replication Assessment page resulted in an error for the replication job.
- Schema name is now shown when selecting tables to replicate.
- Fixed Replication to correctly insert rows that were previous deleted by a replication job.
- Data pipelines running in Spark 3 enabled Dataproc cluster no longer fail with class not found exception.
- Fixed Replication with a SQL Server source to generate rows correctly in BigQuery target table if snapshot failed and restarted.
- Fixed an issue where SQL Server replication job stopped processing data when the connection was reset by the SQL Server.
- Fixed an error in Replication wizard step to select tables, columns and events to replicate, where selecting no columns for a table caused the wizard to fetch all columns in a table.
- Using a macro for a password in a replication job no longer results in an error.
- Fixed logical type display for data pipeline preview runs.
- Fixed Dashboard API to return programs running but started before the startTime.
- Fixed deployed Replication jobs to show advanced configurations in Ui.
- Fixed data pipeline with Python Evaluator transformation to run without stack trace errors.
- Added loading indicator while fetching logs in Log Viewer.
- Fixed Pipeline preview so logical start time function doesn't display as a macro.
- Fixed fields with a list drop down menu in the Replication wizard to default to Select one.
- Added message in Replication Assessment when there are tables that CDF cannot access.
- Used error message when an invalid expression is added in Wrangler.
- Fixed RENAME directive in Wrangler so it is case sensitive.
- Fixed Pipeline Operations UI to stop showing the loading icon forever when it gets error from backend.
- Fixed Wrangler to no longer generate invalid reference names.
- Fixed Wrangler to display logical types instead of java types.
- Fixed pipelines from Wrangler to no longer generate incorrect for xml files.
- Added connection in Wrangler hard codes the name of the JDBC driver.
- Batch data pipelines with Spark 2.2 engine and HDFS sinks no longer fail with delegation token issue error.
FTP Batch Source (system plugin for data pipelines)
FTP Batch Source version 3.0.0 is backward compatible, except that it uses a different artifact. This was done to ensure that updates to the plugin can be delivered out-of-band from Cloud Data Fusion releases, through the Hub.
It is recommended that you use version 3.0.0 or later in your data pipelines.
March 24, 2021
Cloud Data Fusion version 6.3.1 is now available. This version fixes a race condition that results in intermittentant failures in concurrent pipeline executions. This release is in parallel with the CDAP 6.3.1 release.
March 17, 2021
Preview: Cloud Data Fusion now supports Access Transparency. Access Transparency is a part of Google's long-term commitment to transparency and user trust. Access Transparency logs record the actions that Google personnel take when accessing customer content. For more information, see the Access Transparency overview.
February 22, 2021
Cloud Data Fusion Beta instances (versions 22.214.171.124 and lower that were created before November 21, 2019) will be turned down on March 1, 2021. Instead, export your pipeline, delete the old instance to avoid billing impact, create a new instance, and import your pipeline into the new instance.
February 03, 2021
Preview: You can now replicate data continuously and in real time from operational data stores, such as SQL Server and MySQL, into BigQuery.
January 27, 2021
Cloud Data Fusion Beta instances (versions 126.96.36.199 and lower that were created before November 21, 2019) will be turned down on March 1, 2021. Instead, export your pipeline, create a new instance, and import your pipeline into the new instance. This note is incorrect; see entry for February 18, 2021.
January 21, 2021
Cloud Data Fusion 6.3.0 is now available.
In-place upgrades are now supported for minor and patch versions.
You can configure the default system compute profile in the Developer edition starting in Cloud Data Fusion version 6.3.0.
October 27, 2020
You can now specify which Cloud Data Fusion version to use when you create an instance.
You can now specify the service account to use for running your Cloud Data Fusion pipeline on Dataproc:
October 21, 2020
In Cloud Data Fusion versions before 6.2, there is a known issue where pipelines get stuck during execution. Stopping the pipeline results in the following error:
Malformed reply from SOCKS server. To fix this, delete the Dataproc cluster, and then update the memory settings in the compute profile.
September 30, 2020
This release is in parallel with the CDAP 6.2.2 release.
Cloud Data Fusion now supports autoscaling Dataproc clusters.
Cloud Data Fusion now displays the number of pending preview runs, if any, before the current run. In the Studio, the number of pending runs is displayed under the timer.
Improved performance for skewed joins by including Distribution in the Joiner plugin settings.
August 24, 2020
Cloud Data Fusion 6.1.4 provides performance and scalability improvements that increase developer productivity and optimize pipeline runtime performance. The release includes scaled-up previews that support up to 50 concurrent runs, capabilities to handle large and complex schemas in Pipeline Studio, an enhanced log viewer, and other critical improvements and fixes.
This release is in parallel with the CDAP 6.1.4 release.
You can now create autoscaling Dataproc clusters.
You can now use the schema support feature in the UI to edit the precision and scale fields.
Cloud Data Fusion now has improved memory performance in pipelines by utilizing a disk-only auto-caching strategy.
Cloud Data Fusion previews now support up to 50 concurrent runs.
Cloud Data Fusion now supports large and deeply nested schemas (>5K fields with 20+ levels of nesting).
Fixed a bug where the setting for the number of executors in streaming pipelines was ignored.
Fixed a race condition where runtime monitoring failed when programs launched concurrently.
Fixed the preview page in table mode so that it shows multiple inputs and outputs with tabs.
Fixed the stability of state transitions for starting pipelines in AppFabric when AppFabric restarts.
Fixed a bug where a metric incorrectly counted the number of the records written in the Google Cloud Storage sink.
July 16, 2020
Cloud Data Fusion version 6.1.3 is now available. This version includes performance improvements and minor bug fixes.
- Improved performance of Joiner plugins, aggregators, program startup, and previews.
- Added support for custom images. You can select a custom Dataproc image by specifying the image URI.
- Added support for rendering large schemas (>1000 fields) in the pipelines UI.
- Added payload compression support to the messaging service.
April 22, 2020
Cloud Data Fusion version 6.1.2 is now available. This version includes several stability and performance improvements and new features.
- Added support for Field Level Lineage for Spark plugins and Streaming pipelines
- Added support for Spark 2.4
- Added an option to skip header in the files in delimited, CSV, TSV, and text formats
- Added an option for database source to replace the characters in the field names
Reduced preview startup by 60%. Also added limit to max concurrent preview runs (10 by default).
Fixed a bug that caused errors when Wrangler's parse-as-csv with header was used when reading multiple small files.
Fixed a bug that caused zombie processes when using the Remote Hadoop Provisioner.
Fixed a bug that caused DBSource plugin to fail in preview mode.
Fixed a race condition that caused a failure when running a Spark program.
January 10, 2020
Cloud Data Fusion version 6.1.1 is now available. This version includes several stability and performance improvements, as well as these new features:
- Azure Data Lake storage support in Wrangler
- Enabled Field Level Lineage (Beta)
- Data Loss Prevention plugin to identify, tokenize, or encrypt sensitive data at scale (Beta)
December 10, 2019
Cloud Data Fusion version 188.8.131.52 is now available. This version includes several stability and performance improvements.
November 21, 2019
Cloud Data Fusion is now generally available.
Added support for creating Cloud Data Fusion instances that use private IP addresses.
Added support for creating private Cloud Data Fusion instances and executing data pipelines in a VPC-SC environment.
Added support to encrypt resources created in Cloud Storage, BigQuery, and Pub/Sub using Cloud Data Fusion with Customer Managed Encryption Keys.
Added reference documentation for creating and managing pipelines and datasets.
The Cloud Data Fusion UI is now available at a different URL in the format:
May 31, 2019
Renamed "Cloud Dataprep service" to "Wrangler service" in the System Admin page of the Cloud Data Fusion UI.
Added a version number field to the Cloud Data Fusion Instance details page in the GCP Console.
Fixed a bug that caused Cloud Data Fusion to launch Cloud Dataproc clusters in an incorrect project.
Added support for specifying a subnet for the Cloud Dataproc provisioner.
Fixed the Cloud Dataproc provisioner to handle networks that do not use automatic subnet creation.
April 10, 2019
Cloud Data Fusion is now publicly available in beta.