Cloud Data Fusion release notes

This page documents production updates to Cloud Data Fusion. Check this page for announcements about new or updated features, bug fixes, known issues, and deprecated functionality.

You can see the latest product updates for all of Google Cloud on the Google Cloud page, browse and filter all release notes in the Google Cloud console, or programmatically access release notes in BigQuery.

To get the latest product updates delivered to you, add the URL of this page to your feed reader, or add the feed URL directly: https://cloud.google.com/feeds/clouddatafusion-release-notes.xml

March 26, 2024

The Amazon Redshift batch source connector version 1.11.1 is available in Preview in Cloud Data Fusion 6.10.0 and later. This source lets you load batch data from your Redshift dataset to a destination, such as BigQuery.

The Amazon Redshift batch source connector version 1.10.6 is available in Preview in Cloud Data Fusion 6.9 versions. This source lets you load batch data from your Redshift dataset to a destination, such as BigQuery.

Cloud Data Fusion is available in the following regions:

  • asia-south2
  • me-central2

For more information, see Pricing.

March 14, 2024

Salesforce plugin version 1.6.3 is available in Cloud Data Fusion versions 6.8.0 and later. The release includes the following changes:

  • Fixed an issue in the Salesforce sink plugin causing an unsupported type datetime error for DateTime type fields in the input schema. In this version, the Salesforce sink plugin supports datetime and decimal logical types (PLUGIN-1749).
  • Fixed an issue in all Salesforce plugins causing a pipeline to fail when using an OAuth macro because the oAuth macro value didn't get passed to the plugin as intended. In this version, all Salesforce plugins support an OAuth macro (PLUGIN-1767).
  • At the time of failure on the Salesforce sink side, if the Error handling property is set to the Fail on error option, the Salesforce job is aborted, which stops newer batches from being added to the job due to spark retry settings in CDAP (PLUGIN-1768).
  • To make debugging easier, additional debug logs and batch results in logs are available.

March 12, 2024

The Cloud Data Fusion version 6.9.2.3 patch revision is generally available (GA). 6.9.2.3 includes the following fixes:

  • Skipped running MetadataConsumerSubscriberService when Dataplex Data Lineage Integration is disabled (CDAP-20947).

  • Fixed an issue causing runtime arguments of pipeline triggers to not propagate to downstream pipelines (CDAP-20943).

  • Fixed an issue causing pipelines to fail in starting state when the system worker service is intermittently unavailable (CDAP-20956).

  • Fixed an issue causing pipelines to fail in starting state when the Compute Engine metadata server is intermittently unavailable (CDAP-20955).

January 28, 2024

Cloud Data Fusion is available in the following regions:

  • europe-southwest1
  • europe-west8
  • europe-west9
  • europe-west12
  • me-central1
  • southamerica-west1

For more information, see Pricing.

Cloud Data Fusion supports patch revision upgrades in the Google Cloud console. For more information, see Manage patch revisions for instances.

January 27, 2024

Cloud Data Fusion lets you enable and disable Dataplex Lineage, as needed. When you create a new instance in Cloud Data Fusion version 6.8.0 and later, Dataplex Lineage is disabled by default. For more information, see View lineage in Dataplex.

January 16, 2024

Cloud Data Fusion version 6.10.0 is available in Preview. This release is in parallel with the CDAP 6.10.0 release.

Source control management using Github is generally available (GA) in Cloud Data Fusion version 6.10.0. With this feature, you can use GitHub to maintain version histories of your ETL and ELT pipelines.

To simplify the experience of synchronizing pipelines between Cloud Data Fusion and GitHub in bulk, pushing and pulling multiple pipelines is available in Preview.

In versions before 6.10.0, you can only establish connectivity between private Cloud Data Fusion instances and customer resources with VPC Peering. Starting in version 6.10.0, you can use Private Service Connect interfaces to establish connectivity using internal IP addresses. Private Service Connect interfaces are easier to set up, use fewer internal IP addresses, and don't require the use of proxies to reach customer resources in peered VPCs, the internet, or on-premises.

You can control access to Google Cloud resources at the Cloud Data Fusion namespace level using an IAM service account. Per Namespace Service Accounts are available in Preview in Cloud Data Fusion version 6.10.0. For more information, see Access control with namespace service accounts.

Changes in Cloud Data Fusion 6.10.0:

  • Role-based access control in Cloud Data Fusion supports using the Cloud Data Fusion API Service Agent service account while emitting lineage events in RBAC-enabled instances. For more information, see View lineage in Dataplex.
  • To create ephemeral clusters, Cloud Data Fusion uses the Dataproc 2.1 image by default. For more information about its limitations in Cloud Data Fusion, see Change the Dataproc image to version 2.1 (CDAP-20841).
  • Cloud Data Fusion uses the autoscaling Dataproc profile by default for ephemeral clusters.
  • In Source control management, GitHub PAT was removed from the Cloud Data Fusion web interface for repository configurations (CDAP-20904 and CDAP-20581).
  • Improved latency when BigQuery pushdown is enabled by fetching artifacts from a local cache (CDAP-20846).
  • The BigQuery sink supports flexible table names and column names (PLUGIN-1718).
  • BigQuery sinks support ingesting data to JSON data type fields(PLUGIN-1692).
  • In BigQuery sink jobs, you can add labels in the form of key-value pairs (PLUGIN-1705).
  • In BigQuery execute jobs, you can add labels in the form of key-value pairs (PLUGIN-1729).
  • Added capability to run concurrent tasks in task workers (CDAP-20868).
  • If a task is stuck, task workers are forcefully restarted (CDAP-20831).
  • The Cloud Storage Java Client is upgraded to version 2.3 and later (PLUGIN-1293).
  • Wrangler is schema aware (CDAP-15361).
  • Added validation for incorrect credentials in the Amazon S3 source (PLUGIN-1694).

To support the Private Service Connect integration, the following permissions are added to the Cloud Data Fusion API Service Agent role:

  • compute.networkAttachments.get
  • compute.networkAttachments.update
  • compute.networkAttachments.list

Fixed in Cloud Data Fusion 6.10.0:

  • Fixed an issue in the Postgres DB plugin causing macros to be unsupported for database configuration (PLUGIN-1681).
  • Fixed an issue causing slowness in the API while fetching runs for all applications in a namespace (CDAP-20587).
  • Made the following fixes to Wrangler grammar (CDAP-20839):
    • The NUMERIC token type supports negative numbers.
    • The PROPERTIES token type supports one or more properties.
  • Fixed an issue causing columns that have all null values to be dropped in Wrangler (CDAP-20521).
  • Fixed an issue causing pipeline upgrades to not have the intended description (CDAP-20815).

Dataproc 1.5 isn't supported in Cloud Data Fusion version 6.10.0.

The Spark compute engine running on Scala 2.11 is not supported in Cloud Data Fusion version 6.10.0.

December 22, 2023

The Salesforce plugin version 1.6.2 is available in Cloud Data Fusion versions 6.8.0 and later. This version includes the following changes:

  • Fixed an issue in the Salesforce plugin causing the following error in some pipelines that run more than 4 hours: java.lang.IllegalStateException: SSLException reading next record: javax.net.ssl.SSLException: Connection reset. The Connection timeout property was added to the Salesforce plugin properties in the web interface with the default value of 3600 seconds (PLUGIN-1719).

  • For accuracy, fixed schema handling for referenced object fields: child fields are explicitly marked as non-nullable, regardless of the schema values in the referenced object.

In earlier versions, when retrieving schema information for fields in referenced objects, such as contact.account_lastmodifieddate, the schema inherits properties from the referenced object, causing incorrect non-nullable assumptions (PLUGIN-1720).

  • A retry mechanism was added in the Salesforce batch source and Multi-Source plugins for connection timeout issues (PLUGIN-1706).

December 14, 2023

The Cloud Data Fusion version 6.9.2.2 patch revision is generally available (GA). 6.9.2.2 includes the following fixes:

  • Increased the speed of the batch /runs API call for pipelines that are run thousands of times (CDAP-20587).
  • Fixed an issue causing draft pipelines to load incorrectly when you enable an accelerator that automatically installs plugins from a custom hub (CDAP-20628).
  • Fixed an issue causing the CDAP service IP to be cached forever (CDAP-20781).
  • Fixed an issue causing a KubeTwillRunnerService error on shutdown (CDAP-20792).
  • Fixed an issue causing pipelines and streaming jobs to fail after an AppFabric restart. (CDAP-20797).
  • Fixed an issue causing a slowdown in deploying applications (CDAP-20820).
  • Fixed an issue causing refreshed OAuth tokens to not be logged as expected, making it difficult to identify the root cause of some issues (CDAP-20861).
  • Fixed an issue in replication where you couldn't select the Tink transformation in the web interface. (CDAP-20804).

December 06, 2023

Cloud Data Fusion versions 6.5 and 6.6 are no longer supported. You should upgrade your instances to run in a supported version. For instructions, see Manage version upgrades for instances and pipelines.

November 14, 2023

You can apply a patch revision version when you create a new Cloud Data Fusion instance by adding the optional --patch_revision argument to the gcloud beta data-fusion instances create command. For more information, see Manage patch revisions for instances.

You can update the patch revision version of an instance by adding the optional --patch_revision argument to the gcloud beta data-fusion instances update command. For more information, see Manage patch revisions for instances.

October 30, 2023

The Cloud Data Fusion version 6.8.3.1 patch revision is generally available (GA). It fixes a regression that causes a pipeline to fail when using Dataproc secondary workers (CDAP-20807).

The Cloud Data Fusion version 6.9.2.1 patch revision is generally available (GA). It fixes a regression that causes a pipeline to fail when using Dataproc secondary workers (CDAP-20807).

October 23, 2023

Cloud Data Fusion supports patch revisions. These revisions apply bug fixes between major releases. For more information, see Versioning in Cloud Data Fusion.

The Cloud Data Fusion version 6.7.3.1 patch revision is generally available (GA). It introduces retries in the platform transaction layer to handle a PSQLException error thrown from broken database connections (CDAP-19949 and CDAP-20722).

October 18, 2023

The Cloud Data Fusion SAP SLT No RFC Replication plugin version 0.11.0 is available in the Hub in Cloud Data Fusion enterprise edition versions 6.8.0 and later. It differs from the existing SAP SLT Replication plugin in the following ways:

  • All data and metadata file formats are in JSON.
  • No SAP RFC inbound calls occur in the SAP SLT No RFC Replication plugin. Accessing schemas and data from the SAP system no longer requires an SAP connection. Metadata and data extraction are sourced from the Cloud Storage bucket.

September 07, 2023

Cloud Data Fusion version 6.9.2 is generally available (GA). This release is in parallel with the CDAP 6.9.2 release.

Changes in Cloud Data Fusion 6.9.2:

  • Cloud Data Fusion supports setting custom scopes when creating a Dataproc cluster (CDAP-19428).
  • You can set common metadata labels for Dataproc clusters and jobs using the Common Labels property in the Ephemeral Dataproc compute profile (CDAP-20698).
  • You can set labels for the Dataproc jobs using the Common Labels property in the Existing Dataproc compute profile (CDAP-20698).
  • You can set a pipeline runtime argument with the key system.profile.properties.labels and a value representing the labels in the following format: key1|value1;key2|value2. This setting overrides the common labels set in the compute profile for pipeline runs (CDAP-20698).
  • Cloud Data Fusion supports using Dataproc temp buckets in compute profiles (CDAP-20712).

Fixed in Cloud Data Fusion 6.9.2:

  • Added a retry for Pub/Sub snapshot creation and deletion in a real-time pipeline with a Pub/Sub source when a retryable internal error is thrown (PLUGIN-1660).
  • Fixed a bug causing the Dynamic Spark plugins to fail when running on Dataproc 1.5 (CDAP-20674).
  • Fixed a discrepancy in warning and error counts reported between the pipeline summary tab and system logs (CDAP-20680).
  • Fixed a problem when, in rare cases, a cluster couldn't be found with Cluster Reuse (CDAP-20759).
  • Fixed a bug causing the JavaScript transform to fail on Dataproc 2.1 (CDAP-20778).

September 06, 2023

The SAP ODP plugin version 0.7.5 is available in Cloud Data Fusion versions 6.6.0 to 6.8.0. This release includes the following changes:

  • Fixed an issue causing duplicate records or loss of records due to package acknowledgement occurring too early.
  • Filters that you apply are viewable in the logs.

September 05, 2023

The SAP SuccessFactors Batch Source plugin version 1.2.1 is available in Cloud Data Fusion. The release includes the following changes:

  • Fixed an issue causing an inaccurate error message when the plugin's Associated Entity Name property is invalid.

  • Fixed an issue causing inaccurate validation errors when the Expand Fields property contains a macro.

August 09, 2023

In the SAP Table Batch Source plugin version 0.10.0, fixed an issue causing failed data pipeline runs when you clicked the Take a snapshot toggle because the FIELDS parameters weren't exported.

July 26, 2023

In the SAP SuccessFactors Batch Source plugin version 1.2.0, fixed an issue causing a null pointer exception when you provide a valid entity name that has more than one expand option separated by a '/'.

July 20, 2023

Workforce identity federation is generally available (GA) in Cloud Data Fusion.

The Cloud Data Fusion SAP ODP plugin supports extracting data through CDS views.

Apache Hadoop MapReduce is deprecated in Cloud Data Fusion versions 6.7.0 and later (CDAP-18913).

July 17, 2023

In the SAP Ariba plugin 1.2.1, a Token Endpoint field that takes an authentication URL has been added to the plugin properties. The plugin is available in Cloud Data Fusion version 6.7 and later.

In Cloud Data Fusion versions 6.7 and later, SAP Ariba plugin version 1.2.1 fixes the issue causing the following error after entering authentication credentials: CDF_ARIBA_01501 - Failed to call given Ariba service. This issue occurs when the SAP Ariba sandbox exists in the following locations:

  • US
  • EU
  • UAE
  • KSA

July 13, 2023

The SAP OData plugin (version 0.9.1) is available in the Cloud Data Fusion SAP Hub (all versions) with the following changes:

  • Fixed an issue in the SAP OData batch source causing you not to receive a valid error message if the base URL provided is invalid.
  • A warning has been added to the log message when you provide a batch size that is larger than the maximum allowed batch size.

June 14, 2023

Cloud Data Fusion version 6.9.1 is in Preview. This release is in parallel with the CDAP 6.9.1 release.

Features in Cloud Data Fusion 6.9.1:

  • Cloud Data Fusion supports using Source Control Management to manage pipeline versions through GitHub repositories. Source Control Management is available in Preview (CDAP-20228).

  • Data Catalog Asset Lineage Integration is in GA in versions 6.8.0 and later. In version 6.9.1, it supports the Multiple Database Tables source and the BigQuery Multi Table sink.

  • Cloud Data Fusion supports editing deployed pipelines (CDAP-19425).

  • Cloud Data Fusion supports Window Aggregation operations in Transformation Pushdown to reduce the pipeline execution time by performing SQL operations in BigQuery instead of Spark (CDAP-19628).

  • Cloud Data Fusion supports specifying filters in SQL in Wrangler and the pushdown of SQL filters in Wrangler to BigQuery. In the Wrangler transformation, added support for specifying preconditions in SQL, and added support for transformation pushdown for SQL preconditions. For more information, see Wrangler Filter Pushdown (CDAP-20454).

  • Cloud Data Fusion supports Dataproc driver node groups. To use Dataproc driver node groups, when you create the Dataproc cluster, configure the following properties:

    • yarn:yarn.nodemanager.resource.memory.enforced=false
    • yarn:yarn.nodemanager.admin-env.SPARK_HOME=$SPARK_HOME
  • For the Multiple Database Tables Batch Source, added field-level lineage support (CDAP-20440).

  • Cloud Data Fusion version 6.9.1 supports the Dataproc image 2.1 compute engine, which runs in Java11. If you change the Dataproc image to 2.1, the JDBC drivers that the database plugins use in those instances must be compatible with Java11 (CDAP-20543).

  • Cloud Data Fusion supports the following improvements and changes for real time pipelines with a single Pub/Sub streaming source and no Windower plugins:

    • The Pub/Sub streaming source has built-in support—data is processed at least once.
    • Enabling Spark checkpointing isn't required. Pub/Sub streaming source creates a Pub/Sub snapshot at the beginning of each batch and removes it at the end of each batch.
    • The Pub/Sub Snapshot creation has a cost associated with it. For more information, see Pub/Sub pricing.
    • The snapshot creations can be monitored using Cloud Audit logs.

    For more information, see Read from a Pub/Sub streaming source (PLUGIN-1537).

Changes in Cloud Data Fusion 6.9.1:

  • Updated Cloud Data Fusion docker image dependencies to include fixes for security vulnerabilities.

  • Added the ability to configure Java options for a pipeline run by setting the system.program.jvm.opts runtime argument (CDAP-20381).

  • Replication pipelines generate logs for stats of events processed by source and target plugins at a fixed interval (CDAP-20140).

  • Streaming pipelines that use Spark checkpointing can use macros if the cdap.streaming.allow.source.macros runtime argument is set to true. Note that macro evaluation will only be performed for the first run in this case, then stored in the checkpoint. It will not be re-evaluated in later runs (CDAP-20455).

  • Improved performance of replication pipelines by caching schema objects for data events (CDAP-20488).

  • Added a launch mode setting to the Dataproc provisioners. When set to Client mode, the program launcher will run in the Dataproc job itself, and not as a separate YARN application. This reduces start-up time and cluster resources required, but may cause failures if the launcher needs more memory, such as when there's an action plugin that loads data into memory (CDAP-20500).

  • Removed duplicate backend calls when a program reads from the secure store (CDAP-20504).

  • Added support to upgrade Pipeline Post-run Action (Pipeline Alerts) plugins during the pipeline upgrade process (CDAP-20567).

  • Added Lifecycle microservices endpoint to delete a streaming application state for Kafka Consumer Streaming and Google Cloud Pub/Sub Streaming sources (CDAP-20466).

Fixed in Cloud Data Fusion 6.9.1:

  • For SQL Server replication sources, fixed an issue on the Review assessment page, where SQL Server DATETIME and DATETIME2 columns were shown as mapped to TIMESTAMP columns in BigQuery. This was a UI bug. The replication job mapped the data types to the BigQuery DATETIME type (CDAP-19389).

  • For replication jobs, fixed an issue where retries for transient errors from BigQuery might have resulted in data inconsistency (CDAP-20276).

  • Fixed an issue where a replication job got stuck in an infinite retry when it failed to process a DDL operation (CDAP-20301).

  • When you duplicate a pipeline, Cloud Data Fusion appends _copy to the pipeline name when it opens in the Pipeline Studio. In previous releases, Cloud Data Fusion appended _<v1, v2, v3> to the name (CDAP-20373).

  • Fixed the pipeline stage validation API to return unevaluated macro values to prevent secure macros from being returned (CDAP-20430).

  • Fixed an issue that sometimes caused pipelines to fail when running pipelines on Dataproc with the following error: Unsupported program type: Spark. The first time a pipeline that only contained actions ran on a newly created or upgraded instance, it succeeded. However, the next pipeline runs, which included sources or sinks, might have failed with this error (CDAP-20431).

  • Fixed an issue where the flow control running count metric (system.flowcontrol.running.count) might be stale if no new pipelines or replication jobs were started (CDAP-20458).

  • Fixed an issue where executor resource settings are not honored when app.pipeline.overwriteConfig is set (CDAP-20549).

  • In the Oracle batch source, the Oracle NUMBER data type defined without precision and scale by default was mapped to Cloud Data Fusion string data type. If these fields were used by an Oracle Sink to insert into a NUMBER data type field in the Oracle table, the pipeline failed due to incompatibility between string and NUMBER type. Now, the Oracle Sink inserts these string types into NUMBER fields in the Oracle table (PLUGIN-1481).

  • For Oracle batch sources, fixed an issue that caused the pipeline to fail when there was a TIMESTAMP WITH LOCAL TIME ZONE column set to NULLABLE and the source had values that were NULL(PLUGIN-1494) .

  • Fixed an issue where pipelines that had a Database batch source and an Oracle sink that used a connection object (using SYSDBA) to connect to an Oracle database failed to establish a connection to the Oracle database. This was due to a package conflict between the Database batch source and the Oracle sink plugins (PLUGIN-1503).

  • Fixed an issue where pipelines failed when the output schema was overridden in certain source plugins. This was because the output schema didn't match the order of the fields from the query. This happened when the pipeline included any of the following batch sources:

  • Database

  • Oracle

  • MySQL

  • SQL Server

  • PostgreSQL

  • DB2

  • MariaDB

  • Netezza

  • Cloud SQL PostgreSQL

  • Cloud SQL MySQL

  • Teradata

    Pipelines no longer fail when you override the output schema in these source plugins. Cloud Data Fusion uses the name of the field to match the schema of the field in the result set and the field in the output schema {PLUGIN-1512).

  • For BigQuery Pushdown, fixed an issue when BigQuery Pushdown was enabled for an existing dataset, the Location where the BigQuery Sink executed jobs was the location specified in the Pushdown configuration, not the BigQuery Dataset location. The configured Location should have only been used when creating resources. Now, if the dataset already exists, the Location for the existing dataset is used (PLUGIN-1513).

  • For the Database sink, fixed an issue where the pipeline didn't fail if there was an error writing data to the database. Now, if there is an error writing data to the database, the pipeline fails and no data is written to the database (PLUGIN-1514).

  • Fixed an issue that checks GET permission on a namespace which does not exist yet during the namespace creation flow (CDAP-18394).

  • Fixed an issue where Dataproc continued running a job when it couldn't communicate with the Cloud Data Fusion instance if the replication job or pipeline was deleted in Cloud Data Fusion (CDAP-20216).

  • Fixed an issue that caused pipelines with triggers with runtime arguments to fail after the instance was upgraded to Cloud Data Fusion 6.8.0 and 6.9.0 (CDAP-20568).

  • Fixed an issue where arguments set by actions and pipeline triggers don't overwrite runtime arguments. You must add the following runtime argument: system.skip.normal.macro.evaluation=true (CDAP-20597).

  • Fixed an issue that caused the Pipeline Studio page to show an incorrect count of triggers (CDAP-20655).

  • Fixed an issue that caused the Trigger's Payload Config to be missing in the UI for an upgraded instance (CDAP-20660).

  • Fixed bug where initial offset was not considered in Kafka batch source (PLUGIN-1594).

With the introduction of editing deployed pipelines in Cloud Data Fusion 6.9.1, the behavior of some APIs have significantly changed. Due to these changes, some APIs are deprecated (CDAP-20030).

In Cloud Data Fusion 6.9.1, all datasets except FileSet and ExternalDataset are deprecated and will be removed in a future release. All the deprecated datasets use the Table dataset in some form, which only works for programs running with the native provisioner on very old Hadoop releases (CDAP-20667).

June 08, 2023

Cloud Data Fusion version 6.8.3 is generally available (GA). This release is in parallel with the CDAP 6.8.3 release.

Cloud Data Fusion 6.8.3 supports the ability to configure Java options for a pipeline run by setting the system.program.jvm.opts runtime argument (CDAP-20381).

Cloud Data Fusion 6.8.3 supports upgrades in the Pipeline Post-run Action (Pipeline Alerts) plugins during the pipeline upgrade process (CDAP-20567).

Fixed in 6.8.3:

  • Fixed an issue where the event publish feature did not work with RBAC-enabled instances (CDAP-20375).
  • Fixed an issue where executor resource settings were not honored when app.pipeline.overwriteConfig was set (CDAP-20549).
  • Fixed an issue that caused pipelines with triggers with runtime arguments to fail after the instance was upgraded to Cloud Data Fusion 6.8.0 and later (CDAP-20568).
  • Fixed an issue where arguments set by actions and pipeline triggers don't overwrite runtime arguments. You must add the following runtime argument: system.skip.normal.macro.evaluation=true (CDAP-20597).
  • Fixed an issue that caused the Studio page to show an incorrect count of triggers (CDAP-20655).
  • Fixed an issue that caused the Trigger's Payload Config to be missing in the UI for an upgraded instance (CDAP-20660).
  • Fixed an issue in the BigQuery Sink where the absence of an ordering key caused an exception (PLUGIN-1582).
  • Fixed an issue where initial offset was not considered in the Kafka Batch Source (PLUGIN-1594).

June 07, 2023

Zendesk plugins version 1.2.0 is available in the Cloud Data Fusion Hub. The following changes are included in version 1.2.0:

June 02, 2023

The SAP Ariba Batch Source plugin is generally available (GA). You can connect your data pipeline to an SAP Ariba Source and a BigQuery Sink with this plugin in Cloud Data Fusion versions 6.5.1 and later.

The SAP SuccessFactors Batch Source plugin is GA. You can connect your data pipeline to an SAP SuccessFactors Source and a BigQuery Sink with this plugin in Cloud Data Fusion versions 6.5.1 and later.

April 26, 2023

Cloud Data Fusion version 6.8.2 is generally available (GA). This release is in parallel with the CDAP 6.8.2 release.

Cloud Data Fusion version 6.8.2 fixes an issue in Cloud Data Fusion versions 6.8.0 and 6.8.1 that may cause the following error: Unsupported program type: Spark. The first time a pipeline that only contains actions runs on a newly created or upgraded instance, it succeeds. However, following pipeline runs that also include sources or sinks may fail with this error. For updated settings, see Troubleshooting.

March 29, 2023

In Cloud Data Fusion version 6.8.1, Dataproc clusters no longer require the following OAUTH scope to function: https://www.googleapis.com/auth/cloud-platform.

March 22, 2023

In Cloud Data Fusion versions 6.8.0 and 6.8.1, there's a known issue that may cause the following error: Unsupported program type: Spark. The first time a pipeline that only contains actions runs on a newly created or upgraded instance, it succeeds. However, following pipeline runs that also include sources or sinks may fail with this error. For updated settings, see Troubleshooting.

March 20, 2023

Salesforce plugins version 1.4.4 is available in all supported Cloud Data Fusion versions with the following changes:

March 14, 2023

Oracle plugins version 1.8.6 is available in Cloud Data Fusion versions 6.7.1, 6.7.2, and 6.7.3. Oracle plugins version 1.9.2 is available in Cloud Data Fusion versions 6.8.0 and later. These plugin versions have the following changes:

For the Oracle Batch Source, fixed a backward compatibility issue. In plugin versions 1.8.3, 1.9.0, and earlier, Cloud Data Fusion maps the Oracle NUMBER data type with undefined precision and scale to CDAP decimal(38,0) , which can cause data loss due to rounding errors. In plugin versions 1.8.4, 1.8.5, and 1.9.1, the Oracle NUMBER data type with undefined precision and scale maps to the CDAP string data type by default, which preserves all decimal digits. In versions 1.8.6 and 1.9.2, the Oracle NUMBER data type with undefined precision and scale gets mapped to CDAP string by default and lets you edit the output schema to use the older mapping to decimal(38, 0) data type. For more information, see Oracle batch source plugin converts NUMBER to string (PLUGIN-1535 ).

March 08, 2023

SAP BW OHD, SAP ODP, SAP OData, SAP SLT, and SAP Table plugins version 0.8 is generally available (GA) in Cloud Data Fusion versions 6.8.0 and later.

February 28, 2023

Cloud Data Fusion version 6.8.1 is generally available (GA). This release is in parallel with the CDAP 6.8.1 release.

Changes in Cloud Data Fusion 6.8.1:

  • Cloud Data Fusion supports upgrades for real-time pipelines that have a Kafka Consumer Streaming source from version 6.8.0 to 6.8.1 after you upgrade your environment.

  • Starting in Cloud Data Fusion 6.8.1, Dataproc clusters require the following OAUTH scope to function: https://www.googleapis.com/auth/cloud-platform. This scope is no longer required as of March 29, 2023.

  • March 17, 2023 release note addition: The Oracle Batch Source version 1.9.1 (which comes with Cloud Data Fusion 6.8.1) reads the Oracle NUMBER data type with undefined precision and scale as a string in Cloud Data Fusion. In previous versions, Cloud Data Fusion reads the Oracle NUMBER data type with undefined precision and scale as decimal (38,0), which could result in data loss. For more information, see Troubleshooting (PLUGIN-1119).

Fixed in 6.8.1:

  • Fixed an issue in instances with role-based access control where deployment and validations failed with read time out error after a short time.

  • Fixed an issue that caused replication assessment for Oracle by Datastream to fail when an instance had role-based access control.

  • Fixed an issue that allowed reading secure keys in the system namespace with only the Data Fusion Viewer role (datafusion.viewer) or Instance Accessor role (datafusion.accessor). For more information about predefined roles for role-based access control in Cloud Data Fusion, see the Role-based access control (RBAC) overview.

  • Fixed an issue where the Replication Source plugin's event reader was not stopped by the Delta worker when there were errors, which caused leakage of the plugin's resources (CDAP-20394).

  • Fixed an issue that occurred in certain upgrade scenarios, where pipelines didn't have the Use Connection property set and the UI didn't display a plugin's connection properties, such as Project ID and Service Account Type (CDAP-20392).

  • Fixed an issue that caused pipelines to fail if they used a connection property, such as the Service Account JSON property, that used a secure macro with JSON as the value (CDAP-20271).

  • For Oracle by Datastream Replication sources, fixed an issue where the Review Assessment page would freeze when a selected or manually-entered table didn't exist in the source database (CDAP-20257).

  • For Oracle by Datastream Replication sources, fixed an issue where the Select tables and transformations page failed to load the list of tables and displayed the error deadline exceeded when the source database contained a large number of tables (CDAP-20199).

  • Fixed an error in security-enabled instances that caused pipeline launch to fail and return a token expired error when evaluating secure macros in provisioner properties (CDAP-20146).

  • For MySQL Replication sources, fixed an issue that caused Replication jobs to fail during the initial snapshot, if the job included a runtime argument with the Debezium property, binary-handling-mode (CDAP-20121).

  • For Replication jobs, increased retry duration for API calls to update state/offsets in Replication jobs (CDAP-20028).

  • Fixed an issue that prevented upgrades for MySQL and SQL Server Replication jobs in version 6.6.0. Upgrades are supported from version 6.6.0 to 6.7.3 and 6.8.1 (CDAP-19622).

  • Fixed upgrades for Oracle by Datastream Replication jobs. You can upgrade Oracle by Datastream Replication jobs from Cloud Data Fusion versions 6.6.0 and 6.7 to version 6.8.1. (CDAP-20013).

Cloud Data Fusion version 6.7.3 is generally available (GA). This release is in parallel with the CDAP 6.7.3 release.

March 17, 2023 release note addition: The Oracle Batch Source version 1.8.5 (which comes with Cloud Data Fusion 6.7.3) reads the Oracle NUMBER data type with undefined precision and scale as a string in Cloud Data Fusion. In plugin versions 1.8.3 and earlier, Cloud Data Fusion reads the Oracle NUMBER data type with undefined precision and scale as decimal (38,0), which could result in data loss. For more information, see Troubleshooting (PLUGIN-1119).

Fixed in 6.7.3:

  • Fixed an issue that allowed reading secure keys in the system namespace with only the Data Fusion Viewer role (datafusion.viewer) or Instance Accessor role (datafusion.accessor). For more information about predefined roles for role-based access control in Cloud Data Fusion, see the Role-based access control (RBAC) overview.

  • Fixed an issue in the BigQuery Replication Target plugin that caused Replication jobs to fail when the BigQuery target table already existed. The new version of the plugin will automatically be used in new Replication jobs (CDAP-19599).

  • Fixed an issue that prevented upgrades for MySQL and SQL Server Replication jobs in version 6.6.0. Upgrades are supported from version 6.6.0 to 6.7.3 and 6.8.1 (CDAP-19622).

  • Fixed an issue that prevented upgrades for Oracle by Datastream Replication jobs in version 6.6.0. Upgrades are supported from versions 6.6.0 , 6.7.0, 6.7.1, and 6.7.2 to version 6.7.3 (CDAP-20013).

  • Fixed an issue that caused pipelines to fail if they used a connection property, such as the Service Account JSON property, which used a secure macro with JSON as the value (CDAP-20271).

  • Fixed an issue that occurred in certain upgrade scenarios, where pipelines didn't have the Use Connection property set and the UI didn't display a plugin's connection properties, such as Project ID and Service Account Type (CDAP-20392).

  • Fixed an issue where the Replication Source plugin's event reader was not stopped by the Delta worker when there were errors, which caused leakage of the plugin's resources (CDAP-20394).

  • Fixed an error in security-enabled instances that caused pipeline launch to fail and return a token expired error when evaluating secure macros in provisioner properties (CDAP-20146).

  • In the Oracle Batch Source (version 1.8.4 and earlier), when the source data included fields with the NUMBER data type with undefined precision and scale, Cloud Data Fusion read it as decimal and set the precision to 38 and the scale to 0. If any values in the field had scale other than 0, values were rounded, which could have resulted in data loss. If the scale for this field is overridden in the plugin output schema, the pipeline fails.

    In Oracle batch source version 1.8.5, you can edit the scale of the CDAP decimal data type in the output schema. The overridden value is used to map to the Oracle NUMBER data type without failing the pipeline. If there are any numbers present in the fields with a scale greater than the scale defined in the plugin, Cloud Data Fusion rounds the values based on the scale you set in the output schema. For example, if you specify precision=10, scale=3, the value 123.4567 is rounded to 123.457. For more information about setting precision and scale in a plugin, see Changing the precision and scale for decimal fields in the output schema (PLUGIN-1433).

  • Improved performance for batch pipelines with MySQL sinks (PLUGIN-1374).

  • For Database plugins (version 2.9.3), fixed a security issue where the database username and password were exposed in the logs (CDAP-20235).

February 23, 2023

FTP Plugins versions 3.1.0 and 3.2.0 are generally available (GA) in Cloud Data Fusion versions 6.7.2+ and 6.8.0+, respectively. They include support for more file formats and properties. An issue was fixed in the FTP Batch Source that caused pipelines to fail when running with Dataproc 2.0. For more information, see the CDAP Hub release log.

January 05, 2023

The SAP SuccessFactors Batch Source plugin is available in Preview. You can connect your data pipeline to an SAP SuccessFactors Source and a BigQuery Sink with this plugin in Cloud Data Fusion versions 6.5.1 and later.

December 15, 2022

Cloud Data Fusion integrates with Data Catalog for asset level lineage in Preview.

December 14, 2022

Cloud Data Fusion is available in the following regions:

  • us-east5
  • us-south1

For more information, see Locations and Pricing.

December 07, 2022

Cloud Data Fusion is available in the following region:

  • me-west1

For more information, see Locations and Pricing.

December 06, 2022

Cloud Data Fusion version 6.8.0 is in Preview. This release is in parallel with the CDAP 6.8.0 release.

Features in 6.8.0:

In Cloud Data Fusion 6.8.0, Reference name is no longer mandatory for the following plugins:

  • BigQuery Source
  • BigQuery Sink
  • Dataplex Source
  • Dataplex Sink
  • Spanner Sink
  • GCS Sink

For these plugins, their unique identifiers in lineage are generated based on their configuration properties. For example, project ID+dataset+table is used as a unique identifier for BigQuery. This identifier can be seen on the lineage diagram. For more information, see Cloud Data Fusion Plugins.

Changes in 6.8.0:

  • For Replication jobs with an Oracle (by Datastream) source, ensured data consistency when multiple CDC events are generated with the same timestamp, by ordering events reliably.

  • For Oracle replication sources, added a purge policy for a Cloud Storage bucket created by the plugin where Datastream writes its output.

  • In the Oracle replication source, added the GCS Bucket Location property, where Datastream writes its output.

  • In the Oracle replication source, added the list of Datastream regions to the Region property. You no longer need to manually enter the Datastream region.

  • The Oracle replication source identifies each row by the Primary key of the table. Previously, the plugin identified each row by the ROWID.

  • For Replication jobs, improved performance for Review Assessment.

  • Splitter Transformation based plugins have access to prepareRun() and onRunFinish() methods.

  • In Wrangler, added the Average arithmetic function, which calculates the average of the selected columns.

  • In Wrangler, Numeric functions support 3 or more columns.

  • In the Dataplex Sink plugin, added the Update Dataplex Metadata property, which adds support for updating metadata in Dataplex for newly generated data.

  • In the GCS Delete Action plugin, added support for bulk deletion of files and folders. You can use the (*) wildcard character to represent any character.

Fixed in 6.8.0:

  • For custom Dataproc compute profiles, fixed the issue causing the wrong Cloud Storage bucket to be used to stage data. Cloud Data Fusion uses the bucket specified in the custom compute profile.

  • Fixed the issue in the BigQuery Replication Target plugin causing Replication jobs to fail when the BigQuery target table already existed. The new version of the plugin is used by default in new Replication jobs. To use the new plugin version in existing jobs, recreate the job.

  • Fixed an issue causing the Replication Assessment to get stuck when the Oracle (by Datastream) storage bucket property was empty, or had an invalid bucket name. Cloud Data Fusion returns a 400 error code during assessment when the property is empty or has an invalid bucket name.

  • Fixed an issue causing Replication jobs to fail when the source column name didn't comply with BigQuery table naming conventions. In 6.8.0, if a source column name doesn't comply with BigQuery naming conventions, Cloud Data Fusion replaces invalid characters with an underscore, prepends an underscore when the first character is a number, and truncates the name when it exceeds the maximum length.

  • In the File batch source, fixed an issue causing Get Schema to appear only when Format was set to delimited. Get Schema appears for all formats.

  • Fixed an issue with the output schema when connecting a Splitter transformation with a Joiner transformation.

  • Fixed an issue causing imports in the Cloud Data Fusion UI to fail for pipelines exported through the Pipeline Microservices.

  • In the Oracle Batch Source, when the source data included fields with the Numeric data type (undefined precision and scale), Cloud Data Fusion set the precision to 38 and the scale to 0. If any values in the field had scale other than 0, Cloud Data Fusion truncated the values, which could have resulted in data loss. If the scale for a field was overridden in the plugin output schema, the pipeline failed. For more information, see the CDAP 6.8.0 bug fixes.

  • In the Wrangler transformation, fixed the issue causing the pipeline to not fail when the Error Handling property was set to Fail Pipeline. This happened when an error was returned, but no exception was thrown, and there were 0 records in the output. For example, this happened when one of the directives (such as.parse-as-simple-date) failed because the input data wasn't in the correct format. This fix is not available by default. Instead, contact support if you need this fix for your projects.

  • In Wrangler, fixed the issue causing the Wrangler page to get stuck when a BigQuery table name contained non-alphanumeric characters, such as underscores. Wrangler imports BigQuery tables that follow BigQuery table naming conventions.

Upgrading the Cloud Data Fusion version for Replication jobs is broken. Upgrading Replication jobs to Cloud Data Fusion version 6.8.0 isn't recommended.

December 01, 2022

Google Cloud Platform Plugins version 0.20.4 is generally available (GA) in Cloud Data Fusion versions 6.7.1 and 6.7.2. This version includes Dataplex Source and Sink plugins in GA. For more information, see the CDAP Hub release log.

Google Cloud Platform Plugins version 0.19.3 is generally available (GA) in Cloud Data Fusion version 6.6.0. This version includes Dataplex Source and Sink plugins in GA. For more information, see the CDAP Hub release log.

November 30, 2022

Cloud Data Fusion version 6.4 is no longer supported as of December 1, 2022. You should upgrade your instances to run in a supported version. For instructions, see Upgrading your Cloud Data Fusion environment.

November 08, 2022

DNS Resolution is generally available (GA). You can use domain or hostnames for sources instead of IP addresses for pipeline design-time activities, such as getting schema, wrangling, and previewing pipelines.

October 27, 2022

Cloud Data Fusion version 6.7.2 is generally available (GA). This release is in parallel with the CDAP 6.7.2 release.

In Cloud Data Fusion version 6.7.2, the default machine type changed from N2 to E2.

Fixed in 6.7.2:

  • In the BigQuery Sink plugin (version 0.20.3), fixed an issue that caused a NullPointerException error when table metrics were updated or when the output schema was not defined.
  • In the Send Email batch pipeline alert, fixed an issue where emails failed to send when the Protocol was set to TLS.
  • (Release note added January 5, 2023) In the Cloud Storage Sink and File Sink plugins, fixed an issue that caused Decimal values to be formatted incorrectly.

August 15, 2022

Cloud Data Fusion version 6.7.1 is generally available (GA). This release is in parallel with the CDAP 6.7.1 release.

Changes in 6.7.1:

  • Enhanced the Dataproc provisioner to prevent unneeded Compute Engine calls, depending on the configuration settings.

  • For new Dataproc compute profiles, changed the default value of Master Machine Type from n1 to n2.

Fixed in 6.7.1:

  • Fixed an issue that created duplicate entries in the file cache map, which resulted in multiple attempts to delete the same cache file.

  • Fixed an issue where the Log service left empty folders, which made the mounting of Persistent Disk slow. This caused the Log service not to start in a timely manner.

  • Fixed an issue that caused pipelines to take a long time to launch or get stuck. This was linked to I/O throttling that occurred on the underlying Persistent Disk.

  • Fixed an issue that caused pipelines to fail when two or more pipelines were scheduled to start simultaneously on a static Dataproc cluster. This was due to a file upload race condition.

  • Fixed an issue in flow control that caused Appfabric to return a 5xx error code in rare scenarios, instead of 429 (Too Many Requests Error), if the number of concurrently launching or running pipelines was above certain thresholds.

  • Fixed an issue that resulted in an error when a compute profile was exported from the default namespace after switching from a custom namespace.

  • Delete, edit, and export no longer appear as operations for pre-configured connections in Cloud Data Fusion. You can view connection details for those connections.

  • Fixed an issue when you started a pipeline multiple times, and then stopped the pipeline before it completed, which resulted in the following UI error: Program <program name> is not running.

  • Removed verbose logs from the BigQuery client libraries in pipeline logs.

  • Fixed an issue that caused the BigQuery Execute action plugin configured with an Encryption Key Name (CMEK) to fail when the SQL query contained DDL Statements.

  • In the BigQuery Execute action plugin, added the property Store Results in a BigQuery Table to the UI, which hides the destination table related properties by default.

The SAP Ariba Batch Source plugin is available in Preview. You can connect your data pipeline to an SAP Ariba Source and a BigQuery Sink with this plugin in Cloud Data Fusion versions 6.5.1 and later.

August 01, 2022

Cloud Data Fusion versions 6.1, 6.2, and 6.3 are no longer supported. You should upgrade your instances to run in a supported version. For instructions, see Upgrading your Cloud Data Fusion environment.

June 09, 2022

Cloud Data Fusion version 6.7.0 is in Preview. This release is in parallel with the CDAP 6.7.0 release.

Features in 6.7.0:

  • Connection Management is generally available (GA).
  • DNS Resolution is available in Preview. Cloud Data Fusion supports using domain or hostnames for sources instead of IP addresses for pipeline design-time activities, such as getting schema, wrangling, and previewing pipelines.
  • Transformation Pushdown is generally available (GA) for JOIN operations. Several new transformations are available in Preview, including Group By and deduplicate aggregations. Added support for the BigQuery Storage Read API to improve performance when extracting data from BigQuery. For more information, see the Transformation Pushdown overview.
  • Dataplex Source and Sink plugins are available in Preview as system plugins in Cloud Data Fusion. You no longer need to install the plugins.

Changes in 6.7.0:

  • Increased pipeline launch and run scalability in Enterprise instances.
  • In Transformation Pushdown, added the ability to use existing connections.
  • Added the ability to parse files before loading data into a Wrangler workspace.
  • Added the ability to import the schema in JSON and some Avro formats, where schema inference isn't possible before loading data into the Wrangler workspace.
  • In Connection Management:
    • Added the ability to edit connections.
    • Added support for connections for several plugins and sinks.
    • Added the ability to browse partial hierarchies, such as BigQuery datasets and Dataplex zones.
  • In the Cloud Storage Done File Marker Post-Action plugin, added support for the Location property, which lets you have buckets and customer-managed encryption keys in locations that are not US locations.
  • In the BigQuery Execution Action plugin and the BigQuery Argument Setter action plugin, added support for the Dataset Project ID property, the Project ID of the dataset that stores the query results. It's required if the dataset is in a different project than the BigQuery job.
  • In BigQuery sinks, added support for the BigNumeric data type.
  • In the BigQuery Table Batch Source, added the ability to query any temporary table in any project when you set the Enable querying views property to Yes. Previously, you could only query views.
  • In Cloud Data Loss Prevention plugins, added support for templates from other projects.
  • Added a new pipeline state for when you manually stop a pipeline run: Stopping.
  • In the BigQuery Execute plugin, added the ability to look up the drive scope for the service account to read from external tables created from the drive.
  • Improved the generic Database source plugin to correctly read decimal data.
  • Improved the Google Cloud Platform plugins to validate the Encryption Key Name property.
  • In the replication configurations, added the ability to enable soft deletes from a BigQuery target.
  • In Wrangler, added support for nested arrays, such as the BigQuery STRUCT data type.
  • In the Cloud Storage File Reader Batch Source plugin, added the Allow Empty Input property.
  • In the Cloud Storage File Reader Batch Source and Amazon S3 Batch Source plugins, added the Enable Quoted Values property, which lets you treat content between quotes as a value.
  • In the Joiner transformation, added the Input with Larger Data Skew property.
  • Behavior change: In the Pipeline Studio, if you click Stop on a running pipeline and the pipeline doesn't stop after 6 hours, the pipeline is forcefully terminated.
  • Behavior change: In the Deduplicate Analytics plugin, limited the Filter Operation property to one record. If this property isn't set, a random record is chosen from the group of duplicate records.
  • Behavior change: The BigQuery sink supports Nullable Arrays. A NULL array is converted to an empty array at insertion time.

Fixed in 6.7.0:

  • Fixed an issue in the Group By transformation where Longest String and Shortest String aggregators returned an empty string, even when all records contained null values in the specified field. The Group By transformation returns null for empty input.
  • Fixed an issue in the Group By transformation that caused the Concat and Concat Distinct aggregate functions to produce incorrect results in some cases.
  • Fixed an issue in the Group By transformation that caused the Variance, Variance If, and Standard Deviation aggregate functions to produce incorrect results in some cases.
  • In the Oracle and MySQL Batch Source plugins, fixed an issue to treat all timestamps, specifically the ones older than the Gregorian cutover date (October 15, 1582), from the database in Gregorian calendar format.
  • Improved the generic Database source plugin to correctly read data when the data type is NUMBER, scale is set, and the data contains integer values.
  • Fixed an issue in sources (such as File and Cloud Storage) that resulted in an error if you clicked Get Schema when the source file contained delimiters used in regular expressions, such as "|" or ".". You no longer need to escape delimiters for sources.
  • Fixed an issue where Datastore sources read a maximum of 300 records. Datastore sources read all records.
  • Fixed an issue in BigQuery sinks where the output table was not partitioned correctly in the following cases:
    • The output table didn't exist.
    • Partitioning type was set to Time.
    • Operation was set to Upsert.
  • Fixed an issue that caused pipelines with BigQuery sinks that have input schemas with nested array fields to fail.
  • Fixed issues that caused failures when reading maps and named enums from Avro files.

Cloud Data Fusion version 6.7.0 does not support Dataproc version 1.3. For more information, see the compatible versions of Dataproc.

May 23, 2022

Google Cloud Platform Plugins version 0.19.1 is generally available (GA). This version includes Dataplex Source and Sink plugins in Preview. For more information, see the CDAP Hub release log.

April 18, 2022

Google Drive Plugins version 1.4.0 is generally available (GA). For more information, see the CDAP Hub release log.

April 01, 2022

Cloud Data Fusion version 6.6.0 is generally available (GA).

(Release note added March 14, 2023) Role-based access control (RBAC) is generally available (GA) in Cloud Data Fusion 6.6.0 and later. This gives administrators fine-grained access control over what users can do at the namespace level.

March 31, 2022

The SAP SLT Replication plugin is generally available (GA). You can replicate your data continuously and in real time from SAP sources into BigQuery with this plugin in Cloud Data Fusion versions 6.4.0 and later.

February 25, 2022

Cloud Data Fusion version 6.6.0 is in Preview. This release is in parallel with the CDAP 6.6.0 release .

Features in 6.6.0:

  • Cluster reuse is generally available (GA).
  • Predefined autoscaling is available in Preview.
  • Cloud Data Fusion flow control prevents you from submitting too many requests, which can cause stuck or failed pipeline runs. It applies to API and scheduled pipeline launch requests for batch and real-time pipelines and replication jobs. It is available in Preview.

Changes in 6.6.0:

Fixed in 6.6.0:

  • Improved instance stability. Fixed a number of system service unavailability cases resulting in the message "Necessary services are experiencing intermittent problems" and API call failures.
  • Fixed an issue that caused pipelines to fail when a Database Batch Source included a decimal column with precision greater than 19.
  • Fixed an issue that caused pipelines with a Conditional plugin that were running on MapReduce to fail.
  • Fixed an issue that caused pipelines with a Conditional plugin and running on Spark to fail.
  • Fixed an issue that caused validation to fail for Cloud Storage Multi File Sinks.

January 31, 2022

The SAP OData plugin is generally available (GA). You can use SAP as a source for batch-based data extraction in Cloud Data Fusion using the Open Data Protocol (OData). This plugin is available in any Cloud Data Fusion edition, version 6.4.0 and later.

December 21, 2021

The SAP ODP plugin is generally available (GA). You can use SAP as a source for batch-based and delta-based data extraction in Cloud Data Fusion through Operational Data Provisioning (ODP). This plugin is available in any Cloud Data Fusion edition, version 6.4.0 and later.

November 17, 2021

January 7, 2022 correction: Cloud Data Fusion is not yet available in the Santiago (southamerica-west1) region. For available locations, see Locations.

Cloud Data Fusion is now available in the Santiago (southamerica-west1) region.

November 05, 2021

Cloud Data Fusion version 6.5.1 is now available. This release is in parallel with the CDAP 6.5.1 release .

GA: Cloud Data Fusion now supports Customer-Managed Encryption Keys (CMEK), which provides user encryption control over the data written to Google internal resources in tenant projects, and data written by Cloud Data Fusion pipelines. The list of supported plugins has also expanded.

Fixed in 6.5.1 (for more information, see the CDAP release note ):

  • Fixed an issue where messages could not be retrieved for Kafka topics.
  • Fixed an issue where you could not create a profile for an existing Dataproc cluster.
  • Fixed an issue that caused pipelines to fail when Transformation Pushdown was enabled and used macros as properties.
  • Fixed an issue that caused long running programs, like Replication, to fail within the default Hadoop delegation token timeout. Now, these tokens get renewed so that the job keeps running.
  • Fixed an issue in Replication that caused an error when you clicked Configure.
  • Fixed an issue that caused a pipeline to fail with an ACCESS DENIED error when running BigQuery jobs. For more information, see Troubleshooting .
  • Fixed an issue in the Cloud Storage connection that prevented browsing and parsing files stored in folders under buckets.
  • Fixed an issue that caused custom formats to be unusable in the Cloud Storage source and sink.

September 29, 2021

Preview: You can now use SAP as a source for batch-based and delta-based data extraction in Cloud Data Fusion through Operational Data Provisioning (ODP). For more information, see the SAP ODP plugin overview. This plugin is available in Cloud Data Fusion version 6.4.0 and later.

September 02, 2021

Preview: Cloud Data Fusion version 6.5.0 is now available. This version is a Preview. This release is in parallel with the CDAP 6.5.0 release.

Features in 6.5.0:

  • Preview: Cloud Data Fusion now supports role-based access control (RBAC). This gives administrators fine-grained access control over what users can do at the namespace level.

  • Preview: Cloud Data Fusion now supports customer-managed encryption keys (CMEK), which provide user encryption control over the data written to Google internal resources in tenant projects, and data written by Cloud Data Fusion pipelines.

  • Preview: Cloud Data Fusion Instance Admins can now create, view, duplicate, delete, import, and export connections from the Pipeline Studio, Wrangler, or the Namespace Admin page. A connection stores sensitive data, such as user credentials and host information, needed to connect to data sources. For more information, see Managing connections.

  • Preview: Transformation pushdown is now available. It helps you efficiently design and execute ELT workloads by pushing join transformations down to BigQuery. It gives users that prefer ELT in BigQuery access to the same visual experience that ETL users get in Cloud Data Fusion, without needing to maintain complex SQL scripts. When you enable Transformation pushdown, Cloud Data Fusion executes Join operations in BigQuery (instead of Apache Spark). All other stages in a pipeline are executed using Spark. For pipelines that perform multiple complex joins, BigQuery can execute these joins operations faster than Spark.

  • Preview: Dataproc cluster reuse is now available. It can be used to speed up pipeline run startup by reusing clusters from previous runs.

Changes in 6.5.0:

  • In version 6.5.0, Spark 3 is the new default engine used when using Cloud Data Fusion Preview and when running pipelines on Dataproc clusters. After an instance is upgraded to version 6.5.0, any new or upgraded pipeline that uses a Dataproc profile without an explicit image version will use the latest Dataproc image 2.0 that has Spark 3.1 bundled. For more information, see Upgrade notes for Spark 3.

  • Added support for labels in the Dataproc provisioner.

  • Added Shielded VMs to the configuration settings for the Dataproc provisioner. For more information, see the CDAP documentation.

  • Added authorization checks for preferences, logging, compute profiles, and metadata endpoints.

  • Added support to search for tables based on schema name when you select tables for a Replication job.

  • Added additional trace logging in the authorization flow for debugging.

  • Added support for BIGNUMERIC data type for BigQuery target in replication.

  • Behavior change: MySQL, Oracle, Postgres, and SQL Server batch sources, sinks, actions, and pipeline alerts are now installed by default as system plugins. Previously, these plugins were available in the Hub as user plugins.

Fixed in 6.5.0 preview version (for more information, see the CDAP release note):

  • Fixed an issue in Replication that caused jobs to fail if more than 1000 tables were selected for replication.

  • Fixed an issue that caused replication jobs to hang when there were too many Delete or DDL events.

  • Fixed an issue that caused Wrangler to ignore all the other columns other than the given column when parsing Excel files.

  • Fixed Wrangler to fail pipelines upon error. In Wrangler 6.2 and above, there was a backwards-incompatible change where pipelines did not fail if there was an error and instead were marked as completed.

  • Improved resilience of TMS.

  • Fixed an issue that caused File Source Plugin validation to fail when there was a macro in the Format field.

You can create connections for Database, MySQL, Oracle, PostgreSQL, and SQL Server sources, but the plugin properties do not include Use Connection. This means that you cannot reference a connection in a database source plugin. For more information, see Known issues: Database connections.

August 16, 2021

SQL Server source plugin version 1.5.5 is now available. This version fixes a NullPointerException bug that occurs in version 1.5.4. Versions 1.5.4 and above support the Datetime data type. In versions 1.5.3 and earlier, if you had a Datetime column in your SQL Server source, it mapped to the Timestamp data type. Upgrades to version 1.5.4 are backwards incompatible, but upgrades to version 1.5.5 are compatible. For more information, see Troubleshooting and the CDAP SQL Server Batch Source.

June 23, 2021

Preview: You can now replicate data continuously and in real time from operational data stores in Oracle into BigQuery using the Oracle (by Datastream) plugin. The plugin is available in Cloud Data Fusion version 6.4.0 or later.

June 16, 2021

The SAP accelerator for the order to cash process is now available. It provides sample pipelines that you can use to build your end-to-end order to cash process and analytics with Cloud Data Fusion, BigQuery, and Looker. The accelerator is a sample implementation of the SAP Table Batch Source plugin, which enables bulk data integration from SAP applications with Cloud Data Fusion. The accelerator is available in Cloud Data Fusion environments running in version 6.3.0 and above.

May 27, 2021

Cloud Data Fusion version 6.4.1 is now available. To upgrade, see Upgrading instances and pipelines. This release is in parallel with the CDAP 6.4.1 release.

In Cloud Data Fusion version 6.4.1, Replication supports the Datetime data type in BigQuery targets. You can now read and write to tables that contain Datetime fields.

Fixed in 6.4.1 (for more information, see the CDAP release note):

  • Fixed an issue that caused pipelines with aggregations and Decimal fields to fail with an exception.

  • Fixed the Join Condition Type so that it is displayed in the Joiner plugin for pipelines that were upgraded from versions before 6.4.0.

  • Fixed Wrangler so that pipelines fail when there is an error. In Wrangler 6.2 and above, there was a backwards-incompatible change where pipelines did not fail if there was an error and were instead marked as complete.

  • Fixed an issue that prevented new previews from being scheduled after the preview manager had been stopped ten times.

  • Fixed an issue while writing non-null values to a nullable field in BigQuery.

  • Fixed an issue in the BigQuery plugins to correctly delete temporary storage buckets.

  • Fixed an issue in the BigQuery sink that caused pipelines to fail when the input schema was not provided.

  • Fixed an issue in the BigQuery sink that caused pipelines to fail or give incorrect results.

  • Fixed an issue that caused pipelines to fail when a Pub/Sub source Subscription field was a macro.

May 05, 2021

There is an issue in the BigQuery sink plugin version 0.17.0, which causes data pipelines to fail or give incorrect results. This issue is resolved in BigQuery sink plugin version 0.17.1. For more information, see the Cloud Data Fusion Troubleshooting page.

March 31, 2021

Cloud Data Fusion version 6.4.0 is now available. To upgrade, see Upgrading instances and pipelines. This release is in parallel with the CDAP 6.4.0 release.

Features in 6.4.0:

  • GA: You can now ingest data from SAP tables with the SAP Table Batch Source plugin.

  • Cloud Data Fusion now supports the Datetime data type in the following plugins. You can now read and write to tables that contain Datetime fields:

    • BigQuery batch source
    • BigQuery sink
    • BigQuery multi table sink
    • Bigtable batch source
    • Bigtable sink
    • Datastore batch source
    • Datastore sink
    • GCS file batch source
    • GCS file sink
    • GCS multi file sink
    • Spanner batch source
    • Spanner sink
    • File source
    • File sink
    • Wrangler
    • Amazon S3 batch source
    • Amazon S3 sink
    • Database source
  • You can configure machine type, cluster properties, and idle TTL for the Dataproc provisioner. For the available settings, see the CDAP documentation.

  • Adding, editing, and deleting comments on draft data pipelines is now supported. For more information, see Adding comments to a data pipeline.

  • Advanced join conditions are now available in the Joiner plugin. You can specify an arbitrary SQL condition to join on. For more information, see Join Condition Type.

  • A new post-action plugin is now available: GCS Done File Marker. To help you orchestrate downstream/dependent processes, this post-action plugin marks the end of a pipeline run by creating and storing an empty SUCCESS file in the given GCS bucket upon a pipeline completion, success, or failure.

.

Changed in version 6.4.0:

  • Behavior change: When you validate a plugin, macros get resolved with preferences. In previous releases, to validate a plugin's configuration, you had to change the pipeline to remove the macros.
  • Behavior change: Cloud Data Fusion now determines the schema dynamically at runtime instead of requiring arguments to be set. Multi sink runtime argument requirements have been removed, which lets you add simple transformations in multi-source/multi-sink pipelines. In previous releases, multi-sink plugins require the pipeline to set a runtime argument for each table, with the schema for each table.

  • You can now filter tables in the Multiple Database Tables Batch Source.

  • Multiple Database Batch Source and BigQuery multi-table sink have better error handling and let pipelines continue if one or more tables fail.

  • Cloud Data Fusion Replication changes:

    • March 17, 2023 release note addition: Replication from SQL Server and MySQL is generally available (GA) in Cloud Data Fusion version 6.4.0.
    • Renamed Replication pipelines to Replication jobs.
    • The Customer-managed encryption key (CMEK) configuration property is now available for BigQuery targets in your Replication jobs.
    • On the BigQuery Target properties page, renamed the Staging Bucket Location property to Location.
    • Improved reliability by restarting Replication from the last known checkpoint.
  • You can now use files with ISO-8859, Windows and EBCDIC encoding types with Amazon S3, File and GCS File Reader batch source plugins.

  • Cloud Data Fusion now supports running pipelines on a Hadoop cluster with Kerberos enabled.

Fixed in 6.4.0 (for more information, see the CDAP release note):

  • Fixed Bigtable batch source plugin. In previous versions, pipelines that included the Bigtable source would fail.
  • FTP batch source now works with empty File System Properties.
  • Strings are now supported in Min/Max aggregate functions (used in both Group By and Pivot plugins).
  • Fixed Salesforce plugin to correctly parse the schema as Avro schema to be sure all the field names are accepted by Avro.
  • Fixed data pipeline with BigQuery sink that failed with INVALID_ARGUMENT exception if the range specified was a macro.
  • Fixed a class conflict in the Kinesis Spark Streaming source plugin. You can now run pipelines with this source.
  • Fixed an issue in field validation logic in pipelines with BigQuery sink that caused a NullPointerException.
  • Fixed the Wrangler Generate UUID directive to correctly generate a universally unique identifier (UUID) of the record.
  • Fixed advanced joins to recognize auto broadcast setting.
  • Fixed Pipeline Studio to use current namespace when it fetches data pipeline drafts.
  • Fixed Replication statistics to display on the dashboard for SQL Server.
  • Fixed an issue where clicking the Delete button on Replication Assessment page resulted in an error for the replication job.
  • Schema name is now shown when selecting tables to replicate.
  • Fixed Replication to correctly insert rows that were previous deleted by a replication job.
  • Data pipelines running in Spark 3 enabled Dataproc cluster no longer fail with class not found exception.
  • Fixed Replication with a SQL Server source to generate rows correctly in BigQuery target table if snapshot failed and restarted.
  • Fixed an issue where SQL Server replication job stopped processing data when the connection was reset by the SQL Server.
  • Fixed an error in Replication wizard step to select tables, columns and events to replicate, where selecting no columns for a table caused the wizard to fetch all columns in a table.
  • Using a macro for a password in a replication job no longer results in an error.
  • Fixed logical type display for data pipeline preview runs.
  • Fixed Dashboard API to return programs running but started before the startTime.
  • Fixed deployed Replication jobs to show advanced configurations in Ui.
  • Fixed data pipeline with Python Evaluator transformation to run without stack trace errors.
  • Added loading indicator while fetching logs in Log Viewer.
  • Fixed Pipeline preview so logical start time function doesn't display as a macro.
  • Fixed fields with a list drop down menu in the Replication wizard to default to Select one.
  • Added message in Replication Assessment when there are tables that CDF cannot access.
  • Used error message when an invalid expression is added in Wrangler.
  • Fixed RENAME directive in Wrangler so it is case sensitive.
  • Fixed Pipeline Operations UI to stop showing the loading icon forever when it gets error from backend.
  • Fixed Wrangler to no longer generate invalid reference names.
  • Fixed Wrangler to display logical types instead of java types.
  • Fixed pipelines from Wrangler to no longer generate incorrect for xml files.
  • Added connection in Wrangler hard codes the name of the JDBC driver.
  • Batch data pipelines with Spark 2.2 engine and HDFS sinks no longer fail with delegation token issue error.

FTP Batch Source (system plugin for data pipelines)

FTP Batch Source version 3.0.0 is backward compatible, except that it uses a different artifact. This was done to ensure that updates to the plugin can be delivered out-of-band from Cloud Data Fusion releases, through the Hub.

It is recommended that you use version 3.0.0 or later in your data pipelines.

March 24, 2021

Cloud Data Fusion version 6.3.1 is now available. This version fixes a race condition that results in intermittentant failures in concurrent pipeline executions. This release is in parallel with the CDAP 6.3.1 release.

March 17, 2021

Preview: Cloud Data Fusion now supports Access Transparency. Access Transparency is a part of Google's long-term commitment to transparency and user trust. Access Transparency logs record the actions that Google personnel take when accessing customer content. For more information, see the Access Transparency overview.

February 22, 2021

Cloud Data Fusion Beta instances (versions 6.1.0.2 and lower that were created before November 21, 2019) will be turned down on March 1, 2021. Instead, export your pipeline, delete the old instance to avoid billing impact, create a new instance, and import your pipeline into the new instance.

February 03, 2021

Preview: You can now replicate data continuously and in real time from operational data stores, such as SQL Server and MySQL, into BigQuery.

January 27, 2021

Cloud Data Fusion Beta instances (versions 6.1.0.2 and lower that were created before November 21, 2019) will be turned down on March 1, 2021. Instead, export your pipeline, create a new instance, and import your pipeline into the new instance. This note is incorrect; see entry for February 18, 2021.

January 21, 2021

Cloud Data Fusion 6.3.0 is now available.

In-place upgrades are now supported for minor and patch versions.

You can configure the default system compute profile in the Developer edition starting in Cloud Data Fusion version 6.3.0.

October 27, 2020

You can now specify which Cloud Data Fusion version to use when you create an instance.

You can now specify the service account to use for running your Cloud Data Fusion pipeline on Dataproc:

October 21, 2020

In Cloud Data Fusion versions before 6.2, there is a known issue where pipelines get stuck during execution. Stopping the pipeline results in the following error: Malformed reply from SOCKS server. To fix this, delete the Dataproc cluster, and then update the memory settings in the compute profile.

September 30, 2020

This release is in parallel with the CDAP 6.2.2 release.

Cloud Data Fusion now supports autoscaling Dataproc clusters.

Cloud Data Fusion now displays the number of pending preview runs, if any, before the current run. In the Studio, the number of pending runs is displayed under the timer.

Improved performance for skewed joins by including Distribution in the Joiner plugin settings.

Wrangler now supports BigQuery views and materialized views.

August 24, 2020

Highlights

Cloud Data Fusion 6.1.4 provides performance and scalability improvements that increase developer productivity and optimize pipeline runtime performance. The release includes scaled-up previews that support up to 50 concurrent runs, capabilities to handle large and complex schemas in Pipeline Studio, an enhanced log viewer, and other critical improvements and fixes.

This release is in parallel with the CDAP 6.1.4 release.

You can now create autoscaling Dataproc clusters.

You can now use the schema support feature in the UI to edit the precision and scale fields.

Cloud Data Fusion now has improved memory performance in pipelines by utilizing a disk-only auto-caching strategy.

Cloud Data Fusion previews now support up to 50 concurrent runs.

Cloud Data Fusion now supports large and deeply nested schemas (>5K fields with 20+ levels of nesting).

Fixed a bug where the setting for the number of executors in streaming pipelines was ignored.

Fixed a race condition where runtime monitoring failed when programs launched concurrently.

Fixed the preview page in table mode so that it shows multiple inputs and outputs with tabs.

Fixed the stability of state transitions for starting pipelines in AppFabric when AppFabric restarts.

Fixed a bug where a metric incorrectly counted the number of the records written in the Google Cloud Storage sink.

July 16, 2020

Cloud Data Fusion version 6.1.3 is now available. This version includes performance improvements and minor bug fixes.

  • Improved performance of Joiner plugins, aggregators, program startup, and previews.
  • Added support for custom images. You can select a custom Dataproc image by specifying the image URI.
  • Added support for rendering large schemas (>1000 fields) in the pipelines UI.
  • Added payload compression support to the messaging service.

April 22, 2020

Cloud Data Fusion version 6.1.2 is now available. This version includes several stability and performance improvements and new features.

  • Added support for Field Level Lineage for Spark plugins and Streaming pipelines
  • Added support for Spark 2.4
  • Added an option to skip header in the files in delimited, CSV, TSV, and text formats
  • Added an option for database source to replace the characters in the field names

Reduced preview startup by 60%. Also added limit to max concurrent preview runs (10 by default).

Fixed a bug that caused errors when Wrangler's parse-as-csv with header was used when reading multiple small files.

Fixed a bug that caused zombie processes when using the Remote Hadoop Provisioner.

Fixed a bug that caused DBSource plugin to fail in preview mode.

Fixed a race condition that caused a failure when running a Spark program.

January 10, 2020

Cloud Data Fusion version 6.1.1 is now available. This version includes several stability and performance improvements, as well as these new features:

  • Azure Data Lake storage support in Wrangler
  • Enabled Field Level Lineage (Beta)
  • Data Loss Prevention plugin to identify, tokenize, or encrypt sensitive data at scale (Beta)

December 10, 2019

Cloud Data Fusion version 6.1.0.5 is now available. This version includes several stability and performance improvements.

November 21, 2019

Cloud Data Fusion is now generally available.

Added support for creating Cloud Data Fusion instances that use private IP addresses.

Added support for creating private Cloud Data Fusion instances and executing data pipelines in a VPC-SC environment.

Added support to encrypt resources created in Cloud Storage, BigQuery, and Pub/Sub using Cloud Data Fusion with Customer Managed Encryption Keys.

Added reference documentation for creating and managing pipelines and datasets.

The Cloud Data Fusion UI is now available at a different URL in the format: <instance-name>-<project-id>-dot-<region identifier>.datafusion.googleusercontent.com.

May 31, 2019

Renamed "Cloud Dataprep service" to "Wrangler service" in the System Admin page of the Cloud Data Fusion UI.

Added a version number field to the Cloud Data Fusion Instance details page in the GCP Console.

Fixed a bug that caused Cloud Data Fusion to launch Cloud Dataproc clusters in an incorrect project.

Added support for specifying a subnet for the Cloud Dataproc provisioner.

Fixed the Cloud Dataproc provisioner to handle networks that do not use automatic subnet creation.

April 10, 2019

Cloud Data Fusion is now publicly available in beta.