Cloud Data Fusion is a fully managed, cloud-native, enterprise data integration service for quickly building and managing data pipelines.
The Cloud Data Fusion web UI allows you to build scalable data integration solutions to clean, prepare, blend, transfer, and transform data, without having to manage the infrastructure.
Cloud Data Fusion is powered by the open source project CDAP. Throughout this page, there are links to the CDAP documentation site, where you can find more detailed information.
To use Cloud Data Fusion, you can use the visual web UI or command-line tools.
Using the code-free web UI
When using Cloud Data Fusion, you use both the Cloud Console and the separate Cloud Data Fusion web UI.
In the Google Cloud Console, you create a Google Cloud project, create and delete Cloud Data Fusion instances (unique deployments of Cloud Data Fusion), and view Cloud Data Fusion instance details.
In the Cloud Data Fusion UI, you use the various pages, such as Pipeline Studio or Wrangler, to visually design data pipelines and use Cloud Data Fusion functionality.
At a high level, you do the following:
Create a Cloud Data Fusion instance in the Google Cloud Console.
Find your Cloud Data Fusion instance in the Cloud Console Instances page, and click the View instance link in the Action column. This opens the Cloud Data Fusion UI in a new browser tab.
Use the various pages in the Cloud Data Fusion web UI to visually design your pipelines and manage metadata.
Using command-line tools
Alternatively to the web UI, you can use command-line tools to create and manage your Cloud Data Fusion instances and pipelines.
The REST reference describes the API for creating and managing your Cloud Data Fusion instances on Google Cloud.
The CDAP reference describes the REST API for creating and managing pipelines and datasets.
This section provides an introduction to some of the core concepts of Cloud Data Fusion. Some sections provide links to the CDAP documentation, where you can learn more about each concept and in more detail.
Cloud Data Fusion instance
A Cloud Data Fusion instance is a unique deployment of Cloud Data Fusion. To get started with Cloud Data Fusion, you create a Cloud Data Fusion instance through the Cloud Console.
You can create multiple instances in a single Cloud Console project and can specify the Google Cloud region to create your Cloud Data Fusion instances in.
Based on your requirements and cost constraints, you can create a Developer, Basic, or Enterprise instance.
Each Cloud Data Fusion instance contains a unique, independent Cloud Data Fusion deployment that contains a set of services, that handle pipeline lifecycle management, orchestration, coordination, and metadata management. These services run using long-running resources in a tenant project.
Cloud Data Fusion creates ephemeral execution environments to run pipelines when you manually run your pipelines or when pipelines run through a time schedule or a pipeline state trigger. Cloud Data Fusion supports Dataproc as an execution environment, in which you can choose to run pipelines as MapReduce, Spark, or Spark Streaming programs. Cloud Data Fusion provisions an ephemeral Dataproc cluster in your customer project at the beginning of a pipeline run, executes the pipeline using MapReduce or Spark in the cluster, and then deletes the cluster after the pipeline execution is complete.
Alternatively, if you manage your Dataproc clusters in controlled environments, through technologies like Terraform, you can also configure Cloud Data Fusion not to provision clusters. In such environments, you can run pipelines against existing Dataproc clusters.
For information about configuring and using Dataproc autoscaling to automatically and dynamically resize clusters to meet workload demands, see the Autoscaling clusters guide.
Recommended: Use the autoscaling option for all pipelines that don't leverage Analytics plugins, such as Distinct, Group By, Joiner, Deduplicate, or Row Denormalizer.
Not recommended: Autoscaling is not intended for scaling on-cluster HDFS. In Cloud Data Fusion, if you perform aggregations, such as grouping and joining data, autoscaling might cause your pipelines to run slowly or throw errors.
A pipeline is a way to visually design data and control flows to extract, transform, blend, aggregate, and load data from various on-premises and cloud data sources. Building pipelines allows you to create complex data processing workflows that can help you solve data ingestion, integration, and migration problems. You can use Cloud Data Fusion to build both batch and real-time pipelines, depending on your needs.
Pipelines enable you to express your data processing workflows using the logical flow of data, while Cloud Data Fusion handles all the functionality that is required to physically run in an execution environment. The Cloud Data Fusion planner transforms the logical flow into parallel computations, using Apache Spark and Apache Hadoop MapReduce on Dataproc.
Pipelines are represented by a series of nodes arranged in a directed acyclic graph (DAG), forming a one-way flow. Nodes represent the various actions you can take with your pipelines, such as reading from sources, performing data transformations, and writing output to sinks. You can develop data pipelines in the Cloud Data Fusion web UI by connecting together sources, transformations, sinks, and other nodes.
Additionally, by providing access to logs and metrics, pipelines offer a simple way for administrators to operationalize their data processing workflows without the need for custom tooling.
Learn more about pipelines on the CDAP documentation site.
A plugin is a customizable module that can be used to extend the capabilities of Cloud Data Fusion. Cloud Data Fusion provides plugins for sources, transforms, aggregates, sinks, error collectors, alert publishers, actions, and post-run actions. If you need a plugin that isn't provided, you can develop a custom plugin yourself.
A plugin is sometimes referred to as a node, usually in the context of the Cloud Data Fusion web UI.
The following table describes the various categories of plugins available in Cloud Data Fusion.
|Sources||Sources are connectors to databases, files, or real-time streams from which you obtain your data. They enable you to ingest data, using a simple UI, so you don't have to worry about coding low-level connections.|
|Analytics||Analytics plugins are used to perform aggregations such as grouping and joining data from different sources, as well as running analytics and machine learning operations. Cloud Data Fusion provides built-in plugins for a variety of such use cases.|
|Actions||Action plugins define custom actions that are scheduled to take place during a workflow but don't directly manipulate data in the workflow. For example, using the Database custom action, you can run an arbitrary database command at the end of your pipeline. Alternatively, you can trigger an action to move files within Cloud Storage.|
|Sinks||Data must be written to a sink. Cloud Data Fusion contains various sinks, such as Cloud Storage, BigQuery, Spanner, relational databases, file systems, mainframes.|
|Error collectors||When nodes encounter null values, logical errors, or other sources of errors, you can use an error collector plugin to catch errors. You can connect this plugin to the output of any transform or analytics plugin, and it will catch errors that match a condition you define. You can then process these errors in a separate error processing flow in your pipeline.|
|Alert publishers||Alert Publisher plugins allow you to publish notifications when uncommon events occur. Downstream processes can then subscribe to these notifications to trigger custom processing for these alerts.|
|Conditionals||Pipelines offer control flow plugins in the form of conditionals. Conditional plugins allow you to branch your pipeline into two separate paths, depending on whether the specified condition predicate evaluates to true or false.|
If a plugin you need does not exist, you can build your own plugin by using the Cloud Data Fusion plugin APIs.
A compute profile specifies how and where a pipeline is executed. A profile encapsulates any information required to set up and delete the pipeline's physical execution environment. For example, a profile includes the type of cloud provider (such as Google Cloud), the service to use on the cloud provider (such as Dataproc), credentials, resources (memory and CPU), image, minimum and maximum node count, and other values.
A profile is identified by name and must be assigned a provisioner and its related configuration. A profile can exist either at the Cloud Data Fusion instance level or at the namespace level.
Learn more about profiles on the CDAP documentation site.
|Google Cloud Integrations||
|Connectors (Google Cloud)||
|Connectors (non-Google Cloud)||