Stay organized with collections
Save and categorize content based on your preferences.
Cloud Data Fusion is a fully managed, cloud-native, enterprise data
integration service for quickly building and managing data pipelines. The
Cloud Data Fusion web interface lets you build scalable data integration
solutions. It lets you connect to various data sources, transform the data, and
then transfer it to various destination systems, without having to manage the
infrastructure.
Cloud Data Fusion is powered by the open source project
CDAP.
Get started with Cloud Data Fusion
You can start exploring Cloud Data Fusion in minutes.
The main components of Cloud Data Fusion are explained in the following
sections.
Tenant project
The set of services required to build and orchestrate Cloud Data Fusion
pipelines and store pipeline metadata are provisioned in a tenant
project, inside a tenancy
unit. A separate tenant project is created for each customer project, in which
Cloud Data Fusion instances are provisioned. The tenant project inherits
all the networking and firewall configurations from the customer project.
Cloud Data Fusion: Console
The Cloud Data Fusion console, also referred to as control plane, is a
set of API operations
and a web interface that deal with the Cloud Data Fusion instance itself,
such as creating, deleting, restarting, and updating it.
Cloud Data Fusion: Studio
Cloud Data Fusion Studio, also referred to as the data plane, is a set of
REST API and web interface
operations that deal with creation, execution, and management of pipelines and
related artifacts.
Concepts
This section introduces some of the core concepts of Cloud Data Fusion.
A Cloud Data Fusion instance is a unique deployment of
Cloud Data Fusion. To get started with Cloud Data Fusion, you
create a Cloud Data Fusion instance through the
Google Cloud console.
You can create multiple instances in a single Google Cloud console
project and can specify the Google Cloud region to create your
Cloud Data Fusion instances in.
Each Cloud Data Fusion instance contains a unique, independent
Cloud Data Fusion deployment that contains a set of services,
which handle pipeline lifecycle management, orchestration,
coordination, and metadata management. These services run using
long-running resources in a
tenant project.
A namespace is a logical grouping of applications, data, and the
associated metadata in a Cloud Data Fusion instance. You can think
of namespaces as a partitioning of the instance. In a single instance,
one namespace stores the data and metadata of an entity independently
from another namespace.
A pipeline is a way to visually design data and control
flows to extract, transform, blend, aggregate, and load data from
various on-premises and cloud data sources.
Building pipelines lets you create complex data processing
workflows that can help you solve data ingestion, integration, and
migration problems. You can use Cloud Data Fusion to build both
batch and real-time pipelines, depending on your needs.
Pipelines let you express your data processing workflows using
the logical flow of data, while Cloud Data Fusion handles all the
functionality that is required to physically run in an execution
environment.
On the Studio page of the Cloud Data Fusion web interface,
pipelines are represented as a series of nodes arranged in a directed
acyclic graph (DAG), forming a one-way flow.
Nodes represent the various actions that you can take with your
pipelines, such as reading from sources, performing data
transformations, and writing output to sinks. You can develop data
pipelines in the Cloud Data Fusion web interface by connecting
together sources, transformations, sinks, and other nodes.
In the Cloud Data Fusion web interface, to browse plugins, sample
pipelines, and other integrations, click Hub. When a new
version of a plugin is released, it's visible in the Hub in any instance
that's compatible. This applies even if the instance was created before
the plugin was released.
Cloud Data Fusion creates ephemeral execution environments to
execute pipelines.
Cloud Data Fusion supports Dataproc as an
execution environment
Cloud Data Fusion provisions an ephemeral
Dataproc cluster in your customer project at the
beginning of a pipeline run, executes the pipeline using Spark in the
cluster, and then deletes the cluster after the pipeline execution is
complete.
Alternatively, if you manage your Dataproc clusters
in controlled environments, through technologies like Terraform, you
can also configure Cloud Data Fusion to not provision clusters. In
those environments, you can run pipelines against existing
Dataproc clusters.
A compute profile specifies how and where a pipeline is
executed. A profile encapsulates any information required to set up and
delete the physical execution environment of a pipeline.
For example, a compute profile includes the following:
Execution provisioner
Resources (memory and CPU)
Minimum and maximum node count
Other values
A profile is identified by name and must be assigned a provisioner
and its related configuration. A profile can exist either at the
Cloud Data Fusion instance level or at the namespace level.
The Cloud Data Fusion default compute profile is
Autoscaling.
Reusable data pipelines in Cloud Data Fusion allows creation
of a single pipeline that can apply a data integration pattern to a
variety of use cases and datasets.
Reusable pipelines give better manageability by setting most of
the configuration of a pipeline at execution time, instead of
hard-coding it at design time.
Cloud Data Fusion supports creating a trigger on a data
pipeline (called the downstream pipeline), to have it run at
the completion of one or more different pipelines (called upstream
pipelines). You choose when the downstream pipeline runs—for
example, upon the success, failure, stop, or any combination thereof,
of the upstream pipeline run.
Triggers are useful in the following cases:
Cleansing your data once, and then making it available to
multiple downstream pipelines for consumption.
Sharing information, such as runtime arguments and plugin
configurations, between pipelines. This is called Payload
configuration.
Having a set of dynamic pipelines that can run using the data of
the hour, day, week, or month, instead of using a static pipeline
that must be updated on every run.
Cloud Data Fusion resources
Explore Cloud Data Fusion resources:
Release notes provide change
logs of features, changes, and deprecations
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-01-17 UTC."],[],[]]