Cloud Data Fusion overview

Cloud Data Fusion is a fully managed, cloud-native, enterprise data integration service for quickly building and managing data pipelines. The Cloud Data Fusion web interface lets you build scalable data integration solutions. It lets you connect to various data sources, transform the data, and then transfer it to various destination systems, without having to manage the infrastructure.

Cloud Data Fusion is powered by the open source project CDAP.

Get started with Cloud Data Fusion

You can start exploring Cloud Data Fusion in minutes.

Explore Cloud Data Fusion

The main components of Cloud Data Fusion are explained in the following sections.

Tenant project

The set of services required to build and orchestrate Cloud Data Fusion pipelines and store pipeline metadata are provisioned in a tenant project, inside a tenancy unit. A separate tenant project is created for each customer project, in which Cloud Data Fusion instances are provisioned. The tenant project inherits all the networking and firewall configurations from the customer project.

Cloud Data Fusion: Console

The Cloud Data Fusion console, also referred to as control plane, is a set of API operations and a web interface that deal with the Cloud Data Fusion instance itself, such as creating, deleting, restarting, and updating it.

Cloud Data Fusion: Studio

Cloud Data Fusion Studio, also referred to as the data plane, is a set of REST API and web interface operations that deal with creation, execution, and management of pipelines and related artifacts.

Concepts

This section introduces some of the core concepts of Cloud Data Fusion.

Concept Description
Cloud Data Fusion instances
  • A Cloud Data Fusion instance is a unique deployment of Cloud Data Fusion. To get started with Cloud Data Fusion, you create a Cloud Data Fusion instance through the Google Cloud console.
  • You can create multiple instances in a single Google Cloud console project and can specify the Google Cloud region to create your Cloud Data Fusion instances in.
  • Based on your requirements and cost constraints, you can create a Developer, Basic, or Enterprise instance.
  • Each Cloud Data Fusion instance contains a unique, independent Cloud Data Fusion deployment that contains a set of services, which handle pipeline lifecycle management, orchestration, coordination, and metadata management. These services run using long-running resources in a tenant project.
Namespace A namespace is a logical grouping of applications, data, and the associated metadata in a Cloud Data Fusion instance. You can think of namespaces as a partitioning of the instance. In a single instance, one namespace stores the data and metadata of an entity independently from another namespace.
Pipeline
  • A pipeline is a way to visually design data and control flows to extract, transform, blend, aggregate, and load data from various on-premises and cloud data sources.
  • Building pipelines lets you create complex data processing workflows that can help you solve data ingestion, integration, and migration problems. You can use Cloud Data Fusion to build both batch and real-time pipelines, depending on your needs.
  • Pipelines let you express your data processing workflows using the logical flow of data, while Cloud Data Fusion handles all the functionality that is required to physically run in an execution environment.
Pipeline node
  • On the Studio page of the Cloud Data Fusion web interface, pipelines are represented as a series of nodes arranged in a directed acyclic graph (DAG), forming a one-way flow.
  • Nodes represent the various actions that you can take with your pipelines, such as reading from sources, performing data transformations, and writing output to sinks. You can develop data pipelines in the Cloud Data Fusion web interface by connecting together sources, transformations, sinks, and other nodes.
Plugins
  • A plugin is a customizable module that can be used to extend the capabilities of Cloud Data Fusion.
  • Cloud Data Fusion provides plugins for sources, transforms, aggregates, sinks, error collectors, alert publishers, actions, and post-run actions.
  • A plugin is sometimes referred to as a node, usually in the context of the Cloud Data Fusion web interface.
  • To discover and access the popular Cloud Data Fusion plugins, see Cloud Data Fusion plugins.
Hub In the Cloud Data Fusion web interface, to browse plugins, sample pipelines, and other integrations, click Hub. When a new version of a plugin is released, it's visible in the Hub in any instance that's compatible. This applies even if the instance was created before the plugin was released.
Pipeline preview
  • Cloud Data Fusion Studio lets you test the accuracy of pipeline design using Preview on the subset of data.
  • A pipeline in preview runs in the tenant project.
Pipeline execution
  • Cloud Data Fusion creates ephemeral execution environments to execute pipelines.
  • Cloud Data Fusion supports Dataproc as an execution environment
  • Cloud Data Fusion provisions an ephemeral Dataproc cluster in your customer project at the beginning of a pipeline run, executes the pipeline using Spark in the cluster, and then deletes the cluster after the pipeline execution is complete.
  • Alternatively, if you manage your Dataproc clusters in controlled environments, through technologies like Terraform, you can also configure Cloud Data Fusion to not provision clusters. In those environments, you can run pipelines against existing Dataproc clusters.
Compute profiles
  • A compute profile specifies how and where a pipeline is executed. A profile encapsulates any information required to set up and delete the physical execution environment of a pipeline.
  • For example, a compute profile includes the following:
    • Execution provisioner
    • Resources (memory and CPU)
    • Minimum and maximum node count
    • Other values
  • A profile is identified by name and must be assigned a provisioner and its related configuration. A profile can exist either at the Cloud Data Fusion instance level or at the namespace level.
  • The Cloud Data Fusion default compute profile is Autoscaling.
Reusable pipelines
  • Reusable data pipelines in Cloud Data Fusion allows creation of a single pipeline that can apply a data integration pattern to a variety of use cases and datasets.
  • Reusable pipelines give better manageability by setting most of the configuration of a pipeline at execution time, instead of hard-coding it at design time.
Triggers
  • Cloud Data Fusion supports creating a trigger on a data pipeline (called the downstream pipeline), to have it run at the completion of one or more different pipelines (called upstream pipelines). You choose when the downstream pipeline runs—for example, upon the success, failure, stop, or any combination thereof, of the upstream pipeline run.
  • Triggers are useful in the following cases:
    • Cleansing your data once, and then making it available to multiple downstream pipelines for consumption.
    • Sharing information, such as runtime arguments and plugin configurations, between pipelines. This is called Payload configuration.
    • Having a set of dynamic pipelines that can run using the data of the hour, day, week, or month, instead of using a static pipeline that must be updated on every run.

Cloud Data Fusion resources

Explore Cloud Data Fusion resources:

What's next