Architecture components

A Cloud Data Fusion instance runs within one Compute Engine zone in Google Cloud. An instance is composed of several Google Cloud technologies, including Google Kubernetes Engine (GKE), Cloud SQL, Cloud Storage, Persistent Disk, and Cloud Key Management Service.

A Cloud Data Fusion instance is provisioned in a tenancy unit. It provides the capability for building and orchestrating data pipelines, and for centralized management of metadata. It runs on a GKE cluster inside a tenant project, and uses Cloud Storage, Cloud SQL, Persistent Disk, Elasticsearch, and Cloud KMS for storing business, technical, and operational metadata.

The main components of the Cloud Data Fusion architecture are explained in the following sections.

Tenant Project

The set of services required to build and orchestrate Cloud Data Fusion pipelines, and store pipeline metadata are provisioned in a tenant project, inside a tenancy unit. A separate tenant project is created for each customer project in which Cloud Data Fusion instances are provisioned. The tenant project inherits all the networking and firewall configurations of the customer project.

System services

This is the set of services that Cloud Data Fusion uses to manage pipeline lifecycle, orchestration, and metadata. Cloud Data Fusion orchestrates these services using GKE.

User interface

The Cloud Data Fusion UI is a graphical interface to develop, manage and run data pipelines, and search, view, and manage integration metadata. The UI runs in the GKE cluster as well.

Metadata storage

Cloud Data Fusion uses Cloud Storage, Cloud SQL, Persistent Disk, and Elasticsearch to store technical, business, and operational metadata.

Domain

When using public IP, the Cloud Data Fusion UI and backend services run on the domain datafusion.cdap.app. They are exposed using HTTPS and use an SSL cert to encrypt the connection.

Pipeline execution

Cloud Data Fusion runs pipelines using Dataproc clusters. Cloud Data Fusion automatically provisions ephemeral Dataproc clusters, runs pipelines on them, and then tears down the clusters after the pipeline run completes. Optionally, you can also choose to run pipelines against existing Dataproc clusters.

Google Cloud's operations suite

You can choose to optionally send logs to Google Cloud's operations suite. For instances that are configured to integrate with Google Cloud's operations suite, two types of logs are sent to Google Cloud's operations suite:

  1. Audit logs: For all instance management operations, Cloud Data Fusion emits audit logs to Google Cloud's operations suite.
  2. Pipeline logs: You can find logs from Cloud Data Fusion pipelines in the Dataproc cluster logs in Google Cloud's operations suite.

What's next