A Cloud Data Fusion instance runs within one Compute Engine zone in Google Cloud. An instance is composed of several Google Cloud technologies, including Google Kubernetes Engine (GKE), Cloud SQL, Cloud Storage, Persistent Disk, and Cloud Key Management Service.
A Cloud Data Fusion instance is provisioned in a tenancy unit. It provides the capability for building and orchestrating data pipelines, and for centralized management of metadata. A Cloud Data Fusion instance runs on a GKE cluster inside a tenant project, and uses Cloud Storage, Cloud SQL, Persistent Disk, Elasticsearch, and Cloud KMS for storing business, technical, and operational metadata.
The main components of the Cloud Data Fusion architecture are explained in the following sections.
The set of services required to build and orchestrate Cloud Data Fusion pipelines, and store pipeline metadata are provisioned in a tenant project, inside a tenancy unit. A separate tenant project is created for each customer project in which Cloud Data Fusion instances are provisioned. The tenant project inherits all the networking and firewall configurations of the customer project.
The control plane is a set of API operations that deal with the Cloud Data Fusion instance itself, such as creating, deleting, restarting, and updating it.
The data plane refers to a set of REST API operations that deal with the main functionality of Cloud Data Fusion, such as creating, executing, and monitoring pipelines and related artifacts. For example, you create or stop a pipeline with data plane operations. For more information, see the CDAP reference.
Set of services that Cloud Data Fusion uses to manage pipeline lifecycle, orchestration, and metadata. Cloud Data Fusion orchestrates these services using GKE.
The Cloud Data Fusion web interface is a graphical interface to develop, manage and run data pipelines, and search, view, and manage integration metadata. The web interface runs in the GKE cluster as well.
In the Cloud Data Fusion web interface, to browse plugins, sample pipelines, and other integrations, click Hub. When a new version of a plugin is released, it's visible in the Hub in any instance that's compatible. This applies even if the instance was created before the plugin was released.
Cloud Data Fusion uses Cloud Storage, Cloud SQL, Persistent Disk, and Elasticsearch to store technical, business, and operational metadata.
You can use namespaces to partition a Cloud Data Fusion instance to achieve application and data isolation in your design and execution environments. For more information, see Namespaces.
When using public IP, the Cloud Data Fusion web interface and backend services run on the domain datafusion.cdap.app. They are exposed using HTTPS and use an SSL cert to encrypt the connection.
Cloud Data Fusion runs pipelines using Dataproc clusters. Cloud Data Fusion automatically provisions ephemeral Dataproc clusters, runs pipelines on them, and then tears down the clusters after the pipeline run completes. Optionally, you can also choose to run pipelines against existing Dataproc clusters.
Dataproc clusters and Cloud Storage buckets exist in the same region as the Cloud Data Fusion instance. For more information, see Data Location in the general service terms and the Cloud Data Fusion FAQs.
Google Cloud's operations suite
You can choose to optionally send logs to Google Cloud's operations suite. For instances that are configured to integrate with Google Cloud's operations suite, two types of logs are sent to Google Cloud's operations suite:
Audit logs: For all instance management operations, Cloud Data Fusion emits audit logs to Google Cloud's operations suite.
Pipeline logs: You can find logs from the following:
- Cloud Data Fusion pipelines in the Dataproc cluster logs in Google Cloud's operations suite
- Cloud Data Fusion Pipeline Studio page where you run your pipeline
Learn more about working with logs in Cloud Data Fusion.
When you create a data pipeline on the Cloud Data Fusion Studio page, to view a portion of the data from the pipeline sources, click Preview.
A pipeline in preview runs in the tenant project, and when you deploy the pipeline, it runs in the customer project on the relevant compute profile. After you deploy the pipeline, you must duplicate the pipeline to use the Preview feature.