A Cloud Data Fusion instance runs within one Compute Engine zone in Google Cloud. An instance is composed of several Google Cloud technologies, including Google Kubernetes Engine (GKE), Cloud SQL, Cloud Storage, Persistent Disk, and Cloud Key Management Service.
A Cloud Data Fusion instance is provisioned in a tenancy unit. It provides the capability for building and orchestrating data pipelines, and for centralized management of metadata. It runs on a GKE cluster inside a tenant project, and uses Cloud Storage, Cloud SQL, Persistent Disk, Elasticsearch, and Cloud KMS for storing business, technical, and operational metadata.
The main components of the Cloud Data Fusion architecture are explained in the following sections.
The set of services required to build and orchestrate Cloud Data Fusion pipelines, and store pipeline metadata are provisioned in a tenant project, inside a tenancy unit. A separate tenant project is created for each customer project in which Cloud Data Fusion instances are provisioned. The tenant project inherits all the networking and firewall configurations of the customer project.
This is the set of services that Cloud Data Fusion uses to manage pipeline lifecycle, orchestration, and metadata. Cloud Data Fusion orchestrates these services using GKE.
The Cloud Data Fusion UI is a graphical interface to develop, manage and run data pipelines, and search, view, and manage integration metadata. The UI runs in the GKE cluster as well.
In the Cloud Data Fusion UI, you can click Hub to browse plugins, sample pipelines, and other integrations. When a new version of a plugin is released, it's visible in the Hub in any instance that's compatible (even if the instance was created before the plugin was released).
Cloud Data Fusion uses Cloud Storage, Cloud SQL, Persistent Disk, and Elasticsearch to store technical, business, and operational metadata.
You can use namespaces to partition a Cloud Data Fusion instance to achieve application and data isolation in your design and execution environments. For more information, see Namespaces.
When using public IP, the Cloud Data Fusion UI and backend services run on the domain datafusion.cdap.app. They are exposed using HTTPS and use an SSL cert to encrypt the connection.
Cloud Data Fusion runs pipelines using Dataproc clusters. Cloud Data Fusion automatically provisions ephemeral Dataproc clusters, runs pipelines on them, and then tears down the clusters after the pipeline run completes. Optionally, you can also choose to run pipelines against existing Dataproc clusters.
Dataproc clusters and Cloud Storage buckets exist in the same region as the Cloud Data Fusion instance. For more more information, see Data Location in the general service terms and the Cloud Data Fusion FAQs.
Google Cloud's operations suite
You can choose to optionally send logs to Google Cloud's operations suite. For instances that are configured to integrate with Google Cloud's operations suite, two types of logs are sent to Google Cloud's operations suite:
Audit logs: For all instance management operations, Cloud Data Fusion emits audit logs to Google Cloud's operations suite.
Pipeline logs: You can find logs from Cloud Data Fusion pipelines in the Dataproc cluster logs in Google Cloud's operations suite, or in the Cloud Data Fusion Pipeline Studio page where you run your pipeline.
Learn more about working with logs in Cloud Data Fusion.
When you create a data pipeline on Cloud Data Fusion's Studio page, you can click Preview to view a portion of the data from the pipeline's sources. A pipeline in preview runs in the tenant project, and when you deploy the pipeline, it runs in the customer project on the relevant compute profile. After you deploy the pipeline, you must duplicate the pipeline to use the Preview feature.