This page provides background information about connecting to your data sources from public or private Cloud Data Fusion instances from design and execution environments.
Before you start
This page assumes that you are familiar with these terms:
- Tenant project
Cloud Data Fusion creates a tenant project to hold the resources and services it needs to manage pipelines on your behalf. For example: running pipelines on your Dataproc clusters that reside in your customer project. A tenant project is not exposed to customers, but when you create a private instance, you might need to use the tenant project name to set up VPC peering.
A tenant project can have multiple Cloud Data Fusion instances. You access the resources and services that a tenant project holds through a Cloud Data Fusion instance from either the Cloud Data Fusion web UI or Google Cloud CLI.
For more information, see the Service Infrastructure documentation about tenant projects.
- Customer project
The customer creates and owns this project. By default, Cloud Data Fusion creates an ephemeral Dataproc cluster in this project to run the customer's pipelines.
- Cloud Data Fusion instance
A Cloud Data Fusion instance is a unique deployment of Cloud Data Fusion. To get started with Cloud Data Fusion, you create a Cloud Data Fusion instance using the Google Cloud console.
You can create multiple instances in a single Google Cloud project and can specify the Google Cloud region in which to create your Cloud Data Fusion instances.
Based on your requirements and cost constraints, you can create a Developer, Basic, or Enterprise instance.
Each Cloud Data Fusion instance contains a unique, independent Cloud Data Fusion deployment that contains a set of services that handle pipeline lifecycle management, orchestration, coordination, and metadata management. These services run using long-running resources in a tenant project.
You can build data pipelines that extract, transform, blend, aggregate, and load data from various on-premises and cloud data sources.
For Cloud Data Fusion versions below 6.4, the following system architecture diagram shows how Cloud Data Fusion connects with data sources from services like Preview or Wrangler in a tenant project and Dataproc in a customer project.
Advantages of using a tenant project
Using a tenant project in Cloud Data Fusion has the following advantages:
- Users and developers can only use managed services in a tenant project provided by the Cloud Data Fusion web UI or gcloud CLI.
- Users cannot view or manage resources in a tenant project, so they won't be charged or make unintended changes to the services, which might cause system outages.
- Each managed service in the tenant project has its own VPC network and subnet.
Design and execution environments
Cloud Data Fusion provides separation of design and execution environments, which lets you design a pipeline once, and then execute it in multiple environments. The design environment resides in the tenant project, while the execution environment is in one or more customer projects.
Example: You design your pipeline using Cloud Data Fusion services, such as Wrangler and Preview. Those services run in the tenant project, where access to data is controlled by the Google-managed Cloud Data Fusion Service Agent role. You then execute the pipeline in your customer project so that it uses your Dataproc cluster. In the customer project, access to data is controlled by the default Compute Engine service account. You can configure your project to use a custom service account.
For more information about configuring service accounts, see Cloud Data Fusion service accounts.
When you create a Cloud Data Fusion instance in your customer project, Cloud Data Fusion automatically creates a separate, Google-managed tenant project for each customer project. In the tenant project, it runs the services required to manage the lifecycle of pipelines and metadata, the Cloud Data Fusion UI, and design-time tools like Preview and Wrangler.
After you verify and deploy your pipeline in an instance, you either execute the pipeline manually, or it executes on a time schedule or a pipeline state trigger.
Whether the execution environment is provisioned and managed by Cloud Data Fusion or the customer, the environment exists in your customer project.
Cloud Data Fusion instances
There are two types of Cloud Data Fusion instances based on an access model: a public (default) instance and a private instance.
Public instances (default)
The easiest way to provision a Cloud Data Fusion instance is to create a public instance. It serves well as a starting point and provides access to external endpoints on the public internet.
A public instance in Cloud Data Fusion uses the default VPC network in your project.
The default VPC network has the following characteristics:
- Autogenerated subnets for each region
- Routing tables
- Firewall rules to ensure communication among your computing resources
Networking across regions
When you create a new project, a benefit of the default VPC
network is that it autopopulates one subnet per region using a predefined IP
address range, expressed as a CIDR block. The IP address ranges start with
10.132.0.0/20, across the Google Cloud global regions.
To ensure that your computing resources connect to each other across regions,
the default VPC network sets the default local routes to each
subnet. By setting up the default route to the internet (
0.0.0.0/0), you gain
access to the internet and capture any unrouted network traffic.
The default VPC network provides a set of firewall rules:
|Default allow internal||Enable
These default VPC network settings minimize the prerequisites for setting up cloud services, including Cloud Data Fusion. Due to concerns about network security, organizations often do not allow you to use the default VPC network for business operations. Without the default VPC network, you cannot create a Cloud Data Fusion public instance. Instead, follow the steps to create a Cloud Data Fusion private instance.
The default VPC network does not grant open access to resources. Instead, the Identity and Access Management (IAM) service controls who can access resources:
- A validated identity is required to log in to Google Cloud.
- After you've logged in, you need explicit permission (for example, the Viewer role) to view Google Cloud services.
Some organizations require that all of their production systems be isolated from public IP addresses. A Cloud Data Fusion private instance meets that requirement in all kinds of VPC network settings.
In Cloud Data Fusion versions below 6.4, design and execution environments use private IP addresses. They don't use public internet IP addresses attached to any Cloud Data Fusion Compute Engine. As a result, as a design-time tool, the Cloud Data Fusion private IP instance can't access data sources on the public internet.
To connect to data sources on the public internet from a private instance, you design your pipeline in a public instance and then, for execution, move it to a private instance in a customer project, where you control the project's VPC policies. You need to connect to your data from both the projects you use during design and execution.
Access to data in design and execution environments
In a public instance, network communication happens over the open internet, which is not recommended for critical environments. To securely access your data sources, always execute your pipelines from a private instance in your execution environment.
In Cloud Data Fusion version 6.4, when you design your pipeline, you can't access data sources on the open internet from a private instance. Instead, you design your pipeline in a tenant project using a public instance to connect to data sources on the internet. After you've built your pipeline, move it to a customer project and execute it in a private instance, so that you can control VPC policies. You must connect to your data from both projects.
For more information about the types of projects and instances needed to access various data sources, see the Access to sources section.
Access to sources
If your execution environment runs in a Cloud Data Fusion version below 6.4, you can only access resources within your VPC network. Setting up Cloud VPN or Cloud Interconnect lets you access on-premises data sources. Cloud Data Fusion versions before 6.4 can only access sources on the public internet if you set up a Cloud NAT gateway.
When accessing data sources, public and private instances:
- make outgoing calls to Google Cloud APIs using Private Google Access
- communicate with an execution (Dataproc) environment through VPC peering
The following table compares public and private instances during design and execution for various data sources:
|Data sources||Public Cloud Data Fusion instance
|Public Cloud Data Fusion Dataproc
|Private Cloud Data Fusion instance
|Private Cloud Data Fusion Dataproc
|Google Cloud source
(after you grant permissions and set firewall rules)
(after you set up VPN/Interconnect, grant permissions, and set firewall rules)
|Public internet source
(after you grant permissions and set firewall rules)
|versions ≥ 6.4 versions < 6.4|
- Access control in Cloud Data Fusion
- Service accounts in Cloud Data Fusion
- Creating a public instance
- Creating a private instance