This page provides background information about connecting to your data sources from public or private Cloud Data Fusion instances from design and execution environments.
Before you begin
Networking in Cloud Data Fusion requires a basic understanding of the following:
Cloud Data Fusion creates a tenant project that holds the resources and services needed to manage pipelines on your behalf, such as when it runs pipelines on the Dataproc clusters that reside in your customer project. A tenant project is not exposed to customers, but when you create a private instance, you use the project's name to set up VPC peering. A tenant project can have multiple Cloud Data Fusion instances. You manage the resources and services it holds when you access an instance in the Cloud Data Fusion UI or Google Cloud CLI. For more information, see the Service Infrastructure documentation about tenant projects.
The customer creates and owns this project. By default, Cloud Data Fusion creates an ephemeral Dataproc cluster in this project to run your pipelines.
Cloud Data Fusion instance
A Cloud Data Fusion instance is a unique deployment of Cloud Data Fusion, where you design and execute pipelines. You can create multiple instances in a single project and specify the Google Cloud region in which to create the Cloud Data Fusion instances. Based on your requirements and cost constraints, you can create an instance that uses the Developer, Basic, or Enterprise edition of Cloud Data Fusion. Each instance contains a unique, independent Cloud Data Fusion deployment that contains a set of services that handle pipeline lifecycle management, orchestration, coordination, and metadata management. These services run using long-running resources in a tenant project.
The following diagram shows the connections when you build data pipelines that extract, transform, blend, aggregate, and load data from various on-premises and cloud data sources.
For versions earlier than 6.4, the following system architecture diagram shows how Cloud Data Fusion connects with data sources from services like Preview or Wrangler in a tenant project and Dataproc in a customer project.
Advantages of the tenant project
Cloud Data Fusion's tenant projects have following advantages:
- Users and developers can only use managed services provided in the tenant project via Cloud Data Fusion's UI or gcloud CLI.
- Users cannot view or manage resources in a tenant project. This prevents them from being charged or making unintended changes to the services, which might cause system outages.
- Each managed service in the tenant project has its own VPC network and subnet.
Pipeline design and execution
Cloud Data Fusion provides separation of design and execution environments, which lets you design a pipeline once, and then execute it in multiple environments. The design environment resides in the tenant project, while the execution environment is in one or more customer projects.
Example: You design your pipeline using Cloud Data Fusion services, such as Wrangler and Preview. Those services run in the tenant project, where access to data is controlled by the Google-managed Cloud Data Fusion Service Agent role. You then execute the pipeline in your customer project so that it uses your Dataproc cluster. In the customer project, the default Compute Engine service account controls access to data. You can configure your project to use a custom service account.
For more information about configuring service accounts, see Cloud Data Fusion service accounts.
When you create a Cloud Data Fusion instance in your customer project, Cloud Data Fusion automatically creates a separate, Google-managed tenant project to run the services required to manage the lifecycle of pipelines and metadata, the Cloud Data Fusion UI, and design-time tools like Preview and Wrangler.
DNS resolution in Cloud Data Fusion
To resolve domain names in your design-time environment when you wrangle and preview the data that you're transferring into Google Cloud, use DNS Peering (available starting in Cloud Data Fusion 6.7.0). It lets you use domain or hostnames for sources and sinks, which you don't need to reconfigure as often as IP addresses.
DNS resolution is recommended in your design-time environment in Cloud Data Fusion, when you test connections and preview pipelines that use domain names of on-premises or other servers (such as databases or FTP servers), in a private VPC network.
After you verify and deploy your pipeline in an instance, you either execute the pipeline manually, or it executes on a time schedule or a pipeline state trigger.
Whether the execution environment is provisioned and managed by Cloud Data Fusion or the customer, the environment exists in your customer project.
Public instances (default)
The easiest way to provision a Cloud Data Fusion instance is to create a public instance. It serves well as a starting point and provides access to external endpoints on the public internet.
A public instance in Cloud Data Fusion uses the default VPC network in your project.
The default VPC network has the following:
- Autogenerated subnets for each region
- Routing tables
- Firewall rules to ensure communication among your computing resources
Networking across regions
When you create a new project, a benefit of the default VPC
network is that it autopopulates one subnet per region using a predefined IP
address range, expressed as a CIDR block. The IP address ranges start with
10.132.0.0/20, across the Google Cloud global regions.
To ensure that your computing resources connect to each other across regions,
the default VPC network sets the default local routes to each
subnet. By setting up the default route to the internet (
0.0.0.0/0), you gain
access to the internet and capture any unrouted network traffic.
The default VPC network provides a set of firewall rules:
|Default allow internal||Enable
These default VPC network settings minimize the prerequisites for setting up cloud services, including Cloud Data Fusion. Due to concerns about network security, organizations often don't let you use the default VPC network for business operations. Without the default VPC network, you cannot create a Cloud Data Fusion public instance. Instead, create a private instance.
The default VPC network does not grant open access to resources. Instead, Identity and Access Management (IAM) controls access:
- A validated identity is required to log in to Google Cloud.
- After you've logged in, you need explicit permission (for example, the Viewer role) to view Google Cloud services.
Some organizations require that all of their production systems be isolated from public IP addresses. A Cloud Data Fusion private instance meets that requirement in all kinds of VPC network settings.
Private instances in version 6.4 and earlier
In Cloud Data Fusion versions earlier than 6.4, design and execution environments only use internal IP addresses. They don't use public internet IP addresses attached to any Cloud Data Fusion Compute Engine. As a design-time tool, the Cloud Data Fusion private instance can't access data sources on the public internet.
Instead, design the pipeline in a public instance. Then, for execution, move it to a private instance in a customer project, where you control the project's VPC policies. You must connect to your data from both projects.
Access to data in design and execution environments
In a public instance, network communication happens over the open internet, which is not recommended for critical environments. To securely access your data sources, always execute your pipelines from a private instance in your execution environment.
In Cloud Data Fusion version 6.4, when you design your pipeline, you can't access data sources on the open internet from a private instance. Instead, you design your pipeline in a tenant project using a public instance to connect to data sources on the internet. After you've built your pipeline, move it to a customer project and execute it in a private instance, so that you can control VPC policies. You must connect to your data from both projects.
Access to sources
If your execution environment runs in a Cloud Data Fusion version earlier than 6.4, you can only access resources within your VPC network. Set up Cloud VPN or Cloud Interconnect to access on-premises data sources. Cloud Data Fusion versions before 6.4 can only access sources on the public internet if you set up a Cloud NAT gateway.
When accessing data sources, public and private instances:
- make outgoing calls to Google Cloud APIs using Private Google Access
- communicate with an execution (Dataproc) environment through VPC peering
The following table compares public and private instances during design and execution for various data sources:
|Data sources||Public Cloud Data Fusion instance
|Public Cloud Data Fusion Dataproc
|Private Cloud Data Fusion instance
|Private Cloud Data Fusion Dataproc
|Google Cloud source
(after you grant permissions and set firewall rules)
(after you set up VPN/Interconnect, grant permissions, and set firewall rules)
|Public internet source
(after you grant permissions and set firewall rules)
|versions ≥ 6.4 versions < 6.4|
- Access control in Cloud Data Fusion
- Service accounts in Cloud Data Fusion
- Creating a public instance
- Creating a private instance