Introduction to Cloud Data Fusion networking

This page provides background information about connecting to your data sources from public or private Cloud Data Fusion instances from design and execution environments.

Before you begin

Networking in Cloud Data Fusion requires a basic understanding of the following:

Tenant project

Cloud Data Fusion creates a tenant project that holds the resources and services needed to manage pipelines on your behalf, such as when it runs pipelines on the Dataproc clusters that reside in your customer project.

The tenant project isn't exposed to you directly, but when you create a private instance, you use the project's name to set up VPC peering. Each private instance in the tenant project has its own VPC network and subnet.

The project can have multiple Cloud Data Fusion instances. You manage the resources and services it holds when you access an instance in the Cloud Data Fusion UI or Google Cloud CLI.
For more information, see the Service Infrastructure documentation about tenant projects.

Customer project

The customer creates and owns this project. By default, Cloud Data Fusion creates an ephemeral Dataproc cluster in this project to run your pipelines.

Cloud Data Fusion instance

A Cloud Data Fusion instance is a unique deployment of Cloud Data Fusion, where you design and execute pipelines. You can create multiple instances in a single project and specify the Google Cloud region in which to create the Cloud Data Fusion instances. Based on your requirements and cost constraints, you can create an instance that uses the Developer, Basic, or Enterprise edition of Cloud Data Fusion. Each instance contains a unique, independent Cloud Data Fusion deployment that contains a set of services that handle pipeline lifecycle management, orchestration, coordination, and metadata management. These services run using long-running resources in a tenant project.

Network diagram

The following diagram shows the connections when you build data pipelines that extract, transform, blend, aggregate, and load data from various on-premises and cloud data sources.

In Cloud Data Fusion versions 6.4 and later, see the diagrams for controlling egress in a private instance and connecting to a public source.

For versions earlier than 6.4, the following system architecture diagram shows how Cloud Data Fusion connects with data sources from services like Preview or Wrangler in a tenant project and Dataproc in a customer project.

Cloud Data Fusion network diagram

Pipeline design and execution

Cloud Data Fusion provides separation of design and execution environments, which lets you design a pipeline once, and then execute it in multiple environments. The design environment resides in the tenant project, while the execution environment is in one or more customer projects.

Example: You design your pipeline using Cloud Data Fusion services, such as Wrangler and Preview. Those services run in the tenant project, where access to data is controlled by the Google-managed Cloud Data Fusion Service Agent role. You then execute the pipeline in your customer project so that it uses your Dataproc cluster. In the customer project, the default Compute Engine service account controls access to data. You can configure your project to use a custom service account.

For more information about configuring service accounts, see Cloud Data Fusion service accounts.

Design environment

When you create a Cloud Data Fusion instance in your customer project, Cloud Data Fusion automatically creates a separate, Google-managed tenant project to run the services required to manage the lifecycle of pipelines and metadata, the Cloud Data Fusion UI, and design-time tools like Preview and Wrangler.

DNS resolution in Cloud Data Fusion

To resolve domain names in your design-time environment when you wrangle and preview the data that you're transferring into Google Cloud, use DNS Peering (available starting in Cloud Data Fusion 6.7.0). It lets you use domain or hostnames for sources and sinks, which you don't need to reconfigure as often as IP addresses.

DNS resolution is recommended in your design-time environment in Cloud Data Fusion, when you test connections and preview pipelines that use domain names of on-premises or other servers (such as databases or FTP servers), in a private VPC network.

For more information, see DNS Peering and Cloud DNS Forwarding.

Execution environment

After you verify and deploy your pipeline in an instance, you either execute the pipeline manually, or it executes on a time schedule or a pipeline state trigger.

Whether the execution environment is provisioned and managed by Cloud Data Fusion or the customer, the environment exists in your customer project.

Public instances (default)

The easiest way to provision a Cloud Data Fusion instance is to create a public instance. It serves well as a starting point and provides access to external endpoints on the public internet.

A public instance in Cloud Data Fusion uses the default VPC network in your project.

The default VPC network has the following:

  • Autogenerated subnets for each region
  • Routing tables
  • Firewall rules to ensure communication among your computing resources

Networking across regions

When you create a new project, a benefit of the default VPC network is that it autopopulates one subnet per region using a predefined IP address range, expressed as a CIDR block. The IP address ranges start with 10.128.0.0/20, 10.132.0.0/20, across the Google Cloud global regions.

To ensure that your computing resources connect to each other across regions, the default VPC network sets the default local routes to each subnet. By setting up the default route to the internet (0.0.0.0/0), you gain access to the internet and capture any unrouted network traffic.

Firewall rules

The default VPC network provides a set of firewall rules:

Default Description
Default allow icmp Enable icmp protocol for source 0.0.0.0/0
Default allow internal Enable tcp:0-65535; udp:0-65535; icmp for source 10.128.0.0/9, which covers min 10.128.0.1 to max 10.255.255.254 IP addresses
Default allow rdp Enable tcp:3389 for source 0.0.0.0/0
Default allow ssh Enable tcp:22 for source 0.0.0.0/0

These default VPC network settings minimize the prerequisites for setting up cloud services, including Cloud Data Fusion. Due to concerns about network security, organizations often don't let you use the default VPC network for business operations. Without the default VPC network, you cannot create a Cloud Data Fusion public instance. Instead, create a private instance.

The default VPC network does not grant open access to resources. Instead, Identity and Access Management (IAM) controls access:

  • A validated identity is required to log in to Google Cloud.
  • After you've logged in, you need explicit permission (for example, the Viewer role) to view Google Cloud services.

Private instances

Some organizations require that all of their production systems be isolated from public IP addresses. A Cloud Data Fusion private instance meets that requirement in all kinds of VPC network settings.

Private instances in version 6.4 and earlier

In Cloud Data Fusion versions earlier than 6.4, design and execution environments only use internal IP addresses. They don't use public internet IP addresses attached to any Cloud Data Fusion Compute Engine. As a design-time tool, the Cloud Data Fusion private instance can't access data sources on the public internet.

Instead, design the pipeline in a public instance. Then, for execution, move it to a private instance in a customer project, where you control the project's VPC policies. You must connect to your data from both projects.

Access to data in design and execution environments

In a public instance, network communication happens over the open internet, which is not recommended for critical environments. To securely access your data sources, always execute your pipelines from a private instance in your execution environment.

In Cloud Data Fusion version 6.4, when you design your pipeline, you can't access data sources on the open internet from a private instance. Instead, you design your pipeline in a tenant project using a public instance to connect to data sources on the internet. After you've built your pipeline, move it to a customer project and execute it in a private instance, so that you can control VPC policies. You must connect to your data from both projects.

Access to sources

If your execution environment runs in a Cloud Data Fusion version earlier than 6.4, you can only access resources within your VPC network. Set up Cloud VPN or Cloud Interconnect to access on-premises data sources. Cloud Data Fusion versions before 6.4 can only access sources on the public internet if you set up a Cloud NAT gateway.

When accessing data sources, public and private instances:

  • make outgoing calls to Google Cloud APIs using Private Google Access
  • communicate with an execution (Dataproc) environment through VPC peering

The following table compares public and private instances during design and execution for various data sources:

Data sources Public Cloud Data Fusion instance
(design-time)
Public Cloud Data Fusion Dataproc
(execution)
Private Cloud Data Fusion instance
(design-time)
Private Cloud Data Fusion Dataproc
(execution)
Google Cloud source
(after you grant permissions and set firewall rules)
On-premises source
(after you set up VPN/Interconnect, grant permissions, and set firewall rules)
Public internet source
(after you grant permissions and set firewall rules)
versions ≥ 6.4 versions < 6.4

What's next