This page describes how to create a Cloud Data Fusion instance with an internal IP address. You create the instance in a VPC network or a Shared VPC network.
A private Cloud Data Fusion instance has the following benefits:
Connections to the instance are established over a private VPC network in your Google Cloud project. Traffic over the network doesn't go through the public internet.
The instance can connect to your on-premises resources, such as relational databases because your on-premises network connects to the Google Cloud private VPC network through Cloud VPN or Cloud Interconnect. You can securely access your on-premises resources, such as databases, over the private network without opening up access to Google Cloud.
Objectives
- Set up the VPC network or the Shared VPC network.
- Allocate an IP range that will be used to deploy the Cloud Data Fusion instance in the tenant project.
- Create the Cloud Data Fusion private instance.
- Set up the VPC network peering between the VPC that contains the Cloud Data Fusion instance and the VPC that contains the associated tenant project.
- For Shared VPC networks, set up Identity and Access Management (IAM) permissions.
- If your private instance uses Cloud Data Fusion version 6.2.0 or earlier, create a firewall rule.
- Let different Google Cloud services communicate internally with each other by enabling Private Google Access on the Dataproc subnet.
Before you begin
- To learn about Cloud Data Fusion's deployment architecture, see Networking.
Set up the VPC network
If you haven't already done so, create a VPC network or a Shared VPC network.
To set up your VPC network, you must allocate an IP address range.
Allocate an IP range
VPC network
If you're not using a Shared VPC network, Cloud Data Fusion allocates an IP range by default when you create an instance.
Shared VPC network
To use a Shared VPC you must allocate an IP range for your Cloud Data Fusion instance.
To allocate an IP range for your Cloud Data Fusion instance, follow these steps:
In the Google Cloud console, go to the VPC networks page.
In the Name column, click the VPC network in which you want to create a private Cloud Data Fusion instance.
The VPC network details page opens.
Click Private service connection. If prompted, enable the Service Networking API by clicking Enable API.
Click Allocate IP range.
Give your IP range a name.
For IP range, click Automatic.
Specify a prefix size of
22
.Click Allocate.
Create a private instance
Create the private Cloud Data Fusion instance in a VPC network or a Shared VPC network.
VPC network
To create the instance in a VPC network, use either the Google Cloud console or cURL.
If you use the Google Cloud console to create your private instance,
Cloud Data Fusion allocates the /22
IP address range by default. To
choose a different IP range, you must use the cURL command.
Console
Go to the Create Data Fusion instance page.
Enter an instance name and description for your instance.
Select the Region in which to create the instance.
Select a Cloud Data Fusion Version and Edition.
Specify the Dataproc service account to use for running your Cloud Data Fusion pipeline in Dataproc. The default Compute Engine account is pre-selected.
Expand the Advanced Options menu and click Enable Private IP.
In the Network field, choose a network in which to create the instance.
Click Create. It takes up to 30 minutes for the instance creation process to complete.
cURL
For your convenience, you can export the following variables, or you can directly substitute these values into the following commands:
export PROJECT=PROJECT_ID
export LOCATION=REGION
export DATA_FUSION_API_NAME=datafusion.googleapis.com
To create the instance, call its
create()
method:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" https://$DATA_FUSION_API_NAME/v1/projects/$PROJECT/locations/$LOCATION/instances?instance_id=INSTANCE_ID -X POST -d '{"description": "Private CDF instance created through REST.", "type": "ENTERPRISE", "privateInstance": true, "networkConfig": {"network": "NETWORK_NAME", "ipAllocation": "IP_RANGE"}}'
Replace the following:
INSTANCE_ID
: The ID string your new instance should get.NETWORK_NAME
: The name of the VPC network where you want to create your private instance.IP_RANGE
: The IP range that you allocated. To find the IP range in the Google Cloud console, go to VPC network details > Private service connection > Internal IP range .
Shared VPC network
To create your instance in a Shared VPC network, use cURL, not the Google Cloud console.
cURL
For your convenience, you can export the following variables. Alternatively, you can directly substitute these values in the following commands:
export PROJECT=PROJECT_ID export LOCATION=REGION export DATA_FUSION_API_NAME=datafusion.googleapis.com
To create the instance, call its create()
method:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" https://$DATA_FUSION_API_NAME/v1/projects/$PROJECT/locations/$LOCATION/instances?instanceId=INSTANCE_ID -X POST -d '{"description": "Private CDF instance created through REST.", "type": "ENTERPRISE", "privateInstance": true, "networkConfig": {"network": "projects/SHARED_VPC_HOST_PROJECT_ID/global/networks/NETWORK_NAME", "ipAllocation": "IP_RANGE"}}'
Replace the following:
INSTANCE_ID
: The ID string your new instance should get.SHARED_VPC_HOST_PROJECT_ID
: The ID of the project that's hosting the Shared VPC network.NETWORK_NAME
: The name of the VPC network in which you want to create the private instance.IP_RANGE
: The IP range that you allocated. To find the IP range in the Google Cloud console, go to the VPC network details page > Private service connection > Internal IP range.
Set up VPC network peering
Cloud Data Fusion services that you use in your design environment (for example: Wrangler, Connection Manager, and Schema Validation) initiate network connections from the tenant project VPC to the source systems. Cloud Data Fusion uses VPC network peering to establish network connectivity to the VPC or Shared VPC that contains your instance. The VPC network peering lets Cloud Data Fusion access resources in your network through internal IP addresses using your own VPC and its controls. To connect with a resource in another network, see the steps for connection use cases.
The following section describes how to create a peering configuration between your network and the Cloud Data Fusion tenant project network.
Get the tenant project ID
To create a peering configuration, you need the tenant project ID.
Go to the Cloud Data Fusion Instances page.
In the Instance Name column, select the instance.
On the Instance details page, copy the Tenant project ID, which is required when you create a peering connection in the following steps.
Create a peering connection
Go to the VPC network peering page.
Click Create connection > Continue.
On the Create peering connection page that opens, do the following:
- Enter a Name for your peering connection.
- For Your VPC network, select the network that contains your Cloud Data Fusion instance.
- For Peered VPC network, select In another project.
- For Project ID, enter the tenant project ID you found previously in this tutorial.
For VPC network name, select a network or enter INSTANCE_REGION-INSTANCE_ID.
Replace the following:
- INSTANCE_REGION: the region in which you created your Cloud Data Fusion instance.
- INSTANCE_ID: the ID of your Cloud Data Fusion instance.
Select the Internet Protocol version for the peering connection to exchange IPv4 and IPv6 routes between your VPC network and the peered VPC network. For more information, see VPC network peering.
Select Export custom routes so that custom routes can be exported from your VPC network to the tenant VPC network.
Choose whether to allow subnet routes with public IPv4 to be imported or exported into your VPC network.
Click Create.
The VPC network peering becomes active shortly after it is created.
Set up IAM permissions
VPC network
Skip this step and go to Create a firewall rule.
Shared VPC network
If you create your Cloud Data Fusion instance in a Shared VPC network, you must grant the Compute Network User role to the following service accounts. To give permissions to all subnets, grant the role to the Shared VPC host project.
To further control access, instead grant the role to a specific subnet, and the Network Viewer role on the host project.
- Cloud Data Fusion service account:
service-PROJECT_NUMBER@gcp-sa-datafusion.iam.gserviceaccount.com
- Dataproc service account:
service-PROJECT_NUMBER@dataproc-accounts.iam.gserviceaccount.com
PROJECT_NUMBER
is the number of the
Google Cloud project that contains your Cloud Data Fusion
instance.
For more information, see Granting access to the required service accounts.
Create a firewall rule
Create a firewall rule on your VPC network that allows for incoming SSH connections from the IP range you specified when you created your private Cloud Data Fusion instance.
This step is required for Cloud Data Fusion versions earlier than 6.2.0. It allows communication between Cloud Data Fusion and Dataproc clusters running pipelines.
You can create the firewall rule by using the Google Cloud console or by using the gcloud CLI.
Console
gcloud
Run the following command:
gcloud compute firewall-rules create FIREWALL_NAME-allow-ssh --allow=tcp:22 --source-ranges=IP_RANGE --network=NETWORK_NAME --project=PROJECT_ID
Replace the following:
FIREWALL_NAME
: The name of the firewall rule to create.IP_RANGE
: The IP range you allocated.NETWORK_NAME
: The name of the network to which the firewall rule is attached. It's the name of the VPC network in which you created the private instance.PROJECT_ID
: The ID of the project that's hosting the VPC network.
Steps for connection use cases
The following sections describe connection-related use cases for private instances.
Enable Private Google Access
To access resources through internal IP addresses, Cloud Data Fusion must create the Dataproc clusters and run the data pipelines in a subnet that has Private Google Access. You must enable Private Google Access for the subnet that contains the Dataproc clusters.
- If only one subnet is present in the region where the Dataproc clusters are launched, then the cluster is launched in that subnet.
If there are multiple subnets in a region, you must configure Cloud Data Fusion to select the subnet with Private Google Access for launching Dataproc clusters.
To enable Private Google Access for the subnet, see Private Google Access configuration.
Optional: Connect to other sources
After you create a private instance in Cloud Data Fusion, you can connect to other sources, such as the following use cases:
- On-premises databases and systems running in other VPC networks
- Other Google Cloud services running in their own network in private mode, such as Cloud SQL
- Sources on the public internet
Optional: Enable DNS Peering
Enable DNS Peering in the following cases:
- When Cloud Data Fusion connects to systems through hostnames, and not IP addresses
- When the target system is deployed behind a load balancer, such as it does in some SAP deployments
What's next
- Learn about security concepts in Cloud Data Fusion.
- Learn about connecting to resources in external networks.
- Learn about other key Cloud Data Fusion concepts and features.
- See Cloud Data Fusion pricing.