This page explains how to specify a network and a subnetwork when you run Dataflow jobs.
VPC and Shared VPC
A VPC network, sometimes just called a network, provides connectivity for resources in a project. To learn more about VPC networks, see the VPC Networks overview.
A Shared VPC network is one that exists in a separate project, called a host project, within your organization. If a Shared VPC Admin has defined you as a Service Project Admin, you have permission to use at least some of the subnetworks in networks of the host project. For background information about Shared VPC, see the Shared VPC Overview.
VPC Service Controls
Dataflow VPC Service Controls provide additional security for your pipeline's resources and services. To learn more about VPC Service Controls, see the VPC Service Controls overview.
To learn about the limitations when using Dataflow with VPC Service Controls, see the supported products and limitations.
Specifying a network and a subnetwork
When you create a Dataflow job, you can specify either a network or a subnetwork. The following sections describe when you should use each parameter.
If you omit both the subnetwork and network parameters, Google Cloud assumes
you intend to use an auto mode VPC network
with the name default
. If you omit both the subnetwork and network
parameters and you do not have a network named default
in your project, you
receive an error.
Network parameter
You can use the network parameter to specify an auto mode network in your project.
Specify a network using only its name and not the complete URL.
You can also use the network parameter to select a Shared VPC network only if both of these conditions are true:
- The Shared VPC network you are selecting is an auto mode network.
- You are a Service Project Admin with project-level permissions to the whole Shared VPC host project. This means that a Shared VPC Admin has granted you the Network User role to the whole host project, so you are able to use all of its networks and subnetworks.
For all other cases, you must specify a subnetwork.
Subnetwork parameter
If you need to select a specific subnetwork in a network, specify the
subnetwork
parameter. You must select a subnetwork in the same region as
the zone where you run your Dataflow workers. For example, you
must specify the subnetwork parameter in the following situations:
- The subnetwork you need is in a custom mode network.
- You are a Service Project Admin with subnet-level permissions to a specific subnetwork in a Shared VPC host project.
You can specify a subnetwork using either a complete URL or an abbreviated path. If the subnetwork is located in a Shared VPC network, you must use the complete URL.
- Complete URL:
https://www.googleapis.com/compute/v1/projects/
HOST_PROJECT_ID
/regions/REGION
/subnetworks/SUBNETWORK
- Short form:
regions/
REGION
/subnetworks/SUBNETWORK
Replace the following:
- HOST_PROJECT_ID: the host project ID
- REGION: the regional endpoint of your Dataflow job
- SUBNETWORK: the name of your Compute Engine subnetwork
If you specify a subnetwork
, Dataflow chooses the network
for
you. Therefore, when specifying a subnetwork
you can omit the network
parameter.
The subnetwork size only limits the number of instances by number of available IP addresses. This sizing does not have impact on Dataflow VPC Service Controls performance.
Public IP parameter
The Public IP parameter tells the Dataflow service whether to assign public IP addresses to Dataflow workers. By default, the Dataflow service assigns workers both public and private IP addresses. When you turn off public IP addresses, the Dataflow pipeline can access resources only in the following places:
- Another instance in the same VPC network
- A Shared VPC network
- A network with VPC Network Peering enabled
With public IPs turned off, you can still perform administrative and monitoring tasks. You can access your workers by using SSH through the options listed in the preceding list. However, the pipeline cannot access the internet and other Google Cloud networks, and internet hosts cannot access your Dataflow workers.
Turning off public IPs allows you to better secure your data processing infrastructure. By not using public IP addresses for your Dataflow workers, you also lower the number of public IP addresses you consume against your Google Cloud project quota.
If you turn off public IPs, your Dataflow jobs cannot access APIs and services outside of Google Cloud that require internet access. For information on setting up internet access for jobs with private IPs, read Internet access for Dataflow.
JAVA
To turn off public IPs:
- Enable Private Google Access for your network or subnetwork.
- In the parameters of your Dataflow job, specify
--usePublicIps=false
and--network=NETWORK
or--subnetwork=SUBNETWORK
.
PYTHON
To turn off public IPs:
- Follow the Apache Beam pipeline dependencies instructions to stage all Python package dependencies.
- Enable Private Google Access for your network or subnetwork.
- In the parameters of your Dataflow job, specify
--no_use_public_ips
and--network=NETWORK
or--subnetwork=SUBNETWORK
.
Shared VPC
When specifying the subnetwork URL, verify that HOST_PROJECT_ID is the project in which the VPC is hosted.
Additionally, be sure that both the dataflow service account and the controller service account have the "Compute Network User" IAM permission assigned on the subnet being used.
Running your pipeline with the network and subnetwork specified
The following examples show how to run your pipeline on the
Dataflow service with the network
and subnetwork
parameters
specified.
If you are a Service Project Admin who only has permission to use specific
subnetworks in a Shared VPC network, you must specify the
subnetwork
parameter with a subnetwork that you have permission to use.
Using the command line
The following example shows how to run your pipeline from the command line, specifying the subnetwork. Specifying the subnetwork implicitly specifies the network.
JAVA
mvn compile exec:java \ -Dexec.mainClass=com.example.WordCount \ -Dexec.args="--project=my-cloud-project \ --stagingLocation=gs://my-wordcount-storage-bucket/staging/ \ --output=gs://my-wordcount-storage-bucket/output \ --runner=DataflowRunner \ --subnetwork=https://www.googleapis.com/compute/v1/projects/my-cloud-project/regions/us-central1/subnetworks/mysubnetwork"
PYTHON
python -m apache_beam.examples.wordcount \ --project my-cloud-project \ --runner DataflowRunner \ --staging_location gs://my-wordcount-storage-bucket/staging \ --temp_location gs://my-wordcount-storage-bucket/temp \ --output gs://my-wordcount-storage-bucket/output \ --subnetwork https://www.googleapis.com/compute/v1/projects/my-cloud-project/regions/us-central1/subnetworks/mysubnetwork"
Using the REST API
The following example shows how to execute a template and specify a subnetwork. Specifying the subnetwork implicitly specifies the network.
If you're executing a Dataflow template
using the REST API,
add network
and/or subnetwork
to the environment
object. For example,
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/wordcount/template_file
{
"jobName": "my_job",
"parameters": {
"inputFile" : "gs://dataflow-samples/shakespeare/kinglear.txt",
"output": "gs://my-wordcount-storage-bucket/output"
},
"environment": {
"tempLocation": "gs://my-wordcount-storage-bucket/temp",
"subnetwork": "https://www.googleapis.com/compute/v1/projects/my-cloud-project/regions/us-central1/subnetworks/mysubnetwork",
"zone": "us-central1-f"
}
}