This document explains how to specify a network or a subnetwork or both options when you run Dataflow jobs.
This document requires that you know how to create Google Cloud networks and subnetworks. This document also requires your familiarity with the network terms discussed in the next section.
Google Cloud network terminology
VPC network. A VPC network, sometimes called a network, provides connectivity for resources in a project.
To learn more about VPC, see VPC network overview.
Shared VPC network. A Shared VPC network is one that exists in a separate project, called a host project, within your organization. If a Shared VPC Admin has defined you as a Service Project Admin, you have permission to use at least some of the subnetworks in networks of the host project.
To learn more about Shared VPC, see Shared VPC overview.
VPC Service Controls. Dataflow VPC Service Controls help secure your pipeline's resources and services.
To learn more about VPC Service Controls, see VPC Service Controls overview. To learn about the limitations when using Dataflow with VPC Service Controls, see supported products and limitations.
Network and subnetwork for a Dataflow job
When you create a Dataflow job, you can specify either a network or a subnetwork or both options.
Consider the following guidelines:
If you are unsure about which parameter to use, specify only the subnetwork parameter. The network parameter is then implicitly specified for you.
If you omit both the subnetwork and network parameters, Google Cloud assumes you intend to use an auto mode VPC network named
default
.If you omit both the subnetwork and network parameters and you do not have a network named
default
in your project, you receive an error.
Guidelines for specifying a network parameter
You can select an auto mode network in your project with the network parameter.
You can specify a network using only its name and not the complete URL.
You can also use the network parameter to select a Shared VPC network only if both of the following conditions are true:
The Shared VPC network that you select is an auto mode network.
You are a Service Project Admin with project-level permissions to the whole Shared VPC host project. This means that a Shared VPC Admin has granted you the Compute Network User role for the whole host project, so you are able to use all of its networks and subnetworks.
For all other cases, you must specify a subnetwork.
Guidelines for specifying a subnetwork parameter
If you specify a subnetwork, Dataflow chooses the network for you. Therefore, when specifying a subnetwork, you can omit the network parameter.
To select a specific subnetwork in a network, use the subnetwork parameter.
Specify a subnetwork using either a complete URL or an abbreviated path.
You must select a subnetwork in the same region as the zone where you run your Dataflow workers. For example, you must specify the subnetwork parameter in the following situations:
The subnetwork you specify is in a custom mode network.
You are a Service Project Admin with subnet-level permissions to a specific subnetwork in a Shared VPC host project.
The subnetwork size only limits the number of instances by number of available IP addresses. This sizing does not have impact on Dataflow VPC Service Controls performance.
Guidelines for specifying a subnetwork parameter for Shared VPC
When specifying the subnetwork URL for Shared VPC, ensure that HOST_PROJECT_ID is the project in which the VPC is hosted.
If the subnetwork is located in a Shared VPC network, you must use the complete URL.
Make sure the Shared VPC subnetwork is shared with the Dataflow service account and has the Compute Network User role assigned on the specified subnet.
- In the Google Cloud console, go to the Shared VPC page and search for the subnet. In the Shared with column, you can see whether the VPC subnetwork is shared with the Dataflow service account.
- If the network is not shared, the following error message appears:
Error: Message: Required 'compute.subnetworks.get' permission
.
Example network and subnetwork specifications
Example of a complete URL that specifies a subnetwork:
https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME
Replace the following:
HOST_PROJECT_ID
: the host project IDREGION_NAME
: the regional endpoint of your Dataflow jobSUBNETWORK_NAME
: the name of your Compute Engine subnetwork
The following is an example URL, where the host project ID is my-cloud-project
,
the region is us-central1
, and the subnetwork name is mysubnetwork
:
https://www.googleapis.com/compute/v1/projects/my-cloud-project/regions/us-central1/subnetworks/mysubnetwork"
The following is an example of a short form that specifies a subnetwork:
regions/REGION_NAME/subnetworks/SUBNETWORK_NAME
Replace the following:
REGION_NAME
: the regional endpoint of your Dataflow jobSUBNETWORK_NAME
: the name of your Compute Engine subnetwork
Run your pipeline with the network
and subnetwork
specified
If you are a Service Project Admin who only has permission to use specific
subnetworks in a Shared VPC network, you must specify the
subnetwork
parameter with a subnetwork that you have permission to use.
The following example shows how to run your pipeline from the command line or by using the REST API. The example specifies a subnetwork. You can also specify the network.
Java
mvn compile exec:java \ -Dexec.mainClass=com.example.WordCount \ -Dexec.args="--project=HOST_PROJECT_ID \ --stagingLocation=gs://STORAGE_BUCKET/staging/ \ --output=gs://STORAGE_BUCKET/output \ --region=REGION \ --runner=DataflowRunner \ --subnetwork=https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME"
Python
python -m apache_beam.examples.wordcount \ --project HOST_PROJECT_ID \ --region=REGION \ --runner DataflowRunner \ --staging_location gs://STORAGE_BUCKET/staging \ --temp_location gs://STORAGE_BUCKET/temp \ --output gs://STORAGE_BUCKET/output \ --subnetwork https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME
Go
wordcount --project HOST_PROJECT_ID \ --region HOST_GCP_REGION \ --runner dataflow \ --staging_location gs://STORAGE_BUCKET/staging \ --temp_location gs://STORAGE_BUCKET/temp \ --input gs://dataflow-samples/shakespeare/kinglear.txt \ --output gs://STORAGE_BUCKET/output \ --subnetwork https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME
API
If you're running a Dataflow template using the REST API,
add network
or subnetwork
, or both, to the environment
object.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/wordcount/template_file
{
"jobName": "JOB_NAME",
"parameters": {
"inputFile" : "gs://dataflow-samples/shakespeare/kinglear.txt",
"output": "gs://STORAGE_BUCKET/output"
},
"environment": {
"tempLocation": "gs://STORAGE_BUCKET/temp",
"subnetwork": "https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME",
"zone": "us-central1-f"
}
}
Replace the following:
JOB_NAME
: the name of your Dataflow job (API only)HOST_PROJECT_ID
: the host project IDREGION
: a Dataflow regional endpoint, likeus-central1
STORAGE_BUCKET
: the storage bucketREGION_NAME
: the regional endpoint of your Dataflow jobSUBNETWORK_NAME
: the name of your Compute Engine subnetwork
Turn off an external IP address
To turn off an external IP address, see Configure internet access and firewall rules.