Specifying your Network and Subnetwork

This page explains how to specify a network and a subnetwork when you run Cloud Dataflow jobs.

VPC and Shared VPC

A VPC network, sometimes just called a network, provides connectivity for resources in a project. If you are not familiar with VPC networks, review the VPC Network Overview first.

A Shared VPC network is one that exists in a separate project, called a host project, within your organization. If a Shared VPC Admin has defined you as a Service Project Admin, you have permission to use at least some of the subnetworks in networks of the host project. Refer to the Shared VPC Overview for background information about Shared VPC.

Specifying a network and a subnetwork

When you create a Cloud Dataflow job, you can specify either a network or a subnetwork. The following sections describe when you should use each parameter.

If you omit both the subnetwork and network parameters, GCP assumes you intend to use an auto mode VPC network with the name default. If you omit both the subnetwork and network parameters and you do not have a network named default in your project, you will receive an error.

Note: If you are unsure about which parameter to use, specify only the subnetwork parameter. The network parameter will be implicitly specified for you.

Network parameter

You can use the network parameter to specify an auto mode network in your project.

You can also use the network parameter to select a Shared VPC network only if both of these conditions are true:

  • The Shared VPC network you are selecting is an auto mode network.
  • You are a Service Project Admin with project-level permissions to the whole Shared VPC host project. This means that a Shared VPC Admin has granted you the Network User role to the whole host project, so you are able to use all of its networks and subnetworks.

For all other cases, you must specify a subnetwork.

Subnetwork parameter

If you need to select a specific subnetwork in a network, specify the subnetwork parameter. You must select a subnetwork in the same region as the zone where you run your Cloud Dataflow job. For example, you must specify the subnetwork parameter in the following situations:

  • The subnetwork you need is in a custom mode network.
  • You are a Service Project Admin with subnet-level permissions to a specific subnetwork in a Shared VPC host project.

You can specify a subnetwork using either a complete URL or an abbreviated path. If the subnetwork is located in a Shared VPC network, you must use the complete URL.

  • Complete URL:
    https://www.googleapis.com/compute/v1/projects/<HOST_PROJECT>/regions/<REGION>/subnetworks/<SUBNETWORK>
  • Short form:
    regions/<REGION>/subnetworks/<SUBNETWORK>

If you specify a subnetwork, Cloud Dataflow chooses the network for you. Therefore, when specifying a subnetwork you can omit the network parameter.

Public IP parameter

The Public IP parameter tells the Cloud Dataflow service whether to assign public IP addresses to Cloud Dataflow workers. By default, the Cloud Dataflow service assigns workers both public and private IP addresses. When you turn off public IP addresses, the Cloud Dataflow pipeline can access resources only in the following places:

With public IPs turned off, you can still perform administrative and monitoring tasks. You can access your workers by using SSH through the options listed above. However, the pipeline cannot access the internet and other GCP networks, and internet hosts cannot access your Cloud Dataflow workers.

Turning off public IPs allows you to better secure your data processing infrastructure. By not using public IP addresses for your Cloud Dataflow workers, you also lower the number of public IP addresses you consume against your GCP project quota.

JAVA

To turn off public IPs:

  1. Enable Private Google Access for your network or subnetwork.
  2. In the parameters of your Cloud Dataflow job, specify --usePublicIps=false and --network=[NETWORK] or --subnetwork=[SUBNETWORK].

PYTHON

The public IPs parameter requires the Beam SDK for Python. The Cloud Dataflow SDK for Python does not support this parameter. To turn off public IPs:

  1. Follow the Apache Beam pipeline dependencies instructions to stage all Python package dependencies.
  2. Enable Private Google Access for your network or subnetwork.
  3. In the parameters of your Cloud Dataflow job, specify --use_public_ips=false and --network=[NETWORK] or --subnetwork=[SUBNETWORK].

Running your pipeline with the network and subnetwork specified

The following examples show how to run your pipeline on the Cloud Dataflow service with the network and subnetwork parameters specified.

If you are a Service Project Admin who only has permission to use specific subnetworks in a Shared VPC network, you must specify the subnetwork parameter with a subnetwork that you have permission to use.

Using the command-line

The following example shows how to run your pipeline from the command-line, specifying the subnetwork. Specifying the subnetwork implicitly specifies the network.

JAVA

mvn compile exec:java \
  -Dexec.mainClass=com.example.WordCount \
  -Dexec.args="--project=my-cloud-project \
    --stagingLocation=gs://my-wordcount-storage-bucket/staging/ \
    --output=gs://my-wordcount-storage-bucket/output \
    --runner=DataflowRunner \
    --subnetwork=https://www.googleapis.com/compute/v1/projects/my-cloud-project/regions/us-central1/subnetworks/mysubnetwork"

PYTHON

python -m apache_beam.examples.wordcount \
  --project my-cloud-project \
  --runner DataflowRunner \
  --staging_location gs://my-wordcount-storage-bucket/staging \
  --temp_location gs://my-wordcount-storage-bucket/temp \
  --output gs://my-wordcount-storage-bucket/output \
  --subnetwork https://www.googleapis.com/compute/v1/projects/my-cloud-project/regions/us-central1/subnetworks/mysubnetwork"

Using the REST API

The following example shows how to execute a template and specify a subnetwork. Specifying the subnetwork implicitly specifies the network.

If you're executing a Cloud Dataflow template using the REST API, add network and/or subnetwork to the environment object. For example,

POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/wordcount/template_file
{
    "jobName": "my_job",
    "parameters": {
       "inputFile" : "gs://dataflow-samples/shakespeare/kinglear.txt",
       "output": "gs://my-wordcount-storage-bucket/output"
    },
    "environment": {
       "tempLocation": "gs://my-wordcount-storage-bucket/temp",
       "subnetwork": "https://www.googleapis.com/compute/v1/projects/my-cloud-project/regions/us-central1/subnetworks/mysubnetwork",
       "zone": "us-central1-f"
    }
}
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataflow
Need help? Visit our support page.