Specify a network and subnetwork

This document explains how to specify a network or a subnetwork or both options when you run Dataflow jobs.

This document requires that you know how to create Google Cloud networks and subnetworks. This document also requires your familiarity with the network terms discussed in the next section.

The default network has configurations that allow Dataflow jobs to run. However, other services might also use this network. Ensure that your changes to the default network are compatible with all of your services. Alternatively, create a separate network for Dataflow.

For more information about how to troubleshoot networking issues, see Troubleshoot Dataflow networking issues.

Google Cloud network terminology

  • VPC network. A VPC network, sometimes called a network, provides connectivity for resources in a project.

    To learn more about VPC, see VPC network overview.

  • Shared VPC network. A Shared VPC network is one that exists in a separate project, called a host project, within your organization. If a Shared VPC Admin has defined you as a Service Project Admin, you have permission to use at least some of the subnetworks in networks of the host project.

    To learn more about Shared VPC, see Shared VPC overview.

  • VPC Service Controls. Dataflow VPC Service Controls help secure your pipeline's resources and services.

    To learn more about VPC Service Controls, see VPC Service Controls overview. To learn about the limitations when using Dataflow with VPC Service Controls, see supported products and limitations.

Network and subnetwork for a Dataflow job

When you create a Dataflow job, you can specify a network, a subnetwork, or both options.

Consider the following guidelines:

  • If you are unsure about which parameter to use, specify only the subnetwork parameter. The network parameter is then implicitly specified for you.

  • If you omit both the subnetwork and network parameters, Google Cloud assumes you intend to use an auto mode VPC network named default. If you don't have a network named default in your project, you must specify an alternate network or subnetwork.

Guidelines for specifying a network parameter

  • You can select an auto mode VPC network in your project with the network parameter.

  • You can specify a network using only its name and not the complete URL.

  • You can only use the network parameter to select a Shared VPC network if both of the following conditions are true:

    • The Shared VPC network that you select is an auto mode VPC network.

    • You are a Service Project Admin with project-level permissions to the whole Shared VPC host project. A Shared VPC Admin has granted you the Compute Network User role for the whole host project, so you are able to use all of its networks and subnetworks.

  • For all other cases, you must specify a subnetwork.

Guidelines for specifying a subnetwork parameter

  • If you specify a subnetwork, Dataflow chooses the network for you. Therefore, when specifying a subnetwork, you can omit the network parameter.

  • To select a specific subnetwork in a network, use the subnetwork parameter.

  • Specify a subnetwork using either a complete URL or an abbreviated path. If the subnetwork is located in a Shared VPC network, you must use the complete URL.

  • You must select a subnetwork in the same region as the zone where you run your Dataflow workers. For example, you must specify the subnetwork parameter in the following situations:

    • The subnetwork you specify is in a custom mode VPC network.

    • You are a Service Project Admin with subnet-level permissions to a specific subnetwork in a Shared VPC host project.

  • The subnetwork size only limits the number of instances by number of available IP addresses. This sizing does not have impact on Dataflow VPC Service Controls performance.

Guidelines for specifying a subnetwork parameter for Shared VPC

  • When specifying the subnetwork URL for Shared VPC, ensure that HOST_PROJECT_ID is the project in which the VPC is hosted.

  • If the subnetwork is located in a Shared VPC network, you must use the complete URL.

  • Make sure the Shared VPC subnetwork is shared with the Dataflow service account and has the Compute Network User role assigned on the specified subnet. The Compute Network User role must be assigned to the Dataflow service account in the host project.

    • In the Google Cloud console, go to the Shared VPC page and search for the subnet. In the Shared with column, you can see whether the VPC subnetwork is shared with the Dataflow service account.
    • If the network is not shared, the following error message appears: Error: Message: Required 'compute.subnetworks.get' permission.

Example network and subnetwork specifications

Example of a complete URL that specifies a subnetwork:

https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME

Replace the following:

  • HOST_PROJECT_ID: the host project ID
  • REGION_NAME: the region of your Dataflow job
  • SUBNETWORK_NAME: the name of your Compute Engine subnetwork

The following is an example URL, where the host project ID is my-cloud-project, the region is us-central1, and the subnetwork name is mysubnetwork:

 https://www.googleapis.com/compute/v1/projects/my-cloud-project/regions/us-central1/subnetworks/mysubnetwork

The following is an example of a short form that specifies a subnetwork:

regions/REGION_NAME/subnetworks/SUBNETWORK_NAME

Replace the following:

  • REGION_NAME: the region of your Dataflow job
  • SUBNETWORK_NAME: the name of your Compute Engine subnetwork

Run your pipeline with the network specified

If you want to use a network other than the default network created by Google Cloud, in most cases, you need to specify the subnetwork. The network is automatically inferred from the subnetwork that you specify. For more information, see Guidelines for specifying a network parameter in this document.

The following example shows how to run your pipeline from the command line or by using the REST API. The example specifies a network.

Java

mvn compile exec:java \
    -Dexec.mainClass=INPUT_PATH \
    -Dexec.args="--project=HOST_PROJECT_ID \
        --stagingLocation=gs://STORAGE_BUCKET/staging/ \
        --output=gs://STORAGE_BUCKET/output \
        --region=REGION \
        --runner=DataflowRunner \
        --network=NETWORK_NAME"

Python

python -m INPUT_PATH \
    --project HOST_PROJECT_ID \
    --region=REGION \
    --runner DataflowRunner \
    --staging_location gs://STORAGE_BUCKET/staging \
    --temp_location gs://STORAGE_BUCKET/temp \
    --output gs://STORAGE_BUCKET/output \
    --network NETWORK_NAME

Go

wordcount
    --project HOST_PROJECT_ID \
    --region HOST_GCP_REGION \
    --runner dataflow \
    --staging_location gs://STORAGE_BUCKET/staging \
    --temp_location gs://STORAGE_BUCKET/temp \
    --input INPUT_PATH \
    --output gs://STORAGE_BUCKET/output \
    --network NETWORK_NAME

API

If you're running a Dataflow template by using the REST API, add network or subnetwork, or both, to the environment object.

POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/wordcount/template_file
{
    "jobName": "JOB_NAME",
    "parameters": {
       "inputFile" : "INPUT_PATH",
       "output": "gs://STORAGE_BUCKET/output"
    },
    "environment": {
       "tempLocation": "gs://STORAGE_BUCKET/temp",
       "network": "NETWORK_NAME",
       "zone": "us-central1-f"
    }
}

Replace the following:

  • JOB_NAME: the name of your Dataflow job (API only)
  • INPUT_PATH: the path to your source
  • HOST_PROJECT_ID: the host project ID
  • REGION: a Dataflow region, like us-central1
  • STORAGE_BUCKET: the storage bucket
  • NETWORK_NAME: the name of your Compute Engine network

Run your pipeline with the subnetwork specified

If you are a Service Project Admin who only has permission to use specific subnetworks in a Shared VPC network, you must specify the subnetwork parameter with a subnetwork that you have permission to use.

The following example shows how to run your pipeline from the command line or by using the REST API. The example specifies a subnetwork. You can also specify the network.

Java

mvn compile exec:java \
    -Dexec.mainClass=INPUT_PATH \
    -Dexec.args="--project=HOST_PROJECT_ID \
        --stagingLocation=gs://STORAGE_BUCKET/staging/ \
        --output=gs://STORAGE_BUCKET/output \
        --region=REGION \
        --runner=DataflowRunner \
        --subnetwork=https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK_NAME"

Python

python -m INPUT_PATH \
    --project HOST_PROJECT_ID \
    --region=REGION \
    --runner DataflowRunner \
    --staging_location gs://STORAGE_BUCKET/staging \
    --temp_location gs://STORAGE_BUCKET/temp \
    --output gs://STORAGE_BUCKET/output \
    --subnetwork https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK_NAME

Go

wordcount
    --project HOST_PROJECT_ID \
    --region HOST_GCP_REGION \
    --runner dataflow \
    --staging_location gs://STORAGE_BUCKET/staging \
    --temp_location gs://STORAGE_BUCKET/temp \
    --input INPUT_PATH \
    --output gs://STORAGE_BUCKET/output \
    --subnetwork https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK_NAME

API

If you're running a Dataflow template using the REST API, add network or subnetwork, or both, to the environment object.

POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/wordcount/template_file
{
    "jobName": "JOB_NAME",
    "parameters": {
       "inputFile" : "INPUT_PATH",
       "output": "gs://STORAGE_BUCKET/output"
    },
    "environment": {
       "tempLocation": "gs://STORAGE_BUCKET/temp",
       "subnetwork": "https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK_NAME",
       "zone": "us-central1-f"
    }
}

Replace the following:

  • JOB_NAME: the name of your Dataflow job (API only)
  • INPUT_PATH: the path to your source
  • HOST_PROJECT_ID: the host project ID
  • REGION: a Dataflow region, like us-central1
  • STORAGE_BUCKET: the storage bucket
  • SUBNETWORK_NAME: the name of your Compute Engine subnetwork

Turn off an external IP address

To turn off an external IP address, see Configure internet access and firewall rules.