Configure internet access and firewall rules

This document explains how to configure Dataflow virtual machine (VM) instances for internet access, create network tags, and define firewall rules for the network associated with your Dataflow jobs.

This document requires that you have basic knowledge of Google Cloud networks. To define a network for your Dataflow job, see Specify your network and subnetwork.

Internet access for Dataflow

Dataflow worker virtual machines (VMs) must reach Google Cloud APIs and services. Depending on your use case, your VMs may also need access to resources outside Google Cloud. Use one of the following methods to configure internet access for Dataflow:

  • Configure worker VMs with an external IP address so that they meet the internet access requirements.

  • Configure Private Google Access. With Private Google Access, VMs that have only internal IP addresses can access IP addresses for Google Cloud and services.

  • Configure a Private Service Connect endpoint IP address to access Google Cloud APIs and services.

  • Configure a NAT solution, such as Cloud NAT. This option is for running jobs that access APIs and services outside of Google Cloud that require internet access. For example, Python SDK jobs might require access to the Python Package Index (PyPI) to download pipelines dependencies. In this case, you must either configure worker VMs with external IP addresses or use Cloud NAT. You can also supply Python pipeline dependencies during job submission. For example, you can use custom containers to supply Python pipeline dependencies, which removes the need to access PyPI at runtime.

    For more information, see Managing Python Pipeline Dependencies in the Apache Beam documentation.

Turn off external IP address

By default, the Dataflow service assigns workers both external and internal IP addresses. When you turn off external IP addresses, the Dataflow pipeline can access resources only in the following places:

Without external IP addresses, you can still perform administrative and monitoring tasks. You can access your workers by using SSH through the options listed in the preceding list. However, the pipeline cannot access the internet, and internet hosts cannot access your Dataflow workers.

Not using external IP addresses helps to better secure your data processing infrastructure. Additionally, you also lower the number of external IP addresses that you consume against your Google Cloud project quota.

If you turn off external IP addresses, your Dataflow jobs cannot access APIs and services outside of Google Cloud that require internet access.

For information on setting up internet access for jobs with internal IP addresses, read the previous section.

To turn off external IP addresses, do one of the following:

Java

  1. Enable Private Google Access for your network or subnetwork.
  2. In the parameters of your Dataflow job, specify --usePublicIps=false and --network=NETWORK-NAME or --subnetwork=SUBNETWORK-NAME.

    Depending on your choice, replace one of the following:

    • NETWORK-NAME: the name of your Compute Engine network
    • SUBNETWORK-NAME: the name of your Compute Engine subnetwork

Python

  1. To stage all Python package dependencies, follow the Apache Beam pipeline dependencies instructions.
  2. Enable Private Google Access for your network or subnetwork.
  3. In the parameters of your Dataflow job, specify --no_use_public_ips and --network=NETWORK or --subnetwork=SUBNETWORK.
  4. Depending on your choice, replace one of the following:

    • NETWORK-NAME: the name of your Compute Engine network
    • SUBNETWORK-NAME: the name of your Compute Engine subnetwork

Go

  1. Enable Private Google Access for your network or subnetwork.
  2. In the parameters of your Dataflow job, specify --no_use_public_ips and --network=NETWORK or --subnetwork=SUBNETWORK.
  3. Depending on your choice, replace one of the following:

    • NETWORK-NAME: the name of your Compute Engine network
    • SUBNETWORK-NAME: the name of your Compute Engine subnetwork

Network tags for Dataflow

Network tags are text attributes that you can attach to Compute Engine VMs. Network tags let you make VPC network firewall rules and certain custom static routes applicable to specific VM instances. Dataflow supports adding network tags to all worker VMs that run a particular Dataflow job.

Even if you do not use the network parameter, Dataflow always adds the default network tag dataflow to every worker VM it creates.

Enable network tags

You can specify the network tags only when you run the Dataflow job template to create a job. After a job starts, you cannot add more network tags to the job. To apply additional network tags to a job, you must recreate your job template with the required network tags.

Add the following to your pipeline code, whether you run in Java or Python:

--experiments=use_network_tags=TAG-NAME

Replace TAG-NAME with the names of your tags. If you add more than one tag, separate each tag with a semicolon (;), as shown in the following format: TAG-NAME-1;TAG-NAME-2;TAG-NAME-3;....

Even if you do not use this parameter, Dataflow always adds the network tag dataflow to every worker VM it creates.

Enable network tags for Flex Template VMs

When using Flex Templates, to enable network tags for Dataflow worker VMs, use the --additional-experiments option as shown in the following example:

--additional-experiments=use_network_tags=TAG-NAME

To enable network tags for both worker VMs and launcher VMs, you need to use the following two options:

--additional-experiments=use_network_tags=TAG-NAME
--additional-experiments=use_network_tags_for_flex_templates=TAG-NAME

Replace TAG-NAME with the names of your tags. If you add more than one tag, separate each tag with a semicolon (;), as shown in the following format: TAG-NAME-1;TAG-NAME-2;TAG-NAME-3;....

After you enable the network tags, the tags are parsed and attached to the VMs.

See the limits applicable for network tags.

Firewall rules for Dataflow

Firewall rules let you allow or deny traffic to and from your VMs. If your Dataflow jobs use Dataflow Shuffle or Streaming Engine, then you don't need to configure any firewall rules. Otherwise, you must configure firewalls rules so that Dataflow VMs can send and receive network traffic on TCP port 12345 for streaming jobs and on TCP port 12346 for batch jobs. A project owner, editor, or security administrator must create necessary firewall rules in the VPC network used by your Dataflow VMs.

Before you configure firewall rules for Dataflow, read the following documents:

When you create firewall rules for Dataflow, specify the Dataflow network tags. Otherwise, the firewall rules apply to all VMs in the VPC network.

Where applicable, hierarchical firewall policies are evaluated first and these rules preempt VPC firewall rules. If the Dataflow job is in a project which is part of a folder or organization where hierarchical firewall policies are used, the compute.orgFirewallPolicyAdmin role is required to make policy modifications.

If you did not create custom network tags when you ran the pipeline code, Dataflow VMs use the default dataflow tag. In the absence of custom network tags, create the firewall rules with the default dataflow tag.

If you created custom network tags when you ran the pipeline code, Dataflow VMs use those tags. Create the firewall rules with the custom tags.

Some VPC networks, such as the automatically created default network, include a default-allow-internal rule that meets the firewall requirement for Dataflow.

Example firewall ingress rule

The ingress firewall rule permits Dataflow VMs to receive packets from each other. You must always create ingress allow firewall rules or traffic is always blocked, even if egress rules allow such traffic.

In the following example, a firewall ingress rule is created for Dataflow, where all worker VMs have the default network tag dataflow. A project owner, editor, or security admin can use the following gcloud command to create an ingress allow rule that permits traffic on TCP ports 12345 and 12346 from VMs with the network tag dataflow to other VMs with the same tag:

gcloud compute firewall-rules create FIREWALL_RULE_NAME_INGRESS \
    --action=allow \
    --direction=ingress \
    --network=NETWORK  \
    --target-tags=CUSTOM_TAG \
    --source-tags=CUSTOM_TAG \
    --priority=PRIORITY_NUM \
    --rules tcp:12345-12346

Replace the following:

  • FIREWALL_RULE_NAME_INGRESS: a name for the firewall rule

  • NETWORK: the name of the network that your worker VMs use

  • CUSTOM_TAG: a comma-delimited list of network tags

    The following is a list of guidelines for using network tags:

    • If you omit --target-tags, the rule applies to all VMs in the VPC network.

    • If you omit --source-tags and all other source specifications, traffic from any source is allowed.

    • If you have not specified custom network tags and you want the rule to be specific to Dataflow VMs, use dataflow as the network tag.

    • If you have specified custom network tags and you want the rule to be specific to Dataflow VMs, use your custom network tags.

  • PRIORITY_NUM: the priority of the firewall rule

    Lower numbers have higher priorities and 0 is the highest priority.

Example firewall egress rule

The egress firewall rule permits Dataflow VMs to send packets to each other. If you've created any egress deny firewall rules, you might need to create custom egress allow firewall rules in your VPC network.

In this example, a firewall egress rule is created for Dataflow, where all worker VMs have the default network tag of dataflow. A project owner, editor, or security admin can use the following gcloud command to create an egress allow rule that permits traffic from TCP ports 12345 and 12346 on VMs with the network tag dataflow to other VMs with the same tag:

gcloud compute firewall-rules create FIREWALL_RULE_NAME_EGRESS \
    --network=NETWORK \
    --action=allow \
    --direction=egress \
    --target-tags=CUSTOM_TAG \
    --destination-ranges=DESTINATION-RANGES\
    --priority=PRIORITY_NUM  \
    --rules tcp:12345-12346

Replace the following:

  • FIREWALL_RULE_NAME_EGRESS: a name for the firewall rule

  • NETWORK: the name of the network that your worker VMs use

  • CUSTOM_TAG: a comma-delimited list of network tags

    The following is a list of guidelines for using network tags:

    • If you omit --target-tags, the rule applies to all VMs in the VPC network.

    • If you omit --source-tags and all other source specifications, traffic from any source is allowed.

    • If you have not specified custom network tags and you want the rule to be specific to Dataflow VMs, use dataflow as the network tag.

    • If you have specified custom network tags and you want the rule to be specific to Dataflow VMs, use your custom network tags.

  • DESTINATION-RANGES: a comma-delimited list of CIDRs

    Include the selected subnetwork's primary IP address range.

  • PRIORITY_NUM: the priority of the firewall rule

    Lower numbers have higher priorities and 0 is the highest priority.

For specific TCP ports used by Dataflow, you can view the project container manifest. The container manifest explicitly specifies the ports in order to map host ports into the container.

SSH access to worker VMs

Dataflow does not require SSH; however, SSH is useful for troubleshooting.

If your worker VM has an external IP address, you can connect to the VM through either the Google Cloud console or by using the Google Cloud CLI. To connect using SSH, you must have a firewall rule that allows incoming connections on TCP port 22 from at least the IP address of the system on which you're running gcloud or the system running the web browser you use to access the Google Cloud console.

You can view network configuration and activity by opening an SSH session on one of your workers and running iproute2. For more information, see the iproute2 page.

If you need to connect to a worker VM that only has an internal IP address, see Choose a connection option for internal-only VMs.