This document explains how to configure Dataflow virtual machine (VM) instances for internet access, create network tags, and define firewall rules for the network associated with your Dataflow jobs.
This document requires that you have basic knowledge of Google Cloud networks. To define a network for your Dataflow job, see Specify your network and subnetwork. For more information about how to troubleshoot networking issues, see Troubleshoot Dataflow networking issues.
Access to Google Cloud APIs for Dataflow
Dataflow worker virtual machines (VMs) must reach Google Cloud APIs and services. The set of dependent Google Cloud endpoints might change over time, but they all support VPC Service Controls. Use one of the following methods to configure access to Google Cloud APIs:
Configure Private Google Access. With Private Google Access, VMs that have only internal IP addresses can access IP addresses for Google Cloud and services.
Configure a Private Service Connect endpoint IP address to access Google Cloud APIs and services.
Configure worker VMs with external IP addresses.
By default, firewall rules and DNS configurations allow access to
Google Cloud APIs. However, you might be actively restricting access to
a subset of Google Cloud APIs, such as if you're using
VPC Service Controls.
In this case, at a minimum, provide access to
restricted.googleapis.com
. If you're using Private Service Connect,
provide access to the vpc-sc
bundle.
Providing access to more permissive
domains, such as
private.googleapis.com
, also provides the required functionality.
To allow access to the necessary Google Cloud APIs through a given domain, your environment must meet the following requirements:
Firewall rules must allow egress to all address ranges under the chosen domain.
DNS must resolve
*.googleapis.com
to your chosen domain.
For example, if your firewall rules limit egress to the
restricted.googleapis.com
address range, then
*.googleapis.com
must resolve to addresses within that range.
For more information, see
Configure DNS for googleapis.com.
Similarly, if you're using Private Service Connect,
then you must create DNS records
for the googleapis.com
default domain to ensure access to at least
all services in the vpc-sc
bundle.
Internet access for Dataflow
Depending on your use case, your VMs might also need access to resources outside Google Cloud. Use one of the following methods to configure internet access for Dataflow:
Configure worker VMs with an external IP address so that they meet the internet access requirements.
Configure a NAT solution, such as Cloud NAT. This option is for running jobs that access APIs and services outside of Google Cloud that require internet access. For example, Python SDK jobs might require access to the Python Package Index (PyPI) to download pipelines dependencies. In this case, you must either configure worker VMs with external IP addresses or use Cloud NAT. You can also supply Python pipeline dependencies during job submission. For example, you can use custom containers to supply Python pipeline dependencies, which removes the need to access PyPI at runtime.
For more information, see Managing Python Pipeline Dependencies in the Apache Beam documentation.
Turn off external IP address
By default, the Dataflow service assigns workers both external and internal IP addresses. When you turn off external IP addresses, the Dataflow pipeline can access resources only in the following places:
- Another instance in the same VPC network
- A Shared VPC network
- A network with VPC Network Peering enabled
Without external IP addresses, you can still perform administrative and monitoring tasks. You can access your workers by using SSH through the options listed in the preceding list. However, the pipeline cannot access the internet, and internet hosts cannot access your Dataflow workers.
Not using external IP addresses helps to better secure your data processing infrastructure. Additionally, you also lower the number of external IP addresses that you consume against your Google Cloud project quota.
If you turn off external IP addresses, your Dataflow jobs cannot access APIs and services outside of Google Cloud that require internet access.
For information on setting up internet access for jobs with internal IP addresses, read the previous section.
To turn off external IP addresses, do one of the following:
Java
- Enable Private Google Access for your network or subnetwork.
- In the parameters of your Dataflow job, specify
--usePublicIps=false
and--network=NETWORK-NAME
or--subnetwork=SUBNETWORK-NAME
.Depending on your choice, replace one of the following:
- NETWORK-NAME: the name of your Compute Engine network
- SUBNETWORK-NAME: the name of your Compute Engine subnetwork
Python
- To stage all Python package dependencies, follow the Apache Beam pipeline dependencies instructions.
- Enable Private Google Access for your network or subnetwork.
- In the parameters of your Dataflow job, specify
--no_use_public_ips
and--network=NETWORK
or--subnetwork=SUBNETWORK
. - NETWORK-NAME: the name of your Compute Engine network
- SUBNETWORK-NAME: the name of your Compute Engine subnetwork
Depending on your choice, replace one of the following:
Go
- Enable Private Google Access for your network or subnetwork.
- In the parameters of your Dataflow job, specify
--no_use_public_ips
and--network=NETWORK
or--subnetwork=SUBNETWORK
. - NETWORK-NAME: the name of your Compute Engine network
- SUBNETWORK-NAME: the name of your Compute Engine subnetwork
Depending on your choice, replace one of the following:
Network tags for Dataflow
Network tags are text attributes that you can attach to Compute Engine VMs. Network tags let you make VPC network firewall rules and certain custom static routes applicable to specific VM instances. Dataflow supports adding network tags to all worker VMs that run a particular Dataflow job.
Even if you do not use the network parameter, Dataflow always adds the
default network tag dataflow
to every worker VM it creates.
Enable network tags
You can specify the network tags only when you run the Dataflow job template to create a job. After a job starts, you cannot add more network tags to the job. To apply additional network tags to a job, you must recreate your job template with the required network tags.
Add the following to your pipeline code, whether you run in Java or Python:
--experiments=use_network_tags=TAG-NAME
Replace TAG-NAME with the names of your tags. If you add
more than one tag, separate each tag with a semicolon (;
), as shown in the following
format: TAG-NAME-1;TAG-NAME-2;TAG-NAME-3;....
Even if you do not use this parameter, Dataflow always adds the
network tag dataflow
to every worker VM it creates.
Enable network tags for Flex Template VMs
When using Flex Templates, to enable network tags for Dataflow
worker VMs, use the --additional-experiments
option as shown in the following example:
--additional-experiments=use_network_tags=TAG-NAME
To enable network tags for both worker VMs and launcher VMs, you need to use the following two options:
--additional-experiments=use_network_tags=TAG-NAME
--additional-experiments=use_network_tags_for_flex_templates=TAG-NAME
Replace TAG-NAME with the names of your tags. If you add
more than one tag, separate each tag with a semicolon (;
), as shown in the following
format: TAG-NAME-1;TAG-NAME-2;TAG-NAME-3;....
After you enable the network tags, the tags are parsed and attached to the VMs.
See the limits applicable for network tags.
Firewall rules for Dataflow
Firewall rules let you allow or deny traffic to and from your VMs. If your
Dataflow jobs use
Dataflow Shuffle or
Streaming Engine, then you only need to
ensure that firewal rules allow access to Google Cloud APIs.
Otherwise, you must configure additional firewalls rules so that
Dataflow VMs can send and receive network traffic on TCP port
12345
for streaming jobs and on TCP port 12346
for batch jobs.
A project owner, editor, or security administrator
must create necessary firewall rules in the VPC network used by
your Dataflow VMs.
Before you configure firewall rules for Dataflow, read the following documents:
When you create firewall rules for Dataflow, specify the Dataflow network tags. Otherwise, the firewall rules apply to all VMs in the VPC network.
Where applicable, hierarchical firewall policies are evaluated first and these rules preempt VPC
firewall rules. If the Dataflow job is in a project which is part of a folder or
organization where hierarchical firewall policies are used, the
compute.orgFirewallPolicyAdmin
role is required to make policy modifications.
If you did not create custom network tags when you ran the pipeline code,
Dataflow VMs use the default dataflow
tag. In the absence of custom network
tags, create the firewall rules with the default dataflow
tag.
If you created custom network tags when you ran the pipeline code, Dataflow VMs use those tags. Create the firewall rules with the custom tags.
Some VPC networks, such as the automatically created default
network, include a default-allow-internal
rule that meets the firewall
requirement for Dataflow.
Example firewall ingress rule
The ingress firewall rule permits Dataflow VMs to receive packets from each other. You must always create ingress allow firewall rules or traffic is always blocked, even if egress rules allow such traffic.
In the following example, a firewall ingress rule is created for
Dataflow, where all worker VMs have the default network tag dataflow
. A project owner, editor, or security admin can use the following
gcloud
command to create an ingress allow rule that permits traffic on TCP
ports 12345
and 12346
from VMs with the network tag dataflow
to other VMs with
the same tag:
gcloud compute firewall-rules create FIREWALL_RULE_NAME_INGRESS \
--action=allow \
--direction=ingress \
--network=NETWORK \
--target-tags=CUSTOM_TAG \
--source-tags=CUSTOM_TAG \
--priority=PRIORITY_NUM \
--rules tcp:12345-12346
Replace the following:
FIREWALL_RULE_NAME_INGRESS
: a name for the firewall ruleNETWORK
: the name of the network that your worker VMs useCUSTOM_TAG
: a comma-delimited list of network tagsThe following is a list of guidelines for using network tags:
If you omit
--target-tags
, the rule applies to all VMs in the VPC network.If you omit
--source-tags
and all other source specifications, traffic from any source is allowed.If you have not specified custom network tags and you want the rule to be specific to Dataflow VMs, use
dataflow
as the network tag.If you have specified custom network tags and you want the rule to be specific to Dataflow VMs, use your custom network tags.
PRIORITY_NUM
: the priority of the firewall ruleLower numbers have higher priorities and 0 is the highest priority.
Example firewall egress rule
The egress firewall rule permits Dataflow VMs to send packets to each other. If you've created any egress deny firewall rules, you might need to create custom egress allow firewall rules in your VPC network.
In this example, a firewall egress rule is created for Dataflow,
where all worker VMs have the default network tag of dataflow
. A project
owner, editor, or security admin can use the following gcloud
command to
create an egress allow rule that permits traffic from TCP ports 12345
and 12346
on VMs with the network tag dataflow
to other VMs with the same tag:
gcloud compute firewall-rules create FIREWALL_RULE_NAME_EGRESS \
--network=NETWORK \
--action=allow \
--direction=egress \
--target-tags=CUSTOM_TAG \
--source-tags=CUSTOM_TAG \
--destination-ranges=DESTINATION-RANGES\
--priority=PRIORITY_NUM \
--rules tcp:12345-12346
Replace the following:
FIREWALL_RULE_NAME_EGRESS
: a name for the firewall ruleNETWORK
: the name of the network that your worker VMs useCUSTOM_TAG
: a comma-delimited list of network tagsThe following is a list of guidelines for using network tags:
If you omit
--target-tags
, the rule applies to all VMs in the VPC network.If you omit
--source-tags
and all other source specifications, traffic from any source is allowed.If you have not specified custom network tags and you want the rule to be specific to Dataflow VMs, use
dataflow
as the network tag.If you have specified custom network tags and you want the rule to be specific to Dataflow VMs, use your custom network tags.
DESTINATION-RANGES
: a comma-delimited list of CIDRsInclude the selected subnetwork's primary IP address range.
PRIORITY_NUM
: the priority of the firewall ruleLower numbers have higher priorities and 0 is the highest priority.
For specific TCP ports used by Dataflow, you can view the project container manifest. The container manifest explicitly specifies the ports in order to map host ports into the container.
SSH access to worker VMs
Dataflow does not require SSH; however, SSH is useful for troubleshooting.
If your worker VM has an external IP address, you can connect to
the VM through either
the Google Cloud console or by using the Google Cloud CLI. To connect
using SSH, you must have a firewall rule that allows incoming connections on TCP
port 22 from at least the IP address of the system on which you're running
gcloud
or the system running the web browser you use to access the
Google Cloud console.
You can view network configuration and activity by
opening an SSH session on one of your workers and running iproute2
. For more information, see the
iproute2
page.
If you need to connect to a worker VM that only has an internal IP address, see Choose a connection option for internal-only VMs.
What's next
- Learn about Connectivity Tests. Connectivity Tests is a diagnostics tool that lets you check connectivity between network endpoints.
- Create and run Connectivity Tests.