Configuring Internet Access and Firewall Rules

This page explains how to provide routes and define your Google Cloud Platform (GCP) firewall rules for the network associated with your Cloud Dataflow jobs.

Note: The default network has configurations that allow Cloud Dataflow jobs to run. However, other services might also use this network. Make sure your changes to default are compatible with all of your services. Alternatively, create a separate network for Cloud Dataflow.

Internet access for Cloud Dataflow

Cloud Dataflow worker virtual machines (VMs) need to be able to reach GCP APIs and services. You can either configure worker VMs with an external IP address so that they meet the Internet access requirements, or you can use Private Google Access.

With Private Google Access, VMs that have only internal IP addresses can access select public IPs for GCP and services. Read Configuring Private Google Access for information about the routing and firewall rules requirements and configuration steps.

Jobs that access APIs and services outside of GCP require internet access. For example, Python SDK jobs need access to the Python Package Index (PyPI). In this case, you must either configure worker VMs with external IP addresses or you must access the internet through a NAT instance. Read Managing Python Pipeline Dependencies on the Apache Beam website for more details.

Firewall rules

Firewall rules let you allow or deny traffic to and from your VMs. This page assumes familiarity with how GCP firewall rules work as described on the Firewall Rules Overview and Using Firewall Rules pages, including the implied firewall rules.

Firewall rules required by Cloud Dataflow

Cloud Dataflow requires that worker VMs communicate with one another using specific TCP ports within the VPC network that you specify in your pipeline options. You need to configure firewall rules in your VPC network to allow this type of communication.

Some VPC networks, like the automatically created default network, include a default-allow-internal rule that meets the firewall requirement for Cloud Dataflow.

Because all worker VMs have a network tag with the value dataflow, you can create a more specific firewall rule for Cloud Dataflow. A project owner, editor, or Security Admin can use the following gcloud command to create an ingress allow rule that permits traffic on all TCP ports from VMs with the network tag dataflow to VMs with the same tag:

gcloud compute firewall-rules create [FIREWALL_RULE_NAME] \
    --network [NETWORK] \
    --action allow \
    --direction ingress \
    --target-tags dataflow \
    --source-tags dataflow \
    --priority 0 \
    --rules tcp:1-65535

In the above example, replace [FIREWALL_RULE_NAME] with a name for the firewall rule and replace [NETWORK] with the name of the network that your worker VMs use.

For further guidance about firewall rules, refer to Using Firewall Rules. For specific TCP ports used by Cloud Dataflow, you can view the project container manifest. The container manifest explicitly specifies the ports in order to map host ports into the container.

SSH access to worker VMs

Cloud Dataflow does not require SSH; however, SSH is useful for troubleshooting.

If your worker VM has an external IP address, you can connect to the VM through either the GCP Console or by using gcloud command-line tool. To connect using SSH, you must have a firewall rule that allows incoming connections on TCP port 22 from at least the IP address of the system on which you're running gcloud or the system running the web browser you use to access the GCP Console.

If you need to connect to a worker VM that only has an internal IP address, see Connecting to instances that do not have external IP addresses.

Oliko tästä sivusta apua? Kerro mielipiteesi

Palautteen aihe:

Tämä sivu
Cloud Dataflow
Tarvitsetko apua? Siirry tukisivullemme.