Overview
The Compute Engine Virtual Machine instances in a Dataproc cluster, consisting of master and worker VMs, must be able to communicate with each other using ICMP, TCP (all ports), and UDP (all ports).
Firewall rule requirement
Dataproc requires that you create an ingress allow firewall rule with the following characteristics:
The source for the rule must include the cluster's VMs. You can define a source using IP address ranges or you can identify VMs using source tags or source service accounts. If you omit a source specification, the firewall rule will use the range
0.0.0.0/0
(any IP address) as the source. If your Dataproc VMs have external IP addresses, this means they can accept traffic from anywhere on the Internet. Consequently, you should define the source to be as narrow as possible to meet your needs and secure your cluster.The target for the rule must identify the cluster's VMs. The target can be all VMs in the VPC network, or you can identify specific target VMs using target tags or target service accounts.
The rule must include the following protocols and ports: TCP (all ports, 0 through 65535), UDP (all ports, 0 through 65535), and ICMP.
gcloud compute firewall-rules create my-subnet-firewall-rule --allow tcp
It's best to specify an ingress allow firewall rule with a specific source range or to identify Google Cloud VMs by network tag or service account. Refer to the firewall rules overview for additional information.
How to set the source IP range
You can set the source IP range when you create a firewall rule from
the Google Cloud Console or using the gcloud
command-line tool.
Console
Use the Cloud Console
Create a firewall rule
page to create a firewall rule with a specified source IP range.
gcloud command
Use the
gcloud
compute firewall-rules create
command to create a firewall rule
with a specified source IP range.
gcloud compute firewall-rules create "tcp-rule" --allow tcp:80 \ --source-ranges="10.0.0.0/22,10.0.0.0/14" \ --description="Narrowing TCP traffic"
The Google Cloud Console VM instances→Network
details page shows the applicable firewall rules for instances in a
Dataproc cluster. If you use the
default network,
it includes a
pre-populated default-allow-internal
firewall rule
that allows ingress from the 10.128.0.0/9
source range. If you delete this
pre-populated firewall rule or use a VPC network other than the default network,
ingress traffic is blocked by
the implied deny ingress rule.
In those situations, you must create an ingress allow firewall rule that
permits traffic to all TCP and UDP ports of instances in the cluster.
Your Network or Security Administrator can refer to
the firewall rules overview
for more information.

Create a VPC network
You can specify your own Virtual Private Cloud (VPC) network when you create a Dataproc cluster. To do this, you must first create a VPC network with firewall rules. Then, when you create the cluster, you associate your network with the cluster.
Creating a VPC network
You can create a VPC network from the Cloud Console or using the gcloud compute networks create command-line tool. You can create an auto mode VPC network or a custom mode VPC network (called "auto" and "custom" networks, respectively, below). An auto network is automatically configured with subnets in each Compute Engine region. Custom networks are not automatically configured with subnets; you must create one or more subnets in one or more Compute Engine regions when you create the custom network. For more information, see Types of VPC Networks.
Let's look at the options available when you create an auto and custom network from the Cloud Console.
Auto
The Cloud Console screenshot, below, shows the
Cloud Console fields that are populated for the
Automatic creation of subnetworks (an
auto mode VPC network).
You must select one or more firewall rules. The
network-name-allow-internal
rule, which opens
udp:0-65535;tcp:0-65535;icmp
ports, should be selected to enable full internal IP networking access among VM
instances in the network. You can also select the
network-name-allow-ssh
rule to open standard SSH port 22 to allow SSH connections to network.

Custom
If you choose Custom subnetworks when creating a network
(a custom mode VPC network),
you must specify the region and private IP address range for each subnetwork.
To enable full internal access among VMs in the network, you can specify an IP
address range of 10.0.0.0/8
(or a more restrictive range if
appropriate, such as 10.128.0.0/16
).

udp:0-65535;tcp:0-65535;icmp
ports
(as shown in the Cloud Console screenshot below).

Creating a cluster that uses your VPC network
gcloud command
You can use the Cloud SDK
gcloud dataproc clusters create
command with the ‑‑network
or ‑‑subnet
flag to create a cluster that will use an auto or custom subnetwork.
Using the ‑‑network flag
You can use the ‑‑network
flag to create a
cluster that will use a subnetwork with the same name as the network in the
region where the cluster will be created.
gcloud dataproc clusters create my-cluster \ --network network-name \ --region=region \ ... other args ...
For example, since auto networks are created with subnets in each
region with the same name as the auto network, you can pass the auto network
name to the ‑‑network flag
(‑‑network auto-net-name
)
to create a cluster that will use the auto subnetwork in the cluster's region.
Using the ‑‑subnet flag
You can use the ‑‑subnet
flag to create a
cluster that will use an auto or custom subnetwork in the region
where the cluster will be created. You must pass the ‑‑subnet flag
the full resource path of the subnet your cluster will use.
gcloud dataproc clusters create cluster-name \ --subnet projects/project-id/regions/region/subnetworks/subnetwork-name \ --region=region \ ... other args ...
REST API
You can specify either the
networkUri or subnetworkUri
GceClusterConfig
field as part of a
clusters.create
request.
Example
POST /v1/projects/my-project-id/regions/us-central1/clusters/ { "projectId": "my-project-id", "clusterName": "example-cluster", "config": { "configBucket": "", "gceClusterConfig": { "subnetworkUri": "custom-subnet-1", "zoneUri": "us-central1-b" }, ...
Console
After creating a VPC network with firewall rules that allow VMs full access over the network's private IP address range, you can create a cluster from the Dataproc Create a cluster page in the Cloud Console. Select your primary network in the Network configuration section on the Customize cluster panel. After you choose the network, the Subnetwork selector displays the subnetworks(s) available in the region you have selected for the creation of the cluster.
Creating a cluster that uses a VPC network in another project
A Dataproc cluster can use a Shared VPC network by participating as a service project. With Shared VPC, the Shared VPC network is defined in a different project, which is called the host project. The host project is made available for use by IAM members in attached service projects. See Shared VPC Overview for background information.
You will create your Dataproc cluster in a project. In the Shared VPC scenario, this project will be a service project. You will need to reference the project number of this project. Here's one way to find the project number:
Navigate to the IAM & admin page Settings tab.
From the project drop-down list at the top of the page, select the project you will use to create the Dataproc cluster.
Note the project number:
An IAM member who is a Shared VPC Admin must perform the following steps. See directions for setting up Shared VPC for background information.
Make sure that the Shared VPC host project has been enabled.
Attach the Dataproc project to the host project.
Configure either or both of the following service accounts to have the Network User role for the host project. Dataproc will attempt to use the first service account, falling back to the Google APIs service account if required.
service-[project-number]@dataproc-accounts.iam.gserviceaccount.com
- the Google APIs service account,
[project-number]@cloudservices.gserviceaccount.com
Navigate to the IAM tab of the IAM & admin page.
Use the project drop-down list at the top of the page to select the host project.
Click ADD. Repeat these steps to add both service accounts:
Add the service account to the Members field.
From the Roles menu, select Compute Engine > Compute Network User.
Click Add.
Once both service accounts have the Network User role for the host project, you can create a cluster that uses your VPC network.
Create a Dataproc cluster with internal IP addresses only
You can create a Dataproc cluster that is isolated from the public internet whose VM instances communicate over a private IP subnetwork (the VM instances will not have public IP addresses). To do this, the subnetwork of the cluster must have Private Google Access enabled to allow cluster nodes to access Google APIs and services, such as Cloud Storage, from internal IPs.
gcloud command
You can create a Dataproc cluster with internal IP addresses only
by using the
gcloud dataproc clusters create
command with the
‑‑no-address
flag.
Using the ‑‑no-address and ‑‑network flags
Use the ‑‑no-address
flag with the
‑‑network
flag to create a cluster that will use
a subnetwork with the same name as the network in the region
where the cluster will be created.
gcloud dataproc clusters create my-cluster \ --no-address \ --network network-name \ --region=region \ ... other args ...
For example, since auto networks are created with subnets in each
region with the same name as the auto network, you can pass the auto network
name to the ‑‑network flag
(‑‑network auto-net-name
)
to create a cluster that will use the auto subnetwork in the cluster's region.
Using the ‑‑no-address and ‑‑subnet flags
Use the ‑‑no-address
flag with the
‑‑subnet
flags to create a
cluster that will use an auto or custom subnetwork in the region
where the cluster will be created. You must pass the ‑‑subnet flag
the full resource path of the subnet your cluster will use.
gcloud dataproc clusters create cluster-name \ --no-address \ --subnet projects/project-id/regions/region/subnetworks/subnetwork-name \ --region=region \ ... other args ...
REST API
You can set the GceClusterConfig
internalIpOnly
field to "true" as part of a
clusters.create
request to enable internal IP addresses only.
Example
POST /v1/projects/my-project-id/regions/us-central1/clusters/ { "projectId": "my-project-id", "clusterName": "example-cluster", "config": { "configBucket": "", "gceClusterConfig": { "subnetworkUri": "custom-subnet-1", "zoneUri": "us-central1-b", "internalIpOnly": true }, ...
Console
You can create a Dataproc cluster with Private Google Access enabled from the Dataproc Create a cluster page in the Cloud Console. Click Internal IP only on the Customize cluster panel to enable this feature for your cluster.
Dataproc and VPC-SC networks
With VPC Service Controls, administrators can define a security perimeter around resources of Google-managed services to control communication to and between those services.
Note the following limitations and strategies when using VPC-SC networks with Dataproc clusters:
To install components outside the VPC-SC perimeter, create a Dataproc custom image that pre-installs the components, then create the cluster using the custom image.
See Dataproc special steps to protect using VPC Service Controls.