This page explains Dataproc cluster network configuration requirements and options.
Dataproc connectivity requirements
Your Dataproc cluster must be in a VPC network that meets route and firewall requirements to securely access Google APIs and other resources.
Route requirements
The Dataproc agent running on cluster VMs needs a route to the internet to access the Dataproc control API to get jobs and report status. When created, VPC networks contain a system-generated default route to the internet. Deleting the default route to the internet is not recommended; instead, use firewalls to control network access. Note that internal-ip-only clusters also require this default route to the internet to access Dataproc control APIs and other Google services, such as Cloud Storage, but their traffic does not leave Google data centers.
Firewall requirements
Dataproc cluster Virtual Machines (VMs) must be able to communicate with each other using ICMP, TCP (all ports), and UDP (all ports) protocols.
The default
VPC network
default-allow-internal
firewall
rule meets Dataproc cluster connectivity requirements
and allows ingress from the 10.128.0.0/9
source range from
all VMs on the VPC network as follows:
Rule | Network | Direction | Priority | Source range | Protocols:Ports |
---|---|---|---|---|---|
default-allow-internal |
default |
ingress | 65534 |
10.128.0.0/9 |
tcp:0-65535 ,udp:0-65535 ,icmp |
If you delete the
default-allow-internal
firewall rule, ingress traffic on thedefault
network is blocked by the implied deny ingress rule.If you delete the
default-allow-internal
firewall rule or don't use thedefault
VPC network, you must create your own rule that meets Dataproc connectivity requirements, and then apply it to your cluster's VPC network.
Best practice: Create an ingress firewall rule for your cluster VPC network that allows ingress connectivity only among cluster VMs by using a source IP range or by identifying cluster VMs by network tag or service account.
Create an ingress firewall rule
If you or your network or security administrator create an ingress firewall rule to apply to a Dataproc cluster VPC network, it must have the following characteristics:
The sources parameter specifies the sources for packets. All Dataproc cluster VMs must be able to communicate with each other. You can identify the VMs in the cluster by IP address range, source tags, or service accounts associated with the VMs.
The target for the rule must identify the cluster VMs. The target can be all VMs in the VPC network, or you can identify VMs by IP address range, target tag, or target service account.
The rule must include the following protocols and ports:
- TCP (all ports, 0 through 65535)
- UDP (all ports, 0 through 65535)
- ICMP
Dataproc uses services that run on multiple ports. Specifying all ports helps the services run successfully.
Diagnose VPC firewall rules
To audit packets not processed by higher priority firewall rules, you can create two low priority (65534) deny firewall rules. Unlike the implied firewall rules, you can enable firewall rules logging on each of these low priority rules:
An ingress deny rule (sources
0.0.0.0/0
, all protocols, all targets in the VPC network)An egress deny rule (destinations
0.0.0.0/0
, all protocols, all targets in the VPC network)
With these low priority rules and firewall rules logging, you can log packets not processed by higher priority, and potentially more specific, firewall rules. These two low priority rules also align with security best practices by implementing a "final drop packets" strategy.
Examine the firewall rules logs for these rules to determine if you need to create or amend higher priority rules to permit packets. For example, if packets sent between Dataproc cluster VMs are dropped, this can be a signal that your firewall rules need to be adjusted.
Create a VPC network
Instead of using the default
VPC network, you can create your own
auto mode
or custom VPC
network. When you create the cluster, you associate your network with the
cluster.
Assured Workloads environment: When you use an Assured Workloads environment for regulatory compliance, the cluster, its VPC network, and its Cloud Storage buckets must be contained within the Assured Workloads environment.
Create a cluster that uses your VPC network
Console
Select your network in the Network configuration section on the Customize cluster panel. After you choose the network, the Subnetwork selector displays subnetworks(s) available in the region that you selected for the cluster.
Google Cloud CLI
Use
gcloud dataproc clusters create
with the ‑‑network
or ‑‑subnet
flag to create a cluster on a subnet in your network.
If you use the ‑‑network flag, the cluster will use a subnetwork with
the same name as the specified network in the region where the cluster is created.
--network example
. Since auto networks are created
with subnets in each region, with each subnet given the network name, you can
pass the auto mode VPC network name to the ‑‑network
flag.
The cluster will use the auto mode VPC subnetwork in the
region specified with the ‑‑region flag.
gcloud dataproc clusters create CLUSTER_NAME \ --network NETWORK_NAME \ --region=REGION \ ... other args ...
--subnet example
. You can use the ‑‑subnet
flag to create a cluster that uses an auto mode or custom VPC network subnet in the
cluster region. Specify the full resource path of the subnet.
gcloud dataproc clusters create CLUSTER_NAMEW \ --subnet projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME \ --region=REGION \ ... other args ...
REST API
You can specify either the
networkUri or subnetworkUri
GceClusterConfig
field as part of a
clusters.create
request.
Example
POST /v1/projects/my-project-id/regions/us-central1/clusters/ { "projectId": "PROJECT_ID", "clusterName": CLUSTER_NAME, "config": { "configBucket": "", "gceClusterConfig": { "subnetworkUri": SUBNET_NAME, }, ...
Create a cluster that uses a VPC network in another project
A Dataproc cluster can use a Shared VPC network that is defined in a host project. The project where the Dataproc cluster is created is referred to as the service project.
Find the Dataproc cluster project number:
- Open the IAM & Admin Settings page in the Google Cloud console. Select the project where you will create the Dataproc cluster. Copy the project ID.
A principal with the Shared VPC Admin role must perform the following steps. See directions for setting up Shared VPC for background information.
Make sure that the Shared VPC host project is enabled.
Attach the project with the Dataproc cluster to the host project.
Configure the Dataproc service agent service account (
service-[project-number]@dataproc-accounts.iam.gserviceaccount.com
) to have the Network User role for the host project:Open the IAM & Admin page in the Google Cloud console.
Use the project selector to select the new host project.
Click Grant Access.
Fill in the Grant Access form:
Add principals: Input the service account.
Assign roles: Insert "Compute Network" In the filter box, then select the Compute Network User role.
Click Save.
After the service account has the
Network User
role for the host project, create a cluster that uses the Shared VPC network.
Create a cluster that uses a VPC subnetwork in another project
A Dataproc cluster can use a Shared VPC subnetwork that is defined in a host project. The project where the Dataproc cluster is created is referred to as the service project.
Find the Dataproc cluster project number:
- Open the IAM & Admin Settings page in the Google Cloud console. Select the project where you will create the Dataproc cluster. Copy the project ID.
A principal with the Shared VPC Admin role must perform the following steps. See directions for setting up Shared VPC for background information.
Make sure that the Shared VPC host project is enabled.
Attach the project with the Dataproc cluster to the host project.
Configure the Dataproc service agent service account (
service-[project-number]@dataproc-accounts.iam.gserviceaccount.com
) to have the Network User role for the host project:Open the VPC networks page in the Google Cloud console.
Use the project selector to select the host project.
Click the network that contains the subnetwork that your Dataproc cluster will use.
In the VPC Network Details page, click the checkbox next to the name of the subnetwork that your cluster will use.
If the Info Panel is not open, click Show Info Panel.
Perform the following steps for each service account:
In the Info Panel, click Add Principal.
FIll in the Grant Access form:
Add principals: Input the service account.
Assign roles: Insert "Compute Network" In the filter box, then select the Compute Network User role.
Click Save.
After the service account has the
Network User
role for the host project, create a cluster that uses the Shared VPC subnetwork.
Create a Dataproc cluster with internal IP addresses only
You can create a Dataproc cluster that is isolated from the public internet whose VM instances communicate over a private IP subnetwork (cluster VMs are not assigned public IP addresses). To do this, the subnetwork must have Private Google Access enabled to allow cluster nodes to access Google APIs and services, such as Cloud Storage, from internal IPs.
Console
You can create a Dataproc cluster with Private Google Access enabled from the Dataproc Create a cluster page in the Google Cloud console. Click Internal IP only on the Customize cluster panel to enable this feature for your cluster.
gcloud CLI
You can create a Dataproc cluster with internal IP addresses only
by using the
gcloud dataproc clusters create
command with the ‑‑no-address
flag.
Use the ‑‑no-address and ‑‑network flags:
Use the ‑‑no-address
flag with the
‑‑network
flag to create a cluster that will use
a subnetwork with the same name as the network in the region
where the cluster is created.
gcloud dataproc clusters create CLUSTER_NAME \ --no-address \ --network NETWORK_NAME \ --region=REGION \ ... other args ...
For example, since auto networks are created with subnets in each
region with the same name as the auto network, you can pass the auto network
name to the ‑‑network flag
to create a cluster that will use the auto subnetwork in the cluster's region.
Use the ‑‑no-address and ‑‑subnet flags:
Use the ‑‑no-address
flag with the
‑‑subnet
flags to create a
cluster that will use an auto or custom subnetwork in the region
where the cluster will be created. Pass the ‑‑subnet
flag
the full resource path of the subnet.
gcloud dataproc clusters create cluster-name \ --no-address \ --subnet projects/project-id/regions/region/subnetworks/subnetwork-name \ --region=region \ ... other args ...
REST API
You can set the GceClusterConfig
internalIpOnly
field to true
as part of a
clusters.create
request to enable internal IP addresses only.
Example:
POST /v1/projects/my-project-id/regions/us-central1/clusters/ { "projectId": "my-project-id", "clusterName": "example-cluster", "config": { "configBucket": "", "gceClusterConfig": { "subnetworkUri": "custom-subnet-1", "zoneUri": "us-central1-b", "internalIpOnly": true }, ...
Since, by default, internal-ip-only clusters don't have access to the internet, jobs that download dependencies from the internet, for example jobs that download Spark dependency packages from Maven Central, will fail. There are several workarounds to avoid the problem:
Use Cloud NAT to enable cluster access to the internet.
Create a custom image that includes the dependencies (for example, Spark dependency packages in
/usr/lib/spark/jars/
).Upload the dependencies to a Cloud Storage bucket, then use an initialization action to download the dependencies from the bucket during cluster creation.
Dataproc and VPC Service Controls networks
With VPC Service Controls, administrators can define a security perimeter around resources of Google-managed services to control communication to and between those services.
Note the following limitations and strategies when using VPC Service Controls networks with Dataproc clusters:
To install components outside the VPC Service Controls perimeter, create a Dataproc custom image that pre-installs the components, then create the cluster using the custom image.
See Dataproc special steps to protect using VPC Service Controls.