Dataproc Cluster Network Configuration

Dataproc connectivity requirements

The Compute Engine Virtual Machine instances (VMs) in a Dataproc cluster, consisting of master and worker VMs, must be able to communicate with each other using ICMP, TCP (all ports), and UDP (all ports) protocols.

The default VPC network's default-allow-internal firewall rule meets Dataproc cluster connectivity requirements, and allows ingress from the 10.128.0.0/9 source range from all VMs on the VPC network as follows:

Rule Network Direction Priority Source range Protocols:Ports
default-allow-internal default ingress 65534 10.128.0.0/9 tcp:0-65535,udp:0-65535,icmp

If you delete the default-allow-internal firewall rule, ingress traffic on the default network is blocked by the implied deny ingress rule. If you delete the default-allow-internal firewall rule or don't use the default VPC network, you must create your own rule that meets Dataproc connectivity requirements and apply it to your cluster's VPC network.

Create an ingress firewall rule

If you or your network or security administrator create an ingress firewall rule to apply to your cluster's VPC network, it must have the following characteristics:

  • The sources parameter specifies the sources for packets. In Dataproc, all the VMs in the cluster must be able to communicate with each other. You can identify the VMs in the cluster by IP address range, or source tags or service accounts associated with the VMs.

  • The target for the rule must identify the cluster's VMs. The target can be all VMs in the VPC network, or you can identify VMs using target tags or target service accounts.

  • The rule must include the following protocols and ports: TCP (all ports, 0 through 65535), UDP (all ports, 0 through 65535), and ICMP. Dataproc uses services that run on multiple ports; specifying all the ports will help the services run successfully.

Diagnose your VPC firewall rules

To audit packets not processed by higher priority firewall rules, you can create two low priority (65534) deny firewall rules. Unlike the implied firewall rules, you can enable firewall rules logging on each of these low priority rules:

  1. An ingress deny rule (sources 0.0.0.0/0, all protocols, all targets in the VPC network)

  2. An egress deny rule (destinations 0.0.0.0/0, all protocols, all targets in the VPC network)

With these low priority rules and firewall rules logging, you can log packets not processed by higher priority (and potentially more specific) firewall rules. These two low priority rules also align with security best practices by implementing a "final drop packets" strategy.

Examine the firewall rules logs for these rules to determine if you need to create or amend higher priority rules to permit packets. For example, if packets sent between Dataproc cluster instances are dropped, this can be a signal that your firewall rules need to be adjusted.

Create a VPC network

Instead of using the default VPC network, you can create your own auto mode or custom VPC network. When you create the cluster, you associate your network with the cluster.

Create a cluster that uses your VPC network

gcloud command

Use gcloud dataproc clusters create with the ‑‑network or ‑‑subnet flag to create a cluster on a subnet in your network. If you use the ‑‑network flag, the cluster will use a subnetwork with the same name as the specified network in the region where the cluster is created.

--network example. Since auto networks are created with subnets in each region, and each subnet is given the network name, you can pass the auto mode network name to the ‑‑network flag. The cluster will use the auto mode subnetwork in the region specified with the ‑‑region flag.

gcloud dataproc clusters create my-cluster \
    --network network-name \
    --region=region \
    ... other args ...

--subnetwork example. You can use the ‑‑subnet flag to create a cluster that will use an auto mode or custom network subnet in the region specified with the ‑‑region flag. Specify the full resource path of the subnet.

gcloud dataproc clusters create cluster-name \
    --subnet projects/project-id/regions/region/subnetworks/subnetwork-name \
    --region=region \
    ... other args ...

REST API

You can specify either the networkUri or subnetworkUri GceClusterConfig field as part of a clusters.create request.

Example

POST /v1/projects/my-project-id/regions/us-central1/clusters/
{
  "projectId": "my-project-id",
  "clusterName": "example-cluster",
  "config": {
    "configBucket": "",
    "gceClusterConfig": {
      "subnetworkUri": "custom-subnet-1",
      "zoneUri": "us-central1-b"
    },
    ...

Console

Select your network in the Network configuration section on the Customize cluster panel. After you choose the network, the Subnetwork selector displays the subnetworks(s) available in the region you have selected for the creation of the cluster.

Create a cluster that uses a VPC network in another project

A Dataproc cluster can use a Shared VPC network by participating as a service project. With Shared VPC, the Shared VPC network is defined in a different project, which is called the host project. The host project is made available for use in attached service projects. See Shared VPC overview for background information.

You create your Dataproc cluster in a project. In the shared VPC scenario, the project will be a service project. You must reference the project number. To find the project number:

  1. Open the IAM & Admin Settings page in the console. Select the project where you will create the Dataproc cluster, Copy the project ID.

A principal with the Shared VPC Admin role must perform the following steps. See directions for setting up Shared VPC for background information.

  1. Make sure that the Shared VPC host project has been enabled.

  2. Attach the Dataproc project to the host project.

  3. Configure either or both of the following service accounts to have the Network User role for the host project. Dataproc will attempt to use the first service account, falling back to the Google APIs service account if required.

  4. Navigate to the IAM tab of the IAM & admin page.

  5. Use the project selector to select the host project.

  6. Click ADD. Repeat these steps to add both service accounts:

    1. Add the service account to the New principals field.

    2. From the Roles menu, select Compute Engine > Compute Network User.

    3. Click Save.

Once both service accounts have the Network User role for the host project, you can create a cluster that uses your VPC network.

Create a Dataproc cluster with internal IP addresses only

You can create a Dataproc cluster that is isolated from the public internet whose VM instances communicate over a private IP subnetwork (the VM instances will not have public IP addresses). To do this, the subnetwork of the cluster must have Private Google Access enabled to allow cluster nodes to access Google APIs and services, such as Cloud Storage, from internal IPs.

gcloud command

You can create a Dataproc cluster with internal IP addresses only by using the gcloud dataproc clusters create command with the ‑‑no-address flag.

Use the ‑‑no-address and ‑‑network flags
Use the ‑‑no-address flag with the ‑‑network flag to create a cluster that will use a subnetwork with the same name as the network in the region where the cluster will be created.

gcloud dataproc clusters create my-cluster \
    --no-address \
    --network network-name \
    --region=region \
    ... other args ...

For example, since auto networks are created with subnets in each region with the same name as the auto network, you can pass the auto network name to the ‑‑network flag to create a cluster that will use the auto subnetwork in the cluster's region.

Use the ‑‑no-address and ‑‑subnet flags
Use the ‑‑no-address flag with the ‑‑subnet flags to create a cluster that will use an auto or custom subnetwork in the region where the cluster will be created. Pass the ‑‑subnet flag the full resource path of the subnet.

gcloud dataproc clusters create cluster-name \
    --no-address \
    --subnet projects/project-id/regions/region/subnetworks/subnetwork-name \
    --region=region \
    ... other args ...

REST API

You can set the GceClusterConfig internalIpOnly field to true as part of a clusters.create request to enable internal IP addresses only.

Example

POST /v1/projects/my-project-id/regions/us-central1/clusters/
{
  "projectId": "my-project-id",
  "clusterName": "example-cluster",
  "config": {
    "configBucket": "",
    "gceClusterConfig": {
      "subnetworkUri": "custom-subnet-1",
      "zoneUri": "us-central1-b",
      "internalIpOnly": true
    },
    ...

Console

You can create a Dataproc cluster with Private Google Access enabled from the Dataproc Create a cluster page in the console. Click Internal IP only on the Customize cluster panel to enable this feature for your cluster.

Since, by default, internal-ip-only clusters do not have access to the Internet, jobs that download dependencies from the Internet, for example a download of Spark dependency packages from Maven Central, will fail. There are several workarounds to avoid the problem:

  1. Use Cloud NAT to enable cluster access to the Internet.

  2. Create a custom image that includes the dependencies (for example, Spark dependency packages in /usr/lib/spark/jars/).

  3. Upload the dependencies to a Cloud Storage bucket, then use an initialization action to download the dependencies from the bucket during cluster creation.

Dataproc and VPC-SC networks

With VPC Service Controls, administrators can define a security perimeter around resources of Google-managed services to control communication to and between those services.

Note the following limitations and strategies when using VPC-SC networks with Dataproc clusters: