Dataproc Cluster network configuration

This page explains Dataproc cluster network configuration requirements and options.

Dataproc connectivity requirements

Dataproc cluster Virtual Machines (VMs) must be able to communicate with each other using ICMP, TCP (all ports), and UDP (all ports) protocols.

The default VPC network's default-allow-internal firewall rule meets Dataproc cluster connectivity requirements and allows ingress from the 10.128.0.0/9 source range from all VMs on the VPC network as follows:

Rule Network Direction Priority Source range Protocols:Ports
default-allow-internal default ingress 65534 10.128.0.0/9 tcp:0-65535,udp:0-65535,icmp
  • If you delete the default-allow-internal firewall rule, ingress traffic on the default network is blocked by the implied deny ingress rule.

  • If you delete the default-allow-internal firewall rule or don't use the default VPC network, you must create your own rule that meets Dataproc connectivity requirements, and then apply it to your cluster's VPC network.

Best practice: Create an ingress firewall rule for your cluster VPC network that allows ingress connectivity only among cluster VMs by using a source IP range or by identifying cluster VMs by network tag or service account.

Create an ingress firewall rule

If you or your network or security administrator create an ingress firewall rule to apply to a Dataproc cluster VPC network, it must have the following characteristics:

  • The sources parameter specifies the sources for packets. All Dataproc cluster VMs must be able to communicate with each other. You can identify the VMs in the cluster by IP address range, source tags, or service accounts associated with the VMs.

  • The target for the rule must identify the cluster VMs. The target can be all VMs in the VPC network, or you can identify VMs by IP address range, target tag, or target service account.

  • The rule must include the following protocols and ports:

    • TCP (all ports, 0 through 65535)
    • UDP (all ports, 0 through 65535)
    • ICMP

    Dataproc uses services that run on multiple ports. Specifying all ports helps the services run successfully.

Diagnose your VPC firewall rules

To audit packets not processed by higher priority firewall rules, you can create two low priority (65534) deny firewall rules. Unlike the implied firewall rules, you can enable firewall rules logging on each of these low priority rules:

  1. An ingress deny rule (sources 0.0.0.0/0, all protocols, all targets in the VPC network)

  2. An egress deny rule (destinations 0.0.0.0/0, all protocols, all targets in the VPC network)

  • With these low priority rules and firewall rules logging, you can log packets not processed by higher priority, and potentially more specific, firewall rules. These two low priority rules also align with security best practices by implementing a "final drop packets" strategy.

  • Examine the firewall rules logs for these rules to determine if you need to create or amend higher priority rules to permit packets. For example, if packets sent between Dataproc cluster VMs are dropped, this can be a signal that your firewall rules need to be adjusted.

Create a VPC network

Instead of using the default VPC network, you can create your own auto mode or custom VPC network. When you create the cluster, you associate your network with the cluster.

Assured Workloads environment: When you use an Assured Workloads environment for regulatory compliance, the cluster, its VPC network, and its Cloud Storage buckets must be contained within the Assured Workloads environment.

Create a cluster that uses your VPC network

Google Cloud CLI

Use gcloud dataproc clusters create with the ‑‑network or ‑‑subnet flag to create a cluster on a subnet in your network. If you use the ‑‑network flag, the cluster will use a subnetwork with the same name as the specified network in the region where the cluster is created.

--network example. Since auto networks are created with subnets in each region, with each subnet given the network name, you can pass the auto mode VPC network name to the ‑‑network flag. The cluster will use the auto mode VPC subnetwork in the region specified with the ‑‑region flag.

gcloud dataproc clusters create CLUSTER_NAME \
    --network NETWORK_NAME \
    --region=REGION \
    ... other args ...

--subnet example. You can use the ‑‑subnet flag to create a cluster that uses an auto mode or custom VPC network subnet in the cluster region. Specify the full resource path of the subnet.

gcloud dataproc clusters create CLUSTER_NAMEW \
    --subnet projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME \
    --region=REGION \
    ... other args ...

REST API

You can specify either the networkUri or subnetworkUri GceClusterConfig field as part of a clusters.create request.

Example

POST /v1/projects/my-project-id/regions/us-central1/clusters/
{
  "projectId": "PROJECT_ID",
  "clusterName": CLUSTER_NAME,
  "config": {
    "configBucket": "",
    "gceClusterConfig": {
      "subnetworkUri": SUBNET_NAME,
    },
    ...

Console

Select your network in the Network configuration section on the Customize cluster panel. After you choose the network, the Subnetwork selector displays subnetworks(s) available in the region that you selected for the cluster.

Create a cluster that uses a VPC network in another project

A Dataproc cluster can use a shared VPC network that is defined in a host project. The project where the Dataproc cluster is created is referred to as the service project.

  1. Find the Dataproc cluster project number:

    1. Open the IAM & Admin Settings page in the Google Cloud console. Select the project where you will create the Dataproc cluster. Copy the project ID.
  2. A principal with the Shared VPC Admin role must perform the following steps. See directions for setting up Shared VPC for background information.

    1. Make sure that the Shared VPC host project is enabled.

    2. Attach the project with the Dataproc cluster to the host project.

    3. Follow the instructions in this substep to configure both of the following service accounts to have the Network User role for the host project:

      1. Open the IAM & Admin page in the Google Cloud console.

      2. Use the project selector to select the new host project.

      3. Click Grant Access.

      4. Fill in the Grant Access form. Repeat these steps to add both service accounts:

        1. Add principals: Input the service account.

        2. Assign roles: Insert "Compute Network" In the filter box, then select the Compute Network User role.

        3. Click Save.

  3. After both service accounts have the Network User role for the host project, create a cluster that uses the shared VPC network.

Create a cluster that uses a VPC subnetwork in another project

A Dataproc cluster can use a shared VPC subnetwork that is defined in a host project. The project where the Dataproc cluster is created is referred to as the service project.

  1. Find the Dataproc cluster project number:

    1. Open the IAM & Admin Settings page in the Google Cloud console. Select the project where you will create the Dataproc cluster. Copy the project ID.
  2. A principal with the Shared VPC Admin role must perform the following steps. See directions for setting up Shared VPC for background information.

    1. Make sure that the Shared VPC host project is enabled.

    2. Attach the project with the Dataproc cluster to the host project.

    3. Follow the instructions in this step to configure both of the following service accounts to have the Network User role for the host project:

      1. Open the VPC networks page in the Google Cloud console.

      2. Use the project selector to select the host project.

      3. Click the network that contains the subnetwork that your Dataproc cluster will use.

      4. In the VPC Network Details page, click the checkbox next to the name of the subnetwork that your cluster will use.

      5. If the Info Panel is not open, click Show Info Panel.

      6. Perform the following steps for each service account:

        1. In the Info Panel, click Add Principal.

        2. FIll in the Grant Access form:

          1. Add principals: Input the service account.

          2. Assign roles: Insert "Compute Network" In the filter box, then select the Compute Network User role.

          3. Click Save.

  3. After both service accounts have the Network User role for the host project, create a cluster that uses the shared VPC subnetwork.

Create a Dataproc cluster with internal IP addresses only

You can create a Dataproc cluster that is isolated from the public internet whose VM instances communicate over a private IP subnetwork (cluster VMs are not assigned public IP addresses). To do this, the subnetwork must have Private Google Access enabled to allow cluster nodes to access Google APIs and services, such as Cloud Storage, from internal IPs.

gcloud CLI

You can create a Dataproc cluster with internal IP addresses only by using the gcloud dataproc clusters create command with the ‑‑no-address flag.

Use the ‑‑no-address and ‑‑network flags: Use the ‑‑no-address flag with the ‑‑network flag to create a cluster that will use a subnetwork with the same name as the network in the region where the cluster is created.

gcloud dataproc clusters create CLUSTER_NAME \
    --no-address \
    --network NETWORK_NAME \
    --region=REGION \
    ... other args ...

For example, since auto networks are created with subnets in each region with the same name as the auto network, you can pass the auto network name to the ‑‑network flag to create a cluster that will use the auto subnetwork in the cluster's region.

Use the ‑‑no-address and ‑‑subnet flags: Use the ‑‑no-address flag with the ‑‑subnet flags to create a cluster that will use an auto or custom subnetwork in the region where the cluster will be created. Pass the ‑‑subnet flag the full resource path of the subnet.

gcloud dataproc clusters create cluster-name \
    --no-address \
    --subnet projects/project-id/regions/region/subnetworks/subnetwork-name \
    --region=region \
    ... other args ...

REST API

You can set the GceClusterConfig internalIpOnly field to true as part of a clusters.create request to enable internal IP addresses only.

Example:

POST /v1/projects/my-project-id/regions/us-central1/clusters/
{
  "projectId": "my-project-id",
  "clusterName": "example-cluster",
  "config": {
    "configBucket": "",
    "gceClusterConfig": {
      "subnetworkUri": "custom-subnet-1",
      "zoneUri": "us-central1-b",
      "internalIpOnly": true
    },
    ...

Console

You can create a Dataproc cluster with Private Google Access enabled from the Dataproc Create a cluster page in the Google Cloud console. Click Internal IP only on the Customize cluster panel to enable this feature for your cluster.

Since, by default, internal-ip-only clusters don't have access to the internet, jobs that download dependencies from the internet, for example jobs that download Spark dependency packages from Maven Central, will fail. There are several workarounds to avoid the problem:

  1. Use Cloud NAT to enable cluster access to the internet.

  2. Create a custom image that includes the dependencies (for example, Spark dependency packages in /usr/lib/spark/jars/).

  3. Upload the dependencies to a Cloud Storage bucket, then use an initialization action to download the dependencies from the bucket during cluster creation.

Dataproc and VPC Service Controls networks

With VPC Service Controls, administrators can define a security perimeter around resources of Google-managed services to control communication to and between those services.

Note the following limitations and strategies when using VPC Service Controls networks with Dataproc clusters: