Dataproc Cluster Network Configuration

Overview

The Compute Engine Virtual Machine instances in a Dataproc cluster, consisting of master and worker VMs, must be able to communicate with each other using ICMP, TCP (all ports), and UDP (all ports).

Firewall rule requirement

Dataproc requires that you create an ingress allow firewall rule with the following characteristics:

  • The source for the rule must include the cluster's VMs. You can define a source using IP address ranges or you can identify VMs using source tags or source service accounts. If you omit a source specification, the firewall rule will use the range 0.0.0.0/0 (any IP address) as the source. If your Dataproc VMs have external IP addresses, this means they can accept traffic from anywhere on the Internet. Consequently, you should define the source to be as narrow as possible to meet your needs and secure your cluster.

  • The target for the rule must identify the cluster's VMs. The target can be all VMs in the VPC network, or you can identify specific target VMs using target tags or target service accounts.

  • The rule must include the following protocols and ports: TCP (all ports, 0 through 65535), UDP (all ports, 0 through 65535), and ICMP.

    gcloud compute firewall-rules create my-subnet-firewall-rule --allow tcp
    
    It's best to specify an ingress allow firewall rule with a specific source range or to identify Google Cloud VMs by network tag or service account. Refer to the firewall rules overview for additional information.

How to set the source IP range

You can set the source IP range when you create a firewall rule from the Google Cloud Console or using the gcloud command-line tool.

Console

Use the Cloud Console Create A firewall rule page to create a firewall rule with a specified source IP range.

gcloud command

Use the gcloud compute firewall-rules create command to create a firewall rule with a specified source IP range.

gcloud compute firewall-rules create "tcp-rule" --allow tcp:80 \
    --source-ranges="10.0.0.0/22,10.0.0.0/14" \
    --description="Narrowing TCP traffic"

Dataproc cluster default network configuration

When you create a Dataproc cluster, you can accept the default network for the cluster.

Default network

Here's a Google Cloud Console snapshot that shows the default network selected from the Dataproc Create a cluster page.

After the cluster is created, the Google Cloud Console VM instances→Network details page shows the applicable firewall rules for the instances in the cluster. If you use a [default network]/vpc/docs/vpc#default-network), it includes a pre-populated default-allow-internal firewall rule that allows ingress from the 10.128.0.0/9 source range. If you delete this pre-populated firewall rule or use a VPC network other than the default network, ingress traffic is blocked by the implied deny ingress rule. In those situations, you must create an ingress allow firewall rule that permits traffic to all TCP and UDP ports of instances in the cluster. Your Network or Security Administrator can refer to the firewall rules overview for more information.

Create a VPC network

You can specify your own Virtual Private Cloud (VPC) network when you create a Dataproc cluster. To do this, you must first create a VPC network with firewall rules. Then, when you create the cluster, you associate your network with the cluster.

Creating a VPC network

You can create a VPC network from the Cloud Console or using the gcloud compute networks create command-line tool. You can create an auto mode VPC network or a custom mode VPC network (called "auto" and "custom" networks, respectively, below). An auto network is automatically configured with subnets in each Compute Engine region. Custom networks are not automatically configured with subnets; you must create one or more subnets in one or more Compute Engine regions when you create the custom network. For more information, see Types of VPC Networks.

Let's look at the options available when you create an auto and custom network from the Cloud Console.

Auto

The Cloud Console screenshot, below, shows the Cloud Console fields that are populated for the Automatic creation of subnetworks (an auto mode VPC network). You must select one or more firewall rules. The network-name-allow-internal rule, which opens udp:0-65535;tcp:0-65535;icmp ports, should be selected to enable full internal IP networking access among VM instances in the network. You can also select the network-name-allow-ssh rule to open standard SSH port 22 to allow SSH connections to network.

Custom

If you choose Custom subnetworks when creating a network (a custom mode VPC network), you must specify the region and private IP address range for each subnetwork. To enable full internal access among VMs in the network, you can specify an IP address range of 10.0.0.0/8 (or a more restrictive range if appropriate, such as 10.128.0.0/16).

Note that you provide firewall rules for custom subnetworks after the network is created. Again, to enable full network access among VMs in your network, select or create a firewall rule that opens the udp:0-65535;tcp:0-65535;icmp ports (as shown in the Cloud Console screenshot below).

Creating a cluster that uses your VPC network

gcloud command

You can use the Cloud SDK gcloud dataproc clusters create command with the ‑‑network or ‑‑subnet flag to create a cluster that will use an auto or custom subnetwork.

Using the ‑‑network flag
You can use the ‑‑network flag to create a cluster that will use a subnetwork with the same name as the network in the region where the cluster will be created.

gcloud dataproc clusters create my-cluster \
    --network network-name \
    --region=region \
    ... other args ...

For example, since auto networks are created with subnets in each region with the same name as the auto network, you can pass the auto network name to the ‑‑network flag (‑‑network auto-net-name) to create a cluster that will use the auto subnetwork in the cluster's region.

Using the ‑‑subnet flag
You can use the ‑‑subnet flag to create a cluster that will use an auto or custom subnetwork in the region where the cluster will be created. You must pass the ‑‑subnet flag the full resource path of the subnet your cluster will use.

gcloud dataproc clusters create cluster-name \
    --subnet projects/project-id/regions/region/subnetworks/subnetwork-name \
    --region=region \
    ... other args ...

REST API

You can specify either the networkUri or subnetworkUri GceClusterConfig field as part of a clusters.create request.

Example

POST /v1/projects/my-project-id/regions/us-central1/clusters/
{
  "projectId": "my-project-id",
  "clusterName": "example-cluster",
  "config": {
    "configBucket": "",
    "gceClusterConfig": {
      "subnetworkUri": "custom-subnet-1",
      "zoneUri": "us-central1-b"
    },
    ...

Console

After creating a VPC network with firewall rules that allow VMs full access over the network's private IP address range, you can create a cluster from the Cloud Console→Create cluster page, then select your network from the Network selector (expand the Preemptible workers, bucket, network, version, initialization, & access options heading to access the selector).

After you choose the network, the Subnetwork selector displays the subnetworks(s) available in the region you have selected for the creation of the cluster. If a subnetwork is not available in the region, "No subnetworks in this region" is displayed in the Subnetwork selector.

Below is a screenshot that shows the Network and Subnetwork selectors on the Dataproc Create a cluster Cloud Console page. As shown, a custom subnetwork in a custom network has been selected.

Creating a cluster that uses a VPC network in another project

A Dataproc cluster can use a Shared VPC network by participating as a service project. With Shared VPC, the Shared VPC network is defined in a different project, which is called the host project. The host project is made available for use by IAM members in attached service projects. See Shared VPC Overview for background information.

You will create your Dataproc cluster in a project. In the Shared VPC scenario, this project will be a service project. You will need to reference the project number of this project. Here's one way to find the project number:

  1. Navigate to the IAM & admin page Settings tab.

  2. From the project drop-down list at the top of the page, select the project you will use to create the Dataproc cluster.

  3. Note the project number:

An IAM member who is a Shared VPC Admin must perform the following steps. See directions for setting up Shared VPC for background information.

  1. Make sure that the Shared VPC host project has been enabled.

  2. Attach the Dataproc project to the host project.

  3. Configure either or both of the following service accounts to have the Network User role for the host project. Dataproc will attempt to use the first service account, falling back to the Google APIs service account if required.

  4. Navigate to the IAM tab of the IAM & admin page.

  5. Use the project drop-down list at the top of the page to select the host project.

  6. Click ADD. Repeat these steps to add both service accounts:

    1. Add the service account to the Members field.

    2. From the Roles menu, select Compute Engine > Compute Network User.

    3. Click Add.

Once both service accounts have the Network User role for the host project, you can create a cluster that uses your VPC network.

Create a Dataproc cluster with internal IP addresses only

You can create a Dataproc cluster that is isolated from the public internet whose VM instances communicate over a private IP subnetwork (the VM instances will not have public IP addresses). To do this, the subnetwork of the cluster must have Private Google Access enabled to allow cluster nodes to access Google APIs and services, such as Cloud Storage, from internal IPs.

gcloud command

You can create a Dataproc cluster with internal IP addresses only by using the gcloud dataproc clusters create command with the ‑‑no-address flag.

Using the ‑‑no-address and ‑‑network flags
Use the ‑‑no-address flag with the ‑‑network flag to create a cluster that will use a subnetwork with the same name as the network in the region where the cluster will be created.

gcloud dataproc clusters create my-cluster \
    --no-address \
    --network network-name \
    --region=region \
    ... other args ...

For example, since auto networks are created with subnets in each region with the same name as the auto network, you can pass the auto network name to the ‑‑network flag (‑‑network auto-net-name) to create a cluster that will use the auto subnetwork in the cluster's region.

Using the ‑‑no-address and ‑‑subnet flags
Use the ‑‑no-address flag with the ‑‑subnet flags to create a cluster that will use an auto or custom subnetwork in the region where the cluster will be created. You must pass the ‑‑subnet flag the full resource path of the subnet your cluster will use.

gcloud dataproc clusters create cluster-name \
    --no-address \
    --subnet projects/project-id/regions/region/subnetworks/subnetwork-name \
    --region=region \
    ... other args ...

REST API

You can set the GceClusterConfig internalIpOnly field to "true" as part of a clusters.create request to enable internal IP addresses only.

Example

POST /v1/projects/my-project-id/regions/us-central1/clusters/
{
  "projectId": "my-project-id",
  "clusterName": "example-cluster",
  "config": {
    "configBucket": "",
    "gceClusterConfig": {
      "subnetworkUri": "custom-subnet-1",
      "zoneUri": "us-central1-b",
      "internalIpOnly": true
    },
    ...

Console

You can create a Dataproc cluster with Private Google Access enabled from the Dataproc Create a cluster Cloud Console page. Expand the Preemptible workers, bucket, network, version, initialization, & access options link at the bottom of the page, and then click Internal IP only to enable this feature for your cluster.

Dataproc and VPC-SC networks

With VPC Service Controls, administrators can define a security perimeter around resources of Google-managed services to control communication to and between those services.

Note the following limitations and strategies when using VPC-SC networks with Dataproc clusters: