Dataproc Metastore networking overview

This document provides an overview of the networking settings you can use to set up a Dataproc Metastore service.

Quick reference for networking topics

Networking settings Notes
Default network settings
VPC networks By default, Dataproc Metastore services use VPC Networks to connect to Google Cloud.

After the VPC network is created, Dataproc Metastore also automatically configures VPC Network Peering for your service.
VPC subnetworks You can optionally choose to create Dataproc Metastore services with a VPC subnetwork by using Private Service Connect. This is an alternative to using VPC networks.
Additional network settings
Shared VPC networks You can optionally choose to create Dataproc Metastore services in a Shared VPC network.
On-premise networking You can connect to a Dataproc Metastore service with an on-premise environment by using Cloud VPN or Cloud Interconnect.
VPC Service Controls You can optionally choose to create Dataproc Metastore services with VPC Service Controls.
Firewall rules In non-default or private environments with an established security footprint, you might need to create your own firewall rules.

Default networking settings

The following section describes the default network settings that Dataproc Metastore uses—VPC networks and VPC Network Peering.

VPC networks

By default, Dataproc Metastore services use VPC Networks to connect to Google Cloud. A VPC network is a virtual version of a physical network that is implemented inside of Google's production network. When you create a Dataproc Metastore, the service automatically creates the VPC network for you.

If you don't change any settings when you create your service, Dataproc Metastore uses the default VPC network. With this setting, the VPC network that you use with your Dataproc Metastore service can belong to the same Google Cloud project or a different project. This setting also lets you expose your service in a single VPC network or make your service accessible from multiple VPC networks (through the use of subnetworks).

Dataproc Metastore requires the following per region for each VPC network:

VPC Network Peering

After the VPC network is created, Dataproc Metastore also automatically configures VPC Network Peering for your service. VPC provides your service with access to the Dataproc Metastore endpoint protocols. After you create your service, you can see its underlying VPC Network Peering on the VPC Network Peering page in the Google Cloud console.

VPC Network Peering is not transitive. This means that only directly peered networks can communicate with each other. For example, consider the following scenario:

You have the following networks, VPC network N1, N2, and N3.

  • VPC network N1 is paired with N2 and N3.
  • VPC network N2 and N3 are not directly connected.

What does this mean?

It means that through VPC Network Peering, VPC network N2 can't communicate with VPC network N3. This impacts Dataproc Metastore connections in the following ways:

  • Virtual machines that are in networks peered with your Dataproc Metastore project network can't reach Dataproc Metastore.
  • Only hosts on the VPC network can reach a Dataproc Metastore service.

VPC Network Peering Security considerations

  • Traffic over VPC Network Peering is provided with a certain level of encryption. For more information, see Google Cloud virtual network encryption and authentication.

  • Creating one VPC network for each service with a internal IP address provides better network isolation than putting all services in the default VPC network.

VPC subnetworks

Private Service Connect (PSC) lets you set up a private connection to Dataproc Metastore metadata across VPC networks. With PSC, you can create a service without VPC peering. This lets you use your own internal IP addresses to access Dataproc Metastore, without leaving your VPC networks or using external IP addresses.

To set up Private Service Connect when creating a service, see Private Service Connect with Dataproc Metastore.

IP Addresses

To connect to a network and help protect your metadata, Dataproc Metastore services only use internal IP addresses. This means that public IP addresses aren't exposed or are available for networking purposes.

By using an internal IP address, Dataproc Metastore can only connect to Virtual Machines (VMs) that exist on specified Virtual Private Cloud (VPC) networks or an on-premises environment.

Connections to a Dataproc Metastore service using a internal IP address use RFC 1918 address ranges. Using these ranges means that Dataproc Metastore allocates a /17 range and a /20 range from the address space for each region. For example, placing Dataproc Metastore services in two regions requires that the allocated IP address range contains the following:

  • At least two unused address blocks of size /17.
  • At least two unused address blocks of size of size /20.

If RFC 1918 address blocks aren't found, then Dataproc Metastore finds suitable non-RFC 1918 address blocks instead. Note that the allocation of non-RFC 1918 blocks doesn't take into account whether or not those addresses are in use in your VPC network or on-premises.

Additional networking settings

If you require a different networking settings, you can use the following options with your Dataproc Metastore service.

Shared VPC network

You can create Dataproc Metastore services in a Shared VPC network. A Shared VPC lets you connect Dataproc Metastore resources from multiple projects to a common VPC (VPC) network.

To set up a Shared VPC when creating a service, see Create a Dataproc Metastore Service.

On-premise networking

You can connect to a Dataproc Metastore service with an on-premise environment by using Cloud VPN or Cloud Interconnect.

VPC Service Controls

VPC Service Controls improve your ability to mitigate the risk of data exfiltration. With VPC Service Controls, you create perimeters around the Dataproc Metastore service. VPC Service Controls restrict access to resources within the perimeter from the outside. Only clients and resources within the perimeter can interact with one another.

To use VPC Service Controls with Dataproc Metastore, see VPC Service Controls with Dataproc Metastore. Also review Dataproc Metastore limitations when using VPC Service Controls.

Firewall rules for Dataproc Metastore

In non-default or private environments with an established security footprint, you might need to create your own firewall rules. If you do, don't create a firewall rule that blocks the IP addresses range or port of your Dataproc Metastore services.

When you create a Dataproc Metastore service, you can accept the default network for the service. The default network ensures full internal IP networking access for your VMs.

For more general information about firewall rules, see VPC firewall rules and Using VPC firewall rules.

Create a firewall rule for a custom network

When you use a custom network, make sure your firewall rule permits traffic coming from and going to the Dataproc Metastore endpoint. To explicitly allow Dataproc Metastore traffic, run the following gcloud commands:

gcloud compute firewall-rules create dpms-allow-egress-DPMS_NETWORK-REGION --allow tcp --destination-ranges DPMS_NET_PREFIX/17 --network DPMS_NETWORK --direction OUT
gcloud compute firewall-rules create dpms-allow-ingress-DPMS_NETWORK-REGION --allow tcp,udp --source-ranges DPMS_NET_PREFIX/17 --network DPMS_NETWORK

For DPMS_NET_PREFIX, apply a /17 subnet mask to your Dataproc Metastore service IP. You can find your Dataproc Metastore IP address information in the endpointUri configuration on the Service detail page.

Considerations

Networks have an implied allow egress rule that normally allows access from your network to Dataproc Metastore. If you create deny egress rules that override the implied allow egress rule, you should create an allow egress rule with a higher priority to permit egress to the Dataproc Metastore IP.

Some features such as Kerberos require Dataproc Metastore to initiate connections to hosts in your project network. All networks have an implied deny ingress rule that blocks these connections and prevent the those features from working. You should create a firewall rule that allows TCP and UDP ingress on all ports from the /17 IP block that contains the Dataproc Metastore IP.

Custom routing

Custom routes are for subnets that use privately used public IP addresses (PUPI). Custom routes allow your VPC network to connect to a peer network. Custom routes can only be received when your VPC network imports them and the peer network explicitly exports them. Custom routes can be either static or dynamic.

Sharing custom routes with peered VPC networks allow networks to "learn" routes directly from their peered networks. This means that when a custom route in a peered network is updated, your VPC network automatically learns and implements the custom route without requiring any additional action from you.

For more information about custom routing, see network config.

Dataproc Metastore Networking example

In the following example, Google allocates the 10.100.0.0/17 and 10.200.0.0/20 address ranges in the customer VPC network for Google services and uses the address ranges in a peered VPC network.

INSERT ALT TEXT HERE
Figure 1. Dataproc Metastore VPC network configuration

Description of the networking example:

  • On the Google services side of the VPC peering, Google creates a project for the customer. The project is isolated, meaning no other customers share it and the customer is billed for only the resources the customer provisions.
  • When creating the first Dataproc Metastore service in a region, Dataproc Metastore allocates a /17 range and a /20 range in the customer's network for all future Dataproc Metastore services usage in that region and network. Dataproc Metastore further subdivides these ranges to create subnetworks and address ranges in the service producer project.
  • VM services in the customer's network can access Dataproc Metastore service resources in any region if the Google Cloud service supports it. Some Google Cloud services might not support cross-region communication.
  • Egress costs for cross-regional traffic, where a VM instance communicates with resources in a different region, still apply.
  • Google assigns the Dataproc Metastore service the IP address 10.100.0.100. In the customer VPC network, requests with a destination of 10.100.0.100 are routed through the VPC peering to the service producer's network. After reaching the service network, the service network contains routes that direct the request to the correct resource.
  • Traffic between VPC networks travels internally within Google's network, not through the public internet.

What's next