After you create a Dataproc Metastore service, you can attach either of the following to use the service as its Hive metastore:
Before you begin
For optimal network connectivity, create the Dataproc cluster in the same region as the Dataproc Metastore service.
The Dataproc image and Dataproc Metastore Hive version must be compatible. Check the following image versioning pages to ensure that the Hive version is compatible:
For more information, see Dataproc Image version list.
In order to use any Dataproc Metastore service, including one in the same project as the Dataproc cluster, the cluster and the metastore must be on the same network.
You can attach a Dataproc Metastore service to any Dataproc cluster on the entire peered network by providing the following Hive property config setup when using
gcloud dataproc clusters create CLUSTER_NAME.
gcloud dataproc clusters create CLUSTER_NAME \ --properties="hive:hive.metastore.uris=$ENDPOINT_URI,hive:hive.metastore.warehouse.dir=$WAREHOUSE_DIR/hive-warehouse"
A cluster may also use a subnet of the metastore's network. To create a cluster using a subnetwork from the network project, you must configure shared network permissions.
If you're working with a cross-project deployment you must set up additional permissions before creating a Dataproc Metastore cluster. A cross-project deployment can consist of two to three projects, with the Dataproc cluster in a cluster project, the Dataproc Metastore service in a metastore project, and the network in either of the previous two projects or in its own network project. It's also possible for the Dataproc cluster and the Dataproc Metastore service to share a project while the network is in its own network project.
The following diagram provides an overview of the possible project configurations when deploying a Dataproc Metastore cluster:
When using a VPC network belonging to a different project than the service, you must provide the entire relative resource name in
gcloud metastore services create SERVICE.
gcloud metastore services create SERVICE \ --network=projects/HOST_PROJECT/global/networks/NETWORK_ID
Attach a Dataproc cluster
You can create and attach a Dataproc cluster that uses the Dataproc Metastore service as its Hive metastore.
Set up a cross-project deployment
Cross-project deployments where the Dataproc cluster and the Dataproc Metastore service are in separate projects require a permissions setup. You don't need to perform this setup for cases where the Dataproc cluster and the Dataproc Metastore service share a project while the network is in its own network project.
After configuring the network permissions, you must grant the Dataproc Metastore
Viewer role in the metastore project to the Dataproc Service
Agent of the cluster project. The Dataproc Service Agent account
is in the format
You must reference the project number of the cluster project.
To find the project number:
Navigate to the IAM & Admin Settings tab.
From the project list at the top of the page, select the project you'll use to create the Dataproc cluster.
Note the project number.
Configure the permissions:
Navigate to the IAM tab.
From the project list at the top of the page, select the metastore project.
Enter the service account in the New Principals field.
From the Roles menu, select Dataproc Metastore > Dataproc Metastore Viewer.
You can now create a Dataproc cluster using the metastore project's Dataproc Metastore service and network or subnetwork that the service is on.
Create a Dataproc cluster
The following instructions demonstrate how to create and attach a Dataproc cluster.
In the Cloud Console, open the Dataproc Create a cluster page:
Enter the Cluster Name field.
On the Region and Zone menus, select a region and zone for the cluster. You can select a distinct region, to isolate resources and metadata storage locations within the specified region. If you select a distinct region, you can select "No preference" for the zone to let Dataproc pick a zone within the selected region for your cluster (see Dataproc Auto zone placement).
Click on the Customize cluster tab.
In the Network configuration section, select the same network specified during the metastore service creation.
In the Dataproc Metastore section, select your metastore service. If you haven't created one yet you can select Create New Service.
Click Create to create the cluster.
Your new cluster appears in the Clusters list. Cluster status is listed as "Provisioning" until the cluster is ready to use. Its status then changes to "Running."
Run the following
gcloud dataproc clusters create command to create a
gcloud dataproc clusters create CLUSTER_NAME \ --dataproc-metastore=projects/PROJECT_ID/locations/LOCATION/services/SERVICE \ --region=LOCATION
CLUSTER_NAME with the name of the new cluster.
PROJECT_ID with the project ID of the
project you created your Dataproc Metastore service in.
LOCATION with the same region you specified
for the Dataproc Metastore service.
SERVICE with the Dataproc Metastore
Follow the API instructions to create a cluster by using the API Explorer.
Attach a self-managed cluster
After you create a service, you can attach a self-managed Apache Hive, Apache Spark, or Presto cluster that uses the service as its Hive metastore by setting the following in the client config:
Replace the following:
ENDPOINT_URI: The Hive metastore endpoint URI used to access the metastore service.
To find the endpoint URI value to use, click the service name of your service on the Dataproc Metastore page. This brings you to the Service detail page for that service, where you can use the URL value that starts with
WAREHOUSE_DIR: Refers to the directory of the Hive metastore config overrides. It can follow the form
To find the warehouse directory to use, click the service name of your service on the Dataproc Metastore page. This brings you to the Service detail page for that service, where you can use the hive.metastore.warehouse.dir value under Metastore config overrides.