After you create a Dataproc Metastore service, you can create and attach a Dataproc cluster that uses the service as its Hive metastore.
Before you begin
You must create the Dataproc cluster in the same region as the Dataproc Metastore service for optimal network connectivity.
The Dataproc image and Dataproc Metastore Hive version must be compatible:
Dataproc 2.x images require Dataproc Metastore services created with Hive 3.1.2.
Dataproc 1.x images require Dataproc Metastore services created with either Hive 2.3.6 or 3.1.2, but perform optimally with 2.3.6.
For more information on Dataproc image versions and to find out which Hive version is used by a Dataproc image, see Dataproc Versioning.
If you're working with a cross-project deployment you must set up additional permissions before creating a Dataproc Metastore cluster. A cross-project deployment can consist of two to three projects, with the Dataproc cluster in a cluster project, the Dataproc Metastore service in a metastore project, and the network in either of the previous two projects or in its own network project. It's also possible for the Dataproc cluster and the Dataproc Metastore service to share a project while the network is in its own network project.
The following diagram provides an overview of the possible project configurations when deploying a Dataproc Metastore cluster:
When using a VPC network belonging to a different project than the service, the entire relative resource name must be provided when using
gcloud metastore services create SERVICE.
gcloud metastore services create SERVICE \ --network=projects/HOST_PROJECT/global/networks/NETWORK_ID
In order to use any Dataproc Metastore service, including one in the same project as the Dataproc cluster, the cluster and the metastore must be on the same network.
You can attach a Dataproc Metastore service to any Dataproc cluster on the entire peered network. To do so, you must provide the following Hive property config setup when using
gcloud dataproc clusters create CLUSTER_NAME.
gcloud dataproc clusters create CLUSTER_NAME \ --properties="hive:hive.metastore.uris=$ENDPOINT_URI,hive:hive.metastore.warehouse.dir=$WAREHOUSE_DIR/hive-warehouse"
A cluster may also use a subnet of the metastore's network. To be able to create a cluster using a network or subnetwork from the network project, you must configure shared network permissions.
Attaching a Dataproc cluster that uses the Dataproc Metastore service
After you create a service, you can create and attach a Dataproc
cluster that uses the service as its Hive metastore by using the Google Cloud Console,
gcloud tool, or the Dataproc API.
Setting up a cross-project deployment
Cross-project deployments where the Dataproc cluster and the Dataproc Metastore service are in separate projects require a permissions setup. You do not need to perform this setup for cases where the Dataproc cluster and the Dataproc Metastore service share a project while the network is in its own network project.
After configuring the network permissions, you must grant the Dataproc Metastore
Viewer role in the metastore project to the Dataproc Service
Agent of the cluster project. The Dataproc Service Agent account
is in the format
You'll need to reference the project number of the cluster project.
To find the project number:
Navigate to the IAM & Admin Settings tab.
From the project list at the top of the page, select the project you'll use to create the Dataproc cluster.
Note the project number.
Configure the permissions:
Navigate to the IAM tab
From the project list at the top of the page, select the metastore project.
Enter the service account in the New Members field.
From the Roles menu, select Dataproc Metastore > Dataproc Metastore Viewer.
You can now create a Dataproc cluster using the metastore project's Dataproc Metastore service and network or subnetwork that the service is on.
Creating a Dataproc cluster
The following instructions demonstrate how to create and attach a Dataproc
cluster using the Google Cloud Console, the
gcloud tool, or the
Dataproc Metastore API.
In the Cloud Console, open the Dataproc Create a cluster page:
Enter the Cluster Name field.
On the Region and Zone menus, select a region and zone for the cluster. You can select a distinct region, to isolate resources and metadata storage locations within the specified region. If you select a distinct region, you can select "No preference" for the zone to let Dataproc pick a zone within the selected region for your cluster (see Dataproc Auto zone placement).
Click on the Customize cluster tab.
In the Network configuration section, select the same network specified during the metastore service creation.
In the Dataproc Metastore section, select your metastore service. If you haven't created one yet you can select Create New Service.
Click Create to create the cluster.
Your new cluster appears in the Clusters list. Cluster status is listed as "Provisioning" until the cluster is ready to use, then changes to "Running."
Use the following
gcloud dataproc clusters create command to create a cluster:
gcloud dataproc clusters create CLUSTER_NAME \ --dataproc-metastore=projects/PROJECT_ID/locations/LOCATION/services/SERVICE \ --region=LOCATION
CLUSTER_NAME with the name of the new cluster.
PROJECT_ID with the project ID of the project you created your
Dataproc Metastore service in.
LOCATION with the same region you specified above for
the Dataproc Metastore service.
SERVICE with the Dataproc Metastore
Follow the API instructions to create a cluster by using the APIs Explorer.
Attaching a self-managed cluster that uses the Dataproc Metastore service
After you create a service, you can also choose to attach a self-managed Apache Hive, Apache Spark, or Presto cluster that uses the service as its Hive metastore by setting the following in the client config:
Replace the following:
ENDPOINT_URI: The Hive metastore endpoint URI used to access the metastore service.
To find the endpoint URI value to use, click the service name of your service on the Dataproc Metastore page. This brings you to the Service detail page for that service, where you can use the URL value that starts with
WAREHOUSE_DIR: Refers to the directory of the Hive metastore config overrides. It can follow the form
To find the warehouse directory to use, click the service name of your service on the Dataproc Metastore page. This brings you to the Service detail page for that service, where you can use the hive.metastore.warehouse.dir value under Metastore config overrides.
- Learn more about the quickstart guide.
- Learn more about Dataproc Metastore.
- Learn more about Dataproc.