Attaching a Dataproc or self-managed cluster

After you create a Dataproc Metastore service, you can create and attach a Dataproc cluster that uses the service as its Hive metastore.

You can also attach a self-managed Apache Hive/Apache Spark/Presto cluster that uses the Dataproc Metastore service as its Hive metastore by modifying the client config.

Before you begin

  • You must create the Dataproc cluster in the same region as the Dataproc Metastore service for optimal network connectivity.

  • The Dataproc image and Dataproc Metastore versions must be compatible. The Dataproc 2.0 image uses Hive version 3.1.2, and therefore supports Dataproc Metastore services also created with 3.1.2.

  • If you're working with a cross-project deployment you must set up additional permissions before creating a Dataproc Metastore cluster. A cross-project deployment can consist of two to three projects, with the Dataproc cluster in a cluster project, the Dataproc Metastore service in a metastore project, and the network in either of the previous two projects or in its own network project. It's also possible for the Dataproc cluster and the Dataproc Metastore service to share a project while the network is in its own network project.

    The following diagram provides an overview of the possible project configurations when deploying a Dataproc Metastore cluster:

    Overview of the possible project configurations when deploying a Dataproc Metastore and Dataproc cluster

  • In order to use any Dataproc Metastore service, including one in the same project as the Dataproc cluster, the cluster and the metastore must be on the same network.

    • A cluster may also use a subnet of the metastore's network. To be able to create a cluster using a network or subnetwork from the network project, you must configure shared network permissions.

Attaching a Dataproc cluster that uses the Dataproc Metastore service

After you create a service, you can create and attach a Dataproc cluster that uses the service as its Hive metastore by using the Google Cloud Console, the gcloud tool, or the Dataproc API.

Setting up a cross-project deployment

Cross-project deployments where the Dataproc cluster and the Dataproc Metastore service are in separate projects require a permissions setup. You do not need to perform this setup for cases where the Dataproc cluster and the Dataproc Metastore service share a project while the network is in its own network project.

After configuring the network permissions, you must grant the Dataproc Metastore Viewer role in the metastore project to the Dataproc Service Agent of the cluster project. The Dataproc Service Agent account is in the format service-<cluster-project-number>@dataproc-accounts.iam.gserviceaccount.com. You'll need to reference the project number of the cluster project.

Console

To find the project number:

  1. Navigate to the IAM & Admin Settings tab.

  2. From the project list at the top of the page, select the project you'll use to create the Dataproc cluster.

  3. Note the project number.

Configure the permissions:

  1. Navigate to the IAM tab

  2. From the project list at the top of the page, select the metastore project.

  3. Click Add.

    1. Enter the service account in the New Members field.

    2. From the Roles menu, select Dataproc Metastore > Dataproc Metastore Viewer.

    3. Click Add.

You can now create a Dataproc cluster using the metastore project's Dataproc Metastore service and network or subnetwork that the service is on.

Creating a Dataproc cluster

The following instructions demonstrate how to create and attach a Dataproc cluster using the Google Cloud Console, the gcloud tool, or the Dataproc Metastore API.

Console

  1. In the Cloud Console, open the Dataproc Create a cluster page:

    Open the Create a cluster page in the Cloud Console

  2. Enter the Cluster Name field.

  3. On the Region and Zone menus, select a region and zone for the cluster. You can select a distinct region, to isolate resources and metadata storage locations within the specified region. If you select a distinct region, you can select "No preference" for the zone to let Dataproc pick a zone within the selected region for your cluster (see Dataproc Auto zone placement).

  4. Click on the Customize cluster tab.

  5. In the Network configuration section, select the same network specified during the metastore service creation.

  6. In the Dataproc Metastore section, select your metastore service. If you haven't created one yet you can select Create New Service.

  7. Click Create to create the cluster.

Your new cluster appears in the Clusters list. Cluster status is listed as "Provisioning" until the cluster is ready to use, then changes to "Running."

gcloud

Use the following gcloud dataproc clusters create command to create a cluster:

 gcloud dataproc clusters create example-cluster \
    --dataproc-metastore=projects/PROJECT_ID/locations/LOCATION/services/example-service \
    --region=LOCATION
 

Replace PROJECT_ID with the project ID of the project you created your Dataproc Metastore service in.

Replace LOCATION with the same region you specified above for the Dataproc Metastore service.

REST

Follow the API instructions to create a cluster by using the APIs Explorer.

Attaching a self-managed cluster that uses the Dataproc Metastore service

After you create a service, you can also choose to attach a self-managed Apache Hive, Apache Spark, or Presto cluster that uses the service as its Hive metastore by setting the following in the client config:

hive.metastore.uris=endpoint_uri
hive.metastore.warehouse.dir=warehouse_dir

Replace the following:

  • endpoint_uri: The Hive metastore endpoint URI used to access the metastore service.

    To find the endpoint URI value to use, click the service name of your service on the Dataproc Metastore page. This brings you to the Service detail page for that service, where you can use the URL value that starts with thrift://.

  • warehouse_dir: Refers to the directory of the Hive metastore config overrides. It can follow the form gs://.*hive-warehouse.

    To find the warehouse directory to use, click the service name of your service on the Dataproc Metastore page. This brings you to the Service detail page for that service, where you can use the hive.metastore.warehouse.dir value under Metastore config overrides.

Service detail URL and hive.metastore.warehouse.dir values

What's next?