Attach a Dataproc or self-managed cluster

Stay organized with collections Save and categorize content based on your preferences.

After you create a Dataproc Metastore service, you can attach either of the following to use the service as its Hive metastore:

Before you begin

Required Roles

To get the permissions that you need to create a Dataproc Metastore and a Dataproc cluster, ask your administrator to grant you the following IAM roles:

  • To grant full control of Dataproc Metastore resources, either:
    • Dataproc Metastore Editor (roles/metastore.editor) on the user account or service account
    • Dataproc Metastore Admin (roles/metastore.admin) on the user account or service account
  • To create a Dataproc Metastore cluster: (roles/dataproc.worker) on the Dataproc VM service account
  • To grant read and write permissions to the Hive warehouse directory: (roles/storage.objectAdmin) on the Dataproc VM service account

For more information about granting roles, see Manage access.

These predefined roles contain the permissions required to create a Dataproc Metastore and a Dataproc cluster. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

  • To create a Dataproc Metastore: metastore.services.create on the user account or service account
  • To create a Dataproc cluster: dataproc.clusters.create on the user account or service account
  • To access the Hive warehouse directory: orgpolicy.policy.get1,resourcemanager.projects.get,resourcemanager.projects.list,storage.objects.*,storage.multipartUploads.* on the Dataproc VM service account

You might also be able to get these permissions with custom roles or other predefined roles.

For more information about specific Dataproc Metastore roles and permissions, see Manage access with IAM.

Dataproc clusters

You can create and attach a Dataproc cluster that uses the Dataproc Metastore service as its Hive metastore.

For more information about the different ways you can set up a Dataproc cluster, see Project configurations.

Considerations

  • If you're using a Dataproc Metastore service with a Thrift endpoint:

    • The Dataproc cluster and the metastore must be on the same network. A cluster may also use a subnet of the Dataproc Metastore service's network.
    • If the Dataproc cluster belongs to a different project than the network, you must configure shared network permissions.
  • If you're working with a cross-project deployment you must set up additional permissions before creating a Dataproc Metastore cluster.

  • If you are using Dataproc Personal Cluster Authentication, your metastore must use the gRPC endpoint protocol.

Create a Dataproc cluster and attach a metastore

The following instructions show you how to create a Dataproc cluster and connect to it from a Dataproc Metastore.

  • For optimal network connectivity, create the Dataproc cluster in the same region as your Dataproc Metastore service.

  • The Dataproc image and Dataproc Metastore Hive version must be compatible. For more information, see Dataproc Image version list.

Console

  1. In the Google Cloud console, open the Dataproc Create a cluster page:

    Open the Create a cluster page in the Google Cloud console

  2. Enter the Cluster Name field.

  3. On the Region and Zone menus, select a region and zone for the cluster. You can select a distinct region, to isolate resources and metadata storage locations within the specified region. If you select a distinct region, you can select "No preference" for the zone to let Dataproc pick a zone within the selected region for your cluster (see Dataproc Auto zone placement).

  4. Click on the Customize cluster tab.

  5. In the Network configuration section, select the same network specified during the metastore service creation.

  6. In the Dataproc Metastore section, select your metastore service. If you haven't created one yet you can select Create New Service.

  7. Click Create to create the cluster.

Your new cluster appears in the Clusters list. Cluster status is listed as "Provisioning" until the cluster is ready to use. Its status then changes to "Running."

gcloud

Run the following gcloud dataproc clusters create command to create a cluster:

 gcloud dataproc clusters create CLUSTER_NAME \
    --dataproc-metastore=projects/PROJECT_ID/locations/LOCATION/services/SERVICE \
    --region=LOCATION
 

Replace the following:

  • CLUSTER_NAME: the name of the new cluster.
  • PROJECT_ID: the project ID of the project you created your Dataproc Metastore service in.
  • LOCATION: the same region you specified for the Dataproc Metastore service.
  • SERVICE: the Dataproc Metastore service name.

REST

Follow the API instructions to create a cluster by using the API Explorer.

Attach a Dataproc cluster using the ENDPOINT_URI and WAREHOUSE_DIR

You can attach a Dataproc cluster using the ENDPOINT_URI and WAREHOUSE_DIR properties. This is useful if your Dataproc Metastore service uses Private Service Connect or if you want to attach the auxiliary version of your Dataproc Metastore service.

For more information on these properties and where to find them, see Attach a self-managed cluster.

There are two ways you can attach a Dataproc cluster using the ENDPOINT_URI and WAREHOUSE_DIR properties:

  • Option 1: Provide the following Hive property config setup while creating the Dataproc cluster.

    1. Run the following gcloud dataproc clusters create command:

         gcloud dataproc clusters create CLUSTER_NAME \
             --properties="hive:hive.metastore.uris=$ENDPOINT_URI,hive:hive.metastore.warehouse.dir=$WAREHOUSE_DIR/hive-warehouse"
      
  • Option 2: Update the hive-site.xml on the Dataproc cluster with the endpoint URI listed in NetworkConfig.

    1. SSH into the Dataproc cluster's master instance and perform the following:

      1. Modify /etc/hive/conf/hive-site.xml on the Dataproc cluster:

        <property>
          <name>hive.metastore.uris</name>
          <!-- Update this value. -->
          <value>ENDPOINT_URI</value>
        </property>
        <!-- Add this property entry. -->
        <property>
          <name>hive.metastore.warehouse.dir</name>
          <value>WAREHOUSE_DIR</value>
        </property>
        
      2. Restart HiveServer2:

        sudo systemctl restart hive-server2.service
        

Set up a cross-project deployment for Dataproc clusters

A cross-project deployment can consist of two to three projects, with the Dataproc cluster in a cluster project, the Dataproc Metastore service in a metastore project, and the network in either of the previous two projects or in its own network project. It's also possible for the Dataproc Metastore cluster and the Dataproc service to share a project while the network is in its own network project.

Cross-project permissions

Cross-project deployments where the Dataproc cluster and the Dataproc Metastore service are in separate projects require you to grant additional permissions. You don't need to perform this setup for cases where the Dataproc cluster and the Dataproc Metastore service share a project while the network is in its own network project.

After configuring the network permissions, you must grant the Dataproc Metastore Viewer role in the metastore project to the Dataproc Service Agent of the cluster project. The Dataproc Service Agent account is in the format service-<cluster-project-number>@dataproc-accounts.iam.gserviceaccount.com. You must reference the project number of the cluster project.

Console

To find the project number:

  1. Navigate to the IAM & Admin Settings tab.

  2. From the project list at the top of the page, select the project you'll use to create the Dataproc cluster.

  3. Note the project number.

Configure the permissions:

  1. Navigate to the IAM tab.

  2. From the project list at the top of the page, select the metastore project.

  3. Click Add.

    1. Enter the service account in the New Principals field.

    2. From the Roles menu, select Dataproc Metastore > Dataproc Metastore Viewer.

    3. Click Add.

You can now create a Dataproc cluster using the metastore project's Dataproc Metastore service and network or subnetwork that the service is on.

If you are using a VPC network belonging to a different project than the service, you must provide the entire relative resource name in gcloud metastore services create SERVICE. For example:

gcloud metastore services create SERVICE \
  --network=projects/HOST_PROJECT/global/networks/NETWORK_ID

Self-managed clusters

After you create a service, you can attach a self-managed Apache Hive, Apache Spark, or Presto cluster that uses the service as its Hive metastore.

Attach a self-managed cluster

Set the following values in your client config file:

hive.metastore.uris=ENDPOINT_URI
hive.metastore.warehouse.dir=WAREHOUSE_DIR

Replace the following:

  • ENDPOINT_URI: The Hive metastore endpoint URI used to access the metastore service.

    To find the endpoint URI value to use, click the service name of your service on the Dataproc Metastore page. This brings you to the Service detail page for that service, where you can use the URL value that starts with thrift://.

  • WAREHOUSE_DIR: Refers to the directory of the Hive metastore config overrides. It can follow the form gs://.*hive-warehouse.

    To find the warehouse directory to use, click the service name of your service on the Dataproc Metastore page. This brings you to the Service detail page for that service, where you can use the hive.metastore.warehouse.dir value under Metastore config overrides.

    Service detail URL and hive.metastore.warehouse.dir values

What's next