Attach a Dataproc cluster or a self-managed cluster

After you create a Dataproc Metastore service, you can attach any of the following services:

After you connect one of these services, it uses your Dataproc Metastore service as its Hive metastore during query execution.

Before you begin

Required Roles

To get the permissions that you need to create a Dataproc Metastore and a Dataproc cluster, ask your administrator to grant you the following IAM roles:

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to create a Dataproc Metastore and a Dataproc cluster. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to create a Dataproc Metastore and a Dataproc cluster:

  • To create a Dataproc Metastore: metastore.services.create on the user account or service account
  • To create a Dataproc cluster: dataproc.clusters.create on the user account or service account
  • To access the Hive warehouse directory: orgpolicy.policy.get1,resourcemanager.projects.get,resourcemanager.projects.list,storage.objects.*,storage.multipartUploads.* on the Dataproc VM service account

You might also be able to get these permissions with custom roles or other predefined roles.

For more information about specific Dataproc Metastore roles and permissions, see Manage access with IAM.

Dataproc clusters

Dataproc is a managed Apache Spark and Apache Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning.

Considerations

Before you create and attach a Dataproc cluster, check what endpoint protocol your Dataproc Metastore service is using. This protocol defines how your Hive Metastore clients access metadata stored in your Dataproc Metastore. This choice can also affect the features that you can integrate and use with your service.

Apache Thrift

If you use the Apache Thrift endpoint protocol, consider the following network requirements:

  • By default, you should create your Dataproc cluster and Dataproc Metastore service on the same network. Your Dataproc cluster can also use a subnet of the Dataproc Metastore service's network.

  • If your Dataproc cluster belongs to a different project than the network, you must configure shared network permissions.

  • If your Dataproc cluster belongs to a different project than your Dataproc Metastore service, you must set up additional permissions before creating a Dataproc cluster.

gRPC

If you use the gRPC endpoint protocol, consider the following network requirements:

Create a cluster and attach a Dataproc Metastore

The following instructions show you how to create a Dataproc cluster and connect to it from a Dataproc Metastore service. These instructions assume you have already Created a Dataproc Metastore service.

  • Before creating your Dataproc cluster, make sure the Dataproc image you choose is compatible with the Hive metastore version you selected when you created your Dataproc Metastore. For more information, see Dataproc Image version list.
  • To optimize network connectivity, create the Dataproc cluster in the same region as your Dataproc Metastore service.

Console

  1. In the Google Cloud console, open the Dataproc Create a cluster page:

    Open Create a cluster

  2. In the Cluster Name field, enter a name for your cluster.

  3. For the Region and Zone menus, select the same region that you created your Dataproc Metastore service in. You can choose any Zone.

  4. Click the Customize cluster tab.

  5. In the Network configuration section, select the same network that you created your Dataproc Metastore service in.

  6. In the Dataproc Metastore section, select the Dataproc Metastore service you want to attach. If you haven't created one yet, you can select Create New Service.

  7. Optional: If your Dataproc Metastore service uses the gRPC endpoint protocol:

    1. Click the Manage Security tab.
    2. In the Project Acesss section, select Enables the cloud-platform scope for this cluster.
  8. Configure the remaining service options as needed.

  9. To create the cluster, click Create.

    Your new cluster appears in the Clusters list. The cluster status is listed as Provisioning until the cluster is ready to use. When it's ready for use, the status changes to Running.

gcloud CLI

To create a cluster and attach a Dataproc Metastore, run the following gcloud dataproc clusters create command:

gcloud dataproc clusters create CLUSTER_NAME \
    --dataproc-metastore=projects/PROJECT_ID/locations/LOCATION/services/SERVICE \
    --region=LOCATION \
    --scopes=SCOPES

Replace the following:

  • CLUSTER_NAME: the name of your new Dataproc cluster.
  • PROJECT_ID: the project ID of the project you created your Dataproc Metastore service in.
  • LOCATION: the same region you created your Dataproc Metastore service in.
  • SERVICE: the name of the Dataproc Metastore service that you're attaching to the cluster.
  • SCOPES: (Optional) If your Dataproc Metastore service uses the gRPC endpoint protocol, use cloud-platform.

REST

Follow the API instructions to create a cluster by using the APIs Explorer.

Attach a cluster using Dataproc cluster properties

You can also attach a Dataproc cluster to a Dataproc Metastore using Dataproc properties. These properties include the Dataproc Metastore ENDPOINT_URI and WAREHOUSE_DIR.

Use these instructions if your Dataproc Metastore service uses Private Service Connect or if you want to attach a Dataproc cluster to the auxiliary version of your Dataproc Metastore service.

There are two ways that you can attach a Dataproc cluster using the ENDPOINT_URI and WAREHOUSE_DIR properties:

Option 1: While creating a Dataproc cluster

When creating a Dataproc cluster, use the properties flag with the following Hive configuration.

gcloud dataproc clusters create CLUSTER_NAME \
     --properties="hive:hive.metastore.uris=ENDPOINT_URI,hive:hive.metastore.warehouse.dir=WAREHOUSE_DIR/hive-warehouse"

Replace the following:

  • CLUSTER_NAME: the name of your new Dataproc cluster.
  • ENDPOINT_URI: The endpoint URI of your Dataproc Metastore service.
  • WAREHOUSE_DIR: The location of your Hive warehouse directory.

Option 2: Update the hive-site.xml file

You can also attach a Dataproc cluster by directly modifying the cluster's hive-site.xml file.

  1. Connect to the .*-m cluster using SSH.
  2. Open the /etc/hive/conf/hive-site.xml file and modify the following lines:

    <property>
       <name>hive.metastore.uris</name>
       <!-- Update this value. -->
       <value>ENDPOINT_URI</value>
    </property>
    <!-- Add this property entry. -->
    <property>
       <name>hive.metastore.warehouse.dir</name>
       <value>WAREHOUSE_DIR</value>
    </property>
    

    Replace the following:

  3. Restart HiveServer2:

    sudo systemctl restart hive-server2.service
    

Self-managed clusters

A self-managed cluster can be an Apache Hive instance, an Apache Spark instance, or a Presto cluster.

Attach a self-managed cluster

Set the following values in your client configuration file:

hive.metastore.uris=ENDPOINT_URI
hive.metastore.warehouse.dir=WAREHOUSE_DIR

Replace the following:

What's next