Set up a multi-regional Dataproc Metastore service

This page shows you how to set up a multi-regional Dataproc Metastore service. For more information about how multi-regional Dataproc Metastore services work, see Dataproc Metastore regions.

Before you begin

Required roles

To get the permission that you need to create a multi-regional Dataproc Metastore service, ask your administrator to grant you the following IAM roles on your project, based on the principle of least privilege:

For more information about granting roles, see Manage access.

This predefined role contains the metastore.services.create permission, which is required to create a multi-regional Dataproc Metastore service.

You might also be able to get this permission with custom roles or other predefined roles.

For more information about specific Dataproc Metastore roles and permissions, see Manage access with IAM.

About multi-regional Dataproc Metastore services

Multi-regional Dataproc Metastore services store your data in two different regions and use the two regions to run your workloads. For example, the multi-region nam7 contains the us-central1 and us-east4 regions.

  • A multi-regional Dataproc Metastore service replicates metadata across two regions and exposes the relevant endpoints to access the Hive Metastore. For gRPC, one endpoint per region is exposed. For Thrift, one endpoint per subnetwork is exposed.

  • A multi-regional Dataproc Metastore service provides an active-active high availability (HA) cluster configuration. This configuration means that workloads can access either region when running jobs. It also provides a failover mechanism for your service. For example, if your primary regional endpoint goes down, your workloads are automatically routed to the secondary region. This helps prevent disruptions to your Dataproc jobs.

Considerations

The following considerations apply to multi-regional Dataproc Metastore services.

Create a multi-regional Dataproc Metastore service

Choose one of the following tabs to learn how to create a multi-regional service using either the Thrift or gRPC endpoint protocol, with a Dataproc Metastore service 2.

gRPC

When creating a multi-regional service that uses the gRPC endpoint protocol, you don't have to set any specific network settings. The gRPC protocol handles the network routing for you.

Console

  1. In the Google Cloud console, go to the Dataproc Metastore page.

    Go to Dataproc Metastore

  2. In the navigation bar, click +Create.

    The Create Metastore service dialog opens.

  3. Select Dataproc Metastore 2.

  4. In the Pricing and Capacity section, select Enterprise Plus - Dual region

  5. For the Endpoint protocol, select gRPC*.

  6. To create and start the service, click Submit.

    Your new metastore service appears on the Dataproc Metastore page. The status displays Creating until the service is ready to use. When it's ready, the status changes to Active. Provisioning the service might take a few minutes.

gcloud CLI

To create a Dataproc Metastore multi-regional service, run the following gcloud metastore services create command. This command creates Dataproc Metastore version 3.1.2.

gcloud metastore services create SERVICE \
  --location=MULTI_REGION \
{ --instance-size=INSTANCE_SIZE | --scaling-factor=SCALING_FACTOR } \
  --endpoint-protocol=grpc

Replace the following:

  • SERVICE: the name of your Dataproc Metastore service.
  • MULTI_REGION: the multi-region that you're creating your Dataproc Metastore service in.
  • INSTANCE_SIZE: the instance size of your multi-regional Dataproc Metastore. For example, small, medium or large. If you specify a value for INSTANCE_SIZE, don't specify a value for SCALING_FACTOR.
  • SCALING_FACTOR: the scaling factor of your Dataproc Metastore service. For example, 0.1. If you specify a value for SCALING_FACTOR, don't specify a value for INSTANCE_SIZE.

Thrift

When creating a multi-regional service that uses the Thrift endpoint protocol, you must set the appropriate subnetwork settings. In this case, for each VPC network you are using, you must provide at least one subnetwork from each region.

For example, to create the nam7 multi-region, you must provide both the us-central1 and us-east4 regions.

Console

  1. In the Google Cloud console, go to the Dataproc Metastore page.

    Go to Dataproc Metastore

  2. In the navigation bar, click +Create.

    The Create Metastore service dialog opens.

  3. Select Dataproc Metastore 2.

  4. In the Pricing and Capacity section, select Enterprise Plus - Dual region.

    For more information, see pricing plans and scaling configurations.

  5. In the Service name field, enter a unique name for your service.

    For information on naming conventions, see Resource naming convention.

  6. For the Endpoint protocol, select Thrift.

  7. For Network Config, provide the subnetworks that form your chosen multi-regional configuration.

  8. For the remaining service configuration options, use the provided defaults.

  9. To create and start the service, click Submit.

    Your new metastore service appears on the Dataproc Metastore page. The status displays Creating until the service is ready to use. When it's ready, the status changes to Active. Provisioning the service might take a few minutes.

gcloud CLI

To create a Dataproc Metastore multi-regional service, run the following gcloud metastore services create command. This command creates Dataproc Metastore version 3.1.2.

gcloud metastore services create SERVICE \
  --location=MULTI_REGION \
  --consumer-subnetworks="projects/PROJECT_ID/regions/LOCATION1/subnetworks/SUBNET1,projects/PROJECT_ID/regions/LOCATION2/subnetworks/SUBNET2" \
{ --instance-size=INSTANCE_SIZE | --scaling-factor=SCALING_FACTOR } \
  --endpoint-protocol=thrift

Or you can store your network settings in a file, as shown in the following command.

gcloud metastore services create SERVICE \
  --location=MULTI_REGION \
  --network-config-from-file=NETWORK_CONFIG_FROM_FILE
{ --instance-size=INSTANCE_SIZE | --scaling-factor=SCALING_FACTOR } \
  --endpoint-protocol=thrift

Replace the following:

  • SERVICE: the name of your Dataproc Metastore service.
  • MULTI_REGION: the multi-region that you're creating your Dataproc Metastore service in.
  • PROJECT_ID: the Google Cloud project ID that you're creating your Dataproc Metastore service in.
  • SUBNET1,SUBNET2: a list of subnetworks that form a multi-regional configuration. You can use the ID, fully qualified URL, or relative name of the subnetwork. You can specify up to six subnetworks.
  • LOCATION1,LOCATION2: a list of locations that form a multi-regional configuration. You can use the ID of the location. For example, for a nam7 multi-region, you use us-central1 and us-east4.
  • NETWORK_CONFIG_FROM_FILE: the path to a YAML file containing your network configuration.
  • INSTANCE_SIZE: the instance size of your multi-regional Dataproc Metastore. For example, small, medium or large. If you specify a value for INSTANCE_SIZE, don't specify a value for SCALING_FACTOR.
  • SCALING_FACTOR: the scaling factor of your Dataproc Metastore service. For example, 0.1. If you specify a value for SCALING_FACTOR, don't specify a value for INSTANCE_SIZE.

REST

To learn how to create a multi-regional Dataproc Metastore service, follow the instructions to create a service by using the Google APIs Explorer.

To configure a multi-regional service, provide the following information in the Network Config objects.

  "network_config": {
    "consumers": [
        {"subnetwork": "projects/PROJECT_ID/regions/LOCATION/subnetworks/SUBNET1"},
        {"subnetwork": "projects/PROJECT_ID/regions/LOCATION/subnetworks/SUBNET2"}
    ],
    "scaling_config": {
    "scaling_factor": SCALING_FACTOR
    }
  }

Replace the following:

  • PROJECT_ID: the Google Cloud project ID of the project that contains your Dataproc Metastore service.
  • LOCATION: the Google Cloud region that your Dataproc Metastore service resides in.
  • SUBNET1,SUBNET2: a list of subnetworks that form a multi-regional configuration. You can use the ID, fully qualified URL, or relative name of the subnetwork. You can specify up to five subnetworks.
  • SCALING_FACTOR: the scaling factor that you want to use for service.

Connect Dataproc Metastore to a Dataproc cluster

Choose one of the following tabs to learn how to connect a multi-regional Dataproc Metastore service from a Dataproc cluster.

gRPC

To connect a Dataproc cluster, choose the tab that corresponds with the version of Dataproc Metastore that you're using.

Dataproc Metastore 3.1.2

  1. Create the following variables for your Dataproc cluster:

    CLUSTER_NAME=CLUSTER_NAME
    PROJECT_ID=PROJECT_ID
    MULTI_REGION=MULTI_REGION
    DATAPROC_IMAGE_VERSION=DATAPROC_IMAGE_VERSION
    PROJECT=PROJECT
    SERVICE_ID=SERVICE_ID
    

    Replace the following:

    • CLUSTER_NAME: the name of your Dataproc cluster.
    • PROJECT_ID: the Google Cloud project that contains your Dataproc cluster. Make sure that the subnet you're using has the appropriate permissions to access this project.
    • MULTI_REGION: the Google Cloud multi-region that you want to create your Dataproc cluster in.
    • DATAPROC_IMAGE_VERSION: the Dataproc image version that you are using with your Dataproc Metastore service. You must use a image version of 2.0 or higher.
    • PROJECT: the project that contains your Dataproc Metastore service.
    • SERVICE_ID: the service ID of your Dataproc Metastore service.
  2. To create your cluster, run the following gcloud dataproc clusters create command. --enable-kerberos is optional. Only include this option if you are using kerberos with your cluster.

    gcloud dataproc clusters create ${CLUSTER_NAME} \
     --project ${PROJECT_ID} \
     --region ${MULTI_REGION} \
     --image-version ${DATAPROC_IMAGE_VERSION} \
     --scopes "https://www.googleapis.com/auth/cloud-platform" \
     --dataproc-metastore projects/${PROJECT}/locations/${MULTI_REGION}/services/${SERVICE_ID} \
    [ --enable-kerberos ]

Dataproc Metastore 2.3.6

  1. Create the following variables for your Dataproc Metastore service:

    METASTORE_PROJECT=METASTORE_PROJECT
    METASTORE_ID=METASTORE_ID
    MULTI_REGION=MULTI_REGION
    SUBNET=SUBNET

    Replace the following:

    • METASTORE_PROJECT: the Google Cloud project that contains your Dataproc Metastore service.
    • METASTORE_ID: the service ID of your Dataproc Metastore service.
    • MULTI_REGION: the multi-region location that you want to use for your Dataproc Metastore service.
    • SUBNET: one of the subnets that you're using for your Dataproc Metastore service. Or any subnetwork in the parent VPC network of the subnetworks used for your service.
  2. Create the following variables for your Dataproc cluster:

    CLUSTER_NAME=CLUSTER_NAME
    DATAPROC_PROJECT=DATAPROC_PROJECT
    DATAPROC_REGION=DATAPROC_REGION
    HIVE_VERSION=HIVE_VERSION
    IMAGE_VERSION=r>IMAGE_VERSION
    

    Replace the following:

    • CLUSTER_NAME: the name of your Dataproc cluster.
    • DATAPROC_PROJECT: the Google Cloud project that contains your Dataproc cluster. Make sure that the subnet you're using has the appropriate permissions to access this project.
    • DATAPROC_REGION: the Google Cloud region that you want to create your Dataproc cluster in.
    • HIVE_VERSION: the version of Hive that your Dataproc Metastore service uses.
    • IMAGE_VERSION: the Dataproc image version you are using with your Dataproc Metastore service.
      • For Hive Metastore version 2.0, use image version 1.5.
      • For Hive Metastore version 3.1.2, use image version 2.0.
  3. Retrieve the warehouse directory of your Dataproc Metastore service and store it in a variable.

    WAREHOUSE_DIR=$(gcloud metastore services describe "${METASTORE_ID}" --project "${METASTORE_PROJECT}" --location "${MULTI_REGION}" --format="get(hiveMetastoreConfig.configOverrides[hive.metastore.warehouse.dir])")
  4. Create a Dataproc cluster configured with a multi-regional Dataproc Metastore.

    gcloud dataproc clusters create ${CLUSTER_NAME} \
        --project "${DATAPROC_PROJECT}" \
        --region ${DATAPROC_REGION} \
        --scopes "https://www.googleapis.com/auth/cloud-platform" \
        --subnet "${SUBNET}" \
        --optional-components=DOCKER \
        --image-version ${IMAGE_VERSION} \
        --metadata "hive-version=${HIVE_VERSION},dpms-name=${DPMS_NAME}" \
        --properties "hive:hive.metastore.uris=thrift://localhost:9083,hive:hive.metastore.warehouse.dir=${WAREHOUSE_DIR}" \
        --initialization-actions gs://metastore-init-actions/mr-metastore-grpc-proxy/metastore-grpc-proxy.sh

Thrift

Option 1: Edit the hive-site.xml file

  1. Find the endpoint URI and warehouse directory of your Dataproc Metastore service. You can pick any one of the endpoints exposed.
  2. In the Google Cloud console go to the VM Instances page.
  3. In the list of virtual machine instances, click SSH in the row of the Dataproc primary node (.*-m).

    A browser window opens in your home directory on the node.

  4. Open the /etc/hive/conf/hive-site.xml file.

    sudo vim /etc/hive/conf/hive-site.xml
    

    You see an output similar to the following:

    <property>
        <name>hive.metastore.uris</name>
        <value>ENDPOINT_URI</value>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>WAREHOUSE_DIR</value>
    </property>
    

    Replace the following:

  5. Restart HiveServer2:

    sudo systemctl restart hive-server2.service
    

Option 2: Use the gcloud CLI

Run the following gcloud CLI gcloud dataproc clusters create command.

  1. Find the endpoint URI and warehouse directory of your Dataproc Metastore service. You can pick any one of the endpoints exposed.
gcloud dataproc clusters create CLUSTER_NAME \
    --network NETWORK \
    --project PROJECT_ID \
    --scopes "https://www.googleapis.com/auth/cloud-platform" \
    --image-version IMAGE_VERSION \
    --properties "hive:hive.metastore.uris=ENDPOINT,hive:hive.metastore.warehouse.dir=WAREHOUSE_DIR"

Replace the following:

  • CLUSTER_NAME: the name of your Dataproc cluster.
  • NETWORK: the Google Cloud project that contains your Dataproc cluster. Make sure that the subnet you're using has the appropriate permissions to access this project.
  • PROJECT_ID: the version of Hive that your Dataproc Metastore service uses.
  • IMAGE_VERSION: the Dataproc image version you are using with your Dataproc Metastore service.
    • For Hive Metastore version 2.0, use image version 1.5.
    • For Hive Metastore version 3.1.2, use image version 2.0.
  • ENDPOINT: the Thrift endpoint that your Dataproc Metastore uses.
  • WAREHOUSE_DIR: the warehouse directory of your Dataproc Metastore.

Whats next