Creating a service

You can create a Dataproc Metastore service using the Google Cloud Console, the Cloud SDK gcloud command-line tool in a local terminal window or in Cloud Shell, or a services.create API method.

When you create a service, you're required to specify the region for it. See Cloud locations for information on which locations support Dataproc Metastore.

Additional fields include metastore version, network, port, and service tier. Note that if you don't specify a network, the default network in the Dataproc Metastore service project is used. Dataproc Metastore uses private IP, so only VMs on the same network can access the Dataproc Metastore service.

Before you begin

  • Enable billing for the project.

  • Enable the Dataproc Metastore API.

  • Most gcloud metastore commands require a location. You can specify the location by using the --location flag or by setting the default location.

  • Do not set the org-policy constraint to restrict VPC peering. Specifying constraints/compute.restrictVpcPeering will cause your creation request to fail with an INVALID_ARGUMENT error. If you must set the constraint, use the following command to allow under:folders/270204312590:

    gcloud resource-manager org-policies allow compute.restrictVpcPeering under:folders/270204312590 --organization ORGANIZATION_ID
    

    For more information, see Organization policy constraints.

  • If you'd like to enable Kerberos for your Hive metastore instance, you must:

    • Host your own Kerberos Key Distribution Center (KDC).
    • Set up IP connectivity between the VPC network and your KDC.
    • Set up a Secret Manager secret that contains the contents of a Hive Keytab.
    • Specify a principal that is in both the KDC and the Hive Keytab.
    • Specify a krb5.conf file in a Google Cloud Storage bucket.

    For more information, see Configuring Kerberos.

  • To create a Dataproc Metastore service that is accessible in a network belonging to a different project than the one the service belongs to, you must grant roles/metastore.serviceAgent to the service project's Dataproc Metastore service agent (service-SERVICE_PROJECT_NUMBER@gcp-sa-metastore.iam.gserviceaccount.com) in the network project's IAM policy.

    gcloud projects add-iam-policy-binding NETWORK_PROJECT_ID \
        --role "roles/metastore.serviceAgent" \
        --member "serviceAccount:service-SERVICE_PROJECT_NUMBER@gcp-sa-metastore.iam.gserviceaccount.com"
    

Access control

  • To create a service, you must be granted an IAM role containing the metastore.services.create IAM permission. The Dataproc Metastore specific roles roles/metastore.admin and roles/metastore.editor can be used to grant create permission.

  • You can also give create permission to users or groups by using the roles/owner and roles/editor legacy roles.

For more information, see Dataproc Metastore IAM and access control.

Creating a Dataproc Metastore service

The following instructions demonstrate how to create a Dataproc Metastore service using the Google Cloud Console, the gcloud tool, or the Dataproc Metastore API.

Console

  1. In the Cloud Console, open the Dataproc Metastore page:

    Open Dataproc Metastore in the Cloud Console

  2. At the top of the Dataproc Metastore page click the Create button. The Create service page opens.

    Create service page
  3. Enter a unique name for your service in the Service name field. For information on the naming convention, see Resource naming convention.

  4. Select the Data location.

  5. Select the Hive Metastore version. If not specified, Hive version 2.3.6 is used. For more information, see Version policy.

  6. Select the Release channel. If not specified, Stable is used. For more information, see Release channel.

  7. Enter the Port. This is the TCP port at which the Dataproc Metastore Thrift interface is available. If not provided, port number 9083 is used.

  8. Select the Service tier. This influences the capacity of the service. Developer is the default tier. It's good for low-cost proof-of-concept as it provides limited scalability and no fault tolerance. Enterprise tier provides flexible scalability, fault tolerance, and multi-zone high availability. It can handle heavy Dataproc Metastore workloads.

  9. Select the Network. The service must be attached to the same network that other Metastore clients, such as the Dataproc cluster, are attached to in order to access them. If not provided, the default network is used.

    Optional: Click to Use shared VPC network and enter the Project ID and VPC network name of the shared VPC network. For more information, see VPC Service Controls with Dataproc Metastore.

  10. Optional: Enable Data Catalog sync to sync the Dataproc Metastore service to Data Catalog. For more information, see Dataproc Metastore to Data Catalog sync.

  11. Optional: Select the Day of week and Hour of day for the service's maintenance window. For more information, see Maintenance windows.

  12. Optional: Enable a Kerberos keytab file:

    1. Click the toggle to enable Kerberos.

    2. Select or enter your secret resource ID.

    3. Either choose to use the latest secret version or select an older one to use.

    4. Enter the Kerberos principal. This is the principal allocated for this Dataproc Metastore service.

    5. Browse to the krb5 config file.

  13. Optional: Click + Add Overrides to apply a mapping to the Hive Metastore.

  14. Optional: Click + Add Labels to add additional metadata to the metastore service resource.

  15. Click the Submit button to create and start the service.

  16. Verify that you have returned to the Dataproc Metastore page, and that your new service appears in the list.

gcloud

  1. Use the following gcloud metastore services create command to create a service:

    gcloud metastore services create SERVICE \
        --location=LOCATION \
        --labels=k1=v1,k2=v2,k3=v3 \
        --network=NETWORK \
        --port=PORT \
        --tier=TIER \
        --hive-metastore-version=HIVE_METASTORE_VERSION \
        --release-channel=RELEASE_CHANNEL \
        --hive-metastore-configs=K1=V1,K2=V2 \
        --kerberos-principal=KERBEROS_PRINCIPAL \
        --krb5-config=KRB5_CONFIG \
        --keytab=CLOUD_SECRET
    

    Replace the following:

    • SERVICE: The name of the new service.
    • LOCATION: Refers to a Google Cloud region.
    • k1=v1,k2=v2,k3=v3: The labels used.
    • NETWORK: The name of the VPC network on which the service can be accessed. When using a VPC network belonging to a different project than the service, the entire relative resource name must be provided, for example projects/HOST_PROJECT/global/networks/NETWORK_ID.
    • PORT: The TCP port at which the metastore Thrift interface is available. Default: 9083.
    • TIER: The tier capacity of the new service.
    • HIVE_METASTORE_VERSION: The versions of Hive metastore that can be used when creating a new metastore service in this location. The server guarantees that exactly one HiveMetastoreVersion in the list will set is_default.
    • RELEASE_CHANNEL: The release channel of the service.
    • K1=V1,K2=V2: Optional: The Hive metastore configs used.
    • KERBEROS_PRINCIPAL: Optional: A Kerberos principal that exists in the both the keytab and the KDC. A typical principal is of the form "primary/instance@REALM", but there is no exact format.
    • KRB5_CONFIG: Optional: The krb5.config file specifies the KDC and the Kerberos realm information, which includes locations of KDCs and defaults for the realm and Kerberos applications.
    • CLOUD_SECRET: Optional: The relative resource name of a Secret Manager secret version.
  2. Verify that the creation was successful.

REST

Follow the API instructions to create a service by using the APIs Explorer.

Using non-RFC 1918 private IP address ranges

The provided VPC network may run out of available RFC 1918 addresses required by Dataproc Metastore services. If that happens, Dataproc Metastore will attempt to reserve private IP address ranges outside of the RFC 1918 ranges for service creation. Please see Valid ranges in the VPC network documentation for a list of supported non-RFC 1918 private ranges.

Non-RFC 1918 private IP addresses used in Dataproc Metastore may conflict with a range in an on-premises network that is connected to the provided VPC network. To check the list of RFC 1918 and non-RFC 1918 private IP addresses reserved by Dataproc Metastore:

gcloud compute addresses list \
    --project NETWORK_PROJECT_ID \
    --filter="purpose:VPC_PEERING AND name ~ cluster|resourcegroup"

If a conflict is determined and cannot be mitigated by re-configuring the on-premises network, delete the offending Dataproc Metastore service and re-create it again after 2 hours.

After you create a Dataproc Metastore service

After you create a service, you can create and attach a Dataproc cluster or self-managed Apache Hive/Apache Spark/Presto cluster that uses the service as its Hive metastore.

What's next?