Metadata federation

Metadata federation lets you access metadata that is stored in multiple Dataproc Metastore instances.

To set up federation, you create a federation service and then configure multiple Dataproc Metastore instances as your backend metastores. The federation service then exposes a single gRPC endpoint, which you can use to access metadata across all of your metastore instances.

For example, you can create a Dataproc cluster and expose all your Dataproc Metastore services through a single endpoint. After, you can run big data jobs through open source software (OSS) engines, such as Spark or Hive, to access your metadata across multiple metastore instances.

How Federation works

OSS big data workloads that run on Spark or Hive send requests to the Hive Metastore API to fetch metadata at runtime. When sending requests, the Hive Metastore interface supports both read and write methods. The federation service exposes a gRPC version of the Hive Metastore interface.

At runtime, when the federation service receives a request, it completes one of the following actions:

  • If the request contains a database name, it's routed to the backend metastore that contains this database. If more than one metastore contains the same database name, the request is routed to the metastore with the lower rank in the source ordering (also known as the primary metastore).
  • If the request doesn't contain a database name, it's routed to the lowest metastore in the source ordering.
  • If none of the metastores contain a database, then the OSS engine responds with the equivalent of a not-found error.

Restrictions

The following restrictions apply to federation services:

  • Federation services are only available through gRPC endpoints. You must create your Dataproc Metastore services with gRPC endpoints to include them in a federation service.
  • Federation services must be located in the same region as any associated metastores. For example, if you create your federation service in us-central1, then you must also create your metastores in us-central1.

Metadata sources

When you create a federation service, you must add a metadata source. The following restrictions apply to metadata sources:

  • A federation service doesn't contain its own data. Instead, the federation service just serves metadata from one of its metadata sources.
  • A federation service can't be a source of metadata in another federation service.

Source types

You can use the following sources to populate metadata in your federation service:

  • A Dataproc Metastore service.

Source ordering

Your federation service processes metadata requests in a priority order. This concept is known as source ordering.

When you send a request to the federation service, it checks the source ordering and decides which metastore to call to return the applicable metadata.

The metastore with the lowest source ordering rank (in other words, the first one in the list) is referred to as the primary metastore. If a request is sent to the federation service that doesn't specify a database, it's dispatched to the primary metastore. Some examples of Hive Metastore requests that don't specify a database are set_ugi and create_database.

Before you begin

Access control

To use a federation service, you need metastore.federation.* IAM permissions to complete the following actions:

  • list and get Dataproc Metastore federations
  • create and update Dataproc Metastore federations
  • delete Dataproc Metastore federations

The user account or service account that is used to access metadata through the federation service should have the following IAM roles:

  • To access the federation service, use the roles/metastore.federationAccessor role.
  • To complete metadata operations on a Dataproc Metastore configured with a federation service, add both of the following roles:

For more information, see Dataproc Metastore IAM and access control.

Create a federation service

The following instructions show you how to create a federation service and attach it to your Dataproc Metastore service.

To create a federation service, you must already have created one or more Dataproc Metastore services.

Console

  1. In the console, open the Dataproc Metastore page:

    Open Dataproc Metastore in the console

  2. In the Dataproc navigation menu, click Federation.

    The Federated metastore services page opens.

  3. In the Federated metastore navigation menu, click Create.

    The Create federated service page opens.

  4. In the Service name field, enter a unique name for your service.

    For more information, see Resource naming convention.

  5. Select the Location.

    Make sure the region of your federation service is the same region of your primary metastore.

  6. Select the Hive Version.

  7. To add a source for your federation service, click Add Source.

    For a Dataproc Metastore service, specify its project ID, region, and service ID.

    You can add one or more sources. The first source that you add in this list is automatically set as your primary metastore. You can update the source ordering after creation.

    Make sure your primary metastore is using a Hive version that is compatible with your federation service. Your primary metastore must use a Hive version that is greater than or equal to your federation service.

  8. To create and start the service, click Create.

Update a federation service

The following instructions show you how to update a federation service.

When updating a federation service, you can complete the following tasks:

  • Add a Dataproc Metastore source to a federation service.
  • Remove a Dataproc Metastore source from a federation service.
  • Change the source ordering of the Dataproc Metastores contained in a federation.
  • Delete a federation permanently. After you delete a federation, all of its resources are released.

Console

  1. In the console, open the Dataproc Metastore page:

    Open Dataproc Metastore in the console

  2. In the Dataproc navigation menu, click Federation.

    The Federated metastore services page opens.

  3. On the Federated metastore services page, click the service name of the federated service you'd like to update.

    The Service detail page for that service opens.

  4. In the menu bar, click Edit.

    The Edit service page opens.

  5. Choose the updated federation parameter values.

  6. To update the service, click Submit.

Attach a Dataproc cluster to a federation service

The following instructions show you how to create a Dataproc cluster and attach a federation endpoint as its metastore. Before you start these instructions, complete the following tasks:

gcloud

To create a Dataproc cluster and attach a federation endpoint, run the following gcloud dataproc clusters create command.

 gcloud dataproc clusters create CLUSTER_NAME \
    --region LOCATION \
    --project PROJECT_ID \
    --scopes https://www.googleapis.com/auth/cloud-platform \
    --image-version IMAGE_VERSION \
    --service-account SERVICE_ACCOUNT \
    --optional-components=DOCKER \
    --initialization-actions gs://metastore-init-actions/metastore-grpc-proxy/metastore-grpc-proxy.sh \
    --metadata "proxy-uri=FEDERATION_URI,hive-version=FEDERATION_VERSION" \
    --properties "hive:hive.metastore.uris=thrift://localhost:9083,hive:hive.metastore.warehouse.dir=WAREHOUSE_DIR"
 

Replace the following:

  • CLUSTER_NAME: the name of your new cluster.
  • PROJECT_ID: the project ID of the project you're creating the Dataproc cluster in.
  • LOCATION: the region of your dataproc cluster.
  • IMAGE_VERSION: your dataproc image version.
  • SERVICE_ACCOUNT: the service account you're using to create your Dataproc cluster. If unspecified, the cluster uses your default Compute Engine service account.
  • FEDERATION_URI: the endpoint URI of your federation service.
  • FEDERATION_VERSION: the Hive version of your federation service.
  • WAREHOUSE_DIR: the warehouse directory of your primary metastore.

What's next