This page explains how to create a metadata federation service for Dataproc Metastore. A federation service lets you access metadata that is stored in multiple sources from a single gRPC endpoint.
For more information about how federation works and its limitations, see About metadata federation.
Before you begin
- Enable Dataproc Metastore.
- Create a Dataproc Metastore service that uses the gRPC endpoint.
- Optional: If you're using a BigQuery source for federation,
complete the following:
- Enable the BigQuery API in the project that contains the BigQuery source.
- Enable the Resource Manager API.
- Optional: If you're using a Dataplex Lake as a source for federation (Preview),
complete the following:
- Enable the Dataplex API in the project that contains a Dataplex Lake as a source.
Required Roles
To get the permissions that you need to create a federation service and attach a Dataproc cluster, following the principle of least privilege, ask your administrator to grant you the following IAM roles:
-
To access the federation service:
Federation accessor (
roles/metastore.federationAccessor
) on the user account or service account -
To grant full control of all Dataproc Metastore resources:
Dataproc Metastore editor (
roles/metastore.editor
) on the user account or service account -
To complete metadata operations on a Dataproc Metastore configured with a federation service:
Metastore owner (
metastore.metadataEditor
) on the user account or service account -
To create a Dataproc cluster:
Dataproc worker (
roles/dataproc.worker
) on on the Dataproc VM service account - (Optional) To access BigQuery datasets: Use an appropriate BigQuery predefined role applicable for your use case on the user account or service account
- (Optional) To access Dataplex Lakes (Preview): Use an appropriate Dataplex predefined role applicable for your use case on the user account or service account
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to create a federation service and attach a Dataproc cluster, following the principle of least privilege. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to create a federation service and attach a Dataproc cluster, following the principle of least privilege:
-
To create a Dataproc Metastore:
metastore.services.create
on the user account or service account -
To list, get, create, update, and delete a federation service:
metastore.federations.create, metastore.federations.update, metastore.federations.delete, metastore.federations.get metastore.federations.list
on the user account or service account -
To complete metadata operations on a Dataproc Metastore:
metastore.services.get, metastore.services.use, metastore.databases.create, metastore.databases.update, metastore.databases.deletemetastore.databases.get,metastore.databases.list, metastore.databases.getIamPolicy, metastore.tables.create, metastore.tables.update, metastore.tables.delete, metastore.tables.get, metastore.tables.list, metastore.tables.getIamPolicy
on the user account or service account -
(Optional) To access BigQuery datasets:
For more information, see BigQuery permissions
on the user account or service account -
(Optional) To access Dataplex Lakes (Preview):
For more information, see Dataplex permissions
on the user account or service account
You might also be able to get these permissions with custom roles or other predefined roles.
For more information about specific Dataproc Metastore roles and permissions, see Manage Dataproc Metastore access with IAM.Create a federation service
The following instructions show you how to create a federation service and attach it to a source. After you complete these steps, you can attach your federation service to a Dataproc cluster.
To learn more about federation sources and their limitations, see metadata sources.
Console
In the Google Cloud console, open the Dataproc Metastore page:
In the Dataproc navigation menu, click Federation.
The Federated metastore services page opens.
In the Federated metastore menu bar, click Create.
The Create federation service page opens.
In the Federation name field, enter a unique name for your service.
For more information, see Resource naming convention.
Select the Data location.
Make sure you create your federation service in the same regions of your Dataproc Metastore sources.
Select the Hive Version.
To add a source for your federation service, click Add a Source.
You can add one or more sources. The first source that you add in this list is automatically set as your primary metastore. You can update the source ordering after creation.
For the Source type, select your federation source.
You can choose a Dataproc Metastore instance, a project that contains one or more BigQuery datasets, or a Dataplex lake (Preview).
In the Source field, enter the following information:
For a Dataproc Metastore service.
In the Selected project field, click Browse and select the project that contains the Dataproc Metastore you want to use as as a source.
Make sure your Dataproc Metastore sources are using a Hive version that is compatible with your federation service. Your primary metastore must use a Hive version that is greater than or equal to your federation service.
In the Metastore service drop-down, select the Dataproc Metastore that you want to use as a source.
For BigQuery. In the Selected project field, click Browse and select select the project ID of the project that contains the BigQuery Dataset.
For Dataplex (Preview). In the Selected project field, click Browse and select select the project ID of the project that contains the Dataplex Lake.
Click Done.
To create and start the service, click Submit.
You can now attach your federation service to a Dataproc cluster.
Update a federation service
The following instructions show you how to update a federation service. You can complete the following tasks:
- Add a source to a federation service.
- Remove a source from a federation service.
- Change the source ordering of the sources contained in a federation service.
Delete a federation service permanently. After you delete a service, all of its resources are released.
Console
In the Google Cloud console, open the Dataproc Metastore page:
In the Dataproc navigation menu, click Federation.
The Federated metastore services page opens.
On the Federated metastore services page, click the name of the service name that you want to update.
The Service detail page opens.
In the menu bar, click Edit.
The Edit service page opens.
Choose the values that you want to update.
To update the service, click Submit.
Attach a Dataproc cluster to a federation service
The following instructions show you how to create a Dataproc cluster and attach a federation service endpoint as its metastore.
Before you start these instructions, complete all the steps listed in Before you begin and create a federation service.
gcloud CLI
To create a Dataproc cluster and attach a federation endpoint,
run the following gcloud Dataproc clusters create
command.
gcloud dataproc clusters create CLUSTER_NAME \ --region=LOCATION \ --project=PROJECT_ID \ --scopes=https://www.googleapis.com/auth/cloud-platform \ --image-version=IMAGE_VERSION \ --service-account=SERVICE_ACCOUNT \ --optional-components=DOCKER \ --initialization-actions=gs://metastore-init-actions/metastore-grpc-proxy/metastore-grpc-proxy.sh \ --metadata="proxy-uri=FEDERATION_URI,hive-version=FEDERATION_VERSION" \ --properties="hive:hive.metastore.uris=thrift://localhost:9083,hive:hive.metastore.warehouse.dir=WAREHOUSE_DIR"
Replace the following:
CLUSTER_NAME
: the name of your new Dataproc cluster.PROJECT_ID
: the Google Cloud project ID of the project you're creating the Dataproc cluster in.LOCATION
: the region of your Dataproc cluster.IMAGE_VERSION
: the Dataproc image version that you want to use.Make sure the Dataproc image that you're using in this command is compatible with the Hive version used with your federation service. For more information, see Dataproc image version list.
SERVICE_ACCOUNT
optional: the service account that you're using to create your Dataproc cluster. If unspecified, the cluster uses your default Compute Engine service account.FEDERATION_URI
: the endpoint URI of your federation service.FEDERATION_VERSION
: the Hive version that your federation service is using.WAREHOUSE_DIR
: the warehouse directory of your primary Dataproc Metastore.