Metadata federation is a service that lets you access multiple sources of metadata from a single endpoint.
To set up federation, you create a federation service and then configure your metadata sources. After, the service exposes a single gRPC endpoint that you can use to access all of your metadata.
For example, using federation, you can create a Dataproc cluster that exposes multiple Dataproc Metastore services through a single endpoint. Afterwards, you can run big data jobs through open source software (OSS) engines, such as Spark or Hive, to access your metadata across multiple metastores.
How federation works
OSS big data workloads that run on Spark or Hive send requests to the Hive Metastore API to fetch metadata at runtime.
- The Hive Metastore interface supports both read and write methods. The federation service exposes a gRPC version of the Hive Metastore interface.
- At runtime, when the federation service receives a request, it checks the source ordering to retrieve the appropriate metadata.
Metadata sources
When you create a federation service, you must add a metadata source. You can use the following sources as backend metastores:
- A Dataproc Metastore instance.
- A project containing one or more BigQuery datasets.
- A Dataplex Lake (Preview).
Source restrictions
The following section lists the restrictions that you must adhere to when using various metadata sources.
All sources
The following restrictions apply to all metadata sources:
- A federation service doesn't contain its own data. Instead, the federation service just serves metadata from one of its metadata sources.
- A federation service can't be a source of metadata in another federation service.
Dataproc Metastore
If you're using a Dataproc Metastore as a source, the following restrictions apply:
- Federation services are only available through gRPC endpoints. To use a Dataproc Metastore with federation, create your metastore with a gRPC endpoint.
- Federation services can be attached to single region Dataproc Metastore services in any single region. Federation services don't support multi-region Dataproc Metastore services.
BigQuery
If you're using a project that contains BigQuery datasets as a source, you must satisfy the following conditions:
- Grant the correct IAM roles to access the project that contains the BigQuery datasets.
- Add a least one Dataproc Metastore service as a source, along with your BigQuery datasets.
Dataplex Lakes
- Grant an IAM role that contains the
dataplex.lakes.get
permission. - Add at least one Dataproc Metastore service as a source, along with your Dataplex Lake.
Source ordering
Your federation service processes metadata requests in a priority order. This concept is known as source ordering. At runtime, when the federation service receives a request, it checks the source ordering and completes one of the following actions:
- If the request contains a database name. The request is routed to the backend metastore that contains the database name. If more than one metastore contains the same database name, the request is routed to the metastore with the lowest rank.
- If the request creates or drops a database. The request is routed to the metastore with the lowest rank.
- If the request doesn't contain a database name and it doesn't create or
drop a database. The request is routed to the
Dataproc Metastore instance with the lowest rank. Some
examples of Hive Metastore requests that don't specify a database are
set_ugi
andcreate_database
. - If none of the metastores contain a database. The OSS engine responds with the equivalent of a not-found error.