You can enable Dataproc Metastore service to Data Catalog sync to take advantage of metadata discovery and management. Once enabled, database and table metadata are automatically synced from Dataproc Metastore to Data Catalog.
Data Catalog allows you to tag and search for service specific resources, such as databases and tables.
What is Data Catalog
Data Catalog is a fully managed, scalable metadata management service in Google Cloud's Data Analytics family of products. It provides unified view and tagging mechanisms for technical and business metadata.
For more information, see the following Data Catalog feature guides:
Permissions
Data Catalog abides by Dataproc Metastore level permissions. For metadata that is synced from Dataproc Metastore to Data Catalog, IAM permissions specified in Dataproc Metastore apply to the metadata in Data Catalog as well.
Data Catalog checks the permissions for each metastore database/table at the time of access so that only users with access to the Dataproc Metastore service are able to see the synced service resources as entries in Data Catalog.
You must request roles/metastore.metadataViewer
to view synced Dataproc Metastore
entries in Data Catalog. The roles/metastore.Admin
and
roles/metastore.Editor
don't support metastore databases and tables
permissions.
Enable Data Catalog sync
After enabling Data Catalog sync, Data Catalog performs live, full syncs of your Dataproc Metastore service.
Data Catalog syncs the following metadata:
- Instances
- Databases, including name and description
- Tables, including name, description, and schema (columns with descriptions)
The following table shows the resource mapping between Dataproc Metastore and Data Catalog:
Dataproc Metastore Resource | Data Catalog Resource |
---|---|
Instance | Entry group Entry |
Database | Entry |
Table | Entry |
Column | Schema |
You can enable Dataproc Metastore service to Data Catalog sync when you create or update a Dataproc Metastore service using the Google Cloud console. You can disable the sync the same way.
Create a service with Data Catalog sync enabled
Data Catalog sync is disabled by default.
To enable Data Catalog sync for a new service:
Console
In the Google Cloud console, open the Dataproc Metastore page:
At the top of the Dataproc Metastore page, click the Create button. The Create service page opens.
Configure your service as desired.
Under Metadata integration, enable Data Catalog sync to sync the Dataproc Metastore service to Data Catalog.
Click Submit.
Enable or disable Data Catalog sync for an existing service
To enable or disable Data Catalog sync for an existing service:
Console
In the Google Cloud console, open the Dataproc Metastore page:
On the Dataproc Metastore page, click the service name of the service you'd like to update. The Service detail page for that service opens.
Under the Configuration tab, click the Edit button. The Edit service page opens.
In the Metadata integration section, click to toggle Enable on or off for Data Catalog sync.
Click the Submit button to update the service.
Search with Data Catalog
You can search synced Dataproc Metastore metadata using Data Catalog.
Although there are no custom search options for Dataproc Metastore, there are multiple ways to search for different Dataproc Metastore resources:
- Dataproc Metastore instance
- By display name
- Standard Data Catalog ways — by tags, etc.
- Database
- By display name
- By description
- By Dataproc Metastore instance
- Standard Data Catalog ways — by tags, etc.
- Table:
- By display name
- By description
- By column name
- By column description
- By database
- By Dataproc Metastore instance
- Standard Data Catalog ways — by tags, etc.
FAQ
Wait 6 hours before checking Data Catalog for metadata sync completeness and correctness.
If you suspect that there is a problem with the Dataproc Metastore to Data Catalog sync, check the metadata publishing logs in Dataproc Metastore Cloud Logging with the filter
textPayload=~".*Publish.*"
. For more information on accessing logs, see Access job logs in Logging.If you disable Data Catalog sync, metadata stops syncing from Dataproc Metastore to Data Catalog. However, metadata that was already synced remains in Data Catalog.
If you delete a Dataproc Metastore instance, then the corresponding instance, database, and table entries are also removed from Data Catalog.
Data Catalog adheres to standard Google Cloud retention periods.
There are no additional costs to enabling Data Catalog sync for Dataproc Metastore.