Dataproc Metastore to Data Catalog sync

You can enable Dataproc Metastore service to Data Catalog sync to take advantage of the metadata discovery and metadata management service. Once enabled, database and table metadata like schema information are automatically synced from Dataproc Metastore to Data Catalog.

Data Catalog allows you to tag and search for service specific resources, such as databases and tables.

What is Data Catalog

Data Catalog is a fully managed, scalable metadata management service in Google Cloud's Data Analytics family of products. It provides unified view and tagging mechanisms for technical and business metadata.

For more information, see the following Data Catalog feature guides:

Permissions

Dataproc Metastore level permissions are honored by Data Catalog.

For metadata that is synced from Dataproc Metastore to Data Catalog, IAM permissions specified in Dataproc Metastore apply to the metadata in Data Catalog as well. Only users with access to the Dataproc Metastore service will be able to see the synced service resources as entries in Data Catalog.

Enabling Data Catalog sync

After enabling Data Catalog sync, Data Catalog performs live, full syncs of your Dataproc Metastore service.

It syncs the following metadata:

  • Instances
  • Databases, including name and description
  • Tables, including name, description, and schema (columns with descriptions)
  • Database properties
  • Table properties

The following table shows the resource mapping between Dataproc Metastore and Data Catalog:

Dataproc Metastore Resource Data Catalog Resource
Instance Entry group
Entry
Database Entry
Table Entry
Column Schema

You can enable Dataproc Metastore service to Data Catalog sync when you create or update a Dataproc Metastore service using the Google Cloud Console.

Creating a service with Data Catalog sync enabled

Data Catalog sync is disabled by default.

To enable Data Catalog sync for a new service:

Console

  1. In the Cloud Console, open the Dataproc Metastore page:

    Open Dataproc Metastore in the Cloud Console

  2. At the top of the Dataproc Metastore page, click the Create button. The Create service page opens.

  3. Configure your service as desired.

  4. Under Metadata integration, enable Data Catalog sync to sync the Dataproc Metastore service to Data Catalog.

  5. Click Submit.

Enabling or disabling Data Catalog sync for an existing service

To enable or disable Data Catalog sync for an existing service:

Console

  1. In the Cloud Console, open the Dataproc Metastore page:

    Open Dataproc Metastore in the Cloud Console

  2. On the Dataproc Metastore page, click the service name of the service you'd like to update. The Service detail page for that service opens.

  3. Under the Configuration tab, click the Edit button. The Edit service page opens.

  4. In the Metadata integration section, click to toggle Enable on or off for Data Catalog sync.

  5. Click the Submit button to update the service.

Searching with Data Catalog

You can search synced Dataproc Metastore metadata using Data Catalog.

Although there are no custom search options for Dataproc Metastore, there are multiple ways to search for different Dataproc Metastore resources:

  • Dataproc Metastore instance
    • By display name
    • Standard Data Catalog ways — by tags, etc.
  • Database
    • By display name
    • By description
    • By Dataproc Metastore instance
    • Standard Data Catalog ways — by tags, etc.
  • Table:
    • By display name
    • By description
    • By column name
    • By column description
    • By database
    • By Dataproc Metastore instance
    • Standard Data Catalog ways — by tags, etc.

FAQ

  • Wait 6 hours before checking Data Catalog for metadata sync completeness and correctness.

  • If you suspect that there is a problem with the Dataproc Metastore to Data Catalog sync, check the metadata publishing logs in Dataproc Metastore Cloud Logging with the filter textPayload=~".*Publish.*". For more information on accessing logs, see Accessing job logs in Logging.

  • If you disable Data Catalog sync, metadata will no longer be synced from Dataproc Metastore to Data Catalog. However, metadata that was already synced will remain in Data Catalog.

  • If you delete a Dataproc Metastore instance, then the corresponding instance, database, and table entries are also removed from Data Catalog.

  • Data Catalog honors standard retention periods.

  • There are no additional costs to enabling Data Catalog sync for Dataproc Metastore.

What's next?