Dataplex Catalog overview

This document describes Dataplex Catalog, which provides a platform for storing, managing, and accessing your metadata.

Dataplex Catalog provides a unified inventory of Google Cloud resources, such as BigQuery, and other resources, such as on-premises resources. Dataplex Catalog automatically retrieves metadata for Google Cloud resources, and you bring metadata for third-party resources into Dataplex Catalog.

Dataplex Catalog lets you enrich your inventory with additional business and technical metadata to capture the context and knowledge about your resources. With Dataplex Catalog, you can search and discover your data across the organization and enable data governance over your data assets.

You can set your default catalog experience to Dataplex Catalog. If you're using Data Catalog, you can transition your Data Catalog content and usage to Dataplex Catalog. For more information, see transitioning from Data Catalog to Dataplex Catalog.

Use cases

You can use Dataplex Catalog to do the following:

  • Discover and understand your data. Dataplex Catalog provides visibility over your data resources across the organization. It lets you find relevant resources for data consumption needs. It provides context for data resources, which helps you understand the suitability of data resources for your data consumer's needs.

  • Enable data governance and data management. Dataplex Catalog supplies metadata that can inform and power your data governance and data management capabilities.

  • Maintain an extensible and comprehensive repository for your metadata. Dataplex Catalog stores and provides access to metadata that is automatically harvested from your Google Cloud resources. You can integrate your own metadata from non-Google Cloud systems. You can enrich all metadata with additional business and technical metadata annotations.

How Dataplex Catalog works

Dataplex Catalog is based on the following concepts:

  • Entry: An entry represents a data asset. Most of the metadata is described by aspects within an entry. This is similar to entries in Data Catalog. For more information, see Entries.

  • Aspect: An aspect is a set of related metadata fields within an entry. An aspect can be interpreted either as a building block of an entry or additional metadata to it. This is similar to tags in Data Catalog, however aspects are stored within entries and not as standalone resources. For more information, see Aspects.

  • Aspect type: An aspect type is a reusable template for aspects. Every aspect is an instance of an aspect type. This is similar to tag templates in Data Catalog. For more information, see Aspect types.

  • Entry group: An entry group is a container for entries that serves as a unit of management for these entries. For example, use an entry group to configure Identity and Access Management access control, project attribution, or location for the entries in the entry group. This is similar to entry groups in Data Catalog. For more information, see Entry groups.

  • Entry type: An entry type is a template for creating entries. It establishes the essential metadata elements, outlined as a list of required aspects for entries of this type. For more information, see Entry types.

    Entries and entry groups
    Figure 1. Entries and entry groups
    Aspect types and entry types
    Figure 2. Aspect types and entry types

The following are some of the use cases for Dataplex Catalog:

  • As a data analyst or a business analyst, you can search entries across the organization and explore metadata that is associated with the entries. For more information, see Search for data assets.
  • As data owner or a data governor, you can capture additional technical and business metadata by annotating your entries with aspects. For more information, see Manage aspects and enrich metadata.
  • As a data owner or a data governor, you can bring consistency into your metadata by defining the standards for annotation (using aspect types) and custom entries (using entry types). For more information, see Manage aspects and enrich metadata.
  • As a data engineer, you can have an unified inventory for your resources, including Google Cloud resources and resources from third-party systems. Google Cloud resources are automatically harvested by Dataplex Catalog, and non-Google Cloud resources are harvested by you. For more information, see Manage entries and ingest custom sources.

For existing Data Catalog users

If you're already using Data Catalog, note the following:

  • Custom entries, overview context, and entry groups that you created in Data Catalog are made available in Dataplex Catalog.
  • As an administrator, you can choose to make the content of Data Catalog tag templates and tags simultaneously available in Dataplex Catalog. For more information, see Transition from Data Catalog to Dataplex Catalog.
  • When you search for data assets in Dataplex Catalog, both the metadata that was created in Dataplex Catalog directly and the metadata that was brought from Data Catalog into Dataplex Catalog are included.
  • When you search for data assets in Data Catalog, only the metadata that was created in Data Catalog is included.
  • The entry group descriptions in Data Catalog that exceed 1024 characters are truncated to 1024 characters in Dataplex Catalog.

If you want to transition your Data Catalog content and usage to Dataplex Catalog, see transitioning from Data Catalog to Dataplex Catalog.

Dataplex Catalog versus Data Catalog

Dataplex Catalog provides a capability for managing your metadata in Dataplex. It comes with a separate metadata storage and a new set of API methods which are integrated into the Dataplex API.

The main features of Dataplex Catalog includes the following:

  • More robust metamodel

    • Typed entries. You can enforce minimal metadata standards by defining the required metadata content for custom entries
    • User-configurable metamodel for custom entries, which helps to make custom ingestion more robust and improves custom metadata consistency and comprehensiveness.
    • Support for a wider variety and complexity of metadata, including support for nesting structures like lists, maps, and arrays.
  • Improved scalability, including the ability to interact with all metadata that is associated with an entry through single atomic CRUD operations and the ability to fetch multiple metadata annotations associated in search or list responses.

The following table compares the features of Dataplex Catalog and Data Catalog:

Comparison between Dataplex Catalog and Data Catalog
Feature Dataplex Catalog Data Catalog
Supported Google Cloud sources All sources as described in the Supported Google Cloud sources section of this document. All sources described in Entries and entry groups.
Custom sources ingestion

Ingestion into custom entries with governed structure, defined by entry types.

Data Catalog custom entries and entry groups are made available in Dataplex Catalog under the generic entry type.

Ingestion into generic custom entries.
Metadata enrichment Metadata context for entries is captured using aspects and aspect types. Metadata context for entries is captured using tags and tag templates.
Search Search is performed over the following:
  • All Google Cloud sources described in Supported Google Cloud sources
  • Custom entries that are created in Dataplex Catalog
  • Aspects that are created in Dataplex Catalog
  • Custom entries that are created in Data Catalog and are brought into Dataplex Catalog

The search results include only those resources that belong to the same VPC-SC perimeter as the project under which search is performed. When using the Google Cloud console, this is the project that is selected in the console.

Note that, to search for entries, you need at least one of the Dataplex Catalog IAM roles on the project that is used for search. Permissions on search results are checked independently of the selected project.

Search is performed over the following:
  • All Google Cloud sources described in Entries and entry groups
  • Custom entries that are created in Data Catalog
  • Tags that are created in Data Catalog

The following table describes how Dataplex Catalog resources correspond to Data Catalog resources:

Mapping between Dataplex Catalog and Data Catalog resources
Dataplex Catalog resource Data Catalog resource Description
Aspect type (global) Public tag template Tag templates are regional resources. However, you can use them to create tags across regions. Tag templates correspond to global aspect types in Dataplex Catalog.
Optional aspect Public tag Public tags in Data Catalog correspond to optional aspects in Dataplex Catalog.
Entry group Entry group For Google Cloud sources, system entry groups such as @bigquery are established per-project in Dataplex Catalog.
Custom entry required aspects Custom entry

Data Catalog and Dataplex Catalog share similar concepts for custom entries.

Standard entry properties are modeled as required aspects in Dataplex Catalog.

System entry required aspects System (Google Cloud) entry Metadata describing built-in entities, such as Schema for BigQuery tables, is captured in required aspects of the system-defined aspect types.

For more information about the features that are available in Data Catalog and are not supported in Dataplex Catalog, see the Features that are not supported in Dataplex Catalog section in this document.

Supported sources

Metadata from the following Google Cloud sources is automatically ingested into Dataplex Catalog:

  • Analytics Hub exchanges and listings
  • BigQuery datasets, tables, models, routines, connections, and linked datasets
  • Bigtable instances, clusters, and tables (including column family details)
  • Dataform repositories and code assets
  • Cloud SQL instances, databases, schemas, tables, views—see Enabling the Cloud SQL integration
  • Dataproc Metastore services, databases, and tables
  • Pub/Sub topics
  • Spanner instances, databases, tables, and views
  • Vertex AI models, datasets, feature groups, feature views, and online store instances

To import metadata from a third-party source into Dataplex Catalog, you can use a managed connectivity pipeline.

Project and location constraints

Dataplex Catalog resources are housed within various projects and locations. The following limitations apply:

  • Location:

    • The location of an entry must either match the location of the entry type, or the entry type must be global.
    • An aspect added to an entry must be based on an aspect type that is stored in the same location as the entry or the aspect type must be global.
    • An entry type must be composed of aspect types that are stored in the same location as the entry type.
  • Project:

    • If an entry type references custom aspect types, then the aspect types must be in the same location and project as the entry type.

Features that aren't supported in Dataplex Catalog

The following features that are available in Data Catalog are not supported in Dataplex Catalog:

  • The notion of private aspects and aspect types isn't supported in Dataplex Catalog. Access to aspects is governed by permissions that are associated with the entry that contains the aspects. For more information, see Dataplex IAM roles.
  • Search for policy tags isn't supported in Dataplex Catalog search; consequently, the predicates policytag and policytagid don't work in the Dataplex Catalog search.
  • For Data Catalog custom entries that are brought into Dataplex Catalog, the existing IAM permissions for your current metadata aren't automatically propagated to copied metadata. You must explicitly configure IAM permissions for the copied metadata before using it.
  • Sending Sensitive Data Protection job results to Dataplex Catalog isn't supported.
  • You can't list entry types and aspect types across projects using the API. You can scope the list request to a project only.
  • You can't attach business glossary terms to the columns of Dataplex entries.
  • You can't modify the list of required aspect types in an entry type after you create the entry type.
  • For entries that were created directly in Dataplex Catalog, data lineage shows lineage events in the Google Cloud console but doesn't display detailed information about the source, target, or process. Also, data lineage doesn't display aspects for any entries in the Google Cloud console.

Pricing

Dataplex uses the metadata storage SKU to charge for metadata storage. For more information, see Dataplex pricing.

There are no charges to use the following:

  • Creating and managing Dataplex Catalog resources
  • Search API calls for Dataplex Catalog
  • Search queries performed on the Dataplex Catalog page in the Google Cloud console

What's next