Data Catalog overview

Managing data assets can be time consuming and expensive without the right tools. Data Catalog provides a centralized place where organizations can find, curate and describe their data assets.

Data Catalog is a fully managed, scalable metadata management service in Google Cloud's Data Analytics family of products.

Using Data Catalog

There are two main ways you interact with Data Catalog:

  • Searching for data assets that you have access to
  • Tagging assets with metadata

In addition, Data Catalog interacts with Cloud Data Loss Prevention (DLP) to automatically identify sensitive data by using Cloud Data Loss Prevention's powerful auto-tagging mechanism.

How Data Catalog works

Data Catalog can catalog the native metadata on data assets from the following Google Cloud storage system sources:

  • BigQuery datasets, tables, and views
  • Pub/Sub topics

You can also use Data Catalog APIs to create and manage entries for custom data resource types.

After your data is catalogued, you can add your own metadata to these assets using tags.

Technical and business metadata

Data Catalog handles two types of metadata: technical metadata and business metadata. To understand the difference, see the example Data Catalog entry below:

  • Technical metadata: Shown under BigQuery Table Details above, this is sourced from the underlying storage system where the data asset lives, and includes:

    • Project information, such as name and ID
    • Asset name and description
    • Google Cloud resource labels
    • Schema name and description for BigQuery tables and views
  • Business metadata: Shown under Tags (1) above, this is user-generated metadata applied to the asset using Data Catalog tags. Business metadata is always linked to a technical metadata entry.

Search and discovery

Data Catalog offers powerful, structured search capabilities and predicate-based filtering over both the technical and business metadata for an data asset. You must have the ability to read the metadata for a data asset to be able to search for and discover it. Data Catalog does not index the data within a data asset. Data Catalog indexes the metadata that describes an asset.

Data Catalog controls some metadata such as user-generated tags, but for all metadata sourced from the underlying storage system, Data Catalog is a read-only service that reflects the metadata and permissions provided by the underlying storage system. Edits to an asset's native metadata, such as adding, removing, or updating, can be done in the underlying storage system.

For a given project, Data Catalog automatically catalogs all BigQuery datasets, tables, views, and external tables in Cloud Storage, Cloud Bigtable, or Google Sheets. Data Catalog will also automatically catalog Pub/Sub topics from that project.

In addition to cataloging assets within the project IDs you have metadata access to, Data Catalog can catalog data stored in the BigQuery projects that contain public datasets.

Tags

Documenting data assets at a large scale is difficult, especially when the data is consumed by different groups within an organization. Each group can have their own set of documentation for describing data assets. Data Catalog tag templates help you create and manage common metadata about data assets in a single location. The tags are attached to the data asset which means it can be discovered in the Data Catalog system. Using this feature, you can also build additional applications that consume this contextual metadata about a data asset and take further actions.

How to interact with Data Catalog

You can access Data Catalog by using the Cloud Console, the gcloud command-line interface (CLI), and the Data Catalog APIs, or by making calls to the API using Cloud Client Libraries.

What's next

  • To get started with Data Catalog, see the Quickstart.
  • See the How-to guides for instructions on using Data Catalog features.