Data Catalog overview

Data Catalog is a fully managed, scalable metadata management service in Google Cloud's Data Analytics family of products.

Why do you need a Data Catalog?

Most organizations today are dealing with a large and growing number of data assets.

Data stakeholders (consumers, producers, and administrators) within an organization face a number of challenges:

  • Searching for insightful data:

    • Data consumers don't know what data is where. They have to navigate data "swamps" they stumble into.
    • Data consumers don't know what data to use to get insights because most data is not well documented and, even if documented, is not well maintained.
    • Data can't be found and is often lost when it resides only in people's minds.
  • Understanding data:

    • Is the data fresh, clean, validated, approved for use in production?
    • Which data set out of several duplicate sets is relevant and up-to-date?
    • How does one data set relate to another?
    • Who is using the data and who is the owner?
    • Who and what processes are transforming the data?
  • Making data useful:

    • Data producers don't have an efficient way to put forward their data for consumers. If there's no self-service, consumers may overwhelm producers. Several data engineers can't manually provide data to thousands of data analysts.

    • Valuable time is lost if data consumers have to find out how to request data access, request it, wait without a defined response time, escalate, and wait again.

Without the right tools, the above challenges together become a major obstacle to the efficient use of data. Data Catalog provides a centralized place that allows organizations to:

  • Gain a unified view to reduce the pain of searching for the right data.
  • Enrich data with technical and business metadata to allow data-driven decision making and accelerate time to insight.
  • Improve data management to increase operational efficiency and productivity.
  • Take ownership over the data to improve trust and confidence in it.

Using Data Catalog

There are two main ways you interact with Data Catalog:

  • Searching for data assets that you have access to
  • Tagging assets with metadata

In addition, Data Catalog interacts with Cloud Data Loss Prevention (DLP) to automatically identify sensitive data by using Cloud Data Loss Prevention's powerful auto-tagging mechanism.

How Data Catalog works

Data Catalog can catalog the native metadata on data assets from the following Google Cloud storage system sources:

  • BigQuery datasets, tables, and views
  • Pub/Sub topics
  • Dataproc Metastore services, databases, and tables

You can also use Data Catalog APIs to create and manage entries for custom data resource types.

After your data is catalogued, you can add your own metadata to these assets using tags.

Technical and business metadata

Data Catalog handles two types of metadata: technical metadata and business metadata. To understand the difference, see the example Data Catalog entry below:

  • Technical metadata: Shown under BigQuery Table Details above, this is sourced from the underlying storage system where the data asset lives, and includes:

    • Project information, such as name and ID
    • Asset name and description
    • Google Cloud resource labels
    • Schema name and description for BigQuery tables and views
  • Business metadata: Shown under Tags (1) above, this is user-generated metadata applied to the asset using Data Catalog tags. Business metadata is always linked to a technical metadata entry.

Search and discovery

Data Catalog offers powerful, structured search capabilities and predicate-based filtering over both the technical and business metadata for an data asset. You must have the ability to read the metadata for a data asset to be able to search for and discover it. Data Catalog does not index the data within a data asset. Data Catalog indexes the metadata that describes an asset.

Data Catalog controls some metadata such as user-generated tags, but for all metadata sourced from the underlying storage system, Data Catalog is a read-only service that reflects the metadata and permissions provided by the underlying storage system. Edits to an asset's native metadata, such as adding, removing, or updating, can be done in the underlying storage system.

For a given project, Data Catalog automatically catalogs the following assets:

  • BigQuery datasets, tables, views, and external tables in Cloud Storage, Cloud Bigtable, or Google Sheets
  • Pub/Sub topics
  • Dataproc Metastore services, databases, and tables

In addition to cataloging assets within the project IDs you have metadata access to, Data Catalog can catalog data stored in the BigQuery projects that contain public datasets.

Tags

Documenting data assets at a large scale is difficult, especially when the data is consumed by different groups within an organization. Each group can have their own set of documentation for describing data assets. Data Catalog tag templates help you create and manage common metadata about data assets in a single location. The tags are attached to the data asset which means it can be discovered in the Data Catalog system. Using this feature, you can also build additional applications that consume this contextual metadata about a data asset and take further actions.

How to interact with Data Catalog

You can access Data Catalog by using the Cloud Console, the gcloud command-line interface (CLI), and the Data Catalog APIs, or by making calls to the API using Cloud Client Libraries.

What's next

  • To get started with Data Catalog, see the Quickstart.
  • See the How-to guides for instructions on using Data Catalog features.