Data Catalog is a fully managed, scalable metadata management service in Google Cloud's Data Analytics family of products.
Why do you need a Data Catalog?
Most organizations today are dealing with a large and growing number of data assets.
Data stakeholders (consumers, producers, and administrators) within an organization face a number of challenges:
Searching for insightful data:
- Data consumers don't know what data is where. They have to navigate data "swamps" they stumble into.
- Data consumers don't know what data to use to get insights because most data is not well documented and, even if documented, is not well maintained.
- Data can't be found and is often lost when it resides only in people's minds.
- Is the data fresh, clean, validated, approved for use in production?
- Which data set out of several duplicate sets is relevant and up-to-date?
- How does one data set relate to another?
- Who is using the data and who is the owner?
- Who and what processes are transforming the data?
Making data useful:
Data producers don't have an efficient way to put forward their data for consumers. If there's no self-service, consumers may overwhelm producers. Several data engineers can't manually provide data to thousands of data analysts.
Valuable time is lost if data consumers have to find out how to request data access, request it, wait without a defined response time, escalate, and wait again.
Without the right tools, the above challenges together become a major obstacle to the efficient use of data. Data Catalog provides a centralized place that allows organizations to:
- Gain a unified view to reduce the pain of searching for the right data.
- Enrich data with technical and business metadata to allow data-driven decision making and accelerate time to insight.
- Improve data management to increase operational efficiency and productivity.
- Take ownership over the data to improve trust and confidence in it.
Using Data Catalog
There are two main ways you interact with Data Catalog:
In addition, Data Catalog can leverage the results of a Cloud Data Loss Prevention (DLP) scan to identify sensitive data directly within Data Catalog in the form of tag templates.
How Data Catalog works
Data Catalog can catalog the native metadata on data assets from the following Google Cloud system sources:
- BigQuery datasets, tables, and views
- Pub/Sub topics
- Dataproc Metastore services, databases, and tables
You can also use Data Catalog APIs to create and manage entries for custom data resource types.
After your data is catalogued, you can add your own metadata to these assets using tags.
Technical and business metadata
Data Catalog handles two types of metadata: technical metadata and business metadata. To understand the difference, see the example Data Catalog entry below:
Technical metadata: Shown under BigQuery Table Details above, this is sourced from the underlying storage system where the data asset lives, and includes:
- Project information, such as name and ID
- Asset name and description
- Google Cloud resource labels
- Schema name and description for BigQuery tables and views
Business metadata: Shown under Tags (1) above, this is user-generated metadata applied to the asset using Data Catalog tags. Business metadata is always linked to a technical metadata entry.
Search and discovery
Data Catalog offers powerful, structured search capabilities and predicate-based filtering over both the technical and business metadata for an data asset. You must have the ability to read the metadata for a data asset to be able to search for and discover it. Data Catalog does not index the data within a data asset. Data Catalog indexes the metadata that describes an asset.
Data Catalog controls some metadata such as user-generated tags, but for all metadata sourced from the underlying storage system, Data Catalog is a read-only service that reflects the metadata and permissions provided by the underlying storage system. Edits to an asset's native metadata, such as adding, removing, or updating, can be done in the underlying storage system.
For a given project, Data Catalog automatically catalogs the following Google Cloud assets:
- BigQuery datasets, tables, views
- Pub/Sub topics
- (Preview) Dataproc Metastore services, databases, and tables
In addition to cataloging assets within the project IDs you have metadata access to, Data Catalog can catalog data stored in the BigQuery projects that contain public datasets.
To catalog metadata from non-GCP systems in your organization, you can use the following:
- Community-contributed connectors to a number of popular on-premises data sources
- Manually leverage the Data Catalog APIs for custom entries
Documenting data assets at a large scale is difficult, especially when the data is consumed by different groups within an organization. Each group can have their own set of documentation for describing data assets. Data Catalog tag templates help you create and manage common metadata about data assets in a single location. The tags are attached to the data asset which means it can be discovered in the Data Catalog system. Using this feature, you can also build additional applications that consume this contextual metadata about a data asset and take further actions.
How to interact with Data Catalog
- To get started with Data Catalog, see the Quickstart.
- To integrate your data sources, follow the steps in Integrate Google Cloud and on-premises data sources.