Work with Data Catalog
Data Catalog is a feature of Dataplex that integrates with BigQuery by automatically cataloging metadata about BigQuery resources like tables, datasets, views, and models. This document describes how to search these resources, view data lineage, and add tags by using Data Catalog.
Search for BigQuery resources
To use Data Catalog to search for BigQuery datasets, tables, and starred projects, follow these steps:
In the Google Cloud console, go to the Dataplex Search page.
In the Search field, enter a query, and then click Search.
To refine your search parameters, use the Filters panel. For example, in the Systems section, select the BigQuery checkbox. The results are filtered to BigQuery systems.
You can perform basic searches in Data Catalog through the Google Cloud console. For more information about searching in the Google Cloud console, see Open a public dataset.
Data lineage
Data lineage is a Dataplex feature that lets you track how data moves through your systems: where it comes from, where it is passed to, and what transformations are applied to it. You can access the data lineage feature directly from BigQuery.
Enabling data lineage in your BigQuery project causes Dataplex to automatically record lineage information for tables created by the following operations:
- Copy jobs.
Query jobs that use the following data definition language (DDL) or data manipulation language (DML) statements in GoogleSQL:
CREATE TABLE
(including theCREATE TABLE AS SELECT
statement)INSERT
UPDATE
DELETE
MERGE
Before you begin
In this section, you enable the Data Lineage API and grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document.
Enable data lineage
- In the Google Cloud console, on the project selector page, select the project that contains the resources for which you want to track lineage.
- Enable the Data Lineage API and Data Catalog APIs.
Required IAM roles
Lineage information is tracked automatically when you enable the Data Lineage API.
To get the permissions that you need to view lineage visualization graphs, ask your administrator to grant you the following IAM roles:
-
Data Catalog Viewer (
roles/datacatalog.viewer
) on a Data Catalog resource project. -
Data lineage viewer (
roles/datalineage.viewer
) on the project where you use systems supported by data lineage. -
BigQuery Metadata (
roles/bigquery.metadataViewer
)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
For more information, see Data lineage roles.
View lineage graphs in BigQuery
To view the data lineage visualization graph from BigQuery follow these steps:
In the Google Cloud console, go to the BigQuery page.
In the Explorer panel, expand your project and dataset, then select a table.
Click the Lineage tab.
Your data lineage visualization graph is displayed.
Optional: Select a node to view additional details about the entities or processes involved in constructing lineage information.
For more information about data lineage, see About data lineage.
Tags and tag templates
Tags let organizations create, search, and manage metadata for all their data entries in a unified service.
This section explains two key Data Catalog concepts:
Tags let you provide context for a data entry by attaching custom metadata fields.
Tag templates are reusable structures that you can use to rapidly create new tags.
Tags
Data Catalog provides two types of tags: private tags and public tags.
Private tags
Private tags provide strict access controls. You can search or view the tags and the data entries associated with the tags only if you are granted the required view permissions on both the private tag template and the data entries.
Searching for private tags in the Data Catalog page requires that
you use the tag:
search syntax or the search filters.
Private tags are suitable for scenarios where you need to store some sensitive information in the tag and you want to apply additional access restrictions beyond checking whether the user has the permissions to view the tagged entry.
Public tags
Public tags provide less strict access control for searching and viewing the tag
as compared to private tags. Any user who has the required view permissions for
a data entry can view all the public tags associated with it. View permissions
for public tags are only required when you perform a search in Data Catalog
using the tag:
syntax or when you view an unattached tag template.
Public tags support both simple search and search with predicates in the Data Catalog search page. When you create a tag template, the option to create a public tag template is the default and recommended option in the Google Cloud console.
For example, let's assume you have a public tag template called employee data
that you used to create tags for three data entries called Name
, Location
,
and Salary
. Among the three data entries, only members of a specific group
called HR
can view the Salary
data entry. The other two data entries
have view permissions for all employees of the company.
If any employee who is not a member of the HR
group uses the Data Catalog
search page and searches with the word employee
, the search result displays
only Name
and Location
data entries with the associated public tags.
Public tags are useful for a broad set of scenarios. Public tags support simple search and search with predicates, while private tags support only search with predicates.
Tag templates
To start tagging metadata, you first need to create one or more tag templates. A tag template can be a public or private tag template. When you create a tag template, the option to create a public tag template is the default and recommended option in the Google Cloud console. A tag template is a group of metadata key-value pairs called fields. Having a set of templates is similar to having a database schema for your metadata.
You can structure your tags by topic. For example:
- A
data governance
tag with fields for data governor, retention date, deletion date, PII (yes or no), data classification (public, confidential, sensitive, regulatory) - A
data quality
tag with fields for quality issues, update frequency, SLO information - A
data usage
tag with fields for top users, top queries, average daily users
You can then mix and match tags, using only the tags relevant for each data asset and your business needs.
View the tag template gallery
To help you get started, Data Catalog includes a gallery of sample tag templates to illustrate common tagging use cases. Use these examples to learn about the power of tagging, for inspiration, or as a starting point for creating your own tagging infrastructure.
To use a tag template gallery, perform the following steps:
In the Google Cloud console, go to the Dataplex Tag templates page.
Click Create tag template.
The template gallery is displayed as part of the Create template page.
After you select a template from the gallery, you can use it just like any other tag template. You can add or delete attributes and change anything in the template to suit your business needs. You can then search for the template fields and values using Data Catalog.
For more information about tags and tag templates, see Tags and tag templates.
Regional resources
Every tag template and tag is stored in a particular Google Cloud region. You can use a tag template to create a tag in any region, so you don't need to create copies of your template if you have metadata entries spread across multiple regions.