Understanding the fundamentals of tagging in Data Catalog
Google Cloud Data Catalog is a fully managed and scalable metadata management service. Data Catalog helps your organization quickly discover, understand, and manage all your data from one simple interface, letting you gain valuable business insights out of your data investments. One of Data Catalog’s core concepts, called tag templates, helps you organize complex metadata while making it searchable under Cloud Identity and Access Management ( Cloud IAM) control. In this post, we’ll offer some best practices and useful tag templates (referred to as templates from here) to help you start your journey.
Understanding Data Catalog templates
tag template is a collection of related fields that represent your vocabulary for classifying data assets. Each field has a name and a type. The type can be a
datetime. When the type is an
enum, the template also stores the possible values for this field. The fields are stored as an unordered set in the template and each field is treated as optional unless marked as required. A required field means that a value must be assigned to this field each time the template is in use. An optional field means it can be left out when an instance of this template is created.
You’ll create instances of templates when tagging data resources, such as BigQuery tables and views. Tagging means associating a tag template with a specific resource and assigning values to the template fields to describe the resource. We refer to these tags as
structured tags because the fields in these tags are typed as instances of the template. Typed fields let you avoid common misspellings and other inconsistencies, a known pitfall with simple key value pairs.
Two common questions we hear about Data Catalog templates are: What kind of fields should go into a template and how should templates be organized? The answer to the first question really depends on what kind of metadata your organization wants to keep track of and how that metadata will be used. There are various metadata use cases, ranging from data discovery to data governance, and the requirements for each one should drive the contents of the templates.
Let’s look at a simple example of how you might organize your templates. Suppose the goal is to make it easier for analysts to discover data assets in a data lake because they spend a lot of time searching for the right assets. In that case, create a Data Discovery template, which would categorize the assets along the dimensions that the analysts want to search. This would include fields such as
creation_date, etc. If the data governance team wants to categorize the assets for data compliance purposes, you can create a separate template with governance-specific fields, such as
storage_location, etc. In other words, we recommend creating templates to represent a single concept, rather than placing multiple concepts into one template. This avoids confusing those who are using the templates and helps the template administrators maintain them over time.
Some clients create their templates in multiple projects, others create them in a central project, and still others use both options. When creating templates that will be used widely across multiple teams, we recommend creating them in a central project so that they are easier to track. For example, a data governance template is typically maintained by a central group. This group might meet monthly to ensure that the fields in each template are clearly defined and decide how to handle requirements for additional fields. Storing their template in a central project makes sense for maintainability. When the scope of the template is restricted to one team, such as a data discovery template that is customized to the needs of one data science team, then creating the template in that team’s project makes more sense. When the scope is even more restricted, say to one individual, then creating the template in their personal project makes more sense. In other words, choose the storage location of a template based on its scope.
Access control for templates
Data Catalog offers a wide range of permissions for managing access to templates and tags. Templates can be completely private, only visible to authorized users (through the
tag template viewer role), as well as visible and used by authorized users for creating tags (through the
tag template user role). When a template is visible, authorized users can not only view the contents of the template, but also search for assets that were tagged using the template (as long as they also have access to view those underlying assets). You can’t search for metadata if you don’t have access to the underlying data. To obtain read access to the cataloged assets, they would need to be granted the
Data Catalog Viewer role; alternately, the
BigQuery Metadata Viewer role can be used if the underlying assets are stored in BigQuery.
In addition to the viewer and user roles, there is also the concept of a template creator (via the tag template creator role) and template owner (via the
tag template owner role). The creator can only create new templates, while the owner has complete control of the template, which includes rights to delete it. Deleting a template has the ripple effect of deleting all the tags created from the template. For creating and modifying tags, use the
tag editor role. This role should be used in conjunction with a tag template role so that users can access the templates from which to tag.
Billing considerations for templates
There are two components to Data Catalog’s billing: metadata storage and API calls. For storage, projects in which templates are created incur the billing charges pertaining to templates and tags. They are billed for their templates’ storage usage even if the tags created from those templates are on resources that reside in different projects. For example, project A owns a Data Discovery template and project B uses this template to tag its own resources in BigQuery. Project A will incur the billing charges for Project B’s tags because the Data Discovery template resides in project A.
From an API calls perspective, the charges are billed to the project selected when the calls are made for searching, reading, and writing. More details on pricing are available from the product documentation page.
Another common question we hear from potential clients is: Do you have prebuilt templates to help us get started with creating our own? Due to the popularity of this request, we created a few examples to illustrate the types of templates being deployed by our users. You can find them in YAML format below and through a GitHub repo. There is also a script in the same repo that reads the YAML-based templates and creates the actual templates in Data Catalog.
Data governance template
The data governance template categorizes data assets based on their domain, environment, sensitivity, ownership, and retention details. It is intended to be used for data discovery and compliance with usage policies such as GDPR and CCPA. The template is expected to grow over time with the addition of new policies and regulations around data usage and privacy.
- name: dg_template
display_name: "Data Governance Template"
- field: data_domain
display: "Data Domain"
- field: broad_data_category
display: "Broad Data Category"
- field: data_category_customer
display: "Data Category Customer"
- field: data_category_financial
display: "Data Category Financial"
- field: data_category_location
display: "Data Category Location"
- field: data_category_employee
display: "Data Category Employee"
- field: data_category_hippa
display: "Data Category Health"
- field: data_category_competitor
display: "Data Category Competitor"
- field: data_confidentiality
display: "Data Confidentiality"
- field: environment
- field: data_origin
display: "Data Origin"
- field: data_creation
display: "Data Creation Time"
- field: data_retention
display: "Data Retention"
Derived data template
The derived data template is for categorizing derivative data that originates from one or more data sources. Derivative data is produced through a variety of means, including Dataflow pipelines, Airflow DAGs, BigQuery queries, and many others. The data can be transformed in multiple ways, such as aggregation, anonymization, normalization, etc. From a metadata perspective, we want to broadly categorize those transformations as well as keep track of the data sources that produced it. The
parents field in the template is for storing the uris of the origin data sources and is populated by the process producing the derived data. It is declared as a string because complex types are not supported by Data Catalog as of this writing.
- name: derived_template
display_name: "Derived Data Template"
- field: parents
display: "Parent Data Sources"
- field: aggregated_data
display: "Aggregated Data"
- field: pseudo_anonymized_data
display: "Pseudo Anonymized Data"
- field: anonymized_data
display: "Anonymized Data"
- field: normalized_data
display: "Normalized Data"
- field: date_created
display: "Date Created"
- field: product_created
display: "Product Used to Create Derived Data"
Data quality template
The data quality template is intended to store the results of various quality checks to help in assessing the accuracy of the underlying data. Unlike the previous two templates, which are attached to a whole table, this one is attached to a specific column of a table. This would typically be an important numerical column that is used by critical business reports. As Data Catalog already ingests the schema of BigQuery tables through its technical metadata, this template omits the data type of the column and stores only the results of the quality checks. The quality checks are customizable and can easily be implemented in BigQuery.
- name: quality_template
display_name: "Data Quality Template"
- field: count
display: "Number of Values"
- field: unique_values
display: "Number of Unique Values"
- field: percent_missing
display: "Percentage Missing Values"
- field: mean
display: "Mean Value"
- field: std_dev
display: "Standard Deviation"
- field: zeros
display: "Number of Zero Values"
- field: min
display: "Min Value"
- field: median
display: "Median Value"
- field: max
display: "Max Value"
- field: date_created
display: "Date Created"
Data engineering template
The data engineering template is also attached to individual columns of a table. It is intended for describing how those columns are mapped to the same data in a different storage system. Its goal is to support database replication scenarios such as warehouse migrations to BigQuery, continuous real-time replication to BigQuery, and replication to a data lake on Cloud Storage. In those scenarios, data engineers want to capture the mappings between the source and target columns of tables for two primary reasons: facilitate querying the replicated data, which usually has a different schema in BigQuery than the source; and capture how the data is being replicated so that replication issues can be more easily detected and resolved.
- name: eng_template
display_name: "Data Engineering Template"
- field: source_col
display: "Source Column"
- field: source_table
display: "Source Table"
- field: pk_col
display: "Primary Key Column"
- field: fk_col
display: "Foreign Key Column"
- field: incr_col
display: "Incremental Column"
- field: null_col
display: "Field can be NULL"
- field: updatable_col
display: "Field can be updated"