Data Analytics

Understanding the fundamentals of tagging in Data Catalog

#da

Google Cloud Data Catalog is a fully managed and scalable metadata management service. Data Catalog helps your organization quickly discover, understand, and manage all your data from one simple interface, letting you gain valuable business insights out of your data investments. One of Data Catalog’s core concepts, called tag templates, helps you organize complex metadata while making it searchable under Cloud Identity and Access Management ( Cloud IAM) control. In this post, we’ll offer some best practices and useful tag templates (referred to as templates from here) to help you start your journey.

Understanding Data Catalog templates

A tag template is a collection of related fields that represent your vocabulary for classifying data assets. Each field has a name and a type. The type can be a string, double, boolean, enumeration, or datetime. When the type is an enum, the template also stores the possible values for this field. The fields are stored as an unordered set in the template and each field is treated as optional unless marked as required. A required field means that a value must be assigned to this field each time the template is in use. An optional field means it can be left out when an instance of this template is created. 

You’ll create instances of templates when tagging data resources, such as BigQuery tables and views. Tagging means associating a tag template with a specific resource and assigning values to the template fields to describe the resource. We refer to these tags as structured tags because the fields in these tags are typed as instances of the template. Typed fields let you avoid common misspellings and other inconsistencies, a known pitfall with simple key value pairs. 

Organizing templates

Two common questions we hear about Data Catalog templates are: What kind of fields should go into a template and how should templates be organized? The answer to the first question really depends on what kind of metadata your organization wants to keep track of and how that metadata will be used. There are various metadata use cases, ranging from data discovery to data governance, and the requirements for each one should drive the contents of the templates. 

Let’s look at a simple example of how you might organize your templates. Suppose the goal is to make it easier for analysts to discover data assets in a data lake because they spend a lot of time searching for the right assets. In that case, create a Data Discovery template, which would categorize the assets along the dimensions that the analysts want to search. This would include fields such as data_domain, data_owner, creation_date, etc. If the data governance team wants to categorize the assets for data compliance purposes, you can create a separate template with governance-specific fields, such as data_retention, data_confidentiality, storage_location, etc. In other words, we recommend creating templates to represent a single concept, rather than placing multiple concepts into one template. This avoids confusing those who are using the templates and helps the template administrators maintain them over time. 

Some clients create their templates in multiple projects, others create them in a central project, and still others use both options. When creating templates that will be used widely across multiple teams, we recommend creating them in a central project so that they are easier to track. For example, a data governance template is typically maintained by a central group. This group might meet monthly to ensure that the fields in each template are clearly defined and decide how to handle requirements for additional fields. Storing their template in a central project makes sense for maintainability. When the scope of the template is restricted to one team, such as a data discovery template that is customized to the needs of one data science team, then creating the template in that team’s project makes more sense. When the scope is even more restricted, say to one individual, then creating the template in their personal project makes more sense. In other words, choose the storage location of a template based on its scope. 

Access control for templates

Data Catalog offers a wide range of permissions for managing access to templates and tags. Templates can be completely private, only visible to authorized users (through the tag template viewer role), as well as visible and used by authorized users for creating tags (through the tag template user role). When a template is visible, authorized users can not only view the contents of the template, but also search for assets that were tagged using the template (as long as they also have access to view those underlying assets). You can’t search for metadata if you don’t have access to the underlying data. To obtain read access to the cataloged assets, they would need to be granted the Data Catalog Viewer role; alternately, the BigQuery Metadata Viewer role can be used if the underlying assets are stored in BigQuery. 

In addition to the viewer and user roles, there is also the concept of a template creator (via the tag template creator role) and template owner (via the tag template owner role). The creator can only create new templates, while the owner has complete control of the template, which includes rights to delete it. Deleting a template has the ripple effect of deleting all the tags created from the template. For creating and modifying tags, use the tag editor role. This role should be used in conjunction with a tag template role so that users can access the templates from which to tag.   

Billing considerations for templates

There are two components to Data Catalog’s billing: metadata storage and API calls. For storage, projects in which templates are created incur the billing charges pertaining to templates and tags. They are billed for their templates’ storage usage even if the tags created from those templates are on resources that reside in different projects. For example, project A owns a Data Discovery template and project B uses this template to tag its own resources in BigQuery. Project A will incur the billing charges for Project B’s tags because the Data Discovery template resides in project A. 

From an API calls perspective, the charges are billed to the project selected when the calls are made for searching, reading, and writing. More details on pricing are available from the product documentation page.    

Prebuilt templates

Another common question we hear from potential clients is: Do you have prebuilt templates to help us get started with creating our own? Due to the popularity of this request, we created a few examples to illustrate the types of templates being deployed by our users. You can find them in YAML format below and through a GitHub repo. There is also a script in the same repo that reads the YAML-based templates and creates the actual templates in Data Catalog. 

Data governance template

The data governance template categorizes data assets based on their domain, environment, sensitivity, ownership, and retention details. It is intended to be used for data discovery and compliance with usage policies such as GDPR and CCPA. The template is expected to grow over time with the addition of new policies and regulations around data usage and privacy.

  template:
- name: dg_template
  display_name: "Data Governance Template"
  fields:
    - field: data_domain
      type: enum
      values: ENG|PRODUCT|OPS|LOGISTICS|FINANCE|HR|MARKETING|SALES
      display: "Data Domain"
    - field: broad_data_category
      type: enum
      values: "CONTENT|METADATA|CONFIGURATION"
      display: "Broad Data Category"
    - field: data_category_customer
      type: bool
      display: "Data Category Customer"
    - field: data_category_financial
      type: bool
      display: "Data Category Financial"
    - field: data_category_location
      type: bool
      display: "Data Category Location"
    - field: data_category_employee
      type: bool
      display: "Data Category Employee"
    - field: data_category_hippa
      type: bool
      display: "Data Category Health"
    - field: data_category_competitor
      type: bool
      display: "Data Category Competitor"
    - field: data_confidentiality
      type: enum
      values: PUBLIC|SHARED_EXTERNALLY|SHARED_INTERNALLY|SENSITIVE|UNKNOWN
      display: "Data Confidentiality"
    - field: environment
      type: enum
      values: PROD|QA|DEV|STAGING
      display: "Environment"
    - field: data_origin
      type: enum
      values: WORKDAY|SALESFORCE|DATA_LAKE|EVENT|PROMOTION|PARTNER|CONTRACTOR|OPEN_DATA
      display: "Data Origin"
    - field: data_creation
      type: timestamp
      display: "Data Creation Time"
    - field: data_retention
      type: enum
      values: 30_DAYS|60_DAYS|90_DAYS|120_DAYS|1_YEAR|2_YEARS|5_YEARS|UNKNOWN
      display: "Data Retention"

Derived data template

The derived data template is for categorizing derivative data that originates from one or more data sources. Derivative data is produced through a variety of means, including Dataflow pipelines, Airflow DAGs, BigQuery queries, and many others. The data can be transformed in multiple ways, such as aggregation, anonymization, normalization, etc. From a metadata perspective, we want to broadly categorize those transformations as well as keep track of the data sources that produced it. The parents field in the template is for storing the uris of the origin data sources and is populated by the process producing the derived data. It is declared as a string because complex types are not supported by Data Catalog as of this writing.

  template:
- name: derived_template
  display_name: "Derived Data Template"
  fields:
    - field: parents
      type: string
      display: "Parent Data Sources"
      required: true
    - field: aggregated_data
      type: bool
      display: "Aggregated Data"
    - field: pseudo_anonymized_data
      type: bool
      display: "Pseudo Anonymized Data"
    - field: anonymized_data
      type: bool
      display: "Anonymized Data"
    - field: normalized_data
      type: bool
      display: "Normalized Data"
    - field: date_created
      type: timestamp
      display: "Date Created"
    - field: product_created
      type: enum
      values: BIG_QUERY|DATAFLOW|COMPOSER|CLOUD_FUNCTION
      display: "Product Used to Create Derived Data"

Data quality template

The data quality template is intended to store the results of various quality checks to help in assessing the accuracy of the underlying data. Unlike the previous two templates, which are attached to a whole table, this one is attached to a specific column of a table. This would typically be an important numerical column that is used by critical business reports. As Data Catalog already ingests the schema of BigQuery tables through its technical metadata, this template omits the data type of the column and stores only the results of the quality checks. The quality checks are customizable and can easily be implemented in BigQuery.

  template:
- name: quality_template
  display_name: "Data Quality Template"
  fields:
    - field: count
      type: double
      display: "Number of Values"
    - field: unique_values
      type: double
      display: "Number of Unique Values"
    - field: percent_missing
      type: double
      display: "Percentage Missing Values"
    - field: mean
      type: double
      display: "Mean Value"
    - field: std_dev
      type: double
      display: "Standard Deviation"
    - field: zeros
      type: double
      display: "Number of Zero Values"
    - field: min
      type: double
      display: "Min Value"
    - field: median
      type: double
      display: "Median Value"
    - field: max
      type: double
      display: "Max Value"
    - field: date_created
      type: timestamp
      display: "Date Created"

Data engineering template

The data engineering template is also attached to individual columns of a table. It is intended for describing how those columns are mapped to the same data in a different storage system. Its goal is to support database replication scenarios such as warehouse migrations to BigQuery, continuous real-time replication to BigQuery, and replication to a data lake on Cloud Storage. In those scenarios, data engineers want to capture the mappings between the source and target columns of tables for two primary reasons: facilitate querying the replicated data, which usually has a different schema in BigQuery than the source; and capture how the data is being replicated so that replication issues can be more easily detected and resolved.

  template:
- name: eng_template
  display_name: "Data Engineering Template"
  fields:
    - field: source_col
      type: String
      display: "Source Column"
      required: true
    - field: source_table
      type: String
      display: "Source Table"
      required: true
    - field: pk_col
      type: bool
      display: "Primary Key Column"
    - field: fk_col
      type: bool
      display: "Foreign Key Column"
    - field: incr_col
      type: bool
      display: "Incremental Column"
    - field: null_col
      type: bool
      display: "Field can be NULL"
    - field: updatable_col
      type: bool
      display: "Field can be updated"

You can now use Data Catalog structured tags to bring together all your disparate operational and business metadata, attach them to your data assets and make them easily searchable. To learn more about tagging in Data Catalog, try out our quickstart for tagging tables.