Tag tables in Dataplex based on insights from data profiles

This page describes how to automatically apply Dataplex tags to BigQuery tables after Sensitive Data Protection profiles those tables. This page also provides example queries that you can use to find tagged data across your organization and projects.

This feature is useful if you want to enrich your manually curated metadata in Dataplex with insights gathered from Sensitive Data Protection data profiles. The generated tags include the following insights:

  • Information types (infoTypes) detected in the columns of the table
  • Calculated sensitivity level of the table
  • Calculated data risk level of the table

Insights from Sensitive Data Protection data profiles can help you use Dataplex to discover sensitive and high-risk data in your organization. Use these insights to help you make informed decisions about how to manage and govern your data.

If you want to send the results of inspection jobs—not data profiling operations—to Dataplex, see Send Sensitive Data Protection inspection results to Data Catalog instead.

About data profiles

You can configure Sensitive Data Protection to automatically generate profiles about data across an organization, folder, or project. Data profiles contain metrics and metadata about your data and help you determine where sensitive and high-risk data reside. Sensitive Data Protection reports these metrics at various levels of detail. For information about the types of data you can profile, see Supported resources.

About Dataplex and Data Catalog

Dataplex is a Google Cloud service that unifies distributed data and automates data management and governance for that data. Data Catalog is a fully managed, scalable metadata management service within Dataplex.

Data Catalog lets you use tags and tag templates to attach business metadata to your data. You can then search and manage all metadata for your organization or project in a unified service. For more information, see Tags and tag templates.

How it works

If your discovery scan configuration has the Send to Dataplex as tags action enabled, Sensitive Data Protection does the following each time it profiles your data. This action is only applied to new and updated profiles. Existing profiles that aren't updated aren't sent to Dataplex.

  1. Creates a private tag template containing the schema of the tags that will be attached to your BigQuery tables. For information about the name, ID, and location of the tag template, see Tag template details.

    Only principals with the proper roles and permissions can view the tag template.

  2. Creates a tag for each BigQuery table that you profile. The tag is based on the newly created tag template.

    For example, a resulting tag attached to a table can have the following metadata:

    Display name Value
    Column Insights ccn: CREDIT_CARD_NUMBER
    first_name: PERSON_NAME
    last_name: PERSON_NAME
    ssn: US_SOCIAL_SECURITY_NUMBER
    email: EMAIL_ADDRESS
    Column Sensitivity ccn: HIGH
    first_name: MODERATE
    last_name: MODERATE
    favorite_animal: LOW
    ssn: HIGH
    email: MODERATE
    id: LOW
    Data Risk Level HIGH
    Other InfoTypes PHONE_NUMBER
    Predicted InfoTypes CREDIT_CARD_NUMBER,US_SOCIAL_SECURITY_NUMBER,EMAIL_ADDRESS,PERSON_NAME
    Profile Last Generated DATE at TIME
    Sensitive Data Profile organizations/ORGANIZATION_ID/locations/REGION/tableDataProfiles/TABLE_DATA_PROFILE_ID
    Sensitivity Score HIGH

A table has two tags if it was profiled through both of the following:

  • An organization-level or folder-level scan configuration
  • A project-level scan configuration

After the tables are tagged, you can search Dataplex for all data in your organization or project with specific tag values.

Tag template details

The template name, template ID, and the project where the new tag template is stored depend on the resource that the scan configuration pertains to.

  • If the scan configuration is an organization-level or folder-level configuration, the tag template is stored in the service agent container. The name of the tag template is Sensitive Data Profile. Its template ID is sensitive_data_profile.
  • If the scan configuration is a project-level configuration, the tag template is stored in the project to be profiled. The name of the tag template is Sensitive Data Profile (Project). Its template ID is sensitive_data_profile_project.

Pricing

For information about how other Google Cloud services may charge you for exporting data profiles, see Pricing for exporting data profiles.

Automatically tag BigQuery tables based on data profiles

  1. Create a scan configuration. Alternatively, edit an existing scan configuration.

  2. In the Add actions step, make sure Send to Dataplex as tags is turned on.

    • If you're creating a scan configuration, this action is enabled by default.
    • If you're editing a scan configuration, you must enable this action.

After the data is profiled and tagged, you can start searching for tagged data in Dataplex.

Roles and permissions for viewing tags

Dataplex search results show you only the data that you have access to. You need the following Identity and Access Management (IAM) roles or permissions to search for the tags that are attached to your BigQuery tables.

Purpose Predefined role Relevant permissions
View the private tag template Data Catalog TagTemplate Viewer (roles/datacatalog.tagTemplateViewer) datacatalog.tagTemplates.getTag
View the tags applied to BigQuery tables BigQuery Metadata Viewer (roles/bigquery.metadataViewer) bigquery.datasets.get
bigquery.tables.get

For more information about Dataplex roles, see Roles to view public and private tags.

For information about granting a predefined role, see Grant a single role. If you want to use a custom role instead of a predefined role, make sure that the custom role has the relevant permissions. For more information, see a Create a custom role.

Find the generated tag template

  1. In the Google Cloud console, go to the Dataplex Tag Templates page.

    Go to Tag templates

  2. In the list, find the tag template. For information about the name, ID, and location of the tag template, see Tag template details.

  3. Optional: To find the tag template that was generated by a given discovery scan configuration, enter the following in the Filter field:

    name:PROJECT_ID.TAG_TEMPLATE_ID
    

    Replace the following:

    • PROJECT_ID: the ID of the project that is associated with the scan configuration. If you profiled your data at the organization or folder level, enter the project ID of the service agent container.
    • TAG_TEMPLATE_ID: sensitive_data_profile if the scan configuration is for an organization or a folder; sensitive_data_profile_project if the scan configuration is for a project.

Find the generated tag for a given table data profile

  1. In the Google Cloud console, go to the Dataplex Search page.

    Go to Search

  2. In the Search field, enter the following:

    name:TABLE_ID tag:PROJECT_ID.TAG_TEMPLATE_ID
    

    Replace the following:

    • TABLE_ID: the ID of the table that was profiled.
    • PROJECT_ID: the ID of the project that contains the tag template. If you profiled your data at the organization or folder level, enter the project ID of the service agent container.
    • TAG_TEMPLATE_ID: sensitive_data_profile if the scan configuration is for an organization or a folder; sensitive_data_profile_project if the scan configuration is for a project.
  3. In the list that appears, click the table ID. The details of the BigQuery table appear along with any Sensitive Data Profile or Sensitive Data Profile (Project) tags attached to it.

    A table has two tags if it was profiled through both of the following:

    • An organization-level or folder-level scan configuration
    • A project-level scan configuration

For information about how to perform a search through the Data Catalog API, see How to search for data assets.

Example search queries

This section provides example search queries that you can use in Dataplex to find data in your organization or project with specific tag values.

You can find only the data that you have access to. Data access is controlled through IAM permissions. For more information, see Roles and permissions for viewing tags on this page.

You can enter these queries in the Dataplex Search page in the Google Cloud console.

Go to Search

For information about how to form the queries, see Data Catalog search syntax. For information about how to perform a search through the Data Catalog API, see How to search for data assets.

Find all tables that are tagged using the new tag template

tag:PROJECT_ID.TAG_TEMPLATE_ID

Replace the following:

  • PROJECT_ID: the ID of the project that contains the tag template. If you profiled your data at the organization or folder level, enter the project ID of the service agent container.
  • TAG_TEMPLATE_ID: sensitive_data_profile if the scan configuration is for an organization or a folder; sensitive_data_profile_project if the scan configuration is for a project.

The succeeding examples on this page don't include the project ID, so you might get results associated with various discovery scan configurations. To limit your results to a particular scan configuration, add the project ID to the query as shown in this example.

Find all tables that were last profiled before a given date

tag:TAG_TEMPLATE_ID.profile_last_generated<DATE

Replace the following:

  • TAG_TEMPLATE_ID: sensitive_data_profile if the scan configuration is for an organization or a folder; sensitive_data_profile_project if the scan configuration is for a project.
  • DATE: a date in the format YYYY-MM-DD—for example, 2023-01-15.

Find all tables with a given table-level sensitivity score

tag:TAG_TEMPLATE_ID.sensitivity_score=SENSITIVITY_SCORE

Replace the following:

  • TAG_TEMPLATE_ID: sensitive_data_profile if the scan configuration is for an organization or a folder; sensitive_data_profile_project if the scan configuration is for a project.
  • SENSITIVITY_SCORE: one of HIGH, MODERATE, or LOW.

For more information, see Data risk and sensitivity levels.

Find all tables with a given data risk level

tag:TAG_TEMPLATE_ID.data_risk_level=DATA_RISK_LEVEL

Replace the following:

  • TAG_TEMPLATE_ID: sensitive_data_profile if the scan configuration is for an organization or a folder; sensitive_data_profile_project if the scan configuration is for a project.
  • DATA_RISK_LEVEL: one of HIGH, MODERATE, or LOW.

For more information, see Data risk and sensitivity levels.

Find all tables that contain a given predicted infoType

tag:TAG_TEMPLATE_ID.predicted_info_types:INFOTYPE

Replace the following:

  • TAG_TEMPLATE_ID: sensitive_data_profile if the scan configuration is for an organization or a folder; sensitive_data_profile_project if the scan configuration is for a project.
  • INFOTYPE: the infoType—for example, PERSON_NAME.

For a list of all built-in infoTypes, see InfoType detector reference.

For more information, see Predicted infoType in the Metrics reference.

Find all tables that partially contain a given infoType

tag:TAG_TEMPLATE_ID.other_info_types:INFOTYPE

Replace the following:

  • TAG_TEMPLATE_ID: sensitive_data_profile if the scan configuration is for an organization or a folder; sensitive_data_profile_project if the scan configuration is for a project.
  • INFOTYPE: the infoType—for example, PERSON_NAME.

For a list of all built-in infoTypes, see InfoType detector reference.

For more information, see Other infoTypes in the Metrics reference.

Find all tables that contain a given column with a given predicted infoType

tag:TAG_TEMPLATE_ID.column_insights:COLUMN_NAME:INFOTYPE

Replace the following:

  • TAG_TEMPLATE_ID: sensitive_data_profile if the scan configuration is for an organization or a folder; sensitive_data_profile_project if the scan configuration is for a project.
  • COLUMN_NAME: the name of the column in the BigQuery table.
  • INFOTYPE: the infoType—for example, PERSON_NAME.

For a list of all built-in infoTypes, see InfoType detector reference.

For more information, see Predicted infoType in the Metrics reference.

Find all tables that contain a given column with a given column-level sensitivity score

tag:TAG_TEMPLATE_ID.column_sensitivity:COLUMN_NAME:SENSITIVITY_SCORE

Replace the following:

  • TAG_TEMPLATE_ID: sensitive_data_profile if the scan configuration is for an organization or a folder; sensitive_data_profile_project if the scan configuration is for a project.
  • COLUMN_NAME: the name of the column in the BigQuery table.
  • SENSITIVITY_SCORE: one of HIGH, MODERATE, or LOW.

For more information, see Data risk and sensitivity levels.

Truncated tag values

If the column heading data of a BigQuery table exceeds 10 MB, the resulting tag might show [TRUNCATED] in the Column Insights or Column Sensitivity field. In this case, we recommend that you go to Sensitive Data Protection to review the table data profile and associated column data profiles.