Data profiles for BigQuery data

Stay organized with collections Save and categorize content based on your preferences.

This page describes the data profiler and how to use it to determine where sensitive and high-risk data reside in your organization.

Overview

The data profiler lets you protect data across your organization by identifying where sensitive and high-risk data reside. When you turn on data profiling, Cloud DLP automatically scans all BigQuery tables and columns across the entire organization, individual folders, and projects. It then creates data profiles at the table, column, and project levels.

A data profile is a set of metrics that Cloud DLP gathers from scanning a particular resource. These metrics include the predicted infoTypes, the assessed data risk and sensitivity levels, and metadata about your tables. Use these insights to make informed decisions about how you protect, share, and use your data.

As long as the data profiler configuration is active, Cloud DLP automatically scans tables that you add and modify, and generates new and updated data profiles for those tables.

The following image shows a list of column data profiles. Click the image to enlarge it.

Screenshot of column data profiles

For a list of metrics included in each data profile, see Metrics reference.

Data profile creation

To start generating data profiles, you create a scan configuration (also called a data profile configuration). This scan configuration is where you set the resource (organization, folder, or project) that you want to scan. All BigQuery datasets and tables in that resource are in scope for data profiling.

When creating a scan configuration, you also set the inspection template to use. The inspection template is where you specify the types of sensitive data that Cloud DLP must scan for.

When Cloud DLP creates data profiles, it analyzes your BigQuery tables and columns based on your scan configuration and inspection template. A data profile is a snapshot of the analysis and metrics gathered at a point in time.

Work with data profiles

The workflow for using data profiles is as follows:

  1. Confirm that you have the required user roles
  2. Profile a single project
  3. Profile an organization or folder
  4. Organization or folder scans only: grant profiling access to the service agent
  5. View the data profiles
  6. Analyze the data profiles
  7. Remediate the findings

Supported tables

Cloud DLP profiles tables that are supported by the BigQuery Storage Read API, including the following:

  • Normal BigQuery tables
  • BigLake tables stored in Cloud Storage

The following aren't supported:

  • BigQuery Omni tables.
  • Tables where the serialized data size of individual rows exceed the maximum serialized data size that the BigQuery Storage Read API supports—128 MB.
  • Non-BigLake external tables, like Sheets.

Roles required to configure and view data profiles

The following sections list the required user roles, categorized according to their purpose. Depending on how your organization is set up, you might decide to have different people perform different tasks. For example, the person who configures data profiles might be different from the person who regularly monitors them.

Roles required to work with data profiles at the organization or folder level

These roles let you configure and view data profiles at the organization or folder level.

Make sure these roles are granted to the proper people at the organization level. Alternatively, your Google Cloud administrator can create custom roles that only have the relevant permissions.

Purpose Predefined role Relevant permissions
Configure and view data profiles DLP Administrator (roles/dlp.admin)
  • dlp.inspectTemplates.create
  • dlp.jobs.create
  • dlp.jobTriggers.create
  • dlp.columnDataProfiles.list
  • dlp.jobs.list
  • dlp.jobTriggers.list
  • dlp.projectDataProfiles.list
  • dlp.tableDataProfiles.list
Project Creator (roles/resourcemanager.projectCreator)
  • resourcemanager.organizations.get
  • resourcemanager.projects.create
Grant data profiling access One of the following:
  • Organization Administrator (roles/resourcemanager.organizationAdmin)
  • Security Admin (roles/iam.securityAdmin)
  • resourcemanager.organizations.getIamPolicy
  • resourcemanager.organizations.setIamPolicy
View data profiles (read-only) DLP Data Profiles Reader (roles/dlp.dataProfilesReader)
  • dlp.columnDataProfiles.list
  • dlp.projectDataProfiles.list
  • dlp.tableDataProfiles.list
DLP Reader (roles/dlp.reader)
  • dlp.jobs.list
  • dlp.jobTriggers.list

Roles required to work with data profiles at the project level

These roles let you configure and view data profiles at the project level.

Make sure these roles are granted to the proper people at the project level. Alternatively, your Google Cloud administrator can create custom roles that only have the relevant permissions.

Purpose Predefined role Relevant permissions
Configure and view data profiles DLP Administrator (roles/dlp.admin)
  • dlp.inspectTemplates.create
  • dlp.jobs.create
  • dlp.jobTriggers.create
  • dlp.columnDataProfiles.list
  • dlp.jobs.list
  • dlp.jobTriggers.list
  • dlp.projectDataProfiles.list
  • dlp.tableDataProfiles.list
View data profiles (read-only) DLP Data Profiles Reader (roles/dlp.dataProfilesReader)
  • dlp.columnDataProfiles.list
  • dlp.projectDataProfiles.list
  • dlp.tableDataProfiles.list
DLP Reader (roles/dlp.reader)
  • dlp.jobs.list
  • dlp.jobTriggers.list

Scan configuration

A scan configuration or data profile configuration specifies which resource (an organization, folder, or project) to scan, which inspection template to use, and what to do with the results. It also contains administrative details like which service agent container to associate the scan to and which billing account to use.

You can create a scan configuration for your organization and another one for a particular folder. If two or more active scan configurations have the same project in their scope, Cloud DLP determines which scan configuration can generate profiles for that project.

You can also create a scan configuration at the project level. This type of scan configuration can always profile the target project and does not compete with other configurations at the level of the parent folder or organization.

The first time you create a scan configuration, you specify where you want Cloud DLP to store it. All subsequent scan configurations that you create are stored in that same region.

For example, if you create a scan configuration for Folder A and store it in the us-west1 region, then any scan configuration that you later create for any other resource is also stored in that region.

Inspection template

An inspection template specifies what information types (or infoTypes) Cloud DLP looks for while scanning your data. Here, you provide a combination of built-in infoTypes and optional custom infoTypes.

You can also provide a likelihood level to narrow down what Cloud DLP considers to be a match. You can add rule sets to exclude unwanted findings or include additional findings.

If you change an inspection template that your scan configuration uses, the changes are applied only to future scans. Any existing data profiles are not overwritten. For example, if you edit your template to add an infoType, the change only affects tables that have yet to be scanned. Your action doesn't cause a rescan of all existing tables.

The inspection template must be in the same region as the data that to be profiled. If you have data in multiple regions, use an inspection template that is stored in the global region. For more information, see Data residency considerations.

Inspection templates are a core component of the Cloud DLP platform. Data profiles use the same inspection templates that you can use across all Cloud DLP services. For more information on inspection templates, see Templates.

Service agent container and service agent

When you create a scan configuration for your organization or for a folder, Cloud DLP requires you to provide a service agent container. A service agent container is a Google Cloud project that Cloud DLP uses to track billed charges related to organization- and folder-level profiling operations.

The service agent container contains a service agent, which is a Google-managed service account that Cloud DLP uses to profile data on your behalf. You need a service agent to authenticate to Cloud DLP and other APIs. Your service agent must have all the required permissions to access and profile your data. The service agent's ID is in the following format:

service-PROJECT_NUMBER@dlp-api.iam.gserviceaccount.com

Here, the PROJECT_NUMBER is the numerical identifier of the service agent container.

When setting the service agent container, you can choose an existing project. If the project you select contains a service agent, Cloud DLP grants the required IAM permissions to that service agent. If the project doesn't have a service agent, Cloud DLP creates one and automatically grants data profiling permissions to it.

Alternatively, you can choose to have Cloud DLP automatically create the service agent container and service agent. Cloud DLP automatically grants data profiling permissions to the service agent.

In both cases, if Cloud DLP fails to grant data profiling access to your service agent, it shows an error when you view the scan configuration details.

For project-level scan configurations, you don't need a service agent container. The project you're profiling serves the service agent container's purpose. To run profiling operations, Cloud DLP uses that project's own service agent.

Data profiling access at the organization or folder level

When you configure profiling at the organization or folder level, Cloud DLP attempts to automatically grant data profiling access to your service agent. However, if you don't have the permissions to grant IAM roles, Cloud DLP can't do this action on your behalf. Someone with those permissions in your organization, such as a Google Cloud admin, must grant data profiling access to your service agent.

Default frequency of data profile generation

By default, Cloud DLP profiles your data as follows:

  1. After you create a scan configuration for a particular resource, Cloud DLP performs an initial scan, profiling all tables in that resource. After the initial scan, it continuously monitors your BigQuery tables for any additions or changes you introduce.

  2. Cloud DLP profiles new tables you add shortly after you add them.

  3. Every 30 days, Cloud DLP reprofiles existing tables that underwent schema changes within the last 30 days.

However, in your scan configuration, you can customize the profiling frequency by creating one or more schedules for different subsets of your data. You can also specify subsets of data that you never want to be profiled. For more information, see Manage schedules in the instructions for configuring profiling.

By default, Cloud DLP doesn't reprofile tables that have no schema changes. If you want Cloud DLP to reprofile existing tables, you can send a request.

For example scenarios, see Data profiling pricing examples.

Profiling performance

The time it takes to profile your data varies depending on several factors, including, but not limited to, the following:

  • Number of tables being profiled
  • Sizes of the tables
  • Number of columns in the tables
  • Data types in the columns

Therefore, Cloud DLP's performance in a past inspection or profiling task isn't indicative of how it will perform in future profiling tasks.

Retention of data profiles

Cloud DLP retains the latest version of a data profile for 13 months. When Cloud DLP reprofiles an updated table, it replaces that table's existing data profiles with new ones.

Consider the following scenarios:

  • On January 1, Cloud DLP profiles Table A. Table A does not change in over a year, and so it's not profiled again. In this case, Cloud DLP retains the data profiles for Table A for 13 months before deleting them.

  • On January 1, Cloud DLP profiles Table A. Within the month, someone in your organization updates the schema of that table. Because of this change, the following month, Cloud DLP automatically reprofiles Table A. The newly generated data profiles overwrite the ones that were created in January.

For information on how Cloud DLP charges for profiling new and modified tables, see Data profiling pricing.

If you want to retain data profiles indefinitely or keep a record of the changes they undergo, consider saving the data profiles to BigQuery when you configure profiling. You choose which BigQuery dataset to save the profiles to, and you control the table expiration policy for that dataset.

Overriding scan configurations

You can create a maximum of one scan configuration for each organization, folder, and project.

If two or more active scan configurations have the same project in their scope, the following rules apply:

  • Among organization- and folder-level scan configurations, the one that is closest to the project will be able to generate data profiles for that project. This rule applies even if a project-level scan configuration for that project also exists.
  • Cloud DLP treats project-level scan configurations independently of organization- and folder-level configurations. A scan configuration that you create at the project level can't override one that you create for a parent folder or organization.

Consider the following example, where there are three active scan configurations:

Diagram of a resource hierarchy with a scan configuration applied
              to an organization, a folder, and a project

Here, Scan configuration 1 applies to the entire organization, Scan configuration 2 applies to the Team B folder, and Scan configuration 3 applies to the Production project. In this example:

  • Cloud DLP profiles all tables in projects that aren't in the Team B folder according to Scan configuration 1.
  • Cloud DLP profiles all tables in projects in the Team B folder—including tables in the Production project—according to Scan configuration 2.
  • Cloud DLP profiles all tables in the Production project according to Scan configuration 3.

In this example, Cloud DLP effectively generates two sets of profiles for the Production project—one set for each of the following scan configurations:

  • Scan configuration 2
  • Scan configuration 3

However, even though there are two sets of profiles for the same project, you don't see them all together in your dashboard. You only see the profiles that were generated in the purview and region that you're currently viewing.

For more information on Google Cloud's resource hierarchy, see Resource hierarchy.

Data profile snapshots

Each data profile includes a snapshot of the scan configuration and the inspection template that were used to generate it. You can use this snapshot to check the settings that you used to generate a particular data profile.

Data residency considerations

Cloud DLP is designed to support data residency. If you must comply with data residency requirements, consider the following points:

Inspection regions

Cloud DLP inspects your data in the same region where that data is stored. That is, your BigQuery data doesn't leave its current region.

Furthermore, an inspection template can only be used to profile data that resides in the same region as that template. For example, if you configure the data profiler to use an inspection template that is stored in the us-west1 region, Cloud DLP can only profile data in that region.

You can set a dedicated inspection template for each region where you have data. If you provide an inspection template that's stored in the global region, Cloud DLP uses that template for data in regions with no dedicated inspection template.

The following table provides example scenarios:

Scenario Support
Scan data in the us region using an inspection template from the us region. Supported
Scan data in the global region using an inspection template from the us region. Not supported
Scan data in the us region using an inspection template from the global region. Supported
Scan data in the us region using an inspection template from the us-east1 region. Not supported
Scan data in the us-east1 region using an inspection template from the us region. Not supported
Scan data in the us region using an inspection template from the asia region. Not supported

Data profile configuration

When Cloud DLP creates data profiles, it takes a snapshot of your scan configuration and inspection template and stores them in each table data profile. If you configure the data profiler to use an inspection template from the global region, then Cloud DLP copies that template to any region that has data to be profiled. Similarly, it copies the scan configuration to those regions.

Consider this example: Project A contains Table 1. Table 1 is in the us-west1 region; the scan configuration is in the us-west2 region; and the inspection template is in the global region.

When Cloud DLP scans Project A, it creates data profiles for Table 1 and stores them in the us-west1 region. Table 1's table data profile contains copies of the scan configuration and the inspection template used in the profiling operation.

If you don't want your inspection template to be copied to other regions, don't configure Cloud DLP to scan data in those regions.

Regional storage of data profiles

After inspecting your data, Cloud DLP generates data profiles. It stores each data profile in the same region where its target data is stored, which is also where inspection is processed. To view data profiles in your dashboard, you must first select the region where they reside. If you have data in multiple regions, then you must switch regions to view each set of profiles.

Unsupported regions

If you have tables in a region that Cloud DLP doesn't support, then it skips those tables and shows an error when you view the data profiles.

Multi-regions

Cloud DLP treats a multi-region as one region, and not a collection of regions. For example, the us multi-region and the us-west1 region are treated as two separate regions as far as data residency is concerned.

Compliance

For information on how Cloud DLP handles your data and helps you meet compliance requirements, see Data security.

What's next