Using Cloud DLP to scan BigQuery data
Knowing where your sensitive data exists is often the first step in ensuring that it is properly secured and managed. This knowledge can help reduce the risk of exposing sensitive details such as credit card numbers, medical information, Social Security numbers, driver's license numbers, addresses, full names, and company-specific secrets. Periodic scanning of your data can also help with compliance requirements and ensure best practices are followed as your data grows and changes with use. To help meet compliance requirements, use Cloud Data Loss Prevention (Cloud DLP) to inspect your BigQuery tables and to help protect your sensitive data.
There are two ways to scan your BigQuery data:
Sensitive data profiling. Cloud DLP can generate profiles about BigQuery data across an organization, folder, or project. Data profiles contain metrics and metadata about your tables and help you determine where sensitive and high-risk data reside. Cloud DLP reports these metrics at the project, table, and column levels. For more information, see Data profiles for BigQuery data.
On-demand inspection. Cloud DLP can perform a deep inspection on a single table or a subset of columns and report its findings down to the cell level. This kind of inspection can help you identify individual instances of specific data types, such as the precise location of a credit card number inside a table cell. You can do an on-demand inspection through the Data Loss Prevention page in the Google Cloud console, the BigQuery page in the Google Cloud console, or programmatically through the DLP API.
This page describes how to do an on-demand inspection through the BigQuery page in the Google Cloud console.
Cloud DLP is a fully managed service that lets Google Cloud customers identify and protect sensitive data at scale. Cloud DLP uses more than 150 predefined detectors to identify patterns, formats, and checksums. Cloud DLP also provides a set of tools to de-identify your data including masking, tokenization, pseudonymization, date shifting, and more, all without replicating customer data.
To learn more about Cloud DLP, see the Cloud DLP documentation.
Before you begin
- Get familiar with Cloud DLP pricing and how to keep Cloud DLP costs under control.
Scanning BigQuery data using the Google Cloud console
To scan BigQuery data, you create a Cloud DLP job that analyzes a table. You can scan a BigQuery table quickly by using the Scan with DLP option in the BigQuery Google Cloud console.
To scan a BigQuery table using Cloud DLP:
In the Google Cloud console, go to the BigQuery page.
In the Explorer panel, expand your project and dataset, then select the table.
Click Export > Scan with DLP. The Cloud DLP job creation page opens in a new tab.
For Step 1: Choose input data, enter a job ID. The values in the Location section are automatically generated. Also, the Sampling section is automatically configured to run a sample scan against your data, but you can adjust the settings as needed.
Optional: For Step 2: Configure detection, you can configure what types of data to look for, called
Do one of the following:
- To select from the list of predefined
infoTypes, click Manage infoTypes. Then, select the infoTypes you want to search for.
- To use an existing inspection template, in the Template name field, enter the template's full resource name.
For more information on
infoTypes, see InfoTypes and infoType detectors in the Cloud DLP documentation.
- To select from the list of predefined
Optional: For Step 3: Add actions, turn on Save to BigQuery to publish your Cloud DLP findings to a BigQuery table. If you don't store findings, the completed job contains only statistics about the number of findings and their
infoTypes. Saving findings to BigQuery saves details about the precise location and confidence of each individual finding.
Optional: If you turned on Save to BigQuery, in the Save to BigQuery section, enter the following information:
- Project ID: the project ID where your results are stored.
- Dataset ID: the name of the dataset that stores your results.
- Optional: Table ID: the name of the table that stores your
results. If no table ID is specified, a default name is assigned to
a new table similar to the following:
dlp_googleapis_date_1234567890. If you specify an existing table, findings are appended to it.
To include the actual content that was detected, turn on Include quote.
Optional: For Step 4: Schedule, configure a time span or schedule by selecting either Specify time span or Create a trigger to run the job on a periodic schedule.
Optional: On the Review page, examine the details of your job. If needed, adjust the previous settings.
After the Cloud DLP job completes, you are redirected to the job details page, and you're notified by email. You can view the results of the scan on the job details page, or you can click the link to the Cloud DLP job details page in the job completion email.
If you chose to publish Cloud DLP findings to BigQuery, on the Job details page, click View Findings in BigQuery to open the table in the Google Cloud console. You can then query the table and analyze your findings. For more information on querying your results in BigQuery, see Querying Cloud DLP findings in BigQuery in the Cloud DLP documentation.
Learn more about profiling data in an organization, folder, or project.
Read the Identity & Security blog post Take charge of your data: using Cloud DLP to de-identify and obfuscate sensitive information.
If you want to redact or otherwise de-identify the sensitive data that the Cloud DLP scan found, see the following:
- Inspect text to de-identify sensitive information
- De-identifying sensitive data in the Cloud DLP documentation
- AEAD encryption concepts in Google Standard SQL for information on encrypting individual values within a table
- Protecting data with Cloud KMS keys for information on creating and managing your own encryption keys in Cloud KMS to encrypt BigQuery tables