Using Cloud DLP to scan BigQuery data
Knowing where your sensitive data exists is often the first step in ensuring that it is properly secured and managed. This knowledge can help reduce the risk of exposing sensitive details such as credit card numbers, medical information, Social Security numbers, driver's license numbers, addresses, full names, and company-specific secrets. Periodic scanning of your data can also help with compliance requirements and ensure best practices are followed as your data grows and changes with use. To help meet compliance requirements, use Cloud Data Loss Prevention (Cloud DLP) to scan your BigQuery tables and to protect your sensitive data.
Cloud DLP is a fully managed service that lets Google Cloud customers identify and protect sensitive data at scale. Cloud DLP uses more than 150 predefined detectors to identify patterns, formats, and checksums. Cloud DLP also provides a set of tools to de-identify your data including masking, tokenization, pseudonymization, date shifting, and more, all without replicating customer data.
To learn more about Cloud DLP, see the Cloud DLP documentation.
Before you begin
- Get familiar with Cloud DLP pricing and how to keep Cloud DLP costs under control.
- Enable the DLP API. Enable the API
- Ensure that the user creating your Cloud DLP jobs is granted an appropriate predefined Cloud DLP IAM role or sufficient permissions to run Cloud DLP jobs.
Scanning BigQuery data using the Google Cloud console
To scan BigQuery data, you create a Cloud DLP job that analyzes a table. You can scan a BigQuery table quickly by using the Scan with DLP option in the BigQuery Google Cloud console.
To scan a BigQuery table using Cloud DLP:
In the Google Cloud console, go to the BigQuery page.
In the Explorer panel, expand your project and dataset, then select the table.
Click Export > Scan with DLP. The Cloud DLP job creation page opens in a new tab.
For Step 1: Choose input data, enter a job ID. The values in the Location section are automatically generated. Also, the Sampling section is automatically configured to run a sample scan against your data, but you can adjust the settings as needed.
Optional: For Step 2: Configure detection, you can configure what types of data to look for, called
Do one of the following:
- To select from the list of predefined
infoTypes, click Manage infoTypes. Then, select the infoTypes you want to search for.
- To use an existing inspection template, in the Template name field, enter the template's full resource name.
For more information on
infoTypes, see InfoTypes and infoType detectors in the Cloud DLP documentation.
- To select from the list of predefined
Optional: For Step 3: Add actions, turn on Save to BigQuery to publish your Cloud DLP findings to a BigQuery table. If you don't store findings, the completed job contains only statistics about the number of findings and their
infoTypes. Saving findings to BigQuery saves details about the precise location and confidence of each individual finding.
Optional: If you turned on Save to BigQuery, in the Save to BigQuery section, enter the following information:
- Project ID: the project ID where your results are stored.
- Dataset ID: the name of the dataset that stores your results.
- Optional: Table ID: the name of the table that stores your
results. If no table ID is specified, a default name is assigned to
a new table similar to the following:
dlp_googleapis_date_1234567890. If you specify an existing table, findings are appended to it.
To include the actual content that was detected, turn on Include quote.
Optional: For Step 4: Schedule, configure a time span or schedule by selecting either Specify time span or Create a trigger to run the job on a periodic schedule.
Optional: On the Review page, examine the details of your job. If needed, adjust the previous settings.
After the Cloud DLP job completes, you are redirected to the job details page, and you're notified by email. You can view the results of the scan on the job details page, or you can click the link to the Cloud DLP job details page in the job completion email.
If you chose to publish Cloud DLP findings to BigQuery, on the Job details page, click View Findings in BigQuery to open the table in the Google Cloud console. You can then query the table and analyze your findings. For more information on querying your results in BigQuery, see Querying Cloud DLP findings in BigQuery in the Cloud DLP documentation.
To learn more about inspecting BigQuery and other storage repositories for sensitive data using Cloud DLP, see the following topics in the Cloud DLP documentation:
If you want to redact or otherwise de-identify the sensitive data that the Cloud DLP scan found, see:
- Inspect text to de-identify sensitive information
- De-identifying sensitive data in the Cloud DLP documentation
- AEAD encryption concepts in standard SQL for information on encrypting individual values within a table
- Protecting data with Cloud KMS keys for information on creating and managing your own encryption keys in Cloud KMS to encrypt BigQuery tables
- Identity and Security blog post: Taking charge of your data: using Cloud DLP to de-identify and obfuscate sensitive information