This tutorial shows you how to validate the sample dataset that you de-identified in the Running an automated Dataflow pipeline to de-identify a PII dataset tutorial. After validation, you re-identify the data using the Cloud DLP templates that were previously used to de-identify the personally identifiable information (PII).
This document is part of a series:
- De-identification and re-identification of PII in large volume data using Cloud DLP
- Creating Cloud DLP de-identification transformation templates for PII dataset
- Running an automated Dataflow pipeline to de-identify a PII dataset
- Validating de-identified data in BigQuery and re-identifying PII data (this document)
This tutorial assumes that you are familiar with shell scripting, SQL, BigQuery and Dataflow.
Reference architecture
This tutorial demonstrates the data validation and re-identification pipeline that is illustrated in the following diagram.
The data validation and re-identification pipeline validates copies of the de-identified data and uses a Dataflow pipeline to re-identify data at scale.
Objectives
- Validate a de-identified dataset in BigQuery using standard SQL.
- Re-identify data by using a Dataflow pipeline.
Costs
This tutorial uses the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.
Before you begin
Validating the de-identified dataset in BigQuery
You validate that the data in BigQuery contains de-identified data to ensure that it's safe to share without exposing PII. Because the automated pipeline created a BigQuery table based on the sample CSV files in the previous tutorial, you can validate the data by comparing the tables and schemas in BigQuery.
In Cloud Shell, display the header row of the CSV file that you used to create the schema:
head -1 solution-test/CCRecords_1564602825.csv
The output is the following:
ID,Card Type Code,Card Type Full Name,Issuing Bank,Card Number,Card Holder's Name,Issue Date,Expiry Date,Billing Date,Card PIN,Credit Limit,Age,SSN,JobTitle,Additional Details
In the Cloud Console, go to the BigQuery Query editor page.
In the Query editor, run the following query to compare the schema to the header row of the CSV file:
select table_name, column_name from `deid_dataset.INFORMATION_SCHEMA.COLUMNS` WHERE table_name="CCRecords_1564602825"
The output is the following:
There are no spaces in the column names because the pipeline ensures that the column and table names only contains valid characters according to the BigQuery naming standard.
Validate that the number of rows in the table is 100,000:
select count(*) as number_of_rows from `deid_dataset.CCRecords_*` LIMIT 1
In the output, the
number_of_rows
column is 100000 instead of 100001. The CSV file has a header record, which isn't included in the table row count.Validate that the bucketing transformation is successfully applied to the
JobTitle
column:select JobTitle, count(*) as number_of_records_found from `deid_dataset.CCRecords_1564602825` group by JobTitle
In the output, the
JobTitle
values are grouped into three generalized buckets: Executive, Engineer, and Manager.Validate that values in the
Age
column are grouped into six different buckets from 60 to 20:select Age, count(*) as number_of_records_found from `deid_dataset.CCRecords_1564602825` group by Age order by Age desc
The output is the following:
The ages and job titles are grouped in these categories to further aid the de-identification process of the PII. For example, consider a case where there is a well-known young CEO at a startup company who can easily be recognized. Someone might identify this person in the dataset by using these quasi-identifiers (job title and age) from the original dataset. A bucketing transformation means it's more difficult to identify this person on the de-identified copies of dataset.
Validate the masking transformation for the SSN:
select SSN from `deid_dataset.CCRecords_*` where regexp_contains(SSN, "@*")
In the output, the first five digits for all SSN entries are masked:
Validate that the cryptographic transformation used deterministic encryption for the
card_holders_name
andcard_number
andcard_pin
entries:select card_holders_name, card_number, card_pin from `deid_dataset.CCRecords_*` limit 1
In the output, all three entries are replaced with a base64-encoded encrypted string:
Validate that the
infoType
transformation was applied to theAdditional Details
column:select additional_details from `deid_dataset.CCRecords_*` where regexp_contains(additional_details, r'(IBAN_CODE+)') or regexp_contains(additional_details, r'(EMAIL_ADDRESS+)')or regexp_contains(additional_details, r'(PHONE_NUMBER+)')or regexp_contains(additional_details, r'(ONLINE_USER_ID+)')
In the output, sensitive values are replaced with placeholder values such as
[IBAN_CODE]
,[EMAIL_ADDRESS]
,[PHONE_NUMBER,]
and[ONLINE_USER_ID]
:Query the de-identified copies of the dataset for the ID
76901
:select * from `deid_dataset.CCRecords_1564602825` where id='76901'
The output shows the following values:
"ID": "76901" "Card_Type_Code": "AX" "Card_Type_Full_Name": "American Express" "Issuing_Bank": "American Express" "Card_Number": "encrypted value" "Card_Holders_Name": "encrypted value" "Issue_Date": "03/2013" "Expiry_Date": "03/2015" "Billing_Date": "14" "Card_PIN": "encrypted value" "Credit_Limit": "57000" "Age": "40" "SSN": "***-**-9395" "JobTitle": "Manager" "Additional_Details": "[IBAN_CODE][PHONE_NUMBER] [EMAIL_ADDRESS] 102-326-2388 hugedomains.com Maggio[ONLINE_USER_ID]"
In Cloud Shell, compare the output from the previous step with the original dataset in the CSV file for the ID
76901
:awk -F "," '$1 == 76901' solution-test/CCRecords_1564602825.csv
The output is:
76901,AX,American Express,American Express,376467075604410,Francis U Riggs,03/2013,03/2015,14,7425,57000,43,623-12-9395,Product Manager,ES20 6871 8240 0493 0298 3587 dsumpton1nc4@reddit.com 102-326-2388 hugedomains.com Maggio:5282194096147081
Re-identifying the dataset from BigQuery
Finally, you can re-identify the data back to the original values by using the DLP templates that were used to de-identify. To re-identify the data, you use an automated re-identification Dataflow pipeline to re-identify the dataset at scale. This re-identification process is useful when the token encryption key (TEK) must be rotated. You can re-identify the dataset before the key rotation, and then re-tokenize using the new TEK.
In Cloud Shell, create a Pub/Sub topic where the re-identified values will be published:
export TOPIC_ID="reid-topic" gcloud pubsub topics create ${TOPIC_ID}
Create a Pub/Sub subscription for the topic:
export SUBSCRIPTION_ID="reid-subscription" gcloud pubsub subscriptions create ${SUBSCRIPTION_ID} --topic=${TOPIC_ID}
Export the BigQuery SQL query:
export QUERY='select id,card_number,card_holders_name from `deid_dataset.CCRecords_1564602825` where safe_cast(credit_limit as int64)>100000 and safe_cast(age as int64)>50 group by id,card_number,card_holders_name limit 10'
For this tutorial, you re-identify the card number and card holder's name for anyone who is 50 years or older and has a credit limit over US$100,000.
Upload the query in the data storage bucket:
cat << EOF | gsutil cp - gs://${DATA_STORAGE_BUCKET}/reid_query.sql ${QUERY} EOF
Trigger the pipeline:
gcloud beta dataflow flex-template run "dlp-reid-demo" --project=${PROJECT_ID} \ --region=${REGION} \ --template-file-gcs-location=gs://dataflow-dlp-solution-sample-data/dynamic_template_dlp_v2.json -- parameters=^~^streaming=true~enableStreamingEngine=true~tempLocation=gs://${DATAFLOW_TEMP_BUCKET}/temp~numWorkers=1~maxNumWorkers=2 ~runner=DataflowRunner~tableRef=${PROJECT_ID}:deid_dataset.CCRecords_1564602825~dataset=deid_dataset~autoscalingAlgorithm=THROUGHPUT_BASED~workerMachineType=n1-highmem-8 ~topic=projects/${PROJECT_ID}/topics/${TOPIC_ID}~deidentifyTemplateName=${REID_TEMPLATE_NAME} ~DLPMethod=REID~keyRange=1024~queryPath=gs://${DATA_STORAGE_BUCKET}/reid_query.sql
To monitor the pipeline, in the Cloud Console, go to the Dataflow page.
After a few minutes, the pipeline successfully completes when you see the following:
To validate the re-identified data, in Cloud Shell fetch a random record from the Pub/Sub subscriber:
gcloud pubsub subscriptions pull ${SUBSCRIPTION_ID} \ --auto-ack --limit 1000 --project ${PROJECT_ID} >> re-id-data.txt
The output shows that the data is successfully re-identified to the original format:
cat re-id-data.txt
This output is similar to:
{"id":"17880","Card Number":"6558977436350773","Card Holder's Name":"Elden E Courtright"} {"id":"57055","Card Number":"3529749048585358","Card Holder's Name":"Francisco Duffy"} {"id":"31487","Card Number":"6227182611813833","Card Holder's Name":"Leslie C Wade"} {"id":"38521","Card Number":"36533383873740","Card Holder's Name":"Laura Z Thomas"} {"id":"81585","Card Number":"3569920557158538","Card Holder's Name":"Kelly Reed"} {"id":"96408","Card Number":"340687318048371","Card Holder's Name":"Ignacio V Pisano"}
Cleaning up
The easiest way to eliminate billing is to delete the Cloud project you created for the tutorial. Alternatively, you can delete the individual resources.
Delete the project
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Take charge of your data: How tokenization makes data usable without sacrificing privacy.
- Take charge of your data: Using Cloud DLP to find and protect PII.
- Automating the classification of data uploaded to Cloud Storage.
- Protecting sensitive datasets in Google Cloud.
- De-identification and re-identification of PII in large volume data using Cloud DLP.
- Review the sample code in the Migrate Sensitive Data in BigQuery Using Dataflow & Cloud DLP repo on GitHub.
- Learn about other pattern recognition solutions.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.