Validating de-identified data in BigQuery and re-identifying PII data

This tutorial shows you how to validate the sample dataset that you de-identified in the Running an automated Dataflow pipeline to de-identify a PII dataset tutorial. After validation, you re-identify the data using the Cloud DLP templates that were previously used to de-identify the personally identifiable information (PII).

This document is part of a series:

This tutorial assumes that you are familiar with shell scripting, SQL, BigQuery and Dataflow.

Reference architecture

This tutorial demonstrates the data validation and re-identification pipeline that is illustrated in the following diagram.

Architecture of re-identification pipeline.

The data validation and re-identification pipeline validates copies of the de-identified data and uses a Dataflow pipeline to re-identify data at scale.

Objectives

  • Validate a de-identified dataset in BigQuery using standard SQL.
  • Re-identify data by using a Dataflow pipeline.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.

Before you begin

  1. Complete part 2 of the series.
  2. Complete part 3 of the series.

Validating the de-identified dataset in BigQuery

You validate that the data in BigQuery contains de-identified data to ensure that it's safe to share without exposing PII. Because the automated pipeline created a BigQuery table based on the sample CSV files in the previous tutorial, you can validate the data by comparing the tables and schemas in BigQuery.

  1. In Cloud Shell, display the header row of the CSV file that you used to create the schema:

    head -1 solution-test/CCRecords_1564602825.csv
    

    The output is the following:

    ID,Card Type Code,Card Type Full Name,Issuing Bank,Card Number,Card Holder's Name,Issue Date,Expiry Date,Billing Date,Card PIN,Credit Limit,Age,SSN,JobTitle,Additional Details
    
  2. In the Cloud Console, go to the BigQuery Query editor page.

    Go to Query editor

  3. In the Query editor, run the following query to compare the schema to the header row of the CSV file:

    select table_name, column_name
    from `deid_dataset.INFORMATION_SCHEMA.COLUMNS`
    WHERE table_name="CCRecords_1564602825"
    

    The output is the following:

    Output of query in a table with labelled column names.

    There are no spaces in the column names because the pipeline ensures that the column and table names only contains valid characters according to the BigQuery naming standard.

  4. Validate that the number of rows in the table is 100,000:

    select count(*) as number_of_rows
    from `deid_dataset.CCRecords_*` LIMIT 1
    

    In the output, the number_of_rows column is 100000 instead of 100001. The CSV file has a header record, which isn't included in the table row count.

    Output showing number of rows.

  5. Validate that the bucketing transformation is successfully applied to the JobTitle column:

    select JobTitle, count(*) as number_of_records_found
    from `deid_dataset.CCRecords_1564602825`
    group by JobTitle
    

    In the output, the JobTitle values are grouped into three generalized buckets: Executive, Engineer, and Manager.

    Output showing job title groupings.

  6. Validate that values in the Age column are grouped into six different buckets from 60 to 20:

    select Age, count(*) as number_of_records_found
    from `deid_dataset.CCRecords_1564602825`
    group by Age
    order by Age desc
    

    The output is the following:

    Output showing age groupings.

    The ages and job titles are grouped in these categories to further aid the de-identification process of the PII. For example, consider a case where there is a well-known young CEO at a startup company who can easily be recognized. Someone might identify this person in the dataset by using these quasi-identifiers (job title and age) from the original dataset. A bucketing transformation means it's more difficult to identify this person on the de-identified copies of dataset.

  7. Validate the masking transformation for the SSN:

    select SSN
    from `deid_dataset.CCRecords_*`
    where regexp_contains(SSN, "@*")
    

    In the output, the first five digits for all SSN entries are masked:

    Output masking SSN.

  8. Validate that the cryptographic transformation used deterministic encryption for the card_holders_name and card_number and card_pin entries:

    select card_holders_name, card_number, card_pin
    from `deid_dataset.CCRecords_*`
    limit 1
    

    In the output, all three entries are replaced with a base64-encoded encrypted string:

    Output showing encryption of card details.

  9. Validate that the infoType transformation was applied to the Additional Details column:

    select additional_details
    from `deid_dataset.CCRecords_*`
    where regexp_contains(additional_details, r'(IBAN_CODE+)') or regexp_contains(additional_details, r'(EMAIL_ADDRESS+)')or regexp_contains(additional_details, r'(PHONE_NUMBER+)')or regexp_contains(additional_details, r'(ONLINE_USER_ID+)')
    

    In the output, sensitive values are replaced with placeholder values such as [IBAN_CODE], [EMAIL_ADDRESS], [PHONE_NUMBER,] and [ONLINE_USER_ID]:

    Output showing placeholder values.

  10. Query the de-identified copies of the dataset for the ID 76901:

    select * from `deid_dataset.CCRecords_1564602825` where id='76901'
    

    The output shows the following values:

    "ID": "76901"
    "Card_Type_Code": "AX"
    "Card_Type_Full_Name": "American Express"
    "Issuing_Bank": "American Express"
    "Card_Number": "encrypted value"
    "Card_Holders_Name": "encrypted value"
    "Issue_Date": "03/2013"
    "Expiry_Date": "03/2015"
    "Billing_Date": "14"
    "Card_PIN": "encrypted value"
    "Credit_Limit": "57000"
    "Age": "40"
    "SSN": "***-**-9395"
    "JobTitle": "Manager"
    "Additional_Details": "[IBAN_CODE][PHONE_NUMBER] [EMAIL_ADDRESS] 102-326-2388 hugedomains.com Maggio[ONLINE_USER_ID]"
    
  11. In Cloud Shell, compare the output from the previous step with the original dataset in the CSV file for the ID 76901:

    awk -F "," '$1 == 76901' solution-test/CCRecords_1564602825.csv
    

    The output is:

    76901,AX,American Express,American Express,376467075604410,Francis U Riggs,03/2013,03/2015,14,7425,57000,43,623-12-9395,Product Manager,ES20 6871 8240 0493 0298 3587 dsumpton1nc4@reddit.com 102-326-2388 hugedomains.com Maggio:5282194096147081
    

Re-identifying the dataset from BigQuery

Finally, you can re-identify the data back to the original values by using the DLP templates that were used to de-identify. To re-identify the data, you use an automated re-identification Dataflow pipeline to re-identify the dataset at scale. This re-identification process is useful when the token encryption key (TEK) must be rotated. You can re-identify the dataset before the key rotation, and then re-tokenize using the new TEK.

  1. In Cloud Shell, create a Pub/Sub topic where the re-identified values will be published:

    export TOPIC_ID="reid-topic"
    gcloud pubsub topics create ${TOPIC_ID}
    
  2. Create a Pub/Sub subscription for the topic:

    export SUBSCRIPTION_ID="reid-subscription"
    gcloud pubsub subscriptions create ${SUBSCRIPTION_ID} --topic=${TOPIC_ID}
    
  3. Export the BigQuery SQL query:

    export QUERY='select id,card_number,card_holders_name
    from `deid_dataset.CCRecords_1564602825`
    where safe_cast(credit_limit as int64)>100000 and safe_cast(age as int64)>50
    group by id,card_number,card_holders_name limit 10'
    

    For this tutorial, you re-identify the card number and card holder's name for anyone who is 50 years or older and has a credit limit over US$100,000.

  4. Upload the query in the data storage bucket:

      cat << EOF | gsutil cp - gs://${DATA_STORAGE_BUCKET}/reid_query.sql
      ${QUERY}
      EOF
    
  5. Trigger the pipeline:

    gcloud beta dataflow flex-template run "dlp-reid-demo" --project=${PROJECT_ID} \
      --region=${REGION} \
      --template-file-gcs-location=gs://dataflow-dlp-solution-sample-data/dynamic_template_dlp_v2.json \
      --parameters=^~^streaming=true~enableStreamingEngine=true~tempLocation=gs://${DATAFLOW_TEMP_BUCKET}/temp~numWorkers=1~maxNumWorkers=2~runner=DataflowRunner~tableRef=${PROJECT_ID}:deid_dataset.CCRecords_1564602825~dataset=deid_dataset~autoscalingAlgorithm=THROUGHPUT_BASED~workerMachineType=n1-highmem-8~topic=projects/${PROJECT_ID}/topics/${TOPIC_ID}~deidentifyTemplateName=${REID_TEMPLATE_NAME}~DLPMethod=REID~keyRange=1024~queryPath=gs://${DATA_STORAGE_BUCKET}/reid_query.sql
    
  6. To monitor the pipeline, in the Cloud Console, go to the Dataflow page.

    Go to Dataflow

    After a few minutes, the pipeline successfully completes when you see the following:

    Successful completion of pipeline with a job status of succeeded.

  7. To validate the re-identified data, in Cloud Shell fetch a random record from the Pub/Sub subscriber:

    gcloud pubsub subscriptions pull ${SUBSCRIPTION_ID} \
        --auto-ack --limit 1000 --project ${PROJECT_ID} >> re-id-data.txt
    

    The output shows that the data is successfully re-identified to the original format:

    cat re-id-data.txt
    

    This output is similar to:

    {"id":"17880","Card Number":"6558977436350773","Card Holder's Name":"Elden E Courtright"}
    {"id":"57055","Card Number":"3529749048585358","Card Holder's Name":"Francisco Duffy"}
    {"id":"31487","Card Number":"6227182611813833","Card Holder's Name":"Leslie C Wade"}
    {"id":"38521","Card Number":"36533383873740","Card Holder's Name":"Laura Z Thomas"}
    {"id":"81585","Card Number":"3569920557158538","Card Holder's Name":"Kelly Reed"}
    {"id":"96408","Card Number":"340687318048371","Card Holder's Name":"Ignacio V Pisano"}
    

Clean up

The easiest way to eliminate billing is to delete the Cloud project you created for the tutorial. Alternatively, you can delete the individual resources.

Delete the project

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next