Creating Cloud DLP de-identification transformation templates for PII datasets

This tutorial shows you how to create and manage de-identification transformations for large-scale personally identifiable information (PII) datasets using Cloud Data Loss Prevention (Cloud DLP) templates. This tutorial also offers guidance on how to select the appropriate transformations for your use case.

This document is part of a series:

This tutorial assumes that you have basic shell scripting knowledge and is intended for enterprise security admins.

Reference architecture

This tutorial demonstrates the configuration (DLP template and key) management section that is illustrated in the following diagram.

Architecture of de-identification configuration.

This architecture consists of a managed de-identification configuration that is accessible by only a small group of people—for example, security admins—to avoid exposing de-identification methods and encryption keys.

Objectives

  • Design a Cloud DLP transformation for a sample dataset.
  • Create Cloud DLP templates to store the transformation configuration.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. In the Cloud Console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.

  5. You run all commands in this tutorial from Cloud Shell.
  6. In Cloud Shell, enable the Cloud DLP, Cloud Key Management Service, BigQuery, Cloud Storage, Dataflow, and Cloud Build APIs.
    gcloud services enable dlp.googleapis.com
    gcloud services enable cloudkms.googleapis.com
    gcloud services enable bigquery.googleapis.com
    gcloud services enable storage-component.googleapis.com
    gcloud services enable dataflow.googleapis.com
    gcloud services enable cloudbuild.googleapis.com
    

Creating Cloud Storage buckets

In this series, you need two Cloud Storage buckets. The first bucket stores the sample dataset and the second bucket stores temporary data for the automated pipeline used in the next part in this series (Running an automated Dataflow pipeline to de-identify a PII dataset).

  1. In Cloud Shell, create two Cloud Storage buckets (Replace REGION with the Dataflow region) of your choosing, for example us-central1):

    export REGION=REGION
    export PROJECT_ID=$(gcloud config get-value project)
    export DATA_STORAGE_BUCKET=${PROJECT_ID}-data-storage-bucket
    export DATAFLOW_TEMP_BUCKET=${PROJECT_ID}-dataflow-temp-bucket
    gsutil mb -c standard -l ${REGION} gs://${DATA_STORAGE_BUCKET}
    gsutil mb -c standard -l ${REGION} gs://${DATAFLOW_TEMP_BUCKET}
    

Downloading the sample files

You download the sample files to identify the columns that are required for de-identification transformations.

  1. In Cloud Shell, download the sample dataset and scripts used in this tutorial to your local machine:

    curl -X GET \
        -o "sample_data_scripts.tar.gz" \
        "http://storage.googleapis.com/dataflow-dlp-solution-sample-data/sample_data_scripts.tar.gz"
    
  2. Decompress and unpack the contents of the file:

    tar -zxvf sample_data_scripts.tar.gz
    
  3. To validate the transfer, confirm that the output matches the list of the following files downloaded:

    wc -l solution-test/CCRecords_1564602825.csv
    

    The output is the following:

    100001 solution-test/CCRecords_1564602825.csv
    
  4. To identify which columns might require DLP de-identification, examine the header record in the CSV file:

    head -1 solution-test/CCRecords_1564602825.csv
    

    The output is:

    ID,Card Type Code,Card Type Full Name,Issuing Bank,Card Number,Card Holder's Name,Issue Date,Expiry Date,Billing Date,Card PIN,Credit Limit,Age,SSN,JobTitle,Additional Details
    

    The first line of each CSV file defines the data schema and column names. The extracted dataset contains direct identifiers (SSN and Card Holder's Name) and quasi-identifiers (Age and JobTitle).

The process to determine the transformations required varies based on your use case. For the sample dataset used in this tutorial, a list of transformations is summarized in the following table.

Column name InfoType (custom or built-In) Type of Cloud DLP transformation Description of transformation
Card PIN N/A Encrypting using cryptographic hashing Cryptograpic hashing encryption replaces the original data with a base64-encoded value that can't be reversed.
Card Number N/A Encrypting using deterministic encryption (DE) DE replaces the original data with a base64-encoded encrypted value and doesn't preserve the original character set or length.
Card Holder's Name N/A Encrypting using deterministic encryption (DE) DE replaces the original data with a base64-encoded encrypted value and doesn't preserve the original character set or length.
SSN (Social Security Number) N/A Masking character Masking is a non-cryptographic technique that you can use to mask partial or complete original data with a character specified.
Age N/A Bucketing with a general value Bucketing transformation helps to replace an identifiable value with a general value.
Job Title N/A Bucketing with a general value Bucketing transformation helps to replace an identifiable value with a general value.
Additional Details Built In: IBAN_CODE, EMAIL_ADDRESS, PHONE_NUMBER
Custom: ONLINE_USER_ID
Replacement with a token Replacement transformation replaces original data with an arbitrary value.

Creating a BigQuery dataset

  1. Create a dataset in BigQuery where the Cloud DLP pipeline can store the de-identified data (Replace LOCATION with your preferred BigQuery location, for example US):

    bq mk --location=LOCATION \
        --description="De-Identified PII Dataset" \
        deid_dataset
    

The steps in this tutorial assume that sensitive data is stored in Cloud Storage in column-delimited format, typically CSV. The pipeline used in this tutorial automatically creates a BigQuery table based on the content of the CSV header record found in your data files. If this header record is removed from your CSV files, you risk exposing PII data in BigQuery.

Creating a key encryption key (KEK)

A token encryption key (TEK) is protected (wrapped) with another key (key encryption key) from Cloud Key Management Service (Cloud KMS):

  1. In Cloud Shell, create a TEK locally. For this tutorial, you generate a random 32-character base-64 encryption key:

    export TEK=$(openssl rand -base64 32); echo ${TEK}
    

    The output is a random key generated in the following format:

    MpyFxEQKYKscEJVOiKMuEPdwqdffk4vTF+qwGwrp7Ps=
    

    The effective key size in base-2 is (64)^32 =(2^6)^32=2^192, or an AES initialization vector (IV) of size 192-bit. This key is used later in a script.

  2. Export the key, key ring, and KEK file as variables:

    export KEY_RING_NAME=my-kms-key-ring
    export KEY_NAME=my-kms-key
    export KEK_FILE_NAME=kek.json
    
  3. Enable the Cloud KMS admin and key encrypter roles for the Cloud Build service account:

    export PROJECT_NUMBER=$(gcloud projects list \
        --filter=${PROJECT_ID} --format="value(PROJECT_NUMBER)")
    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
        --member serviceAccount:$PROJECT_NUMBER@cloudbuild.gserviceaccount.com \
        --role roles/cloudkms.cryptoKeyEncrypter
    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
        --member serviceAccount:$PROJECT_NUMBER@cloudbuild.gserviceaccount.com \
        --role roles/cloudkms.admin
    
  4. Clone the following GitHub repository and go to the project root folder:

    git clone https://github.com/GoogleCloudPlatform/dlp-dataflow-deidentification.git
    cd dlp-dataflow-deidentification
    
  5. Create a KEK:

    gcloud builds submit . \
        --config dlp-demo-part-1-crypto-key.yaml \
        --substitutions \
        _GCS_BUCKET_NAME=gs://${DATA_STORAGE_BUCKET},_KEY_RING_NAME=${KEY_RING_NAME},_KEY_NAME=${KEY_NAME},_TEK=${TEK},_KEK=${KEK_FILE_NAME},_API_KEY=$(gcloud auth print-access-token)
    
  6. Validate that the KEK was successfully created:

    gsutil cat gs://${DATA_STORAGE_BUCKET}/${KEK_FILE_NAME}
    

    The output looks like the following:

    {
      "name": "kms-key-resource-path",
      "ciphertext": "kms-wrapped-key",
      "ciphertextCrc32c": "checksum"
    }
    

    Notes about the output:

    • kms-key-resource-path: The resource path of the KEK in the format: projects/${PROJECT_ID}/locations/global/keyRings/${KEY_RING_NAME}/cryptoKeys/${KEY_NAME}/cryptoKeyVersions/1.

    • kms-wrapped-key: Base64-encoded value of the Cloud KMS-wrapped TEK.

    • checksum: A CRC32C checksum of the ciphertext.

Creating the Cloud DLP templates

At this point, you have investigated the sample dataset and determined which Cloud DLP transformations are required. For the columns requiring cryptographic transformations, you have also created a key encryption key (KEK). The next step is to execute a Cloud Build script to create Cloud DLP templates based on the required transformation and the KEK.

Creating a service account for Cloud DLP

  1. In Cloud Shell, create a service account:

    export SERVICE_ACCOUNT_NAME=my-service-account
    gcloud iam service-accounts create ${SERVICE_ACCOUNT_NAME} \
        --display-name "DLP Demo Service Account"
    
  2. Create a JSON API key file called service-account-key.json for the service account:

    gcloud iam service-accounts keys create \
        --iam-account ${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \
        service-account-key.json
    
  3. Assign the project editor and storage admin role to the service account:

    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
        --member serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \
        --role roles/editor
    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
        --member serviceAccount:${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \
        --role roles/storage.admin
    
  4. Activate the service account:

    gcloud auth activate-service-account --key-file service-account-key.json
    

Creating the templates

  1. In Cloud Shell, execute the Cloud Build script to create the templates:

    gcloud builds submit . \
        --config dlp-demo-part-2-dlp-template.yaml \
        --substitutions \
        _KEK_CONFIG_FILE=gs://${DATA_STORAGE_BUCKET}/${KEK_FILE_NAME},_GCS_BUCKET_NAME=gs://${DATA_STORAGE_BUCKET},_API_KEY=$(gcloud auth print-access-token)
    
  2. Validate that the de-identification template was successfully created:

    gsutil cp gs://${DATA_STORAGE_BUCKET}/deid-template.json .
    cat deid-template.json
    

    The output looks like the following:

    {
      "name": "projects/<project_id>/deidentifyTemplates/<template_id>",
      "displayName": "Config to DeIdentify Sample Dataset",
      "description": "De-identifies Card Number, Card PIN, Card Holder's Name, SSN, Age, Job Title, Additional Details and Online UserId Fields",
      "createTime": "2019-12-01T19:21:07.306279Z",
      "updateTime": "2019-12-01T19:21:07.306279Z",
      "deidentifyConfig": {
        "recordTransformations": {
          "fieldTransformations": [
            {
              "fields": [
                {
                  "name": "Card PIN"
                }
              ],
              "primitiveTransformation": {
                "cryptoHashConfig": {
                  "cryptoKey": {
                    "kmsWrapped": {
                      "wrappedKey": "<var>kms-wrapped-key</var>",
                      "cryptoKeyName": "<var>kms-key-resource-name</var>"
                    }
                  }
                }
              }
            },
            {
              "fields": [
                {
                  "name": "SSN"
                }
              ],
              "primitiveTransformation": {
                "characterMaskConfig": {
                  "maskingCharacter": "*",
                  "numberToMask": 5,
                  "charactersToIgnore": [
                    {
                      "charactersToSkip": "-"
                    }
                  ]
                }
              }
            },
            {
              "fields": [
                {
                  "name": "Age"
                }
              ],
              "primitiveTransformation": {
                "bucketingConfig": {
                  "buckets": [
                    {
                      "min": {
                        "integerValue": "18"
                      },
                      "max": {
                        "integerValue": "30"
                      },
                      "replacementValue": {
                        "stringValue": "20"
                      }
                    },
                    {
                      "min": {
                        "integerValue": "30"
                      },
                      "max": {
                        "integerValue": "40"
                      },
                      "replacementValue": {
                        "stringValue": "30"
                      }
                    },
                    {
                      "min": {
                        "integerValue": "40"
                      },
                      "max": {
                        "integerValue": "50"
                      },
                      "replacementValue": {
                        "stringValue": "40"
                      }
                    },
                    {
                      "min": {
                        "integerValue": "50"
                      },
                      "max": {
                        "integerValue": "60"
                      },
                      "replacementValue": {
                        "stringValue": "50"
                      }
                    },
                    {
                      "min": {
                        "integerValue": "60"
                      },
                      "max": {
                        "integerValue": "99"
                      },
                      "replacementValue": {
                        "stringValue": "60"
                      }
                    }
                  ]
                }
              }
            },
            {
              "fields": [
                {
                  "name": "JobTitle"
                }
              ],
              "primitiveTransformation": {
                "bucketingConfig": {
                  "buckets": [
                    {
                      "min": {
                        "stringValue": "CIO"
                      },
                      "max": {
                        "stringValue": "CIOz"
                      },
                      "replacementValue": {
                        "stringValue": "Executive"
                      }
                    },
                    {
                      "min": {
                        "stringValue": "CEO"
                      },
                      "max": {
                        "stringValue": "CEOz"
                      },
                      "replacementValue": {
                        "stringValue": "Executive"
                      }
                    },
                    {
                      "min": {
                        "stringValue": "Vice President"
                      },
                      "max": {
                        "stringValue": "Vice Presidentz"
                      },
                      "replacementValue": {
                        "stringValue": "Executive"
                      }
                    },
                    {
                      "min": {
                        "stringValue": "Software Engineer"
                      },
                      "max": {
                        "stringValue": "Software Engineerz"
                      },
                      "replacementValue": {
                        "stringValue": "Engineer"
                      }
                    },
                    {
                      "min": {
                        "stringValue": "Product Manager"
                      },
                      "max": {
                        "stringValue": "Product Managerz"
                      },
                      "replacementValue": {
                        "stringValue": "Manager"
                      }
                    }
                  ]
                }
              }
            },
            {
              "fields": [
                {
                  "name": "Additional Details"
                }
              ],
              "infoTypeTransformations": {
                "transformations": [
                  {
                    "infoTypes": [
                      {
                        "name": "EMAIL_ADDRESS"
                      },
                      {
                        "name": "PHONE_NUMBER"
                      },
                      {
                        "name": "IBAN_CODE"
                      },
                      {
                        "name": "ONLINE_USER_ID"
                      }
                    ],
                    "primitiveTransformation": {
                      "replaceWithInfoTypeConfig": {}
                    }
                  }
                ]
              }
            },
            {
              "fields": [
                {
                  "name": "Card Holder's Name"
                },
                {
                  "name": "Card Number"
                }
              ],
              "primitiveTransformation": {
                "cryptoDeterministicConfig": {
                  "cryptoKey": {
                    "kmsWrapped": {
                      "wrappedKey": "<var>kms-wrapped-key</var>",
                      "cryptoKeyName": "<var>kms-key-resource-name</var>"
                    }
                  }
                }
              }
            }
          ]
        }
      }
    }
  3. Validate that the inspect template was successfully created:

    gsutil cp gs://${DATA_STORAGE_BUCKET}/inspect-template.json .
    cat inspect-template.json
    

    The output looks like the following:

    {
      "name": "projects/<project_id>/inspectTemplates/<template_id>",
      "displayName": "Config to Inspect Additional Details Column",
      "description": "Inspect template for built in info types EMAIL_ADDRESS, PHONE_NUMBER, IBAN_CODE and custom Info type ONLINE_USER_ID",
      "createTime": "2019-12-01T19:21:08.063415Z",
      "updateTime": "2019-12-01T19:21:08.063415Z",
      "inspectConfig": {
        "infoTypes": [
          {
            "name": "IBAN_CODE"
          },
          {
            "name": "EMAIL_ADDRESS"
          },
          {
            "name": "PHONE_NUMBER"
          }
        ],
        "minLikelihood": "LIKELY",
        "limits": {},
        "customInfoTypes": [
          {
            "infoType": {
              "name": "ONLINE_USER_ID"
            },
            "regex": {
              "pattern": "\\b:\\d{16}"
            }
          }
        ]
      }
    }
  4. Validate that the re-identification template was successfully created:

    gsutil cp gs://${DATA_STORAGE_BUCKET}/reid-template.json .
    cat reid-template.json
    

    The output looks like the following:

    {
      "name": "projects/<project_id>/deidentifyTemplates/<template_id>",
      "displayName": "Config to ReIdentify Sample Dataset",
      "description": "Used to re-identify Card Number and Card Holder's Name",
      "createTime": "2019-12-01T19:21:07.306279Z",
      "updateTime": "2019-12-01T19:21:07.306279Z",
      "deidentifyConfig": {
        "recordTransformations": {
          "fieldTransformations": [
            {
              "fields": [
                {
                  "name": "Card_Holders_Name"
                },
                {
                  "name": "Card_Number"
                }
              ],
              "primitiveTransformation": {
                "cryptoDeterministicConfig": {
                  "cryptoKey": {
                    "kmsWrapped": {
                      "wrappedKey": "<var>kms-wrapped-key</var>",
                      "cryptoKeyName": "<var>kms-key-resource-name</var>"
                    }
                  }
                }
              }
            }
          ]
        }
      }
    }

    The re-identification template is similar to the de-identification template, except it only contains transformations that are reversible, in this case the ones for the Card Holder's Name and Card Number fields.

  5. Export the Cloud DLP templates names:

    export DEID_TEMPLATE_NAME=$(jq -r '.name' deid-template.json)
    export INSPECT_TEMPLATE_NAME=$(jq -r '.name' inspect-template.json)
    export REID_TEMPLATE_NAME=$(jq -r '.name' reid-template.json)
    
  6. Validate that the following variables exist:

    echo ${DATA_STORAGE_BUCKET}
    echo ${DATAFLOW_TEMP_BUCKET}
    echo ${DEID_TEMPLATE_NAME}
    echo ${INSPECT_TEMPLATE_NAME}
    echo ${REID_TEMPLATE_NAME}
    

You have successfully completed this tutorial. In the next tutorial, you trigger an automated Dataflow pipeline to inspect and de-identify the sample dataset and store this dataset in BigQuery.

Cleaning up

If you don't intend to continue with the tutorials in the series, the easiest way to eliminate billing is to delete the Cloud project you created for the tutorial. Alternatively, you can delete the individual resources.

Delete the project

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

What's next