De-identification and re-identification of PII in large-scale datasets using Sensitive Data Protection

Last reviewed 2024-06-07 UTC

This document discusses how to use Sensitive Data Protection to create an automated data transformation pipeline to de-identify sensitive data like personally identifiable information (PII). De-identification techniques like tokenization (pseudonymization) let you preserve the utility of your data for joining or analytics while reducing the risk of handling the data by obfuscating the raw sensitive identifiers. To minimize the risk of handling large volumes of sensitive data, you can use an automated data transformation pipeline to create de-identified replicas. Sensitive Data Protection enables transformations such as redaction, masking, tokenization, bucketing, and other methods of de-identification. When a dataset hasn't been characterized, Sensitive Data Protection can also inspect the data for sensitive information by using more than 100 built-in classifiers.

This document is intended for a technical audience whose responsibilities include data security, data processing, or data analytics. This guide assumes that you're familiar with data processing and data privacy, without the need to be an expert.

Reference architecture

The following diagram shows a reference architecture for using Google Cloud products to add a layer of security to sensitive datasets by using de-identification techniques.

Architecture of de-identification pipeline, configuration management, and re-identification pipeline.

The architecture consists of the following:

  • Data de-identification streaming pipeline: De-identifies sensitive data in text using Dataflow. You can reuse the pipeline for multiple transformations and use cases.

  • Configuration (Sensitive Data Protection template and key) management: A managed de-identification configuration that is accessible by only a small group of people—for example, security admins—to avoid exposing de-identification methods and encryption keys.

  • Data validation and re-identification pipeline: Validates copies of the de-identified data and uses a Dataflow pipeline to re-identify data at a large scale.

Helping to secure sensitive data

One of the key tasks of any enterprise is to help ensure the security of their users' and employees' data. Google Cloud provides built-in security measures to facilitate data security, including encryption of stored data and encryption of data in transit.

Encryption at rest: Cloud Storage

Maintaining data security is critical for most organizations. Unauthorized access to even moderately sensitive data can damage the trust, relationships, and reputation that you have with your customers. Google encrypts data stored at rest by default. By default, any object uploaded to a Cloud Storage bucket is encrypted using a Google-owned and Google-managed encryption key. If your dataset uses a pre-existing encryption method and requires a non-default option before uploading, there are other encryption options provided by Cloud Storage. For more information, see Data encryption options.

Encryption in transit: Dataflow

When your data is in transit, the at-rest encryption isn't in place. In-transit data is protected by secure network protocols referred to as encryption in transit. By default, Dataflow uses Google-owned and Google-managed encryption keys. The tutorials associated with this document use an automated pipeline that uses the default Google-owned and Google-managed encryption keys.

Sensitive Data Protection data transformations

There are two main types of transformations performed by Sensitive Data Protection:

Both recordTransformations and infoTypeTransformations methods can de-identify and encrypt sensitive information in your data. For example, you can transform the values in the US_SOCIAL_SECURITY_NUMBER column to be unidentifiable or use tokenization to obscure it while keeping referential integrity.

The infoTypeTransformations method enables you to inspect for sensitive data and transform the finding. For example, if you have unstructured or free-text data, the infoTypeTransformations method can help you identify an SSN inside of a sentence and encrypt the SSN value while leaving the rest of the text intact. You can also define custom infoTypes methods.

The recordTransformations method enables you to apply a transformation configuration per field when using structured or tabular data. With the recordTransformations method, you can apply the same transformation across every value in that field such as hashing or tokenizing every value in a column with SSN column as the field or header name.

With the recordTransformations method , you can also mix in the infoTypeTransformations method that only apply to the values in the specified fields. For example, you can use an infoTypeTransformations method inside of a recordTransformations method for the field named comments to redact any findings for US_SOCIAL_SECURITY_NUMBER that are found inside the text in the field.

In increasing order of complexity, the de-identification processes are as follows:

  • Redaction: Remove the sensitive content with no replacement of content.
  • Masking: Replace the sensitive content with fixed characters.
  • Encryption: Replace sensitive content with encrypted strings, possibly reversibly.

Working with delimited data

Often, data consists of records delimited by a selected character, with fixed types in each column, like a CSV file. For this class of data, you can apply de-identification transformations (recordTransformations) directly, without inspecting the data. For example, you can expect a column labeled SSN to contain only SSN data. You don't need to inspect the data to know that the infoType detector is US_SOCIAL_SECURITY_NUMBER. However, free-form columns labeled Additional Details can contain sensitive information, but the infoType class is unknown beforehand. For a free-form column, you need to inspect the infoTypes detector (infoTypeTransformations) before applying de-identification transformations. Sensitive Data Protection allows both of these transformation types to co-exist in a single de-identification template. Sensitive Data Protection includes more than 100 built-in infoTypes detectors. You can also create custom types or modify built-in infoTypes detectors to find sensitive data that is unique to your organization.

Determining transformation type

Determining when to use the recordTransformations or infoTypeTransformations method depends on your use case. Because using the infoTypeTransformations method requires more resources and is therefore more costly, we recommend using this method only for situations where the data type is unknown. You can evaluate the costs of running Sensitive Data Protection using the Google Cloud pricing calculator.

For examples of transformation, this document refers to a dataset that contains CSV files with fixed columns, as demonstrated in the following table.

Column name Inspection infoType (custom or built-in) Sensitive Data Protection transformation type
Card Number Not applicable Deterministic encryption (DE)
Card Holder's Name Not applicable Deterministic encryption (DE)
Card PIN Not applicable Crypto hashing
SSN (Social Security Number) Not applicable Masking
Age Not applicable Bucketing
Job Title Not applicable Bucketing
Additional Details Built-in:
IBAN_CODE, EMAIL_ADDRESS, PHONE_NUMBER
Custom:
ONLINE_USER_ID
Replacement

This table lists the column names and describes which type of transformation is needed for each column. For example, the Card Number column contains credit card numbers that need to be encrypted; however, they don't need to be inspected, because the data type (infoType) is known.

The only column where an inspection transformation is recommended is the Additional Details column. This column is free-form and might contain PII, which, for the purposes of this example, should be detected and de-identified.

The examples in this table present five different de-identification transformations:

  • Two-way tokenization: Replaces the original data with a token that is deterministic, preserving referential integrity. You can use the token to join data or use the token in aggregate analysis. You can reverse or de-tokenize the data using the same key that you used to create the token. There are two methods for two-way tokenizations:

    • Deterministic encryption (DE): Replaces the original data with a base64-encoded encrypted value and doesn't preserve the original character set or length.
    • Format-preserving encryption with FFX (FPE-FFX): Replaces the original data with a token generated by using format-preserving encryption in FFX mode. By design, FPE-FFX preserves the length and character set of the input text. It lacks authentication and an initialization vector, which can cause a length expansion in the output token. Other methods, like DE, provide stronger security and are recommended for tokenization use cases unless length and character-set preservation are strict requirements, such as backward compatibility with legacy data systems.
  • One-way tokenization, using cryptographic hashing: Replaces the original value with a hashed value, preserving referential integrity. However, unlike two-way tokenization, a one-way method isn't reversible. The hash value is generated by using an SHA-256-based message authentication code (HMAC-SHA-256) on the input value.

  • Masking: Replaces the original data with a specified character, either partially or completely.

  • Bucketing: Replaces a more identifiable value with a less distinguishing value.

  • Replacement: Replaces original data with a token or the name of the infoType if detected.

Method selection

Choosing the best de-identification method can vary based on your use case. For example, if a legacy app is processing the de-identified records, then format preservation might be important. If you're dealing with strictly formatted 10-digit numbers, FPE preserves the length (10 digits) and character set (numeric) of an input for legacy system support.

However, if strict formatting isn't required for legacy compatibility, as is the case for values in the Card Holder's Name column, then DE is the preferred choice because it has a stronger authentication method. Both FPE and DE enable the tokens to be reversed or de-tokenized. If you don't need de-tokenization, then cryptographic hashing provides integrity but the tokens can't be reversed.

Other methods—like masking, bucketing, date-shifting, and replacement—are good for values that don't need to retain full integrity. For example, bucketing an age value (for example, 27) to an age range (20-30) can still be analyzed while reducing the uniqueness that might lead to the identification of an individual.

Token encryption keys

For cryptographic de-identification transformations, a cryptographic key, also known as token encryption key, is required. The token encryption key that is used for de-identification encryption is also used to re-identify the original value. The secure creation and management of token encryption keys are beyond the scope of this document. However, there are some important principles to consider that are used later in the associated tutorials:

  • Avoid using plaintext keys in the template. Instead, use Cloud KMS to create a wrapped key.
  • Use separate token encryption keys for each data element to reduce the risk of compromising keys.
  • Rotate token encryption keys. Although you can rotate the wrapped key, rotating the token encryption key breaks the integrity of the tokenization. When the key is rotated, you need to re-tokenize the entire dataset.

Sensitive Data Protection templates

For large-scale deployments, use Sensitive Data Protection templates to accomplish the following:

  • Enable security control with Identity and Access Management (IAM).
  • Decouple configuration information, and how you de-identify that information, from the implementation of your requests.
  • Reuse a set of transformations. You can use the de-identify and re-identify templates over multiple datasets.

BigQuery

The final component of the reference architecture is viewing and working with the de-identified data in BigQuery. BigQuery is Google's data warehouse tool that includes serverless infrastructure, BigQuery ML, and the ability to run Sensitive Data Protection as a native tool. In the example reference architecture, BigQuery serves as a data warehouse for the de-identified data and as a backend to an automated re-identification data pipeline that can share data through Pub/Sub.

What's next