This document discusses how to use Cloud Data Loss Prevention (Cloud DLP) to create an automated data transformation pipeline to de-identify sensitive data like personally identifiable information (PII). De-identification techniques like tokenization (pseudonymization) let you preserve the utility of your data for joining or analytics while reducing the risk of handling the data by obfuscating the raw sensitive identifiers. To minimize the risk of handling large volumes of sensitive data, you can use an automated data transformation pipeline to create de-identified replicas. Cloud DLP enables transformations such as redaction, masking, tokenization, bucketing, and other methods of de-identification. When a dataset hasn't been characterized, Cloud DLP can also inspect the data for sensitive information by using more than 100 built-in classifiers.
This document is intended for a technical audience whose responsibilities include data security, data processing, or data analytics. This guide assumes that you're familiar with data processing and data privacy, without the need to be an expert.
This document is part of a series:
- De-identification and re-identification of PII in large-scale datasets using Cloud DLP (this document)
- Creating DLP de-identification transformation templates for PII datasets
- Running an automated Dataflow pipeline to de-identify a PII dataset
- Validating de-identified data in BigQuery and re-identifying PII data
The following diagram shows a reference architecture for using Google Cloud products to add a layer of security to sensitive datasets by using de-identification techniques.
The architecture consists of the following:
Data de-identification streaming pipeline: De-identifies sensitive data in text using Dataflow. You can reuse the pipeline for multiple transformations and use cases.
Configuration (DLP template and key) management: A managed de-identification configuration that is accessible by only a small group of people—for example, security admins—to avoid exposing de-identification methods and encryption keys.
Data validation and re-identification pipeline: Validates copies of the de-identified data and uses a Dataflow pipeline to re-identify data at a large scale.
Helping to secure sensitive data
One of the key tasks of any enterprise is to help ensure the security of their users' and employees' data. Google Cloud provides built-in security measures to facilitate data security, including encryption of stored data and encryption of data in transit.
Encryption at rest: Cloud Storage
Maintaining data security is critical for most organizations. Unauthorized access to even moderately sensitive data can damage the trust, relationships, and reputation that you have with your customers. Google encrypts data stored at rest by default. By default, any object uploaded to a Cloud Storage bucket is encrypted using a Google-managed encryption key. If your dataset uses a pre-existing encryption method and requires a non-default option before uploading, there are other encryption options provided by Cloud Storage. For more information, see Data encryption options.
Encryption in transit: Dataflow
When your data is in transit, the at-rest encryption isn't in place. In-transit data is protected by secure network protocols referred to as encryption in transit. By default, Dataflow uses Google-managed encryption keys. The tutorials associated with this document use an automated pipeline that uses the default Google-managed encryption keys.
Cloud DLP data transformations
There are two main types of transformations performed by Cloud DLP:
infoTypeTransformations methods can
de-identify and encrypt sensitive information in your data. For example, you can
transform the values in the
US_SOCIAL_SECURITY_NUMBER column to be
unidentifiable or use tokenization to obscure it while keeping referential
infoTypeTransformations method enables you to inspect for sensitive data
and transform the finding. For example, if you have unstructured or free-text
infoTypeTransformations method can help you identify a SSN inside of
a sentence and encrypt the SSN value while leaving the rest of the text intact.
You can also define custom
recordTransformations method enables you to apply a transformation
configuration per field when using structured or tabular data. With the
recordTransformations method, you
can apply the same transformation across every value in that field such as
hashing or tokenizing every value in a column with
SSN column as the field or
recordTransformations method , you can also mix in the
infoTypeTransformations method that only apply to the values in the specified
fields. For example, you can use an
infoTypeTransformations method inside of a
recordTransformations method for the field named
comments to redact any findings for
US_SOCIAL_SECURITY_NUMBER that are found
inside the text in the field.
In increasing order of complexity, the de-identification processes are as follows:
- Redaction: Remove the sensitive content with no replacement of content.
- Masking: Replace the sensitive content with fixed characters.
- Encryption: Replace sensitive content with encrypted strings, possibly reversibly.
Working with delimited data
Often, data consists of records delimited by a selected character, with fixed
types in each column, like a CSV file. For this class of data, you can apply
de-identification transformations (
recordTransformations) directly, without
inspecting the data. For example, you can expect a column labeled
contain only SSN data. You don't need to inspect the data to know that the
infoType detector is
US_SOCIAL_SECURITY_NUMBER. However, free-form
Additional Details can contain sensitive information, but the
infoType class is unknown beforehand. For a free-form column, you need to
infoTypes detector (
infoTypeTransformations) before applying
de-identification transformations. Cloud DLP allows both of these
transformation types to co-exist in a single de-identification template.
Cloud DLP includes
more than 100 built-in
You can also create custom types or modify built-in
infoTypes detectors to
find sensitive data that is unique to your organization.
Determining transformation type
Determining when to use the
method depends on your use case. Because using the
method requires more resources and is therefore more costly, we recommend using
this method only for situations where the data type is unknown. You can evaluate
the costs of running Cloud DLP using the
Google Cloud pricing calculator.
To provide examples of transformation, this document has several associated tutorials that use a dataset. The dataset contains CSV files with fixed columns, as demonstrated in the following table.
||DLP transformation type|
||Not applicable||Deterministic encryption (DE)|
||Not applicable||Deterministic encryption (DE)|
||Not applicable||Crypto hashing|
This table lists the column names and describes which type of transformation is
needed for each column. For example, the
Card Number column contains credit
card numbers that need to be encrypted; however, they don't need to be
inspected, because the data type (
infoType) is known.
The only column where an inspection transformation is recommended is the
Additional Details column. This column is free-form and might contain PII,
which, for the purposes of this series, should be detected and de-identified.
The examples in this table present five different de-identification transformations:
Two-way tokenization: Replaces the original data with a token that is deterministic, preserving referential integrity. You can use the token to join data or use the token in aggregate analysis. You can reverse or de-tokenize the data using the same key that you used to create the token. There are two methods for two-way tokenizations:
- Deterministic encryption (DE): Replaces the original data with a base64-encoded encrypted value and doesn't preserve the original character set or length.
- Format-preserving encryption with FFX (FPE-FFX): Replaces the original data with a token generated by using format-preserving encryption in FFX mode. By design, FPE-FFX preserves the length and character set of the input text. It lacks authentication and an initialization vector, which can cause a length expansion in the output token. Other methods, like DE, provide stronger security guarantees and are recommended for tokenization use cases unless length and character-set preservation are strict requirements, such as backward compatibility with legacy data systems.
One-way tokenization, using cryptographic hashing: Replaces the original value with a hashed value, preserving referential integrity. However, unlike two-way tokenization, a one-way method isn't reversible. The hash value is generated by using an SHA-256-based message authentication code (HMAC-SHA-256) on the input value.
Masking: Replaces the original data with a specified character, either partially or completely.
Bucketing: Replaces a more identifiable value with a less distinguishing value.
Replacement: Replaces original data with a token or the name of the
Choosing the best de-identification method can vary based on your use case. For example, if a legacy app is processing the de-identified records, then format preservation might be important. If you're dealing with strictly formatted 10-digit numbers, FPE preserves the length (10 digits) and character set (numeric) of an input for legacy system support.
However, if strict formatting isn't required for legacy compatibility, as is
the case for values in the
Card Holder's Name column, then DE is the
preferred choice because it
has a stronger authentication method. Both FPE and DE enable the tokens to be
reversed or de-tokenized. If you don't need de-tokenization, then cryptographic
hashing provides integrity but the tokens can't be reversed.
Other methods—like masking, bucketing, date-shifting, and replacement—are good for values that don't need to retain full integrity. For example, bucketing an age value (for example, 27) to an age range (20-30) can still be analyzed while reducing the uniqueness that might lead to the identification of an individual.
Token encryption keys
For cryptographic de-identification transformations, a cryptographic key, also known as token encryption key, is required. The token encryption key that is used for de-identification encryption is also used to re-identify the original value. The secure creation and management of token encryption keys are beyond the scope of this document. However, there are some important principles to consider that are used later in the associated tutorials:
- Avoid using plaintext keys in the template. Instead, use Cloud KMS to create a wrapped key.
- Use separate token encryption keys for each data element to reduce the risk of compromising keys.
- Rotate token encryption keys. Although you can rotate the wrapped key, rotating the token encryption key breaks the integrity of the tokenization. When the key is rotated, you need to re-tokenize the entire dataset.
Cloud DLP templates
For large-scale deployments, use Cloud DLP templates to accomplish the following:
- Enable security control with Identity and Access Management (IAM).
- Decouple configuration information, and how you de-identify that information, from the implementation of your requests.
- Reuse a set of transformations. You can use the de-identify and re-identify templates over multiple datasets.
The associated tutorials take you through the process to create and manage Cloud DLP templates.
The final component of the reference architecture is viewing and working with the de-identified data in BigQuery. BigQuery is Google's data warehouse tool that includes serverless infrastructure, BigQuery ML, and the ability to run Cloud DLP as a native tool. In the following tutorials, BigQuery serves as a data warehouse for the de-identified data and as a backend to an automated re-identification data pipeline that can share data through Pub/Sub.
To learn more about advanced applications of BigQuery ML and sensitive data, see Considerations for sensitive data within machine learning datasets.
- Creating Cloud DLP de-identification transformation templates for PII datasets.
- Running an automated Dataflow pipeline to de-identify a PII dataset.
- Validating de-identified data in BigQuery and re-identifying PII data.
- Review the sample code in the Migrate Sensitive Data in BigQuery Using Dataflow & Cloud DLP repo on GitHub.
- Learn about other pattern recognition solutions.
- Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.