Take charge of your data: How tokenization makes data usable without sacrificing privacy
Achin (Ace) Kulshrestha
Software Engineer, Google
Scott Ellis
Senior Product Manager
Privacy regulations place strict controls on how to examine and share sensitive data. At the same time, you can’t let your business come to a standstill. De-identification techniques can help you strike a balance between utility and privacy for your data. In previous “Take charge of your data” posts, we showed you how to gain visibility into your data using Cloud Data Loss Prevention (DLP) and how to protect sensitive data by incorporating data de-identification, obfuscation, and minimization techniques. In this post, we’ll dive a bit deeper into one of these de-identification techniques: tokenization.
Tokenization substitutes sensitive data with surrogate values called tokens, which can then be used to represent the original (or raw) sensitive value. It is sometimes referred to as pseudonymization or surrogate replacement. The concept of tokenization is widely used in industries like finance and healthcare to help reduce the risk of data in use, compliance scope, and minimize sensitive data being exposed to systems that do not need it. It’s important to understand how tokenization can help protect sensitive data while allowing your business operations and analytical workflows to use the information they need. With Cloud DLP, customers can perform tokenization at scale with minimal setup and overhead.
Understanding the problem
First, let’s look at the following scenario: Casey works as a data scientist at a large financial company that services businesses and end users. Casey’s primary job is to analyze data and improve the user experience for people using the company’s vast portfolio of financial applications. In the normal course of doing business, the company collects sensitive and regulated data including personally identifiable information (PII) like Social Security numbers.
In order to demonstrate the benefits of tokenization, let’s consider a task that Casey might do as part of her job: The company wants to determine what products they can build to help users improve their credit scores depending on their age range. In order to answer this question, Casey needs to join user information in the company’s banking app with customers’ credit score data received from a third party.
Casey requests access to both the users table and the table filled with the third party’s credit score data.
Let’s take a look at these tables.
The first few rows of each appear as follows:
At this point, it’s worth enumerating some of the risks involved with Casey getting access to this data.
Email address and exact age is not required for the task at hand but are being disclosed because they are part of the user’s table.
Social Security numbers are required for joining the tables, but they’re being disclosed in their raw form.
While using this raw data will allow Casey to complete the job at hand, it exposes sensitive data which could be propagated into new systems. Now let’s try to tackle this privacy problem using de-identification and tokenization with Cloud DLP.
Fixing the problem with tokenization
As mentioned above, tokenization substitutes sensitive data with surrogate values called tokens. These tokens can then be used to represent surrogate values in multiple ways. For example, they can retain the format of the original data while revealing only a few characters. This can be useful in cases where you need to retain a record identifier or join data, but don’t want to reveal the sensitive underlying elements. This is sometimes referred to as preserving referential integrity, and can be used to strike a balance between utility and reducing risk when using the data.
Continuing with our example, Casey can use this tokenized data to join the two data sources and perform analysis.
Step 1: Joining tables on a token instead of raw SSN
Cloud DLP supports multiple cryptographic token formats that keep the referential integrity needed to join tables:
Deterministic encryption: Replaces an input value with a cryptographic token. This encryption method is reversible, which helps to maintain referential integrity across your database and has no character-set limitations.
Format Preserving Encryption (FPE): Creates a token of the same length and character set as the input. Similarly to deterministic encryption, it’s reversible. FPE is great for inputs with a well defined alphabet space—for example, an alphabet including only [0-9a-zA-Z].
Secure, key-based hashes: Creates a token based on a one-way hash generated using an encryption key. This encryption method is inherently irreversible. This may not be appropriate for use cases that need to reverse or de-tokenize data in another workflow.
For Casey’s task, let’s use deterministic encryption in Cloud DLP to tokenize the Social Security Numbers in both tables. Note: To support your data’s privacy, DLP supports encryption keys which are wrapped using Cloud Key Management Service.
Step 2: Masking or redacting unneeded raw PII values.
For the business use case described above, we don’t need email addresses or exact age.
Given that, one option is to use Cloud DLP’s value replacement and bucketing to transform values in a column. For example, the users_db table can be transformed to replace email address with the string “[email-address]” and replace exact age values with bucket ranges.
Great, the datasets have been de-identified and tokenized! Casey can now join and analyze the data
Tokenization for unstructured data
What we’ve described so far is tokenization of structured data. However, in a real-word scenario, it’s likely that unstructured data containing PII is present. For example, let’s consider the users_db table again and add a customer_support_notes column to the table. This column stores a user’s call log from when they last called the company’s automated customer support line.
Cloud DLP can also detect and de-identify sensitive information in this unstructured data. In the example below we have configured Cloud DLP to tokenize the SSN to keep referential integrity and redacted other sensitive data. Referential integrity here would allow someone to see that the Social Security Number in the unstructured notes is the same as the Social Security Number in other parts of the data such as in the SSN column. This happens because the token is the same value for the same given input values:
What we’ve just shown is how you can use leverage the power of Cloud DLP’s inspection engine along with its ability to transform and tokenize to help protect both structured and unstructured text.
Tokenization in action
On Google Cloud Platform, you can tokenize data using Cloud DLP and a click-to-deploy Cloud Dataflow pipeline. This ready-to-use pipeline takes data from Cloud Storage, processes it and ingests it into BigQuery.
To do this:
Click on “Create job from template”
Select “Data Masking/Tokenization using Cloud DLP from Cloud Storage to BigQuery”
In short, tokenization using Cloud DLP can help you support privacy-sensitive use cases and adhere to data security policies within your organization. It can also help your business satisfy policy and regulatory requirements.
For more about tokenization and Cloud DLP, watch our recent Cloud OnAir webinar, “Protecting sensitive datasets in Google Cloud Platform” to see a demo of tokenization with Cloud DLP in action. Then, to learn more, visit Cloud Data Loss Prevention for resources on getting started.