Jump to Content
Security & Identity

Take charge of your data: using Cloud DLP to de-identify and obfuscate sensitive information

November 20, 2018
Jordanna Chord

Senior Staff Software Engineer

Scott Ellis

Product Manager, Google Cloud

In our previous “Taking charge of your data” post, we talked about how to gain visibility into your data using the Cloud Data Loss Prevention (DLP) API. But discovering sensitive data is just the start. In this post we’ll tackle how to protect that data by incorporating data obfuscation and minimization techniques automatically into your workflows—leaving less potential for human error.

Imagine you have a database of chat logs from your customer interactions or unstructured medical notes. Both contain valuable insights but also occasional bits of sensitive data such as phone numbers, email addresses, and other forms of personally identifiable information (PII). You have a few options to protect this sensitive data. For one, you can apply traditional security techniques like access control and encryption. These are great for securing data, but are agnostic to the actual content. In other words, any user to whom you grant access gets access to all the information in a dataset. This can lead to exposing more data than is needed, even for approved business uses such as analytics.

What you really want is to give users access to all of the data except those bits that contain sensitive data. One option is to remove the sensitive bits themselves from the rest of the data with a form of privacy-preserving transformations referred to as de-identification. NIST defines de-identification as the process of “remov[ing] identifying information from a dataset so that individual data cannot be linked with specific individuals.”

Thus, with de-identification, if a customer gave their phone number in a chat message like:

My phone number is 8582394000 if you need to reach me

You could share the text minus the phone number—something like:

My phone number is [PHONE_NUMBER] if you need to reach me

In this way, de-identification can help reduce the risks inherent in data so that when someone is granted access to it, they are less likely to be exposed to any sensitive PII.  

De-identification in Cloud DLP

There are several de-identification techniques that can help obscure sensitive information while preserving some utility. Below are a few common techniques supported in Cloud DLP.

  • Replacement - Replaces each input value with a given value.

  • Redaction - Redacts a value by removing it.

  • Mask - Masks a string either fully or partially by replacing a given number of characters with a specified fixed character.  This technique can, for example, mask everything but the last 4-digits of an account number or social security number.

  • Pseudonymization with secure hash - Replaces input values with a secure one-way hash generated using a data encryption key.

  • Pseudonymization with format-preserving token - Replaces an input value with a “token,” or surrogate value, of the same character set and length using format-preserving encryption (FPE).  Preserving the format can help ensure compatibility with legacy systems that have restricted schema or format requirements.

  • Date shifting - Shifts dates by a random number of days per user or entity. This helps obfuscate actual dates while still preserving the sequence and duration of a series of events or transactions.

  • Date extraction - Extracts or preserves a portion of Date, Timestamp, and TimeOfDay values.

  • Generalization: Bucketing - Masks input values by replacing them with “buckets,” or ranges, within which the input value falls. For example, you can bucket specific ages into age ranges or distinct values into ranges like “low,” “medium,” or “high.”

You can apply these de-identification techniques on both structured and unstructured data. That is, you can apply them to an entire column (such as a user ID) or on findings inside a block of text.

Expanding on the phone number example above, let’s say you wanted to replace the phone number using pseudonymization with a format preserving token This would look something like the following:

My phone number is 6070548884 if you need to reach me

Or, here’s the pseudonymized output plus an optional prefix indicating it can be reverted:

My phone number is PHONE(10):6070548884 if you need to reach me

All of Cloud DLP’s de-identification options are available through a simple REST API, as well as several client libraries in common scripting and programming languages. To get started, check out the Cloud DLP documentation.

Cloud DLP in action

Cloud DLP provides a variety of flexible and scalable tools to help you de-identify sensitive data and reduce risk in your production workloads. Here are a few examples of how we see customers using Cloud DLP.

Automated large-scale data obfuscation

Your business needs to share data with a third-party to run an analysis, but you don’t want to share raw PII that the partner doesn’t need. Cloud DLP helps you automatically mask, pseudonymize, or generalize the data that you work with. You can configure it to fit your workload by aggressively removing data and/or setting to preserve the referential integrity of joins and aggregate analysis.

https://storage.googleapis.com/gweb-cloudblog-publish/images/large-scale_data_obfuscation.max-2800x2800.max-2200x2200.jpg

To get started, here is a reference pipeline that leverages Cloud DLP and Dataflow to automatically obfuscate data and ingest it to Cloud Storage or BigQuery.

Real-time data minimization

In the spirit of reducing privacy risks, you need to manage the data that you collect from customers. If you collect data directly from your customers or through partners using various services and APIs, Cloud DLP can help you at the point of collection to  reduce the collection of unnecessary PII. As an API you can use streaming “inspection” and “de-identify” methods to classify and obfuscate data in real-time

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/data_minimization.gif

Better yet, you can call Cloud DLP’s API directly to integrate with virtually any application, including Google Apigee, our API management platform, to help protect all your API endpoints inbound and outbound.


Preventing PII stored in your data from exposure is a key concern for many organizations—and not so easy to do. Cloud DLP provides powerful tools to help you protect the security and privacy of your data, via an easy-to-use and flexible API. To learn more, visit our Cloud Data Loss Prevention page for more resources on getting started.
Posted in