Example architecture for using a DLP proxy to query a database containing sensitive data

Last reviewed 2022-09-29 UTC

This document describes using Sensitive Data Protection to mitigate the risk of exposing sensitive data stored in Google Cloud databases to users, and yet still let them query meaningful data.

Sensitive data can exist across your enterprise. Data that is collected, processed, and shared can contain information such as personally identifiable information (PII) that is subject to external and internal policies or regulations. In addition to proper security controls to restrict access to sensitive data, you can also use these techniques to help protect the data in use. De-identification helps to strike a balance between utility and privacy of data by using techniques such as data masking, bucketing, and tokenization.

Tokenization substitutes sensitive data with surrogate values called tokens, which represent the original (raw) sensitive value when the data is queried or viewed. This process is sometimes referred to as pseudonymization or surrogate replacement. The concept of tokenization is widely used in industries such as finance and healthcare, to help lower the risk of data in use, reduce compliance scope, and minimize sensitive data being exposed to people or systems that don't need it.

With Sensitive Data Protection, you can classify and de-identify sensitive data in batches and in real time. Classification is the process of identifying sensitive information and deciding what type it is. This document discusses where you can employ these de-identification techniques and shows how you can use a proxy to accomplish these tasks.

The following diagram illustrates the scenario outlined in this document.

Architecture of data stored in Cloud Storage, ingested through ETL, and then queried by users.

  • Data is stored at rest in Cloud Storage. For example, data that is received from a partner.
  • Data is ingested through an extract, transform, and load (ETL) process to a SQL database.
  • Data in this database is queried by users to perform analysis.

In this scenario, the results returned from the query are raw data, so sensitive data is displayed and potentially exposes PII data to the user running the query. You should design your application to audit and prevent unauthorized queries of sensitive data.

The DLP proxy architecture

One way to protect PII data is to pass all queries and results through a service that parses, inspects, and then either logs the findings or de-identifies the results by using Sensitive Data Protection before returning the requested data to the user. In this document, this service is called DLP proxy.

The DLP proxy application accepts a SQL query as input, runs that query on the database, and then applies Sensitive Data Protection to the results, before returning them to the user requesting the data.

The following diagram illustrates the architecture of the DLP proxy application.

Architecture of the DLP proxy app with data transformation commands.

Sensitive Data Protection allows detailed configuration of what types of data to inspect for, and how to transform that data based on these inspection findings or data structure (for example, field names). To simplify the creation and management of the configuration, you use Sensitive Data Protection templates. The DLP proxy application references both inspect and de-identify templates.

You can use templates to create and persist configuration information with Sensitive Data Protection. Templates are useful for decoupling configuration information—such as what you inspect for and how you de-identify it—from the implementation of your requests. For more information about templates, see the Sensitive Data Protection templates.

Cloud Audit Logs is an integrated logging service from Google Cloud that is used in this architecture. First, Cloud Audit Logs provides an audit trail of calls made to the Cloud Data Loss Prevention API (part of Sensitive Data Protection). The audit log entries include information about who made the API call, which Google Cloud project it was run against, and details about the request, including if a template was used as part of the request. Second, if you use the application's configuration file to turn on auditing, Cloud Audit Logs records a summary of the inspection findings.

Cloud Key Management Service (Cloud KMS) is cloud-hosted key management service from Google Cloud that lets you manage cryptographic keys for your cloud services.

Sensitive Data Protection methods for tokenization and date shifting use cryptography to generate the replacement values. These cryptographic methods use a key to encrypt the values in a consistent way to keep the referential integrity or, for methods that are reversible, to detokenize. You can provide this key directly to Sensitive Data Protection when the call is made, or you can wrap it by using Cloud KMS. Wrapping your key in Cloud KMS provides another layer of access control and auditing and therefore is the preferred method for production deployments.

In a production configuration, you should use the principle of least privilege to assign permissions. The following diagram illustrates an example of this principle.

Production configuration with three personas and their permissions.

The preceding diagram shows how in a typical production configuration there are three personas with different roles and access to the raw data:

  • Infrastructure admin: Installs and configures the proxy so they have access to the compute environment that the Sensitive Data Protection proxy is installed on.
  • Data analyst: Accesses the client that connects to the DLP proxy.

  • Security admin: Classifies the data, creates the Sensitive Data Protection templates, and configures Cloud KMS.

For more information about using Cloud KMS to encrypt and decrypt data, see Encrypting and decrypting data.

For the DLP proxy used in this document, this information is all configured in a Sensitive Data Protection de-identification template.

Protecting PII with auditing, masking, and tokenization

There are two strategies that you can implement to mitigate the risk of exposing PII in this scenario.

Raw data stored in the database

If your application stores raw data in a database, you can use the DLP proxy to process results returned to the user by automatically inspecting and generating an audit of any sensitive findings. Or you can mask the query results in real time, as illustrated in the following diagram.

Architecture where query results are masked in real time.

This configuration requires that you use a SQL client that connects to the DLP proxy. If you enable auditing on your app, a log is created in Cloud Audit Logs with a summary of the inspection findings. This summary indicates what type of sensitive information was returned in the query.

Data stored in de-identified form

If you don't want to store the raw data, you can store the data for your application in a de-identified or masked form by performing the de-identification transforms during the ETL process into the database, as illustrated in the following diagram.

Architcture where query results are masked during the ETL process.

The preceding diagram illustrates the basic flow, where data is inspected and masked before being ingested into the database. When a user queries this data, even if they have raw access to the database, they can only see the masked version.

If you permit unmasked data to be seen by the user, you need to use a client that can connect to an instance of the DLP proxy that has permission to unmask the data, as illustrated in the following diagram.

Architecture where you use a client to connect to DLP proxy to view unmasked data.

The preceding diagram illustrates how you can use a client to connect to the DLP proxy to allow unmasked data to be shown to the client.

What's next