Security & Identity

Protecting Healthcare data with DLP: A guide for getting started

August 30, 2021

Steve Kluger

Lead Architect, Google Cloud

Nelly Wilson

Data & Analytics Cloud Consultant, Google Cloud

Try Google Cloud

Start building on Google Cloud with $300 in free credits and 20+ always free products.

Free trial

Protecting patient data is of paramount importance to any healthcare provider. This is not only because of the many laws and regulations in different countries around the world requiring this data to be safeguarded, but it is a foundational requirement for trust between a provider and their patient. This does create some tension between patients and their providers. To give patients the best care possible, a significant amount of Protected Health Information (PHI) is shared with providers who in turn share it with other providers, insurers, labs, etc. While sharing data can lead to better quality of care, it also introduces more risk to patient privacy.

With many healthcare providers and insurers leveraging cloud technologies, the concerns around protecting PHI evolve. The various data types encountered in the healthcare ecosystem are diverse and complex, which means there are many different systems and formats to protect. This is where Cloud Data Loss Prevention (DLP) comes into play. Cloud DLP can help identify PHI/PII and help to obfuscate it, allowing for use of that data while adding additional layers of protection for the privacy of the patient. This additional layer of protection compliments traditional security measures like access control, encryption-at-rest and encryption-in-transit by adding a layer of protection that can change or mask the data itself. This helps attain a deeper level of “least privilege” access, or data minimization.

At Google Cloud we help many healthcare providers build and deploy globally scaled data infrastructure. These providers use Cloud Data Loss Prevention to discover PHI and protect it.

In this series of articles, our goal is to discuss the various types of data formats (i.e. structured data, csv, etc) and systems in Google Cloud that may handle PHI and how the Google DLP API can be used across all of them to protect sensitive data.

What is DLP?

Cloud DLP helps customers inspect and mask this sensitive data with techniques like redaction, bucketing, date-shifting, and tokenization, which help strike the balance between risk and utility. This is especially crucial when dealing with unstructured or free-text workloads, in which it can be challenging to know what data to redact. Google Cloud DLP provides many system- and storage-agnostic capabilities that enable it to be used in virtually any workload, migration, or real-time application.

In 2021, Forrester Research has named Google Cloud a Leader in The Forrester Wave™: Unstructured Data Security Platforms, Q2 2021 report, and rated Google Cloud highest in the current offering category among the providers evaluated. Additionally Cloud received the highest possible score in the Obfuscation criteria, a technique that can help protect sensitive data, like personally identifiable information (PII).

While there are many great articles describing Google Cloud DLP and its components, for the purposes of this article we will focus on the key features of Cloud DLP that Healthcare providers leverage including:

Data discovery and classification
Data masking and obfuscation
Measuring re-identification risk

Types of Data Healthcare Providers Manage

While all Healthcare providers have data needs that are unique to their organization, we generally see Healthcare data falling into two buckets:

Text-based data

This data is typically seen as CSVs, flat files and transactional database entries

HL7/FHIR/DICOM data

This data is typically received from EMR and other systems that follow interoperability standards in Healthcare
In the case of HL7/FHIR data the data is typically formed as a JSON or XML object
DICOM, being the standard for medical imaging, stores image files that often have text embedded in them

These common Healthcare data structures leverage various data sources and sinks to ingest, store and analyze data. The following is the list of the services we typically see leveraged for Healthcare data:

Google Cloud Storage
Google BigQuery
Google Cloud SQL
Google Cloud Pub/Sub
Google Cloud Dataflow

One service in Google Cloud deserves a special callout here due to its impact on managing and leveraging Healthcare data and that is the Google Healthcare API. Check out overview videos and documentation here for more complete information on the Google Healthcare API. For the purposes of these articles we will focus on a few of its key capabilities including:

Ingestion of HL7/FHIR data
Ingestion of DICOM data
Leveraging DLP through the inbuilt features of the Healthcare API

Getting Started with DLP for Healthcare Data

There are a few key steps required to begin leveraging the DLP API which we will walk through. We will begin with a simple use case that scans a Google Cloud Storage bucket for the information we define, and replaces the information with the name of the information type. This is a good method to use to periodically scan a common data store for PHI, for example, in development environments.

Inspecting Data

The first step is to create a template to instruct the DLP API on what data you need to find. To do so you will build an inspect template (example shown below). Inspect templates have many built-in infotypes (over 150) that allow users to discover common data elements that require redaction, like names, social security numbers, etc. It also allows for custom data types to be built in case you need to extend past the built-in detectors.

Knowing where sensitive data exists is a critical step to protecting it. Using Cloud DLP to help discover, inspect, and classify data can help you understand how to best protect and secure your data. This inspection can be integrated into workflows to proactively detect and prevent data loss, or it can be used for ongoing inspections, which can generate security notifications/alerts when data is found in areas that it’s not expected.

For the purposes of this article, we will show how to build an inspect template similar to the one that has been integrated into our Cloud Healthcare API for the de-identification of FHIR data, with a couple additional infotypes we often see used by our customers. There are many other built-in infotypes that are useful for Healthcare, like ICD10 codes. We also added in a custom infotype to show how any regex can be used to match data based on unique needs.

{
  "templateId": "exampleInspectTemplate",
  "inspectTemplate": {
    "inspectConfig": {
      "infoTypes": [
        {
          "name": "CREDIT_CARD_NUMBER"
        },
        {
          "name": "EMAIL_ADDRESS"
        },
        {
          "name": "IP_ADDRESS"
        },
        {
          "name": "MAC_ADDRESS_LOCAL"
        },
        {
          "name": "MAC_ADDRESS"
        },
        {
          "name": "PHONE_NUMBER"
        },
        {
          "name": "US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER"
        },
        {
          "name": "US_SOCIAL_SECURITY_NUMBER"
        },
        {
          "name": "US_VEHICLE_IDENTIFICATION_NUMBER"
        },
        {
          "name": "PASSPORT"
        },
        {
          "name": "AGE"
        },
        {
          "name": "DATE"
        },
        {
          "name": "LOCATION"
        },
        {
          "name": "PERSON_NAME"
        },
        {
          "name": "SWIFT_CODE"
        },
        {
          "name": "DATE_OF_BIRTH"
        },
        {
          "name": "STREET_ADDRESS"
        },
        {
          "name": "ETHNIC_GROUP"
        },
        {
          "name": "US_DRIVERS_LICENSE_NUMBER"
        }
      ],
      "minLikelihood": "POSSIBLE",
      "customInfoTypes": [
        {
          "infoType": {
            "name": "CUSTOM_ID1"
          },
          "regex": {
            "pattern": "(A[0-9]{5}-[0-9]{7})"
          }
        }
      ]
    }
  }
}

De-Identifying Data

The next step, de-identification, tells the DLP API what to do once it finds information based on the inspect template that you built. The De-Identify template can do many things such as:

Tokenization with secure one-way hashing
Tokenization with two-way Deterministic Encryption or Format Preserving Encryption
Date shifting
Data masking
Bucketing or generalization
Combinations and variations of the above

For the purposes of this article we created a De-Identify template (shown below) that redacts the data matching the inspection configuration with the name of the infotype detected (i.e. [DATE_OF_BIRTH]). Many more options for what the De-Identify template can do can be found here.

Scheduled Inspection Jobs

Now that you know what you are scanning for and the actions you want to take when sensitive data is discovered, you can schedule a scan. This process is well-documented here but these are a few key callouts before we proceed:

When configuring a scan you must know the GCS bucket (or BigQuery dataset) that you want to scan
When configuring your scan, consider reducing the sampling rate (percentage of data scanned) and only scanning data changed since the last scan to reduce costs
While configuring the scan you will be given several options for notifications and outputs of DLP scans called Actions (seen below)

https://storage.googleapis.com/gweb-cloudblog-publish/images/dlp_hcls-1.max-600x600.jpg

These actions have various capabilities for analysis and notifications. Detailed descriptions of the options are listed here:

Publish to BigQuery - This setting allows you to publish all results of DLP scans to a BigQuery dataset for future analysis
Publish to Pub/Sub: This option will create messages in a selected Pub/Sub topic about the outcome of DLP scans. This is a great option if you want to have other applications, like a SIEM, consume the results.
Publish to Security Command Center: Results can be published into Security Command Center for review by security teams.
Publish to Data Catalog: If you leverage Data Catalog in your environment to manage and understand your data in Google Cloud, you can add your scan result data to your catalog.
Notify by email: This action sends an email to project owners and editors when the job completes.

Publish to Cloud Monitoring: Send inspection results to Cloud Monitoring in Google Cloud's Operations suite.

See Results

In the console, navigate to Data Loss Prevention and select the Inspection tab. You should see what is indicated in the following image:

https://storage.googleapis.com/gweb-cloudblog-publish/images/image2_D4i9n4B.max-1000x1000.jpg

If you select any job ID you will see the job details including findings, bytes scanned, errors, and a result listing that shows which infotypes were discovered.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_2TCqIfL.max-1500x1500.jpg

As noted above, you can send your results to many locations via DLP Actions. Thanks to that flexibility, there are many options for bespoke solutions on current BI tools to analyze scan results. If you don’t have that tooling available, a great solution could be to send your results to BigQuery and use a Data Studio dashboard to analyze your results.

Next Steps

In this article we started down the path of leveraging DLP to protect PHI in Healthcare environments. Now that we have the basic building blocks set up, we want to start using them.

In the rest of this blog post series we will discuss:

DLP use cases for the different data stores commonly used in Healthcare
DLP in the Healthcare API
Alternate De-Identification methods
Viewing and managing scanning results

Google Cloud DLP is built for the modern technology landscape. By utilizing the steps above, you can create a secure foundation for protecting patient data.

The Google Cloud team is here to help. To learn more about getting started on DLP or general best practices to manage risk, reach out to your Technical Account Manager or contact a Google Cloud account team.

Security & Identity