De-identification of medical images through the Cloud Healthcare API

Introduction

This document explains how researchers, data scientists, IT teams, or healthcare and life sciences organizations can use the Cloud Healthcare API to remove personally identifying information (PII) and protected health information (PHI) from Digital Imaging and Communications in Medicine (DICOM) data. This process, known as de-identification, helps to ensure patient privacy and to prepare DICOM data for use in research, data sharing, and machine learning.

The accompanying tutorial, Using the Cloud Healthcare API to de-identify medical images, guides you through two use cases of de-identifying medical image data by using the Cloud Healthcare API.

How DICOM data de-identification works

Medical images acquired for clinical purposes can have important secondary uses in research projects and teaching libraries. However, you might need to remove or modify sensitive data elements (PII or PHI) from DICOM images before you analyze them or share them with authorized collaborators.

The following diagram shows several pipelines of medical images from on-premises sources that are routed to Google Cloud and then anonymized by the Cloud Healthcare API de-identify operation.

DICOM de-identification pipeline.

First, you upload DICOM-formatted medical images to Cloud Storage, and then to the Cloud Healthcare API. Alternatively, you can upload DICOM images directly to the Cloud Healthcare API. The medical images, which are kept in a DICOM store in the Cloud Healthcare API, are then routed through the Cloud Healthcare API de-identification process to anonymize the images and associated metadata.

For example, as a medical researcher, you might have access to x-ray images of spinal fractures from patients in an on-premises picture archiving and communication system (PACS). You can move the image pixel data to Cloud Storage by using the Storage Transfer Service, the Transfer Appliance, or one of the hybrid connectivity products. You can then copy or move the data from Cloud Storage to the Cloud Healthcare API. After the data is in the Cloud Healthcare API, you can use it as backup, view it remotely, or allow it to be accessed by approved third-party cloud services and apps.

In another scenario, you might send de-identified DICOM images to AutoML Vision to train a model for helping healthcare teams detect spinal fractures in x-rays. This way, you can build a clinical decision-support tool by using your own data.

Cloud Healthcare API

The Cloud Healthcare API offers a managed solution for storing and accessing healthcare data in Google Cloud, providing a critical bridge between existing care systems and applications hosted on Google Cloud.

Within a Google Cloud project, data ingested through Cloud Healthcare API is stored in a dataset, which resides in a geographic location corresponding to a Google Cloud region. The Cloud Healthcare API supports the regions listed in Regions. For a list of Google Cloud products and the regions in which they are implemented, see Cloud locations.

Because each healthcare data modality—for example, DICOM, Fast Healthcare Interoperability Resources (FHIR), and HL7v2—has different structural and processing characteristics, datasets are split into modality-specific stores.

The following diagram shows how the Cloud Healthcare API organizes clinical data by location, dataset, and store.

Cloud Healthcare API organization of clinical data.

Each dataset contains one or more stores that service the same modality, or different modalities, as needed by the app. Using multiple stores in the same dataset might be appropriate if an app processes different types of data. For example, you might want to separate data according to its source hospital, clinic, or department. An app can access as many datasets or stores as required with no performance penalty. It's important to design your overall dataset and store architecture to meet your organization's broad goals, such as proximity to compute resources or end users, partitioning, or access control.

The following diagram shows two datasets containing HL7v2, DICOM, and FHIR stores.

Architecture of datasets with HL7v2 and DICOM stores.

You can copy DICOM images to a DICOM store or stores inside a dataset from a variety of sources. For more information, see Creating and managing DICOM stores.

De-identifying DICOM data

The Cloud Healthcare API includes de-identification tools that can scalably redact (remove) or modify sensitive content in text and images, based on the specified configuration.

These tools operate on text and images encoded in specific medical record formats, such as DICOM and FHIR. When you work with DICOM instances, the components to a de-identification API call are as follows:

  • Source: A dataset or DICOM store that contains one or more DICOM instances with sensitive data. The accompanying tutorial uses a dataset, but you can modify the examples to work on a single DICOM store.
  • What to de-identify: Configuration parameters that specify how to process the dataset. You can configure the DICOM de-identification operation to de-identify DICOM instance metadata by using tag keywords, by obscuring burned-in text in DICOM images, or both.
  • Destination: De-identification doesn't impact the original dataset or its data. Instead, processed copies of the original data are written to a new dataset or DICOM store, called the destination. The accompanying tutorial uses a dataset, but you can modify the examples to work on a DICOM store.

The following two images show a sample x-ray image before and after de-identification, where the goal is to remove or modify all metadata and burned-in text associated with the image.

The first image shows an x-ray image with sample PII and PHI data appearing in both metadata and burned-in text.

Sample x-ray image before de-identification (with sample data).

The second image shows the same x-ray image with all sample PII and PHI metadata removed or obscured.

Sample x-ray image after de-identification (with sample data).

After de-identification, all image metadata is removed, and all text burned into the image is obscured with an opaque rectangle. This configuration of de-identification is useful for when you need only the image pixel data for further analysis, machine learning (ML) model training, or inference.

For example, you might want to train an image classification model to determine whether there is a fracture present in an x-ray. To train this model, you need a large number of image samples—some that include fractured bones and some that don't. However, you won't need any sensitive information, such as patient gender, age, or birthdate, because this information isn't relevant to the model.

Or you might want to analyze the progression of a particular disease in a patient population as the patients age. In this case, you need to know information such as the patient's age and gender, as well as the date of each study, because this information is relevant to clinical analysis. You have the option of keeping some of the metadata, while redacting other identifiable information about patients, such as their names and medical record numbers.

Best practice is to change the dates in any study so that the relative timelines are maintained, but matching them up with a patient is nearly impossible. For more information, see date shifting.

Required access and Identity and Access Management roles

In Google Cloud, access to resources is managed through Identity and Access Management (IAM) roles. Access to the Cloud Healthcare API requires that your (IAM) account has the appropriate roles for the function that you want to perform.

You can use either a user account (the one that you use to access the Google Cloud console) or an IAM service account. The accompanying tutorial uses a service account except for medical image viewing, for which you need to use a user account. The general information presented here applies to all account types.

To create the destination dataset, you must have at least healthcare.datasets.deidentify permission on the source dataset and healthcare.datasets.create permission on the Google Cloud project. The Healthcare Dataset Administrator IAM role includes both of these permissions.

For information about how to control access to datasets and DICOM stores, see Controlling access to Cloud Healthcare API resources. For information about the required permissions for dataset methods, see Access control or the Cloud Healthcare API.

Medical image viewers

The following DICOM viewers are integrated with the Cloud Healthcare API, and you can use them to view images before and after de-identification:

For the viewer to function properly, your login credentials must have the healthcare.dicomViewer role.

API structure

You can access and manage data in Cloud Healthcare API datasets and stores by using a REST API that identifies each store by its Google Cloud project, location, dataset, store type, and store name. The Cloud Healthcare API implements modality-specific standards for access that are consistent with industry standards for each respective modality. For example, the Cloud Healthcare API natively provides operations for reading DICOM studies and series that are consistent with the DICOMweb standard.

Operations that access a modality-specific store use a request path that consists of a base path and a modality-specific request path. Administrative operations—which generally operate only on locations, datasets, and data stores—might use only the base path.

To reference a particular store within a Cloud Healthcare API dataset, use a base path structured like this:

 /projects/project/locations/location/datasets/dataset/store-type/store-name

Replace the following:

  • project: your Google Cloud project
  • location: the zone where your resources are located
  • dataset: the name of your dataset
  • store-type: the type of data store
  • store-name: the name of your data store

Following is an example of a base path:

/projects/MyProj/locations/us-central1/datasets/dataset1/dicomStores/dicomstore1

The preceding path example references a Cloud Healthcare API DICOM store in the Google Cloud project MyProj, in the US-central region, in a dataset called dataset1, and with the name of dicomstore1.

To access a piece of data, you combine the base path with a request path that is formatted according to the appropriate modality standard. For example, DICOMweb requests to a DICOM store might look like this:

 base-path/dicomWeb/studies/{study_id}/series?PatientName={patient_name}

The base-path part of the path represents a base path specific to this request. The {study_id} part of the path identifies a particular DICOM study, and the patient's name is specified by {patient_name}. In the preceding example, the path specification is consistent with the DICOMweb standard path structure.

De-identification using tags and image redaction configuration

De-identification of DICOM data includes two processes:

  • De-identifying DICOM metadata
  • Redacting burned-in text in images

In the Cloud Healthcare API, metadata de-identification is based on DICOM tags, and burned-in text redaction is performed through the TextRedactionMode option.

Using tags and profiles for de-identification

You can de-identify DICOM instances based on tag keywords in the DICOM metadata. The following tag filtering methods are available in the DicomConfig object:

  • keepList: A list of tags to keep. Removes all other tags.
  • removeList: A list of tags to remove. Keeps all other tags.
  • TagFilterProfile: A tag-filtering profile that specifies which tags to keep or remove.

DICOM minimum attribute tags

The following tags are the minimum attributes of a valid DICOM instance within the Cloud Healthcare API:

  • StudyInstanceUID
  • SeriesInstanceUID
  • SOPInstanceUID
  • TransferSyntaxUID
  • MediaStorageSOPInstanceUID
  • MediaStorageSOPClassUID
  • PixelData
  • Rows
  • Columns
  • SamplesPerPixel
  • BitsAllocated
  • BitsStored
  • Highbit
  • PhotometricInterpretation
  • PixelRepresentation
  • NumberOfFrames

keepList

To use the keepList tag filtering method, you need to provide a list of tag names. These tags are the only ones that are retained in the de-identified resources. When you specify a keeplist tag in the DicomConfig object, DICOM minimum attribute tags are added by default.

If no keeplist tags are provided, then no DICOM tags in the dataset are removed. Generally, when a tag is kept, it appears as unchanged in the output compared to the original. However, the StudyInstanceUID, SeriesInstanceUID, SOPInstanceUID, and MediaStorageSOPInstanceUID tags are regenerated with new, unique values in the output.

removeList

You can specify a removeList tag in the DicomConfig object. The de-identify operation removes only the tags specified in the list. If no removeList tags are provided, then the de-identification operation proceeds as usual, but no DICOM tags in the destination dataset are redacted.

DICOM minimum attribute tags cannot be added to a removeList.

TagFilterProfile

Rather than specifying which tags to keep or remove, you can use the TagFilterProfile profile. This predefined profile determines how tags are handled and modified. For example, MINIMAL_KEEP_LIST_PROFILE profile keeps only the tags required to produce valid DICOM resources and removes all the other tags. For more information, see the TagFilterProfile documentation.

We recommend the TagFilterProfile profile as a tag filtering method, especially for non-technical users, because the preselected profile means there's no need to review and understand all the DICOM tags and their contents.

Frequently used profiles

You can perform one of the industry's common de-identification use cases— removing tags based on the DICOM Standard's attribute confidentiality profiles—by using the profile ATTRIBUTE_CONFIDENTIALITY_BASIC_PROFILE.

Another frequently used profile is DEIDENTIFY_TAG_CONTENTS, which inspects metadata within tag contents and replaces sensitive text. When using the DEIDENTIFY_TAG_CONTENTS profile, you can also apply configurations such as information types and primitive transformations. Information types and primitive transformations cannot be applied to the other profiles.

You can use information types to define what data is scanned when performing de-identification with tags. An information type is a type of sensitive data, such as a patient name, email address, telephone number, identification number, or credit card number. For more information, see InfoTypes and infoType detectors.

Primitive transformations are rules that you use for transforming an input value. You can customize how DICOM tags are de-identified by applying a primitive transformation to each tag's information type. For example, you can de-identify a patient's last name and replace it with a series of asterisks. For information about primitive transformations, see primitive transformation options.

The accompanying tutorial provides a use case for the MINIMAL_KEEP_LIST_PROFILE profile.

Default information types

By default, the DEIDENTIFY_TAG_CONTENTS profile handles the following information types:

  • AGE
  • CREDIT_CARD_NUMBER
  • DATE
  • EMAIL_ADDRESS
  • IP_ADDRESS
  • LOCATION
  • MAC_ADDRESS
  • PERSON_NAME
  • PHONE_NUMBER
  • SWIFT_CODE
  • US_DRIVERS_LICENSE_NUMBER
  • US_PASSPORT
  • US_SOCIAL_SECURITY_NUMBER
  • US_VEHICLE_IDENTIFICATION_NUMBER
  • US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER

If you need to modify only the information types in the preceding list, you can use the DEIDENTIFY_TAG_CONTENTS profile without any additional parameters.

Redacting burned-in text from images

The Cloud Healthcare API can redact sensitive burned-in text from images. Sensitive data, such as PII or PHI, is detected by the Cloud Healthcare API, which then obscures it with an opaque rectangle. The Cloud Healthcare API returns the same DICOM images as input, but any text identified as containing sensitive information, according to your criteria, is redacted.

You can redact burned-in text from images by specifying a TextRedactionMode option inside of an ImageConfig object:

  • REDACT_ALL_TEXT: Redacts all burned-in text from DICOM images in a dataset.
  • REDACT_SENSITIVE_TEXT: Redacts sensitive burned-in text from DICOM images in a dataset.

When you specify REDACT_SENSITIVE_TEXT, you redact default infoTypes and custom infoType as patient identifiers. Information such as medical record numbers (MRNs) is redacted from images.

For more information about image redaction configuration, see redacting burned-in text from images.

What's next