Data de-identification

De-identification is the process of removing identifying information from data. The Cloud Healthcare API supports the de-identification of health information through the deidentify operation. The API detects sensitive data such as personally identifiable information (PII), and then uses a de-identification transformation to mask, delete, or otherwise obscure the data.

De-identification is available for DICOM instances and FHIR resources.

Some use cases for de-identification are:

  • When sharing health information with non-privileged parties
  • When creating datasets from multiple sources and analyzing them
  • When anonymizing data so that it can be used in machine learning models

Dataset de-identification

De-identification in the Cloud Healthcare API occurs at the dataset level. This means that, when you want to de-identify health data, you run the deidentify operation on the entire dataset in which the data resides. The de-identified data is then copied to a new dataset.

You cannot, for example, de-identify the resources in a specific FHIR store within a dataset. If a dataset contains FHIR stores and DICOM stores that hold medical data, then de-identification occurs on all of the FHIR and DICOM data in that dataset.

De-identification does not impact the original dataset or its data. Instead, de-identified copies of the original data are written to a new dataset. In other words, the API returns the same items you gave it, in the same format, but with sensitive information processed according to your configuration.

DICOM de-identification

A DICOM instance contains a set of key-value metadata elements (also known as “tags”) and one or more images. The deidentify operation can remove specific tags that contain sensitive data. The operation can also use automated optical character recognition (OCR) to redact burnt-in text on images contained in DICOM instances.

For examples of how to de-identify DICOM data, see De-identifying DICOM data.

FHIR de-identification

Each FHIR resource is a JSON-like object that contains key-value elements. Some elements are standardized, while others are free text. You can use the deidentify operation to:

  • Remove specific values in the resource


  • Process the arbitrary text portions to remove only the sensitive portions, leaving the rest of the data as is

For examples of how to de-identify FHIR data, see De-identifying FHIR data.