Using the Cloud Healthcare API to de-identify medical images

Last reviewed 2023-03-28 UTC

This tutorial shows you how to use the DICOM de-identification operation of the Cloud Healthcare API to remove or modify personally identifying information (PII) and protected health information (PHI) from (Digital Imaging and Communications in Medicine (DICOM)) data. De-identifying DICOM data helps to ensure patient privacy and to prepare healthcare data for use in research, data sharing, and machine learning.

The tutorial and its accompanying conceptual document, De-identification of medical images through the Cloud Healthcare API, are intended for researchers, data scientists, IT teams, and healthcare and life sciences organizations. This tutorial guides you through two common use cases of de-identifying medical image data by using the Cloud Healthcare API. The conceptual document explains the rationale of DICOM data de-identification and outlines its high-level steps.

This tutorial assumes that you have a fundamental knowledge of Linux. A basic understanding of Google Cloud and DICOM standards is also helpful. Run all commands in this tutorial in a Linux terminal.

Objectives

  • Use the DICOM de-identification operation of the Cloud Healthcare API to remove or modify PII and PHI in DICOM instances in a DICOM store.
  • Remove or modify PII and PHI metadata and burned-in text in one Cloud Healthcare API call.
  • Use either the curl command-line tool or the Google Cloud CLI to make DICOM de-identification Cloud Healthcare API calls.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

This tutorial assumes that your DICOM images have already been imported into a DICOM store. For information about creating DICOM stores on Google Cloud, see Creating and managing DICOM stores. For information about importing DICOM data into DICOM stores, see Importing and exporting DICOM data using Cloud Storage.

Additionally, this tutorial assumes that:

  • You're working in a project called MyProj.
  • You've created a dataset called dataset1 in the us-central1 Google Cloud region in MyProj.
  • You've created a DICOM store called dicomstore1 in dataset1.

If your resources are named differently, you'll need to modify the commands listed in this document accordingly.

  1. In the Google Cloud console, go to the Project selector page.
    Go to the Project Selector page
  2. Select a Google Cloud project called MyProj.
  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Cloud Healthcare API.

    Enable the API

  5. Install the Google Cloud CLI.
  6. To initialize the gcloud CLI, run the following command:

    gcloud init
  7. In a shell, run the gcloud components update command to make sure that you have the latest version of the gcloud CLI that includes Cloud Healthcare API-related functionality.

Creating an IAM service account

The Healthcare Dataset Administrator role includes all of the required roles for this tutorial.

  1. Create a service account.

  2. Assign the Healthcare Dataset Administrator role to the service account.

  3. Create and download the service account JSON key.

  4. Activate your service account key:

    gcloud auth activate-service-account --key-file=path-to-key-file
    

    The output is the following:

    Activated service account credentials for: [key-name@project-name.iam.gserviceaccount.com]
    
    • key-name is the name that you assigned to the service account key.
    • project-name is the name of your Google Cloud project.

Using a medical image viewer

This tutorial uses the Mach7 diagnostic viewer as a medical image viewer. You can request a demonstration version of the viewer at the Mach7 website

To use this viewer, assign the Healthcare DICOM Viewer role to your user account by performing the following steps:

  1. As an administrator in the Google Cloud console, go to the IAM page.

    Go to the IAM page

  2. Click Add.

  3. In the New principals field, enter your user account or your gmail address.

  4. In the Select a role drop-down list, select Cloud Healthcare.

  5. Hold your pointer over Cloud Healthcare and then select the Healthcare DICOM Viewer role.

  6. Click Save.

To use the viewer for production purposes, you need to obtain a full version.

Obtaining an OAuth 2.0 access token

To use the Cloud Healthcare API to ingest data, you need an OAuth 2.0 access token that the commands in this tutorial obtain for you. In this tutorial, some of the example Cloud Healthcare API requests use the curl command-line tool. These examples use the gcloud auth print-access-token command to obtain an OAuth 2.0 bearer token and to include the token in the request's authorization header. For more information about this command, see gcloud auth application-default print-access-token.

This tutorial covers two of the most common use cases of removing identifying information from DICOM data. In both cases, the solution is provided by using either the curl command-line tool or the Google Cloud CLI. For more information about de-identifying DICOM data by using Cloud Healthcare API, configuration options, and sample curl and Windows PowerShell commands, see De-identifying DICOM data.

Set up environment variables

This step applies to both use cases.

  • Export the environment variables based on the location and attributes of the DICOM store where your images are stored.

    export PROJECT_ID=MyProj
    export REGION=us-central1
    export SOURCE_DATASET_ID=dataset1
    export DICOM_STORE_ID=dicomstore1
    export DESTINATION_DATASET_ID=deid-dataset1
    

Use case I: Removing all metadata and redacting all burned-in text

This use case shows how to de-identify a dataset containing DICOM stores and DICOM data by removing all metadata (except for the minimum data required for a valid DICOM resource) and by redacting all burned-in text from DICOM images. You can do these functions:

  • Create a POST request and provide the name of the destination dataset and an access token.
  • Remove all metadata and create a set of minimum keepList tags to have a valid DICOM resource.
  • Redact all sensitive burned-in text from the DICOM image by creating a DeidentifyConfig object with image.text_redaction_mode set to REDACT_ALL_TEXT.

You can do these functions all in one curl command like the following:

curl -X POST \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \
    -H "Content-Type: application/json; charset=utf-8" \
    --data "{
      'destinationDataset': 'projects/$PROJECT_ID/locations/$REGION/datasets/$DESTINATION_DATASET_ID',
      'config': {
        'dicom': {'keepList': {
           'tags': [
              'StudyInstanceUID',
              'SOPInstanceUID',
              'TransferSyntaxUID',
              'PixelData',
              'Columns',
              'NumberOfFrames',
              'PixelRepresentation',
              'MediaStorageSOPClassUID',
              'MediaStorageSOPInstanceUID',
              'Rows',
              'SamplesPerPixel',
              'BitsAllocated',
              'HighBit',
              'PhotometricInterpretation',
              'BitsStored' ] }
                 },
        'image': {
          'textRedactionMode': 'REDACT_ALL_TEXT'
        }
      }
    }" "https://healthcare.googleapis.com/v1/projects/$PROJECT_ID/locations/$REGION/datasets/$SOURCE_DATASET_ID:deidentify"

Alternatively, you can complete the same de-identification operation without knowing or specifying any tag name by using the MINIMAL_KEEP_LIST_PROFILE tag filter profile. See the following example:

curl -X POST \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \
    -H "Content-Type: application/json; charset=utf-8" \
    --data "{
      'destinationDataset': 'projects/$PROJECT_ID/locations/$REGION/datasets/$DESTINATION_DATASET_ID',
      'config': {

'dicom':{'filterProfile':'MINIMAL_KEEP_LIST_PROFILE'},

        'image': {
          'textRedactionMode': 'REDACT_ALL_TEXT'
        }
      }
    }" "https://healthcare.googleapis.com/v1/projects/$PROJECT_ID/locations/$REGION/datasets/$SOURCE_DATASET_ID:deidentify"

In all the preceding commands, if the request is successful, the server returns a response in JSON format, like the following:

{
  "name": "projects/$PROJECT_ID/locations/$REGION/datasets/$SOURCE_DATASET_ID/OPERATION_NAME"
}

The response contains an operation name. You can use the operation name with the Operation get method to track the status of the operation.

curl -X GET \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \
    -H "Content-Type: application/json; charset=utf-8" \
"https://healthcare.googleapis.com/v1/projects/$PROJECT_ID/locations/$REGION/datasets/$SOURCE_DATASET_ID/operations/OPERATION_NAME"

If the request is successful, the server returns a response in JSON format. After the de-identification process completes, the response includes "done": true.

{
  "name": "projects/$PROJECT_ID/locations/$REGION/datasets/$SOURCE_DATASET_ID/operations/OPERATION_NAME",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1.OperationMetadata",
    "apiMethodName": "google.cloud.healthcare.v1.dataset.DatasetService.DeidentifyDataset",
    "createTime": "2018-01-01T00:00:00Z",
    "endTime": "2018-01-01T00:00:00Z"
  },
  "done": true,
  "response": {
    "@type": "...",
    "successStoreCount": "SUCCESS_STORE_COUNT"
  }
}

You can also use the Google Cloud CLI to Google Cloud to run all versions of the Cloud Healthcare API, including the de-identification API. For a complete list of available commands, see the Cloud Healthcare API gcloud documentation or execute the following command:

gcloud healthcare --help

The following sample shows how to use the gcloud CLI to de-identify a dataset containing DICOM stores and DICOM data in order to remove all metadata and redact all burned-in text from DICOM images.

gcloud healthcare datasets deidentify $SOURCE_DATASET_ID \
--location $REGION \
--dicom-filter-tags=StudyInstanceUID,SOPInstanceUID,TransferSyntaxUID,PixelData,Columns,NumberOfFrames,PixelRepresentation,MediaStorageSOPClassUID,MediaStorageSOPInstanceUID,Rows,SamplesPerPixel,BitsAllocated,HighBit,PhotometricInterpretation,BitsStored \
--text-redaction-mode all \
--destination-dataset projects/$PROJECT_ID/locations/$REGION/datasets/$DESTINATION_DATASET_ID \
--async

If the request is successful, the server returns a response like the following:

Request issued for: [$SOURCE_DATASET_ID]
Check operation [OPERATION NAME] for status.
name: projects/$PROJECT_ID/locations/$REGION/datasets/$SOURCE_DATASET_ID/operations/OPERATION_NAME

To check the operation status, run the following command:

gcloud healthcare operations describe --dataset $SOURCE_DATASET_ID OPERATION_NAME

If the request is successful, the server returns a response like the following. After the de-identification process completes, the response contains "done": true.

done: true
metadata:
  '@type': type.googleapis.com/google.cloud.healthcare.v1.OperationMetadata
  apiMethodName: google.cloud.healthcare.v1.dataset.DatasetService.DeidentifyDataset
  "createTime": "2018-01-01T00:00:00Z",
  "endTime": "2018-01-01T00:00:00Z"
name: "projects/$PROJECT_ID/locations/$REGION/datasets/$SOURCE_DATASET_ID/operations/OPERATION_NAME"
response:
  '@type': type.googleapis.com/google.cloud.healthcare.v1.dataset.DeidentifySummary
  successResourceCount: 'SUCCESS_RESOURCE_COUNT'
  successStoreCount: 'SUCCESS_STORE_COUNT'

Use case II: Modifying metadata and redacting sensitive burned-in text

This use case shows how to to de-identify a dataset containing DICOM stores and DICOM data by using the filterProfile tag filtering method to remove some metadata, modify other metadata, and redact sensitive burned-in text associated with images. The goal is to redact the PERSON_NAME value, replace the PHONE_NUMBER value with asterisks, and modify DATE and DATE_OF_BIRTH to a date value in the range of 100 days of the original values.

In this use case, the provided crypto key, U2FsdGVkX19bS2oZsdbK9X5zi2utBn22uY+I2Vo0zOU=, is an AES-encrypted 256 bit base64-encoded key generated by using the following command. When prompted, an empty password is provided to the command:

 echo -n "test" | openssl enc -e -aes-256-ofb -a -salt

You can do these functions:

  • Create a POST request and provide the name of the destination dataset and an access token.
  • Remove some metadata and modify other metadata in DICOM tags using the DEIDENTIFY_TAG_CONTENT filter profile with appropriate combinations of info types and primitive transformations.
  • Redact burned-in text from a DICOM image by setting image.text_redaction_mode to REDACT_SENSITIVE_TEXT.

You can do these functions all in one curl command like the following:

curl -X POST \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \
    -H "Content-Type: application/json; charset=utf-8" \
    --data "{
      'destinationDataset': 'projects/$PROJECT_ID/locations/$REGION/datasets/$DESTINATION_DATASET_ID',
      'config':{
    'dicom':{'filterProfile':'DEIDENTIFY_TAG_CONTENTS'},
    'text':{
        'transformations':[
            {'infoTypes':['PERSON_NAME'], 'redactConfig':{}},
            {'infoTypes':['PHONE_NUMBER'], 'characterMaskConfig':{'maskingCharacter':''}},
            {'infoTypes':['DATE', 'DATE_OF_BIRTH'], 'dateShiftConfig':{'cryptoKey':'U2FsdGVkX19bS2oZsdbK9X5zi2utBn22uY+I2Vo0zOU='}}]},
    'image':{'textRedactionMode':'REDACT_SENSITIVE_TEXT'}}}" \
"https://healthcare.googleapis.com/v1/projects/$PROJECT_ID/locations/$REGION/datasets/$SOURCE_DATASET_ID:deidentify"

If the request is successful, the server returns a response in JSON format like the following:

{
  "name": "projects/$PROJECT_ID/locations/$REGION/datasets/$SOURCE_DATASET_ID/OPERATION_NAME"
}

The response contains an operation name. You can use the Operation get method to track the status of the operation:

curl -X GET \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \
    -H "Content-Type: application/json; charset=utf-8" \
"https://healthcare.googleapis.com/v1/projects/$PROJECT_ID/locations/$REGION/datasets/$SOURCE_DATASET_ID/operations/OPERATION_NAME"

If the request is successful, the server returns the following response in JSON format:

{
  "name": "projects/$PROJECT_ID/locations/$REGION/datasets/$SOURCE_DATASET_ID/operations/OPERATION_NAME",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1.OperationMetadata",
    "apiMethodName": "google.cloud.healthcare.v1.dataset.DatasetService.DeidentifyDataset",
    "createTime": "2018-01-01T00:00:00Z",
    "endTime": "2018-01-01T00:00:00Z"
  },
  "done": true,
  "response": {
    "@type": "...",
    "successStoreCount": "SUCCESS_STORE_COUNT"
  }
}

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the individual resources

  • Delete the destination datasets. If needed, add the --location parameter and specify the region for your dataset.

    gcloud healthcare datasets delete $DESTINATION_DATASET_ID
    

What's next