Annotating de-identified data

This page explains how to configure annotation stores and annotation records when de-identifying sensitive FHIR and DICOM data.

Annotating de-identified data overview

Each time you de-identify sensitive FHIR or DICOM data, you can output information about the sensitive data that was removed to an annotation store. This information is stored as one or more annotation records inside the annotation store.

You can create the annotation store in an existing dataset or create it in the new dataset created during the de-identify operation. If you create the annotation store in an existing dataset, an annotation store with the same name cannot already exist in that dataset.

The created annotation store must be in the same project as the de-identified source data. For example, you cannot simultaneously de-identify data in one project and output annotation records to an annotation store in a different project.

To specify an annotation store and its behavior during de-identification, set the annotation_store_name field inside an annotation object in the DeidentifyConfig object.

You can optionally set the store_quote field, depending on your use case. Information about setting the store_quote field is available in the next section.

Using the store_quote field

The following information applies to both FHIR and DICOM data.

When the store_quote field inside annotation in the request is set to true, the original values of the de-identified data display in the annotation record in the quote field. For example:

  • If a DATE is de-identified, and if store_quote is set to true, then the following information displays in the annotation record:

    • The value of the date (such as 1980-12-05), displayed in the quote field
    • The DATE infoType
    • The start and end locations of where the data was found. The start and end locations use a zero-based index and are both inclusive.
  • Ifstore_quote is set to false, then the date (1980-12-05) does not display in the annotation record, and only the following information displays:

    • The DATE infoType
    • The start and end locations of where the data was found. The start and end locations use a zero-based index and are both inclusive.

Annotations for de-identified FHIR data

This section builds on concepts explained in De-identifying FHIR data using the Cloud Healthcare API.

Annotation record structure

The de-identify operation creates one annotation record for each de-identified FHIR resource. Each annotation record contains a textAnnotation object that holds information about the de-identified data that was inspected and transformed. For a de-identified field to appear in the annotation record, it must have the INSPECT_AND_TRANSFORM Action applied to it.

Configuring annotations for de-identified FHIR data

The following samples use the Default FHIR data de-identification as their starting point. The samples show how to de-identify a Patient resource using the FHIR default method and store information about the de-identified data in an annotation record in a new annotation store. In the samples, the store_quote field is set to true, meaning that the output annotation record contains the original values of the data that was de-identified.

The new annotation store is in the dataset created by the de-identify operation, but you can also create the annotation store in an existing dataset.

curl

curl -X POST \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    -H "Content-Type: application/json; charset=utf-8" \
    --data "{
      'destinationDataset': 'projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID',
      'config': {
        'fhir': {},
        'annotation': {
          'annotation_store_name': 'projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID',
          'store_quote': 'true'
        }
      }
    }" "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID:deidentify"

If the request is successful, the server returns the response in JSON format:

{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID"
}

The response contains an operation name. You can use the Operation get method to track the status of the operation:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    -H "Content-Type: application/json; charset=utf-8" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID"

If the request is successful, the server returns the response in JSON format. After the de-identification process finishes, the response contains "done": true.

{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1beta1.OperationMetadata",
    "apiMethodName": "google.cloud.healthcare.v1beta1.dataset.DatasetService.DeidentifyDataset",
    "createTime": "CREATE_TIME",
    "endTime": "END_TIME"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1beta1.deidentify.DeidentifySummary",
    "successStoreCount": "1",
    "successResourceCount": "1"
  }
}

After checking that the de-identification was successful, you can list the annotation stores in the dataset and see that the operation created the annotation store:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores"

If the request is successful, the server returns the response in JSON format:

{
  "annotationStores": [
    {
      "name": "projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID"
    },
    {
      ...
    }
  ]
}

Use the ANNOTATION_STORE_ID value to list the annotation records in the annotation store:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations"

If the request is successful, the server returns the response in JSON format:

{
  "annotations": [
    "projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/ANNOTATION_RECORD_ID",
    ...
  ]
}

Use the ANNOTATION_RECORD_ID value to view the annotation record:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/ANNOTATION_RECORD_ID"

If the request is successful, the server returns the response in JSON format.

The textAnnotation object contains information about sensitive text that the de-identification operation removed. In the details field, you can see that the operation searched the patient.text.div object and found four infoTypes, along with their values and the locations where the values were found.

Using the default FHIR de-identification, the only data that was inspected and transformed was the data in the patient.text.div object; all other de-identified data was transformed without being inspected because its infoType was already declared in the original FHIR resource.

{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/ANNOTATION_RECORD_ID",
  "annotationSource": {
    "cloudHealthcareSource": {
      "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/fhirStores/FHIR_STORE_ID/fhir/Patient/PATIENT_ID"
    }
  },
  "textAnnotation": {
    "details": {
      "patient.text.div": {
        "findings": [
          {
            "infoType": "PERSON_NAME",
            "start": "42",
            "end": "54",
            "quote": "Smith, Darcy"
          },
          {
            "infoType": "PERSON_NAME",
            "start": "42",
            "end": "47",
            "quote": "Smith"
          },
          {
            "infoType": "PERSON_NAME",
            "start": "49",
            "end": "54",
            "quote": "Darcy"
          },
          {
            "infoType": "DATE",
            "start": "81",
            "end": "91",
            "quote": "1980-12-05"
          }
        ]
      }
    }
  }
}

Annotations for de-identified DICOM data

This section builds on concepts explained in De-identifying DICOM data using the Cloud Healthcare API.

Annotation record structure

The de-identify operation creates two types of annotation records for de-identified DICOM data. The two types of annotation records are:

  • Text annotation records: Contain metadata, such as DICOM tags, from the de-identified data. Each text annotation record contains a textAnnotation object that holds information about the de-identified data that was inspected and transformed. For a de-identified tag to appear in the annotation record, it must have been inspected for protected health information (PHI) based on the configuration provided in the TagFilterProfile field. For example, the samples in Configuring annotations for de-identified DICOM data use the DEIDENTIFY_TAG_CONTENTS configuration.
  • Image annotation records: Contain the location of sensitive information in individual DICOM frames. Each image annotation record contains an ImageAnnotation object that holds the coordinates for the found sensitive information.

The de-identify operation creates annotation records for each frame in a DICOM instance. For example, if a DICOM instance has three frames, the de-identify operation creates the following annotation records:

  • One text annotation record, containing a textAnnotation, for the DICOM tags in the DICOM instance.
  • Three image annotation records, each containing an imageAnnotation, for each of the three frames. Each image annotation record contains a frame_index field to indicate the frame that the record corresponds to.

All four of these annotation records have the samecloudHealthcareSource.name value, which is the DICOM instance path in the format: projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/dicomStores/DICOM_STORE_ID/dicomWeb/studies/STUDY_UID/series/SERIES_UID/instances/INSTANCE_UID.

Configuring annotations for de-identified DICOM data

The following samples use Combining tag de-identification and burnt-in text redaction as their starting point. The samples show how to de-identify a DICOM instance to redact all-burnt in text in the image and inspect and transform sensitive text. The samples also show how to store information about the de-identified data in an annotation record in a new annotation store. In the samples, the store_quote field is set to true, meaning that the output annotation record contains the original values of the data that was de-identified.

The new annotation store is in the dataset created by the de-identify operation, but you can also create the annotation store in an existing dataset.

curl

curl -X POST \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    -H "Content-Type: application/json; charset=utf-8" \
    --data "{
      'destinationDataset': 'projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID',
      'config': {
        'dicom': {
          'filterProfile': 'DEIDENTIFY_TAG_CONTENTS'
        },
        'image': {
          'textRedactionMode': 'REDACT_ALL_TEXT'
        },
        'annotation': {
          'annotation_store_name': 'projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID',
          'store_quote': 'true'
        }
      }
    }" "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID:deidentify"

If the request is successful, the server returns the response in JSON format:

{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID"
}

The response contains an operation name. You can use the Operation get method to track the status of the operation:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    -H "Content-Type: application/json; charset=utf-8" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID"

If the request is successful, the server returns the response in JSON format. After the de-identification process finishes, the response contains "done": true.

{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1beta1.OperationMetadata",
    "apiMethodName": "google.cloud.healthcare.v1beta1.dataset.DatasetService.DeidentifyDataset",
    "createTime": "CREATE_TIME",
    "endTime": "END_TIME"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1beta1.deidentify.DeidentifySummary",
    "successStoreCount": "1",
    "successResourceCount": "1"
  }
}

After checking that the de-identification was successful, you can list the annotation stores in the dataset and see that the operation created the new annotation store:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores"

If the request is successful, the server returns the response in JSON format:

{
  "annotationStores": [
    {
      "name": "projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID"
    },
    {
      ...
    }
  ]
}

Use the ANNOTATION_STORE_ID value to list the annotation records in the annotation store:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations"

If the request is successful, the server returns the response in JSON format:

{
  "annotations": [
    "projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/TEXT_ANNOTATION_RECORD_ID",
    "projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/IMAGE_ANNOTATION_RECORD_ID",
    ...
  ]
}

You can see that two annotation records were created: a text annotation record and an image annotation record.

First, use the TEXT_ANNOTATION_RECORD_ID value to view the text annotation record:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/datasets/DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/TEXT_ANNOTATION_RECORD_ID"

If the request is successful, the server returns the response in JSON format.

The textAnnotation object contains information about the sensitive text that the de-identification operation removed. In the details field, you can see that the operation provided a list of DICOM tags. When a DICOM tag was found, its information was provided in the findings object, which shows the infoType, the infoType's value, and the locations where the values were found.

{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/TEXT_ANNOTATION_RECORD_ID",
  "annotationSource": {
    "cloudHealthcareSource": {
      "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/dicomStores/DICOM_STORE_ID/dicomWeb/studies/STUDY_UID/series/SERIES_UID/instances/INSTANCE_UID"
    }
  },
  "textAnnotation": {
    "details": {
      "00080070": {},
      "00080090": {
        "findings": [
          {
            "infoType": "PERSON_NAME",
            "end": "8",
            "quote": "John Doe"
          }
        ]
      },
      "00081090": {},
      "00100010": {
        "findings": [
          {
            "infoType": "PERSON_NAME",
            "end": "11",
            "quote": "Ann Johnson"
          }
        ]
      },
      "00100020": {},
      "00100030": {
        "findings": [
          {
            "infoType": "DATE",
            "end": "8",
            "quote": "19880812"
          }
        ]
      },
      "00020013": {
        "findings": [
          {
            "infoType": "LOCATION",
            "end": "5",
            "quote": "OFFIS"
          }
        ]
      },
      "00080020": {
        "findings": [
          {
            "infoType": "DATE",
            "end": "8",
            "quote": "20110909"
          }
        ]
      }
    }
  }
}

Next, use the IMAGE_ANNOTATION_RECORD_ID value to view the image annotation record:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/datasets/DATASET_ID/annotationStores/ANNOTATION_STORE_ID//annotations/IMAGE_ANNOTATION_RECORD_ID"

If the request is successful, the server returns the response in JSON format.

Inside the imageAnnotation object, there are multiple vertices, each of which contains four X/Y points that bound the locations where the de-identification operation detected sensitive image data and burnt-in text.

{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/IMAGE_ANNOTATION_RECORD_ID",
  "annotationSource": {
    "cloudHealthcareSource": {
      "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/dicomStores/DICOM_STORE_ID/dicomWeb/studies/STUDY_UID/series/SERIES_UID/instances/INSTANCE_UID"
    }
  },
  "imageAnnotation": {
    "boundingPolys": [
      {
        "vertices": [
          {
            "x": 439,
            "y": 919
          },
          {
            "x": 495,
            "y": 919
          },
          {
            "x": 495,
            "y": 970
          },
          {
            "x": 439,
            "y": 970
          }
        ]
      },
      {
        "vertices": [
          {
            "x": 493,
            "y": 919
          },
          {
            "x": 610,
            "y": 919
          },
          {
            "x": 610,
            "y": 972
          },
          {
            "x": 493,
            "y": 972
          }
        ]
      },
      {
        "vertices": [
        ...
        ]
      },
      ...
    ]
  }
}