Annotating de-identified data

This page explains how to configure annotation stores and annotation records when de-identifying sensitive FHIR and DICOM data.

Annotating de-identified data overview

Each time you de-identify sensitive FHIR or DICOM data, you can output information about the sensitive data that was removed to an annotation store. This information is stored as one or more annotation records inside the annotation store.

You can create the annotation store in an existing dataset or create it in the new dataset created during the de-identify operation. If you create the annotation store in an existing dataset, an annotation store with the same name cannot already exist in that dataset.

The created annotation store must be in the same project as the de-identified source data. For example, you cannot simultaneously de-identify data in one project and output annotation records to an annotation store in a different project.

To specify an annotation store and its behavior during de-identification, set the annotationStoreName field inside an annotation object in the DeidentifyConfig object.

You can optionally set the storeQuote field, depending on your use case. Information about setting the storeQuote field is available in the next section.

Using the `storeQuote` field

The following information applies to both FHIR and DICOM data.

When the storeQuote field inside annotation in the request is set to true, the original values of the de-identified data display in the annotation record in the quote field. For example:

If a DATE is de-identified, and if storeQuote is set to true, then the following information displays in the annotation record:
- The value of the date (such as 1980-12-05), displayed in the quote field
- The DATE infoType
- The start and end locations of where the data was found. The start and end locations use a zero-based index and are both inclusive.
IfstoreQuote is set to false, then the date (1980-12-05) does not display in the annotation record, and only the following information displays:
- The DATE infoType
- The start and end locations of where the data was found. The start and end locations use a zero-based index and are both inclusive.

Annotations for de-identified FHIR data

This section builds on concepts explained in De-identifying FHIR data using the Cloud Healthcare API.

Annotation record structure

The de-identify operation creates one annotation record for each de-identified FHIR resource. Each annotation record contains a textAnnotation object that holds information about the de-identified data that was inspected and transformed. For a de-identified field to appear in the annotation record, it must have the INSPECT_AND_TRANSFORM Action applied to it.

Configuring annotations for de-identified FHIR data

The following samples use the Default FHIR data de-identification as their starting point. The samples show how to de-identify a Patient resource using the FHIR default method and store information about the de-identified data in an annotation record in a new annotation store. In the samples, the storeQuote field is set to true, meaning that the output annotation record contains the original values of the data that was de-identified.

The new annotation store is in the dataset created by the de-identify operation, but you can also create the annotation store in an existing dataset.

curl

curl -X POST \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    -H "Content-Type: application/json; charset=utf-8" \
    --data "{
      'destinationDataset': 'projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID',
      'config': {
        'fhir': {},
        'annotation': {
          'annotationStoreName': 'projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID',
          'storeQuote': 'true'
        }
      }
    }" "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID:deidentify"

If the request is successful, the server returns the response in JSON format:

{
  "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID"
}

The response contains an operation name. You can use the Operation get method to track the status of the operation:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    -H "Content-Type: application/json; charset=utf-8" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID"

If the request is successful, the server returns the response in JSON format. After the de-identification process finishes, the response contains "done": true.

{
  "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1beta1.OperationMetadata",
    "apiMethodName": "google.cloud.healthcare.v1beta1.dataset.DatasetService.DeidentifyDataset",
    "createTime": "CREATE_TIME",
    "endTime": "END_TIME"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1beta1.deidentify.DeidentifySummary",
    "successStoreCount": "1",
    "successResourceCount": "1"
  }
}

After checking that the de-identification was successful, you can list the annotation stores in the dataset and see that the operation created the annotation store:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores"

If the request is successful, the server returns the response in JSON format:

{
  "annotationStores": [
    {
      "name": "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID"
    },
    {
      ...
    }
  ]
}

Use the ANNOTATION_STORE_ID value to list the annotation records in the annotation store:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations"

If the request is successful, the server returns the response in JSON format:

{
  "annotations": [
    "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/ANNOTATION_RECORD_ID",
    ...
  ]
}

Use the ANNOTATION_RECORD_ID value to view the annotation record:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/ANNOTATION_RECORD_ID"

If the request is successful, the server returns the response in JSON format.

The textAnnotation object contains information about sensitive text that the de-identification operation removed. In the details field, you can see that the operation searched the patient.text.div object and found four infoTypes, along with their values and the locations where the values were found.

Using the default FHIR de-identification, the only data that was inspected and transformed was the data in the patient.text.div object; all other de-identified data was transformed without being inspected because its infoType was already declared in the original FHIR resource.

{
  "name": "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/ANNOTATION_RECORD_ID",
  "annotationSource": {
    "cloudHealthcareSource": {
      "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/fhirStores/FHIR_STORE_ID/fhir/Patient/PATIENT_ID"
    }
  },
  "textAnnotation": {
    "details": {
      "patient.text.div": {
        "findings": [
          {
            "infoType": "PERSON_NAME",
            "start": "42",
            "end": "54",
            "quote": "Smith, Darcy"
          },
          {
            "infoType": "PERSON_NAME",
            "start": "42",
            "end": "47",
            "quote": "Smith"
          },
          {
            "infoType": "PERSON_NAME",
            "start": "49",
            "end": "54",
            "quote": "Darcy"
          },
          {
            "infoType": "DATE",
            "start": "81",
            "end": "91",
            "quote": "1980-12-05"
          }
        ]
      }
    }
  }
}

gcloud

The following sample uses the gcloud beta healthcare datasets deidentify command. The storeQuote field is set to true by default, and cannot be changed when using the Google Cloud CLI.

gcloud beta healthcare datasets deidentify SOURCE_DATASET_ID \
    --destination-dataset=projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID \
    --default-fhir-config \
    --annotation-store=projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID

The command line displays the operation ID and, after the operation completes, done:

Request issued for: [SOURCE_DATASET_ID]
Waiting for operation [projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID] to complete...done.

To view more details about the operation, run the gcloud beta healthcare operations describe command, providing the OPERATION_ID from the response:

gcloud beta healthcare operations describe --dataset=SOURCE_DATASET_ID \
    OPERATION_ID

The response includes done: true:

done: true
metadata:
  '@type': type.googleapis.com/google.cloud.healthcare.v1beta1.OperationMetadata
  apiMethodName: google.cloud.healthcare.v1beta1.dataset.DatasetService.DeidentifyDataset
  counter: {COUNTER}
  createTime: 'CREATE_TIME'
  endTime: 'END_TIME'
  logsUrl: CLOUD_LOGGING_URL
name: projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID
response:
  '@type': type.googleapis.com/google.cloud.healthcare.v1beta1.deidentify.DeidentifySummary

Run the following command to list the annotation stores in the dataset and see that the operation created the new annotation store:

gcloud beta healthcare annotation-stores list --dataset=DESTINATION_DATASET_ID

If the request is successful, the server returns the new annotation store:

ID                    LOCATION
ANNOTATION_STORE_ID                      LOCATION

It is not possible to view details about an individual annotation using the gcloud CLI. To view details about an individual annotation, follow the instructions in the curl sample.

Annotations for de-identified DICOM data

This section builds on concepts explained in De-identifying DICOM data using the Cloud Healthcare API.

Annotation record structure

The de-identify operation creates two types of annotation records for de-identified DICOM data. The two types of annotation records are:

Text annotation records: Contain metadata, such as DICOM tags, from the de-identified data. Each text annotation record contains a textAnnotation object that holds information about the de-identified data that was inspected and transformed. For a de-identified tag to appear in the annotation record, it must have been inspected for protected health information (PHI) based on the configuration provided in the TagFilterProfile field. For example, the samples in Configuring annotations for de-identified DICOM data use the DEIDENTIFY_TAG_CONTENTS configuration.
Image annotation records: Contain the location of sensitive information in individual DICOM frames. Each image annotation record contains an ImageAnnotation object that holds the coordinates for the found sensitive information.

The de-identify operation creates annotation records for each frame in a DICOM instance. For example, if a DICOM instance has three frames, the de-identify operation creates the following annotation records:

One text annotation record, containing a textAnnotation, for the DICOM tags in the DICOM instance.
Three image annotation records, each containing an imageAnnotation, for each of the three frames. Each image annotation record contains a frame_index field to indicate the frame that the record corresponds to.

All four of these annotation records have the samecloudHealthcareSource.name value, which is the DICOM instance path in the format: projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/dicomStores/DICOM_STORE_ID/dicomWeb/studies/STUDY_UID/series/SERIES_UID/instances/INSTANCE_UID.

Configuring annotations for de-identified DICOM data

The following samples use Combining tag de-identification and burnt-in text redaction as their starting point. The samples show how to de-identify a DICOM instance to redact all-burnt in text in the image and inspect and transform sensitive text. The samples also show how to store information about the de-identified data in an annotation record in a new annotation store. In the samples, the storeQuote field is set to true, meaning that the output annotation record contains the original values of the data that was de-identified.

The new annotation store is in the dataset created by the de-identify operation, but you can also create the annotation store in an existing dataset.

curl

curl -X POST \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    -H "Content-Type: application/json; charset=utf-8" \
    --data "{
      'destinationDataset': 'projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID',
      'config': {
        'dicom': {
          'filterProfile': 'DEIDENTIFY_TAG_CONTENTS'
        },
        'image': {
          'textRedactionMode': 'REDACT_ALL_TEXT'
        },
        'annotation': {
          'annotationStoreName': 'projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID',
          'storeQuote': 'true'
        }
      }
    }" "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID:deidentify"

If the request is successful, the server returns the response in JSON format:

{
  "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID"
}

The response contains an operation name. You can use the Operation get method to track the status of the operation:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    -H "Content-Type: application/json; charset=utf-8" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID"

If the request is successful, the server returns the response in JSON format. After the de-identification process finishes, the response contains "done": true.

{
  "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1beta1.OperationMetadata",
    "apiMethodName": "google.cloud.healthcare.v1beta1.dataset.DatasetService.DeidentifyDataset",
    "createTime": "CREATE_TIME",
    "endTime": "END_TIME"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1beta1.deidentify.DeidentifySummary",
    "successStoreCount": "1",
    "successResourceCount": "1"
  }
}

After checking that the de-identification was successful, you can list the annotation stores in the dataset and see that the operation created the new annotation store:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores"

If the request is successful, the server returns the response in JSON format:

{
  "annotationStores": [
    {
      "name": "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID"
    },
    {
      ...
    }
  ]
}

Use the ANNOTATION_STORE_ID value to list the annotation records in the annotation store:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations"

If the request is successful, the server returns the response in JSON format:

{
  "annotations": [
    "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/TEXT_ANNOTATION_RECORD_ID",
    "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/IMAGE_ANNOTATION_RECORD_ID",
    ...
  ]
}

You can see that two annotation records were created: a text annotation record and an image annotation record.

First, use the TEXT_ANNOTATION_RECORD_ID value to view the text annotation record:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/TEXT_ANNOTATION_RECORD_ID"

If the request is successful, the server returns the response in JSON format.

The textAnnotation object contains information about the sensitive text that the de-identification operation removed. In the details field, you can see that the operation provided a list of DICOM tags. When a DICOM tag was found, its information was provided in the findings object, which shows the infoType, the infoType's value, and the locations where the values were found.

{
  "name": "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/TEXT_ANNOTATION_RECORD_ID",
  "annotationSource": {
    "cloudHealthcareSource": {
      "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/dicomStores/DICOM_STORE_ID/dicomWeb/studies/STUDY_UID/series/SERIES_UID/instances/INSTANCE_UID"
    }
  },
  "textAnnotation": {
    "details": {
      "00080070": {},
      "00080090": {
        "findings": [
          {
            "infoType": "PERSON_NAME",
            "end": "8",
            "quote": "John Doe"
          }
        ]
      },
      "00081090": {},
      "00100010": {
        "findings": [
          {
            "infoType": "PERSON_NAME",
            "end": "11",
            "quote": "Ann Johnson"
          }
        ]
      },
      "00100020": {},
      "00100030": {
        "findings": [
          {
            "infoType": "DATE",
            "end": "8",
            "quote": "19880812"
          }
        ]
      },
      "00020013": {
        "findings": [
          {
            "infoType": "LOCATION",
            "end": "5",
            "quote": "OFFIS"
          }
        ]
      },
      "00080020": {
        "findings": [
          {
            "infoType": "DATE",
            "end": "8",
            "quote": "20110909"
          }
        ]
      }
    }
  }
}

Next, use the IMAGE_ANNOTATION_RECORD_ID value to view the image annotation record:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID//annotations/IMAGE_ANNOTATION_RECORD_ID"

If the request is successful, the server returns the response in JSON format.

Inside the imageAnnotation object, there are multiple vertices, each of which contains four X/Y points that bound the locations where the de-identification operation detected sensitive image data and burnt-in text.

{
  "name": "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/IMAGE_ANNOTATION_RECORD_ID",
  "annotationSource": {
    "cloudHealthcareSource": {
      "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/dicomStores/DICOM_STORE_ID/dicomWeb/studies/STUDY_UID/series/SERIES_UID/instances/INSTANCE_UID"
    }
  },
  "imageAnnotation": {
    "boundingPolys": [
      {
        "vertices": [
          {
            "x": 439,
            "y": 919
          },
          {
            "x": 495,
            "y": 919
          },
          {
            "x": 495,
            "y": 970
          },
          {
            "x": 439,
            "y": 970
          }
        ]
      },
      {
        "vertices": [
          {
            "x": 493,
            "y": 919
          },
          {
            "x": 610,
            "y": 919
          },
          {
            "x": 610,
            "y": 972
          },
          {
            "x": 493,
            "y": 972
          }
        ]
      },
      {
        "vertices": [
        ...
        ]
      },
      ...
    ]
  }
}

gcloud

The following sample uses the gcloud beta healthcare datasets deidentify command. The storeQuote field is set to true by default, and cannot be changed when using the Google Cloud CLI.

gcloud beta healthcare datasets deidentify SOURCE_DATASET_ID \
    --destination-dataset=projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID \
    --text-redaction-mode=all \
    --annotation-store=projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID

The command line displays the operation ID and, after the operation completes, done:

Request issued for: [SOURCE_DATASET_ID]
Waiting for operation [projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID] to complete...done.

To view more details about the operation, run the gcloud beta healthcare operations describe command, providing the OPERATION_ID from the response:

gcloud beta healthcare operations describe --dataset=SOURCE_DATASET_ID \
    OPERATION_ID

The response includes done: true:

done: true
metadata:
  '@type': type.googleapis.com/google.cloud.healthcare.v1beta1.OperationMetadata
  apiMethodName: google.cloud.healthcare.v1beta1.dataset.DatasetService.DeidentifyDataset
  counter: {COUNTER}
  createTime: 'CREATE_TIME'
  endTime: 'END_TIME'
  logsUrl: CLOUD_LOGGING_URL
name: projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID
response:
  '@type': type.googleapis.com/google.cloud.healthcare.v1beta1.deidentify.DeidentifySummary

Run the following command to list the annotation stores in the dataset and see that the operation created the new annotation store:

gcloud beta healthcare annotation-stores list --dataset=DESTINATION_DATASET_ID

If the request is successful, the server returns the new annotation store:

ID                    LOCATION
ANNOTATION_STORE_ID                      LOCATION

It is not possible to view details about an individual annotation using the gcloud CLI. To view details about an individual annotation, follow the instructions in the curl sample.

Annotating de-identified data