This page explains how to configure annotation stores and annotation records when de-identifying sensitive FHIR and DICOM data.
Annotating de-identified data overview
Each time you de-identify sensitive FHIR or DICOM data, you can output information about the sensitive data that was removed to an annotation store. This information is stored as one or more annotation records inside the annotation store.
You can create the annotation store in an existing dataset or create it in the new dataset created during the de-identify operation. If you create the annotation store in an existing dataset, an annotation store with the same name cannot already exist in that dataset.
The created annotation store must be in the same project as the de-identified source data. For example, you cannot simultaneously de-identify data in one project and output annotation records to an annotation store in a different project.
To specify an annotation store and its behavior during de-identification,
set the annotationStoreName
field inside an annotation
object in the DeidentifyConfig
object.
You can optionally set the storeQuote
field, depending on your use case.
Information about setting the storeQuote
field is available in the
next section.
Using the storeQuote
field
The following information applies to both FHIR and DICOM data.
When the storeQuote
field inside annotation
in the request is set
to true
, the original values of the de-identified data display in the
annotation record in the quote
field. For example:
If a
DATE
is de-identified, and ifstoreQuote
is set totrue
, then the following information displays in the annotation record:- The value of the date (such as
1980-12-05
), displayed in thequote
field - The
DATE
infoType - The start and end locations of where the data was found. The start and end locations use a zero-based index and are both inclusive.
- The value of the date (such as
If
storeQuote
is set tofalse
, then the date (1980-12-05
) does not display in the annotation record, and only the following information displays:- The
DATE
infoType - The start and end locations of where the data was found. The start and end locations use a zero-based index and are both inclusive.
- The
Annotations for de-identified FHIR data
This section builds on concepts explained in De-identifying FHIR data using the Cloud Healthcare API.
Annotation record structure
The de-identify operation creates one annotation record for each de-identified
FHIR resource.
Each annotation record contains a textAnnotation
object that holds
information about the de-identified data that was inspected and transformed.
For a de-identified field to appear in the annotation record, it must have
the INSPECT_AND_TRANSFORM
Action
applied to it.
Configuring annotations for de-identified FHIR data
The following samples use the Default FHIR data de-identification
as their starting point. The samples show how to de-identify a Patient resource
using the FHIR default method and store information about the de-identified data in an
annotation record in a new annotation store. In the samples, the storeQuote
field is set to true
, meaning that the output annotation record contains
the original values of the data that was de-identified.
The new annotation store is in the dataset created by the de-identify operation, but you can also create the annotation store in an existing dataset.
curl
curl -X POST \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ -H "Content-Type: application/json; charset=utf-8" \ --data "{ 'destinationDataset': 'projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID', 'config': { 'fhir': {}, 'annotation': { 'annotationStoreName': 'projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID', 'storeQuote': 'true' } } }" "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID:deidentify"
If the request is successful, the server returns the response in JSON format:
{ "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID" }
The response contains an operation name. You can use the
Operation get
method
to track the status of the operation:
curl -X GET \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ -H "Content-Type: application/json; charset=utf-8" \ "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID"
If the request is successful, the server returns the response in JSON format.
After the de-identification process finishes, the
response contains "done": true
.
{ "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.healthcare.v1beta1.OperationMetadata", "apiMethodName": "google.cloud.healthcare.v1beta1.dataset.DatasetService.DeidentifyDataset", "createTime": "CREATE_TIME", "endTime": "END_TIME" }, "done": true, "response": { "@type": "type.googleapis.com/google.cloud.healthcare.v1beta1.deidentify.DeidentifySummary", "successStoreCount": "1", "successResourceCount": "1" } }
After checking that the de-identification was successful, you can list the annotation stores in the dataset and see that the operation created the annotation store:
curl -X GET \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores"
If the request is successful, the server returns the response in JSON format:
{ "annotationStores": [ { "name": "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID" }, { ... } ] }
Use the ANNOTATION_STORE_ID value to list the annotation records in the annotation store:
curl -X GET \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations"
If the request is successful, the server returns the response in JSON format:
{ "annotations": [ "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/ANNOTATION_RECORD_ID", ... ] }
Use the ANNOTATION_RECORD_ID value to view the annotation record:
curl -X GET \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/ANNOTATION_RECORD_ID"
If the request is successful, the server returns the response in JSON format.
The textAnnotation
object contains information about sensitive text that the
de-identification operation removed. In the details
field, you can see that
the operation searched the patient.text.div
object and found
four infoTypes, along with their values and the locations where the values
were found.
Using the default FHIR de-identification, the only data that was inspected
and transformed was the data in the patient.text.div
object; all other
de-identified data was transformed without being inspected because its
infoType was already declared in the original FHIR resource.
{ "name": "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/ANNOTATION_RECORD_ID", "annotationSource": { "cloudHealthcareSource": { "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/fhirStores/FHIR_STORE_ID/fhir/Patient/PATIENT_ID" } }, "textAnnotation": { "details": { "patient.text.div": { "findings": [ { "infoType": "PERSON_NAME", "start": "42", "end": "54", "quote": "Smith, Darcy" }, { "infoType": "PERSON_NAME", "start": "42", "end": "47", "quote": "Smith" }, { "infoType": "PERSON_NAME", "start": "49", "end": "54", "quote": "Darcy" }, { "infoType": "DATE", "start": "81", "end": "91", "quote": "1980-12-05" } ] } } } }
gcloud
The following sample uses the gcloud beta healthcare datasets deidentify
command. The storeQuote
field is set to
true
by default, and cannot be changed when using the Google Cloud CLI.
gcloud beta healthcare datasets deidentify SOURCE_DATASET_ID \ --destination-dataset=projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID \ --default-fhir-config \ --annotation-store=projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID
The command line displays the operation ID and, after the operation completes,
done
:
Request issued for: [SOURCE_DATASET_ID] Waiting for operation [projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID] to complete...done.
To view more details about the operation, run the
gcloud beta healthcare operations describe
command, providing the OPERATION_ID from the response:
gcloud beta healthcare operations describe --dataset=SOURCE_DATASET_ID \ OPERATION_ID
The response includes done: true
:
done: true metadata: '@type': type.googleapis.com/google.cloud.healthcare.v1beta1.OperationMetadata apiMethodName: google.cloud.healthcare.v1beta1.dataset.DatasetService.DeidentifyDataset counter: {COUNTER} createTime: 'CREATE_TIME' endTime: 'END_TIME' logsUrl: CLOUD_LOGGING_URL name: projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID response: '@type': type.googleapis.com/google.cloud.healthcare.v1beta1.deidentify.DeidentifySummary
Run the following command to list the annotation stores in the dataset and see that the operation created the new annotation store:
gcloud beta healthcare annotation-stores list --dataset=DESTINATION_DATASET_ID
If the request is successful, the server returns the new annotation store:
ID LOCATION ANNOTATION_STORE_ID LOCATION
It is not possible to view details about an individual annotation using the
gcloud CLI. To view details about an individual annotation, follow
the instructions in the curl
sample.
Annotations for de-identified DICOM data
This section builds on concepts explained in De-identifying DICOM data using the Cloud Healthcare API.
Annotation record structure
The de-identify operation creates two types of annotation records for de-identified DICOM data. The two types of annotation records are:
- Text annotation records: Contain metadata, such
as DICOM tags, from the de-identified data. Each text annotation record contains
a
textAnnotation
object that holds information about the de-identified data that was inspected and transformed. For a de-identified tag to appear in the annotation record, it must have been inspected for protected health information (PHI) based on the configuration provided in theTagFilterProfile
field. For example, the samples in Configuring annotations for de-identified DICOM data use theDEIDENTIFY_TAG_CONTENTS
configuration. - Image annotation records: Contain the location of sensitive information
in individual DICOM frames. Each image annotation record contains an
ImageAnnotation
object that holds the coordinates for the found sensitive information.
The de-identify operation creates annotation records for each frame in a DICOM instance. For example, if a DICOM instance has three frames, the de-identify operation creates the following annotation records:
- One text annotation record, containing a
textAnnotation
, for the DICOM tags in the DICOM instance. - Three image annotation records, each containing an
imageAnnotation
, for each of the three frames. Each image annotation record contains aframe_index
field to indicate the frame that the record corresponds to.
All four of these annotation records have the samecloudHealthcareSource.name
value, which is the DICOM instance path in the format: projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/dicomStores/DICOM_STORE_ID/dicomWeb/studies/STUDY_UID/series/SERIES_UID/instances/INSTANCE_UID
.
Configuring annotations for de-identified DICOM data
The following samples use Combining tag de-identification and burnt-in text redaction
as their starting point. The samples show how to de-identify a DICOM instance
to redact all-burnt in text in the image and inspect and transform sensitive
text. The samples also show how to store information about the de-identified data
in an
annotation record in a new annotation store. In the samples, the storeQuote
field is set to true
, meaning that the output annotation record contains
the original values of the data that was de-identified.
The new annotation store is in the dataset created by the de-identify operation, but you can also create the annotation store in an existing dataset.
curl
curl -X POST \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ -H "Content-Type: application/json; charset=utf-8" \ --data "{ 'destinationDataset': 'projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID', 'config': { 'dicom': { 'filterProfile': 'DEIDENTIFY_TAG_CONTENTS' }, 'image': { 'textRedactionMode': 'REDACT_ALL_TEXT' }, 'annotation': { 'annotationStoreName': 'projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID', 'storeQuote': 'true' } } }" "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID:deidentify"
If the request is successful, the server returns the response in JSON format:
{ "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID" }
The response contains an operation name. You can use the
Operation get
method
to track the status of the operation:
curl -X GET \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ -H "Content-Type: application/json; charset=utf-8" \ "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID"
If the request is successful, the server returns
the response in JSON format. After the de-identification process finishes, the
response contains "done": true
.
{ "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.healthcare.v1beta1.OperationMetadata", "apiMethodName": "google.cloud.healthcare.v1beta1.dataset.DatasetService.DeidentifyDataset", "createTime": "CREATE_TIME", "endTime": "END_TIME" }, "done": true, "response": { "@type": "type.googleapis.com/google.cloud.healthcare.v1beta1.deidentify.DeidentifySummary", "successStoreCount": "1", "successResourceCount": "1" } }
After checking that the de-identification was successful, you can list the annotation stores in the dataset and see that the operation created the new annotation store:
curl -X GET \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores"
If the request is successful, the server returns the response in JSON format:
{ "annotationStores": [ { "name": "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID" }, { ... } ] }
Use the ANNOTATION_STORE_ID value to list the annotation records in the annotation store:
curl -X GET \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations"
If the request is successful, the server returns the response in JSON format:
{ "annotations": [ "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/TEXT_ANNOTATION_RECORD_ID", "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/IMAGE_ANNOTATION_RECORD_ID", ... ] }
You can see that two annotation records were created: a text annotation record and an image annotation record.
First, use the TEXT_ANNOTATION_RECORD_ID value to view the text annotation record:
curl -X GET \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/TEXT_ANNOTATION_RECORD_ID"
If the request is successful, the server returns the response in JSON format.
The textAnnotation
object contains information about the sensitive text that the
de-identification operation removed. In the details
field, you can see that
the operation provided a list of DICOM tags. When a DICOM tag was found, its
information was provided in the findings
object, which shows the infoType,
the infoType's value, and the locations where the values were found.
{ "name": "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/TEXT_ANNOTATION_RECORD_ID", "annotationSource": { "cloudHealthcareSource": { "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/dicomStores/DICOM_STORE_ID/dicomWeb/studies/STUDY_UID/series/SERIES_UID/instances/INSTANCE_UID" } }, "textAnnotation": { "details": { "00080070": {}, "00080090": { "findings": [ { "infoType": "PERSON_NAME", "end": "8", "quote": "John Doe" } ] }, "00081090": {}, "00100010": { "findings": [ { "infoType": "PERSON_NAME", "end": "11", "quote": "Ann Johnson" } ] }, "00100020": {}, "00100030": { "findings": [ { "infoType": "DATE", "end": "8", "quote": "19880812" } ] }, "00020013": { "findings": [ { "infoType": "LOCATION", "end": "5", "quote": "OFFIS" } ] }, "00080020": { "findings": [ { "infoType": "DATE", "end": "8", "quote": "20110909" } ] } } } }
Next, use the IMAGE_ANNOTATION_RECORD_ID value to view the image annotation record:
curl -X GET \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ "https://healthcare.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID//annotations/IMAGE_ANNOTATION_RECORD_ID"
If the request is successful, the server returns the response in JSON format.
Inside the imageAnnotation
object, there are multiple vertices
, each of which
contains four X/Y points that bound the locations where the de-identification
operation detected sensitive image data and burnt-in text.
{ "name": "projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID/annotations/IMAGE_ANNOTATION_RECORD_ID", "annotationSource": { "cloudHealthcareSource": { "name": "projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/dicomStores/DICOM_STORE_ID/dicomWeb/studies/STUDY_UID/series/SERIES_UID/instances/INSTANCE_UID" } }, "imageAnnotation": { "boundingPolys": [ { "vertices": [ { "x": 439, "y": 919 }, { "x": 495, "y": 919 }, { "x": 495, "y": 970 }, { "x": 439, "y": 970 } ] }, { "vertices": [ { "x": 493, "y": 919 }, { "x": 610, "y": 919 }, { "x": 610, "y": 972 }, { "x": 493, "y": 972 } ] }, { "vertices": [ ... ] }, ... ] } }
gcloud
The following sample uses the gcloud beta healthcare datasets deidentify
command. The storeQuote
field is set to
true
by default, and cannot be changed when using the Google Cloud CLI.
gcloud beta healthcare datasets deidentify SOURCE_DATASET_ID \ --destination-dataset=projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID \ --text-redaction-mode=all \ --annotation-store=projects/PROJECT_ID/locations/LOCATION/datasets/DESTINATION_DATASET_ID/annotationStores/ANNOTATION_STORE_ID
The command line displays the operation ID and, after the operation completes,
done
:
Request issued for: [SOURCE_DATASET_ID] Waiting for operation [projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID] to complete...done.
To view more details about the operation, run the
gcloud beta healthcare operations describe
command, providing the OPERATION_ID from the response:
gcloud beta healthcare operations describe --dataset=SOURCE_DATASET_ID \ OPERATION_ID
The response includes done: true
:
done: true metadata: '@type': type.googleapis.com/google.cloud.healthcare.v1beta1.OperationMetadata apiMethodName: google.cloud.healthcare.v1beta1.dataset.DatasetService.DeidentifyDataset counter: {COUNTER} createTime: 'CREATE_TIME' endTime: 'END_TIME' logsUrl: CLOUD_LOGGING_URL name: projects/PROJECT_ID/locations/LOCATION/datasets/SOURCE_DATASET_ID/operations/OPERATION_ID response: '@type': type.googleapis.com/google.cloud.healthcare.v1beta1.deidentify.DeidentifySummary
Run the following command to list the annotation stores in the dataset and see that the operation created the new annotation store:
gcloud beta healthcare annotation-stores list --dataset=DESTINATION_DATASET_ID
If the request is successful, the server returns the new annotation store:
ID LOCATION ANNOTATION_STORE_ID LOCATION
It is not possible to view details about an individual annotation using the
gcloud CLI. To view details about an individual annotation, follow
the instructions in the curl
sample.