De-identifying Sensitive Data

は、

This page explains how to use the deidentify operation to de-identify sensitive data in a dataset. For example, you might want to de-identify data before analyzing or sharing it.

De-identification is available for DICOM instances and FHIR resources. If a dataset contains both DICOM instances and FHIR resources, you can de-identify all of the instances and resources at the same time.

De-identifying DICOM data

When working with DICOM instances, there are three components to a de-identification API call:

  1. The source dataset: A dataset containing DICOM stores with one or more instances that have sensitive data. When you call the deidentify operation, all instances in all DICOM stores in the dataset are de-identified.
  2. What to de-identify: Configuration parameters that specify how to process the dataset. You can configure DICOM de-identification using two methods:

    • Tag whitelist: DICOM tags to retain as is. All other tags will be redacted.
    • Image redaction: Redact burnt-in text in images.
  3. The destination dataset: De-identification does not impact the original dataset or its data. Instead, de-identified copies of the original data are written to a new dataset, called the destination dataset.

De-identification using whitelist tags

When you specify a whitelist tag in the DicomConfig object, the following whitelist tags are added by default:

  • StudyInstanceUID
  • SeriesInstanceUID
  • SOPInstanceUID
  • TransferSyntaxUID
  • MediaStorageSOPInstanceUID
  • MediaStorageSOPClassUID
  • PixelData
  • Rows
  • Columns
  • SamplesPerPixel
  • BitsAllocated
  • BitsStored
  • Highbit
  • PhotometricInterpretation
  • PixelRepresentation
  • NumberOfFrames

The deidentify operation can't redact every DICOM tag. If no whitelist tags are provided, then no DICOM tags in the dataset will be redacted.

The following samples show how to de-identify a dataset containing DICOM stores and DICOM data while still leaving some tags unchanged. The samples use whitelist tags with a single DICOM instance, but you can also de-identify multiple instances.

curl command

To de-identify a dataset containing DICOM data, make a POST request and provide the name of the destination dataset, a set of whitelist tags for the data you want to retain, and an access token. The following shows an example of a POST request using curl.

curl -X POST \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \
    -H "Content-Type: application/json; charset=utf-8" \
    --data "{
      'destinationDataset': 'projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID',
      'config': {
        'dicom': {
          'whitelistTags': 'PatientID',
          'whitelistTags': 'TransferSyntaxUID',
          'whitelistTags': 'SOPInstanceUID',
          'whitelistTags': 'StudyInstanceUID',
          'whitelistTags': 'SeriesInstanceUID',
          'whitelistTags': 'PixelData'
        }
      }
    }" "https://healthcare.googleapis.com/v1alpha/projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID:deidentify"

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

200 OK
{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/OPERATION_NAME"
}

The response contains an operation name. You can use the Operation get method to track the status of the operation:

curl -X GET \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \
    -H "Content-Type: application/json; charset=utf-8" \
    "https://healthcare.googleapis.com/v1alpha/projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/operations/OPERATION_NAME"

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format. After the de-identification process finishes, the response contains "done": true.

200 OK
{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/operations/OPERATION_NUMBER",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1alpha.OperationMetadata",
    "apiMethodName": "google.cloud.healthcare.v1alpha.DatasetService.DeidentifyDataset",
    "createTime": "2018-01-01T00:00:00Z",
    "endTime": "2018-01-01T00:00:00Z"
  },
  "done": true,
  "response": {
    "@type": "..."
  }
}

Powershell

To de-identify a dataset containing DICOM data, make a POST request and provide the name of the destination dataset, a set of whitelist tags for the data you want to retain, and an access token. The following shows an example of a POST request using Windows PowerShell.

$cred = gcloud auth print-access-token
$headers = @{ Authorization = "Bearer $cred" }

Invoke-WebRequest `
  -Method Post `
  -Headers $headers `
  -ContentType: "application/json; charset=utf-8" `
  -Body "{
    'destinationDataset': 'projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID',
    'config': {
      'dicom': {
        'whitelistTags': 'PatientID',
        'whitelistTags': 'TransferSyntaxUID',
        'whitelistTags': 'SOPInstanceUID',
        'whitelistTags': 'StudyInstanceUID',
        'whitelistTags': 'SeriesInstanceUID',
        'whitelistTags': 'PixelData'
      }
    }
  }" `
  -Uri "https://healthcare.googleapis.com/v1alpha/projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID:deidentify" | Select-Object -Expand Content

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

200 OK
{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/OPERATION_NAME"
}

The response contains an operation name. You can use the Operation get method to track the status of the operation:

$cred = gcloud auth print-access-token
$headers = @{ Authorization = "Bearer $cred" }

Invoke-WebRequest `
  -Method Get `
  -Headers $headers `
  -ContentType: "application/json; charset=utf-8" `
  -Uri "https://healthcare.googleapis.com/v1alpha/projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/operations/OPERATION_NAME" | Select-Object -Expand Content

After the de-identification process finishes, the response contains "done": true.

200 OK
{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/operations/OPERATION_NUMBER",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1alpha.OperationMetadata",
    "apiMethodName": "google.cloud.healthcare.v1alpha.DatasetService.DeidentifyDataset",
    "createTime": "2018-01-01T00:00:00Z",
    "endTime": "2018-01-01T00:00:00Z"
  },
  "done": true,
  "response": {
    "@type": "..."
  }
}

Node.js

function deidentifyDataset(
  client,
  projectId,
  cloudRegion,
  sourceDatasetId,
  destinationDatasetId,
  whitelistTags
) {
  // Client retrieved in callback
  // getClient(serviceAccountJson, function(client) {...});
  // const cloudRegion = 'us-central1';
  // const projectId = 'adjective-noun-123';
  // const sourceDatasetId = 'my-dataset';
  // const destinationDatasetId = 'my-destination-dataset';
  // const whitelistTags = 'PatientID';
  const sourceDatasetName = `projects/${projectId}/locations/${cloudRegion}/datasets/${sourceDatasetId}`;
  const destinationDatasetName = `projects/${projectId}/locations/${cloudRegion}/datasets/${destinationDatasetId}`;

  const request = {
    sourceDataset: sourceDatasetName,
    destinationDataset: destinationDatasetName,
    resource: {config: {dicom: {whitelistTags: whitelistTags}}},
  };

  client.projects.locations.datasets
    .deidentify(request)
    .then(() => {
      console.log(`De-identified data written from dataset
            ${sourceDatasetId} to dataset ${destinationDatasetId}`);
    })
    .catch(err => {
      console.error(err);
    });
}

Python

def deidentify_dataset(
        service_account_json,
        api_key,
        project_id,
        cloud_region,
        dataset_id,
        destination_dataset_id,
        whitelist_tags):
    """Creates a new dataset containing de-identified data
    from the source dataset.
    """
    client = get_client(service_account_json, api_key)
    source_dataset = 'projects/{}/locations/{}/datasets/{}'.format(
        project_id, cloud_region, dataset_id)
    destination_dataset = 'projects/{}/locations/{}/datasets/{}'.format(
        project_id, cloud_region, destination_dataset_id)

    body = {
        'destinationDataset': destination_dataset,
        'config': {
            'dicom': {
                'whitelistTags': whitelist_tags
            }
        }
    }

    request = client.projects().locations().datasets().deidentify(
        sourceDataset=source_dataset, body=body)

    try:
        response = request.execute()
        print(
            'Data in dataset {} de-identified.'
            'De-identified data written to {}'.format(
                dataset_id,
                destination_dataset_id))
        return response
    except HttpError as e:
        print('Error, data could not be deidentified: {}'.format(e))
        return ""

Redacting burnt-in text from images

The following sample shows how to use curl to redact all burnt-in text from DICOM images in a dataset:

curl command

To redact all burnt-in text from a DICOM image, make a POST request and provide the name of the destination dataset, a DeidentifyConfig object with image.redact_all_text set to true, and an access token.

curl -X POST \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \
    -H "Content-Type: application/json; charset=utf-8" \
    --data "{
      'destinationDataset': 'projects/PROJECT_ID/locations/REGION/datasets/DESTINATION_DATASET_ID',
      'config': {
        'image': {
          'redactAllText': 'true'
        }
      }
    }" "https://healthcare.googleapis.com/v1alpha/projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID:deidentify"

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

200 OK
{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/OPERATION_NAME"
}

The response contains an operation name. You can use the Operation get method to track the status of the operation:

curl -X GET \
    -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
    -H "Content-Type: application/json; charset=utf-8" \
    "https://healthcare.googleapis.com/v1alpha/projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/operations/OPERATION_NAME"

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format. After the de-identification process finishes, the response contains "done": true.

200 OK
{
  "name": "projects/PROJECT_ID/locations/REGION/datasets/SOURCE_DATASET_ID/operations/OPERATION_NUMBER",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.healthcare.v1alpha.OperationMetadata",
    "apiMethodName": "google.cloud.healthcare.v1alpha.DatasetService.DeidentifyDataset",
    "createTime": "2018-01-01T00:00:00Z",
    "endTime": "2018-01-01T00:00:00Z"
  },
  "done": true,
  "response": {
    "@type": "..."
  }
}

De-identifying FHIR data

When working with a dataset containing FHIR stores, there are three components to a de-identification API call:

  1. The source dataset: A dataset containing FHIR stores with one or more resources that have sensitive data. When you call the deidentify operation, all FHIR resources in all FHIR stores in the dataset are de-identified.
  2. What to de-identify: Configuration parameters that specify how to process the dataset. Custom configurations for the FHIR DeidentifyConfig field are not available. Instead, use an empty FhirConfig inside of the DeidentifyConfig object.
  3. The destination dataset: De-identification does not impact the original dataset or its data. Instead, de-identified copies of the original data are written to a new dataset, called the destination dataset.

The following example shows the process of de-identifying a FHIR Patient resource.

Suppose that you have a project with the following settings:

  • Project name: patient-project
  • Project region: us-central1
  • Dataset: patient-dataset
  • FHIR store: patient-fhir-store

Inside the FHIR store is a Patient resource with the following properties:

{
  "birthDate": "1980-12-05",
  "gender": "female",
  "id": "r77433dd-dkeuc-633743nfd-383nfdsjds732",
  "name": [
    {
      "family": "Smith",
      "given": [
        "Darcy"
      ],
      "use": "official"
    }
  ],
  "resourceType": "Patient"
}

To de-identify the resource, you would run the following curl command:

curl -X POST \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \
    -H "Content-Type: application/json; charset=utf-8" \
    --data "{
      'destinationDataset': 'projects/patient-project/locations/us-central1/datasets/destination-patient-dataset>',
      'config': {
        'fhir': {}
      }
    }" "https://healthcare.googleapis.com/v1alpha/projects/patient-project/locations/us-central1/datasets/patient-dataset:deidentify"

Next, using the Patient ID, you can get the details for the Patient resource in the new destination dataset:

curl -X GET \
     -H "Authorization: Bearer "$(gcloud auth print-access-token) \
     -H "Content-Type: application/fhir+json; charset=utf-8" \
     "https://healthcare.googleapis.com/v1alpha/projects/patient-project/locations/us-central1/datasets/destination-patient-dataset/fhirStores/patient-fhir-store/resources/Patient/r77433dd-dkeuc-633743nfd-383nfdsjds732"

The server returns the following response:

{
  "birthDate": "",
  "gender": "unknown",
  "id": "r77433dd-dkeuc-633743nfd-383nfdsjds732",
  "meta": {
    "lastUpdated": "2018-01-01T2018-01-01T00:00:00+00:00",
    "versionId": "MTU0MDU4NTcxNjI2MTUxNDAwMA"
  },
  "name": [
    {
      "family": "",
      "given": [
        ""
      ],
      "use": "official"
    }
  ],
  "resourceType": "Patient"
}

You can see that the following values were redacted to de-identify the resource:

  • birthDate
  • gender
  • name.family
  • name.given
このページは役立ちましたか?評価をお願いいたします。

フィードバックを送信...