Creating a regular custom dictionary detector

Custom dictionaries provide the simple but powerful ability to match a list of words or phrases. You can use a custom dictionary as a detector or as an exception list for built-in detectors. You can also use custom dictionaries to augment built-in infoType detectors to match additional findings.

This section describes how to create a regular custom dictionary detector from a list of words.

Anatomy of a dictionary custom infoType detector

As summarized in API overview, to create a dictionary custom infoType detector, you define a CustomInfoType object that contains the following:

  • The name you want to give the custom infoType detector, within in an InfoType object.
  • An optional Likelihood value. If you omit this field, matches to the dictionary items will return a default likelihood of VERY_LIKELY.
  • Optional DetectionRule objects, or hotword rules. These rules adjust the likelihood of findings within a given proximity of specified hotwords. Learn more about hotword rules in Customizing match likelihood.
  • An optional SensitivityScore value. If you omit this field, matches to the dictionary items will return a default sensitivity level of HIGH.

    Sensitivity scores are used in data profiles. When profiling your data, Cloud DLP uses the sensitivity scores of the infoTypes to calculate the sensitivity level.

  • A Dictionary, as either a WordList containing a list of words to scan for or a CloudStoragePath to a single text file containing a newline-delimited list of words to scan for.

As a JSON object, a dictionary custom infoType detector that includes all optional components looks like the following. This JSON includes a path to a dictionary text file stored in Cloud Storage. To see an inline word list, see the Examples section, later in this topic.

{
  "customInfoTypes":[
    {
      "infoType":{
        "name":"CUSTOM_INFOTYPE_NAME"
      },
      "likelihood":"LIKELIHOOD_LEVEL",
      "detectionRules":[
        {
          "hotwordRule":{
            HOTWORD_RULE
          }
        },
        ...
      ],
      "sensitivityScore":{
          "score": "SENSITIVITY_SCORE"
        },
      "dictionary":
      {
        "cloudStoragePath":
        {
          "path": "gs://PATH_TO_TXT_FILE"
        }
      }
    }
  ],
  ...
}

Dictionary matching specifics

Following is guidance about how Cloud DLP matches dictionary words and phrases. These points apply to both regular and large custom dictionaries:

  • Dictionary words are case-insensitive. If your dictionary includes Abby, it will match on abby, ABBY, Abby, and so on.
  • All characters—in dictionaries or in content to be scanned—other than letters and digits contained within the Unicode Basic Multilingual Plane are considered as whitespace when scanning for matches. If your dictionary scans for Abby Abernathy, it will match on abby abernathy, Abby, Abernathy, Abby (ABERNATHY), and so on.
  • The characters surrounding any match must be of a different type (letters or digits) than the adjacent characters within the word. If your dictionary scans for Abi, it will match the first three characters of Abi904, but not of Abigail.
  • Dictionary words containing many characters that are not letters or digits might result in unexpected findings, because those characters are treated as whitespace.

Examples

Simple word list

Suppose you have data that includes what hospital room a patient was treated in during a visit. These locations may be considered sensitive in a particular data set, but they are not something that would be picked up by Cloud DLP's built-in detectors.

The rooms were listed as:

  • "RM-Orange"
  • "RM-Yellow"
  • "RM-Green"

Protocol

The following example JSON defines a custom dictionary that you could use to de-identify custom room numbers.

JSON input:

POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/content:deidentify?key={YOUR_API_KEY}

{
  "item":{
    "value":"Patient was seen in RM-YELLOW then transferred to rm green."
  },
  "deidentifyConfig":{
    "infoTypeTransformations":{
      "transformations":[
        {
          "primitiveTransformation":{
            "replaceWithInfoTypeConfig":{

            }
          }
        }
      ]
    }
  },
  "inspectConfig":{
    "customInfoTypes":[
      {
        "infoType":{
          "name":"CUSTOM_ROOM_ID"
        },
        "dictionary":{
          "wordList":{
            "words":[
              "RM-GREEN",
              "RM-YELLOW",
              "RM-ORANGE"
            ]
          }
        }
      }
    ]
  }
}

JSON output:

When we POST the JSON input to content:deidentify, it returns the following JSON response:

{
  "item":{
    "value":"Patient was seen in [CUSTOM_ROOM_ID] then transferred to [CUSTOM_ROOM_ID]."
  },
  "overview":{
    "transformedBytes":"17",
    "transformationSummaries":[
      {
        "infoType":{
          "name":"CUSTOM_ROOM_ID"
        },
        "transformation":{
          "replaceWithInfoTypeConfig":{

          }
        },
        "results":[
          {
            "count":"2",
            "code":"SUCCESS"
          }
        ],
        "transformedBytes":"17"
      }
    ]
  }
}

Cloud DLP has correctly identified the room numbers specified in the custom dictionary's WordList message. Note that items are even matched when the case and the hyphen (-) are missing, as in the second example, "rm green."

Java

To learn how to install and use the client library for Cloud DLP, see Cloud DLP client libraries.

To authenticate to Cloud DLP, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


import com.google.cloud.dlp.v2.DlpServiceClient;
import com.google.privacy.dlp.v2.ContentItem;
import com.google.privacy.dlp.v2.CustomInfoType;
import com.google.privacy.dlp.v2.CustomInfoType.Dictionary;
import com.google.privacy.dlp.v2.CustomInfoType.Dictionary.WordList;
import com.google.privacy.dlp.v2.DeidentifyConfig;
import com.google.privacy.dlp.v2.DeidentifyContentRequest;
import com.google.privacy.dlp.v2.DeidentifyContentResponse;
import com.google.privacy.dlp.v2.InfoType;
import com.google.privacy.dlp.v2.InfoTypeTransformations;
import com.google.privacy.dlp.v2.InfoTypeTransformations.InfoTypeTransformation;
import com.google.privacy.dlp.v2.InspectConfig;
import com.google.privacy.dlp.v2.LocationName;
import com.google.privacy.dlp.v2.PrimitiveTransformation;
import com.google.privacy.dlp.v2.ReplaceWithInfoTypeConfig;
import java.io.IOException;

public class DeIdentifyWithSimpleWordList {

  public static void main(String[] args) throws Exception {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String textToDeIdentify = "Patient was seen in RM-YELLOW then transferred to rm green.";
    deidentifyWithSimpleWordList(projectId, textToDeIdentify);
  }

  public static void deidentifyWithSimpleWordList(String projectId, String textToDeIdentify)
      throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (DlpServiceClient dlp = DlpServiceClient.create()) {

      // Specify what content you want the service to DeIdentify.
      ContentItem contentItem = ContentItem.newBuilder().setValue(textToDeIdentify).build();

      // Construct the word list to be detected
      Dictionary wordList =
          Dictionary.newBuilder()
              .setWordList(
                  WordList.newBuilder()
                      .addWords("RM-GREEN")
                      .addWords("RM-YELLOW")
                      .addWords("RM-ORANGE")
                      .build())
              .build();

      // Specify the word list custom info type the inspection will look for.
      InfoType infoType = InfoType.newBuilder().setName("CUSTOM_ROOM_ID").build();
      CustomInfoType customInfoType =
          CustomInfoType.newBuilder().setInfoType(infoType).setDictionary(wordList).build();
      InspectConfig inspectConfig =
          InspectConfig.newBuilder().addCustomInfoTypes(customInfoType).build();

      // Define type of deidentification as replacement.
      PrimitiveTransformation primitiveTransformation =
          PrimitiveTransformation.newBuilder()
              .setReplaceWithInfoTypeConfig(ReplaceWithInfoTypeConfig.getDefaultInstance())
              .build();

      // Associate deidentification type with info type.
      InfoTypeTransformation transformation =
          InfoTypeTransformation.newBuilder()
              .addInfoTypes(infoType)
              .setPrimitiveTransformation(primitiveTransformation)
              .build();

      // Construct the configuration for the Redact request and list all desired transformations.
      DeidentifyConfig deidentifyConfig =
          DeidentifyConfig.newBuilder()
              .setInfoTypeTransformations(
                  InfoTypeTransformations.newBuilder().addTransformations(transformation))
              .build();

      // Combine configurations into a request for the service.
      DeidentifyContentRequest request =
          DeidentifyContentRequest.newBuilder()
              .setParent(LocationName.of(projectId, "global").toString())
              .setItem(contentItem)
              .setInspectConfig(inspectConfig)
              .setDeidentifyConfig(deidentifyConfig)
              .build();

      // Send the request and receive response from the service
      DeidentifyContentResponse response = dlp.deidentifyContent(request);

      // Print the results
      System.out.println(
          "Text after replace with infotype config: " + response.getItem().getValue());
    }
  }
}

C#

To learn how to install and use the client library for Cloud DLP, see Cloud DLP client libraries.

To authenticate to Cloud DLP, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


using System;
using Google.Api.Gax.ResourceNames;
using Google.Cloud.Dlp.V2;

public class DeidentifyWithSimpleWordList
{
    public static DeidentifyContentResponse Deidentify(string projectId, string text)
    {
        // Instantiate a client.
        var dlp = DlpServiceClient.Create();

        var contentItem = new ContentItem { Value = text };

        var wordList = new CustomInfoType.Types.Dictionary.Types.WordList
        {
            Words = { new string[] { "RM-GREEN", "RM-YELLOW", "RM-ORANGE" } }
        };

        var infoType = new InfoType
        {
            Name = "CUSTOM_ROOM_ID"
        };

        var customInfoType = new CustomInfoType
        {
            InfoType = infoType,
            Dictionary = new CustomInfoType.Types.Dictionary
            {
                WordList = wordList
            }
        };

        var inspectConfig = new InspectConfig
        {
            CustomInfoTypes =
            {
                customInfoType,
            }
        };
        var primitiveTransformation = new PrimitiveTransformation
        {
            ReplaceWithInfoTypeConfig = new ReplaceWithInfoTypeConfig { }
        };

        var transformation = new InfoTypeTransformations.Types.InfoTypeTransformation
        {
            InfoTypes = { infoType },
            PrimitiveTransformation = primitiveTransformation
        };

        var deidentifyConfig = new DeidentifyConfig
        {
            InfoTypeTransformations = new InfoTypeTransformations
            {
                Transformations = { transformation }
            }
        };

        var request = new DeidentifyContentRequest
        {
            Parent = new LocationName(projectId, "global").ToString(),
            InspectConfig = inspectConfig,
            DeidentifyConfig = deidentifyConfig,
            Item = contentItem
        };

        // Call the API.
        var response = dlp.DeidentifyContent(request);

        // Inspect the results.
        Console.WriteLine($"Deidentified content: {response.Item.Value}");
        return response;
    }
}

Exception list

Suppose you have log data that includes customer identifiers such as email addresses, and you want to redact this information. However, these logs also include the email addresses of internal developers, and you don't want to redact those.

Protocol

The following JSON example creates a custom dictionary that lists a subset of email addresses within the WordList message (jack@example.org and jill@example.org), and assigns them the custom infoType name DEVELOPER_EMAIL. This JSON instructs Cloud DLP to ignore the specified email addresses, while replacing any other email addresses it detects with a string that corresponds to its infoType (in this case, EMAIL_ADDRESS):

JSON input:

POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/content:deidentify?key={YOUR_API_KEY}

{
  "item":{
    "value":"jack@example.org accessed customer record of user5@example.com"
  },
  "deidentifyConfig":{
    "infoTypeTransformations":{
      "transformations":[
        {
          "primitiveTransformation":{
            "replaceWithInfoTypeConfig":{

            }
          },
          "infoTypes":[
            {
              "name":"EMAIL_ADDRESS"
            }
          ]
        }
      ]
    }
  },
  "inspectConfig":{
    "customInfoTypes":[
      {
        "infoType":{
          "name":"DEVELOPER_EMAIL"
        },
        "dictionary":{
          "wordList":{
            "words":[
              "jack@example.org",
              "jill@example.org"
            ]
          }
        }
      }
    ],
    "infoTypes":[
      {
        "name":"EMAIL_ADDRESS"
      }
    ]
    "ruleSet": [
      {
        "infoTypes": [
          {
            "name": "EMAIL_ADDRESS"
          }
        ],
        "rules": [
          {
            "exclusionRule": {
              "excludeInfoTypes": {
                "infoTypes": [
                  {
                    "name": "DEVELOPER_EMAIL"
                  }
                ]
              },
              "matchingType": "MATCHING_TYPE_FULL_MATCH"
            }
          }
        ]
      }
    ]
  }
}

JSON output:

When we POST this JSON to content:deidentify, it returns the following JSON response:

{
  "item":{
    "value":"jack@example.org accessed customer record of [EMAIL_ADDRESS]"
  },
  "overview":{
    "transformedBytes":"17",
    "transformationSummaries":[
      {
        "infoType":{
          "name":"EMAIL_ADDRESS"
        },
        "transformation":{
          "replaceWithInfoTypeConfig":{

          }
        },
        "results":[
          {
            "count":"1",
            "code":"SUCCESS"
          }
        ],
        "transformedBytes":"17"
      }
    ]
  }
}

The output has correctly identified user1@example.com as matched by the EMAIL_ADDRESS infoType detector and jack@example.org as matched by the DEVELOPER_EMAIL custom infoType detector. Note that because we chose to only transform EMAIL_ADDRESS, jack@example.org was left intact.

Java

To learn how to install and use the client library for Cloud DLP, see Cloud DLP client libraries.

To authenticate to Cloud DLP, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


import com.google.cloud.dlp.v2.DlpServiceClient;
import com.google.privacy.dlp.v2.ContentItem;
import com.google.privacy.dlp.v2.CustomInfoType;
import com.google.privacy.dlp.v2.CustomInfoType.Dictionary;
import com.google.privacy.dlp.v2.CustomInfoType.Dictionary.WordList;
import com.google.privacy.dlp.v2.DeidentifyConfig;
import com.google.privacy.dlp.v2.DeidentifyContentRequest;
import com.google.privacy.dlp.v2.DeidentifyContentResponse;
import com.google.privacy.dlp.v2.ExclusionRule;
import com.google.privacy.dlp.v2.InfoType;
import com.google.privacy.dlp.v2.InfoTypeTransformations;
import com.google.privacy.dlp.v2.InfoTypeTransformations.InfoTypeTransformation;
import com.google.privacy.dlp.v2.InspectConfig;
import com.google.privacy.dlp.v2.InspectionRule;
import com.google.privacy.dlp.v2.InspectionRuleSet;
import com.google.privacy.dlp.v2.LocationName;
import com.google.privacy.dlp.v2.MatchingType;
import com.google.privacy.dlp.v2.PrimitiveTransformation;
import com.google.privacy.dlp.v2.ReplaceWithInfoTypeConfig;
import java.io.IOException;

public class DeIdentifyWithExceptionList {

  public static void main(String[] args) throws Exception {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String textToDeIdentify = "jack@example.org accessed customer record of user5@example.com";
    deIdentifyWithExceptionList(projectId, textToDeIdentify);
  }

  public static void deIdentifyWithExceptionList(String projectId, String textToDeIdentify)
      throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (DlpServiceClient dlp = DlpServiceClient.create()) {

      // Specify what content you want the service to DeIdentify.
      ContentItem contentItem = ContentItem.newBuilder().setValue(textToDeIdentify).build();

      // Construct the custom word list to be detected.
      Dictionary wordList =
          Dictionary.newBuilder()
              .setWordList(
                  WordList.newBuilder()
                      .addWords("jack@example.org")
                      .addWords("jill@example.org")
                      .build())
              .build();

      // Construct the custom dictionary detector associated with the word list.
      InfoType developerEmail = InfoType.newBuilder().setName("DEVELOPER_EMAIL").build();
      CustomInfoType customInfoType =
          CustomInfoType.newBuilder().setInfoType(developerEmail).setDictionary(wordList).build();

      ExclusionRule exclusionRule =
          ExclusionRule.newBuilder()
              .setDictionary(wordList)
              .setMatchingType(MatchingType.MATCHING_TYPE_FULL_MATCH)
              .build();

      InspectionRule inspectionRule =
          InspectionRule.newBuilder()
              .setExclusionRule(exclusionRule)
              .build();

      // Specify the word list custom info type and build-in info type the inspection will look for.
      InfoType emailAddress = InfoType.newBuilder().setName("EMAIL_ADDRESS").build();

      InspectionRuleSet inspectionRuleSet =
          InspectionRuleSet.newBuilder()
              .addInfoTypes(emailAddress)
              .addRules(inspectionRule)
              .build();

      InspectConfig inspectConfig =
          InspectConfig.newBuilder()
              .addInfoTypes(emailAddress)
              .addCustomInfoTypes(customInfoType)
              .addRuleSet(inspectionRuleSet)
              .build();

      // Define type of deidentification as replacement.
      PrimitiveTransformation primitiveTransformation =
          PrimitiveTransformation.newBuilder()
              .setReplaceWithInfoTypeConfig(ReplaceWithInfoTypeConfig.getDefaultInstance())
              .build();

      // Associate de-identification type with info type.
      InfoTypeTransformation transformation =
          InfoTypeTransformation.newBuilder()
              .addInfoTypes(emailAddress)
              .setPrimitiveTransformation(primitiveTransformation)
              .build();

      // Construct the configuration for the de-id request and list all desired transformations.
      DeidentifyConfig deidentifyConfig =
          DeidentifyConfig.newBuilder()
              .setInfoTypeTransformations(
                  InfoTypeTransformations.newBuilder().addTransformations(transformation))
              .build();

      // Combine configurations into a request for the service.
      DeidentifyContentRequest request =
          DeidentifyContentRequest.newBuilder()
              .setParent(LocationName.of(projectId, "global").toString())
              .setItem(contentItem)
              .setInspectConfig(inspectConfig)
              .setDeidentifyConfig(deidentifyConfig)
              .build();

      // Send the request and receive response from the service
      DeidentifyContentResponse response = dlp.deidentifyContent(request);

      // Print the results
      System.out.println(
          "Text after replace with infotype config: " + response.getItem().getValue());
    }
  }
}

C#

To learn how to install and use the client library for Cloud DLP, see Cloud DLP client libraries.

To authenticate to Cloud DLP, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


using System;
using Google.Api.Gax.ResourceNames;
using Google.Cloud.Dlp.V2;

public class DeidentifyWithExceptionList
{
    public static DeidentifyContentResponse Deidentify(string projectId, string text)
    {
        // Instantiate a client.
        var dlp = DlpServiceClient.Create();

        var contentItem = new ContentItem { Value = text };

        var wordList = new CustomInfoType.Types.Dictionary.Types.WordList
        {
            Words = { new string[] { "jack@example.org", "jill@example.org" } }
        };

        var exclusionRule = new ExclusionRule
        {
            MatchingType = MatchingType.FullMatch,
            Dictionary = new CustomInfoType.Types.Dictionary
            {
                WordList = wordList
            }
        };

        var infoType = new InfoType { Name = "EMAIL_ADDRESS" };

        var inspectionRuleSet = new InspectionRuleSet
        {
            InfoTypes = { infoType },
            Rules = { new InspectionRule { ExclusionRule = exclusionRule } }
        };

        var inspectConfig = new InspectConfig
        {
            InfoTypes = { infoType },
            RuleSet = { inspectionRuleSet }
        };
        var primitiveTransformation = new PrimitiveTransformation
        {
            ReplaceWithInfoTypeConfig = new ReplaceWithInfoTypeConfig { }
        };

        var transformation = new InfoTypeTransformations.Types.InfoTypeTransformation
        {
            InfoTypes = { infoType },
            PrimitiveTransformation = primitiveTransformation
        };

        var deidentifyConfig = new DeidentifyConfig
        {
            InfoTypeTransformations = new InfoTypeTransformations
            {
                Transformations = { transformation }
            }
        };

        var request = new DeidentifyContentRequest
        {
            Parent = new LocationName(projectId, "global").ToString(),
            InspectConfig = inspectConfig,
            DeidentifyConfig = deidentifyConfig,
            Item = contentItem
        };

        // Call the API.
        var response = dlp.DeidentifyContent(request);

        // Inspect the results.
        Console.WriteLine($"Deidentified content: {response.Item.Value}");
        return response;
    }
}

Augment a built-in infotype detector

Consider a scenario in which a built-in infoType detector isn’t returning the correct values. For example, you want to return matches on person names, but Cloud DLP's built-in PERSON_NAME detector is failing to return matches on some person names that are common in your dataset.

Cloud DLP allows you to augment built-in infoType detectors by including a built-in detector in the declaration for a custom infoType detector, as shown in the following example. This snippet illustrates how to configure Cloud DLP so that the PERSON_NAME built-in infoType detector will additionally match the name “Quasimodo:”

...
  "inspectConfig":{
    "customInfoTypes":[
      {
        "infoType":{
          "name":"PERSON_NAME"
        },
        "dictionary":{
          "wordList":{
            "words":[
              "quasimodo"
            ]
          }
        }
      }
    ]
  }
...

What's next

Learn about large custom dictionaries.