Creating a Custom Dictionary Detector

Custom dictionaries provide the simple but powerful ability to match a list of words or phrases. This can act as its own unique detector and can be used to create an exception list for built-in detectors.

Anatomy of a dictionary custom infoType detector

As summarized in API Overview, to create a dictionary custom infoType detector, you define a CustomInfoType object that contains:

  • The name you want to give the custom infoType detector, within in an InfoType object.
  • An optional Likelihood value. If you omit this, matches to the dictionary items will return a default likelihood of VERY_LIKELY.
  • Optional DetectionRule objects, or hotword rules. These rules adjust the likelihood of findings within a given proximity of specified hotwords. Learn more about hotword rules in Customizing match likelihood.
  • A Dictionary, as a WordList containing a list of words.

As a JSON object, a dictionary custom infoType detector that includes all optional components looks like this:

{
  "customInfoTypes":[
    {
      "infoType":{
        "name":"custom-infoType-name"
      },
      "likelihood":"likelihood-value",
      "detectionRules":[
        {
          "hotwordRule":{
            HotwordRule-object
          }
        },
        ...
      ],
      "dictionary":{
        "wordList":{
          "words":[
            "dictionary-word1",
            "dictionary-word2",
            "etc."
          ]
        }
      }
    }
  ],
  ...
}

Stored custom dictionaries

The DLP API also supports stored custom dictionaries for inspecting storage repositories. With a stored custom dictionary, the DLP API can scan for, for instance, hundreds of millions of usernames, email addresses, or any other strings as defined in "Dictionary matching specifics." In addition, the DLP API includes built-in functionality for you to programmatically update your stored custom dictionary as needed.

Stored custom dictionaries are built from collections of phrases located in either a Cloud Storage bucket or BigQuery table owned by your organization. The first time you create a stored custom dictionary, you enter the phrases to search on in Cloud Storage or BigQuery, and then use the DLP API to generate the stored custom dictionary. The custom dictionary is stored in Cloud Storage. When you add or remove terms from the Cloud Storage bucket or BigQuery table where you're storing them, you then use the DLP API to update the stored custom dictionary.

For more information about creating stored custom dictionaries, see Creating a Stored Custom Dictionary Detector.

Dictionary matching specifics

Following is guidance about how the DLP API matches dictionary words and phrases. These points apply to both regular and stored custom dictionaries:

  • Dictionary words are case-insensitive. If your dictionary includes Abby, it will match on abby, ABBY, Abby, and so on.
  • All characters—in dictionaries or in content to be scanned—other than letters and digits contained within the Unicode Basic Multilingual Plane are considered as whitespace when scanning for matches. If your dictionary scans for Abby Abernathy, it will match on abby abernathy, Abby, Abernathy, Abby (ABERNATHY), and so on.
  • The characters surrounding any match must be of a different type (letters or digits) than the adjacent characters within the word. If your dictionary scans for Abi, it will match the first three characters of Abi904, but not of Abigail.
  • Dictionary words containing a large number of characters that are not letters or digits may result in unexpected findings, because those characters are treated as whitespace.

Dictionary example: Simple word list

Suppose you have data that includes what hospital room a patient was treated in during a visit. These locations may be considered sensitive in a particular data set, but they are not something that would be picked up by the DLP API's built-in detectors.

The rooms were listed as:

  • "RM-Orange"
  • "RM-Yellow"
  • "RM-Green"

The following example JSON defines a custom dictionary that you could use to de-identify custom room numbers.

JSON Input:

{
 "item": {
  "value": "Patient was seen in RM-YELLOW then transferred to rm green."
 },
 "deidentifyConfig": {
  "infoTypeTransformations": {
   "transformations": [
    {
     "primitiveTransformation": {
      "replaceWithInfoTypeConfig": {
      }
     }
    }
   ]
  }
 },
 "inspectConfig": {
  "customInfoTypes": [
   {
    "infoType": {
     "name": "CUSTOM_ROOM_ID"
    },
    "dictionary": {
     "wordList": {
      "words": [
       "RM-GREEN",
       "RM-YELLOW",
       "RM-ORANGE"
      ]
     }
    }
   }
  ]
 }
}

URL:

POST https://dlp.googleapis.com/v2/{parent=projects/*}/content:deidentify

JSON Output:

When we POST the JSON input to the projects.content.deidentify method, the DLP API returns the following JSON response:

{
 "item": {
  "value": "Patient was seen in [CUSTOM_ROOM_ID] then transferred to [CUSTOM_ROOM_ID]."
 },
 "overview": {
  "transformedBytes": "17",
  "transformationSummaries": [
   {
    "infoType": {
     "name": "CUSTOM_ROOM_ID"
    },
    "transformation": {
     "replaceWithInfoTypeConfig": {
     }
    },
    "results": [
     {
      "count": "2",
      "code": "SUCCESS"
     }
    ],
    "transformedBytes": "17"
   }
  ]
 }
}

The DLP API has correctly identified the room numbers specified in the custom dictionary's WordList message. Note that items are even matched when the case and the hyphen ("-") are missing, as in the second example, "rm green."

Dictionary example: Exception list

Suppose you have log data that includes customer identifiers such as email addresses, and you want to redact this information. However, these logs also include the email addresses of internal developers, and you don't want to redact those.

The following JSON example creates a custom dictionary that lists a subset of email addresses within the WordList message (jack@example.org and jill@example.org), and assigns them the custom infoType name DEVELOPER_EMAIL. This JSON instructs the DLP API to ignore the specified email addresses, while replacing any other email addresses it detects with a string that corresponds to its infoType (in this case, EMAIL_ADDRESS):

JSON Input:

{
 "item": {
  "value": "jack@example.org accessed customer record of user5@example.com"
 },
 "deidentifyConfig": {
  "infoTypeTransformations": {
   "transformations": [
    {
     "primitiveTransformation": {
      "replaceWithInfoTypeConfig": {
      }
     },
     "infoTypes": [
      {
       "name": "EMAIL_ADDRESS"
      }
     ]
    }
   ]
  }
 },
 "inspectConfig": {
  "customInfoTypes": [
   {
    "infoType": {
     "name": "DEVELOPER_EMAIL"
    },
    "dictionary": {
     "wordList": {
      "words": [
       "jack@example.org",
       "jill@example.org"
      ]
     }
    }
   }
  ],
  "infoTypes": [
   {
    "name": "EMAIL_ADDRESS"
   }
  ]
 }
}

URL:

POST https://dlp.googleapis.com/v2/{parent=projects/*}/content:deidentify

JSON Output:

When we POST this JSON to the projects.content.deidentify method, the method returns the following JSON response:

{
 "item": {
  "value": "jack@example.org accessed customer record of [EMAIL_ADDRESS]"
 },
 "overview": {
  "transformedBytes": "17",
  "transformationSummaries": [
   {
    "infoType": {
     "name": "EMAIL_ADDRESS"
    },
    "transformation": {
     "replaceWithInfoTypeConfig": {
     }
    },
    "results": [
     {
      "count": "1",
      "code": "SUCCESS"
     }
    ],
    "transformedBytes": "17"
   }
  ]
 }
}

The output has correctly identified user1@example.com as matched by the EMAIL_ADDRESS infoType detector and jack@example.org as matched by the DEVELOPER_EMAIL custom infoType detector. Note that because we chose to only transform EMAIL_ADDRESS, jack@example.org was left intact.

Was this page helpful? Let us know how we did:

Send feedback about...

Data Loss Prevention API