Custom dictionaries provide the simple but powerful ability to match a list of words or phrases. You can use a custom dictionary as a detector or as an exception list for built-in detectors. You can also use custom dictionaries to augment built-in infoType detectors to match additional findings.
This section describes how to create a regular custom dictionary detector from a list of words.
Anatomy of a dictionary custom infoType detector
As summarized in
API overview, to create a
dictionary custom infoType detector, you define a
CustomInfoType
object that contains the following:
- The name you want to give the custom infoType detector, within in an
InfoType
object. - An optional
Likelihood
value. If you omit this field, matches to the dictionary items will return a default likelihood ofVERY_LIKELY
. - Optional
DetectionRule
objects, or hotword rules. These rules adjust the likelihood of findings within a given proximity of specified hotwords. Learn more about hotword rules in Customizing match likelihood. An optional
SensitivityScore
value. If you omit this field, matches to the dictionary items will return a default sensitivity level ofHIGH
.Sensitivity scores are used in data profiles. When profiling your data, Sensitive Data Protection uses the sensitivity scores of the infoTypes to calculate the sensitivity level.
A
Dictionary
, as either aWordList
containing a list of words to scan for or aCloudStoragePath
to a single text file containing a newline-delimited list of words to scan for.
As a JSON object, a dictionary custom infoType detector that includes all optional components looks like the following. This JSON includes a path to a dictionary text file stored in Cloud Storage. To see an inline word list, see the Examples section, later in this topic.
{
"customInfoTypes":[
{
"infoType":{
"name":"CUSTOM_INFOTYPE_NAME"
},
"likelihood":"LIKELIHOOD_LEVEL",
"detectionRules":[
{
"hotwordRule":{
HOTWORD_RULE
}
},
...
],
"sensitivityScore":{
"score": "SENSITIVITY_SCORE"
},
"dictionary":
{
"cloudStoragePath":
{
"path": "gs://PATH_TO_TXT_FILE"
}
}
}
],
...
}
Dictionary matching specifics
Following is guidance about how Sensitive Data Protection matches dictionary words and phrases. These points apply to both regular and large custom dictionaries:
- Dictionary words are case-insensitive. If your dictionary includes
Abby
, it will match onabby
,ABBY
,Abby
, and so on. - All characters—in dictionaries or in content to be scanned—other
than letters, digits, and other alphabetic characters contained within the Unicode
Basic Multilingual Plane
are considered as whitespace when scanning for matches. If your dictionary
scans for
Abby Abernathy
, it will match onabby abernathy
,Abby, Abernathy
,Abby (ABERNATHY)
, and so on. - The characters surrounding any match must be of a different type (letters
or digits) than the adjacent characters within the word. If your dictionary
scans for
Abi
, it will match the first three characters ofAbi904
, but not ofAbigail
. - Dictionary words containing characters in the Supplementary Multilingual Plane of the Unicode standard can yield unexpected findings. Examples of such characters are emojis, scientific symbols, and historical scripts.
Letters, digits, and other alphabetic characters are defined as follows:
- Letters: characters with general categories
Lu
,Ll
,Lt
,Lm
, orLo
in the Unicode specification - Digits: characters with general category
Nd
in the Unicode specification - Other alphabetic characters: characters with general category
Nl
in the Unicode specification or with contributory propertyOther_Alphabetic
as defined by the Unicode Standard
Examples
Simple word list
Suppose you have data that includes what hospital room a patient was treated in during a visit. These locations may be considered sensitive in a particular data set, but they are not something that would be picked up by Sensitive Data Protection's built-in detectors.
The rooms were listed as:
- "RM-Orange"
- "RM-Yellow"
- "RM-Green"
C#
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
REST
The following example JSON defines a custom dictionary that you could use to de-identify custom room numbers.
JSON input:
POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/content:deidentify?key={YOUR_API_KEY}
{
"item":{
"value":"Patient was seen in RM-YELLOW then transferred to rm green."
},
"deidentifyConfig":{
"infoTypeTransformations":{
"transformations":[
{
"primitiveTransformation":{
"replaceWithInfoTypeConfig":{
}
}
}
]
}
},
"inspectConfig":{
"customInfoTypes":[
{
"infoType":{
"name":"CUSTOM_ROOM_ID"
},
"dictionary":{
"wordList":{
"words":[
"RM-GREEN",
"RM-YELLOW",
"RM-ORANGE"
]
}
}
}
]
}
}
JSON output:
When we POST the JSON input to
content:deidentify
,
it returns the following JSON response:
{
"item":{
"value":"Patient was seen in [CUSTOM_ROOM_ID] then transferred to [CUSTOM_ROOM_ID]."
},
"overview":{
"transformedBytes":"17",
"transformationSummaries":[
{
"infoType":{
"name":"CUSTOM_ROOM_ID"
},
"transformation":{
"replaceWithInfoTypeConfig":{
}
},
"results":[
{
"count":"2",
"code":"SUCCESS"
}
],
"transformedBytes":"17"
}
]
}
}
Sensitive Data Protection has correctly identified the room numbers
specified in the custom dictionary's
WordList
message. Note that items are even matched when the case and the hyphen (-
) are
missing, as in the second example, "rm green."
Exception list
Suppose you have log data that includes customer identifiers such as email addresses, and you want to redact this information. However, these logs also include the email addresses of internal developers, and you don't want to redact those.
C#
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
REST
The following JSON example creates a custom dictionary that lists a subset of
email addresses within the
WordList
message (jack@example.org and jill@example.org), and assigns them the custom
infoType name DEVELOPER_EMAIL
. This JSON instructs
Sensitive Data Protection to ignore the specified email addresses, while
replacing any other email addresses it detects with a string that corresponds
to its infoType (in this case, EMAIL_ADDRESS
):
JSON input:
POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/content:deidentify?key={YOUR_API_KEY}
{
"item":{
"value":"jack@example.org accessed customer record of user5@example.com"
},
"deidentifyConfig":{
"infoTypeTransformations":{
"transformations":[
{
"primitiveTransformation":{
"replaceWithInfoTypeConfig":{
}
},
"infoTypes":[
{
"name":"EMAIL_ADDRESS"
}
]
}
]
}
},
"inspectConfig":{
"customInfoTypes":[
{
"infoType":{
"name":"DEVELOPER_EMAIL"
},
"dictionary":{
"wordList":{
"words":[
"jack@example.org",
"jill@example.org"
]
}
}
}
],
"infoTypes":[
{
"name":"EMAIL_ADDRESS"
}
]
"ruleSet": [
{
"infoTypes": [
{
"name": "EMAIL_ADDRESS"
}
],
"rules": [
{
"exclusionRule": {
"excludeInfoTypes": {
"infoTypes": [
{
"name": "DEVELOPER_EMAIL"
}
]
},
"matchingType": "MATCHING_TYPE_FULL_MATCH"
}
}
]
}
]
}
}
JSON output:
When we POST this JSON to
content:deidentify
,
it returns the following JSON response:
{
"item":{
"value":"jack@example.org accessed customer record of [EMAIL_ADDRESS]"
},
"overview":{
"transformedBytes":"17",
"transformationSummaries":[
{
"infoType":{
"name":"EMAIL_ADDRESS"
},
"transformation":{
"replaceWithInfoTypeConfig":{
}
},
"results":[
{
"count":"1",
"code":"SUCCESS"
}
],
"transformedBytes":"17"
}
]
}
}
The output has correctly identified user1@example.com as matched by the
EMAIL_ADDRESS
infoType detector and jack@example.org as matched by the
DEVELOPER_EMAIL
custom infoType detector. Note that because we chose to only
transform EMAIL_ADDRESS
, jack@example.org was left intact.
Augment a built-in infotype detector
Consider a scenario in which a built-in infoType detector isn't returning the
correct values. For example, you want to return matches on person names, but
Sensitive Data Protection's built-in PERSON_NAME
detector is failing to
return matches on some person names that are common in your dataset.
Sensitive Data Protection allows you to augment built-in infoType detectors
by including a built-in detector in the declaration for a custom infoType
detector, as shown in the following example. This snippet illustrates how to
configure Sensitive Data Protection so that the PERSON_NAME
built-in
infoType detector will additionally match the name "Quasimodo:"
C#
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
REST
... "inspectConfig":{ "customInfoTypes":[ { "infoType":{ "name":"PERSON_NAME" }, "dictionary":{ "wordList":{ "words":[ "quasimodo" ] } } } ] } ...
What's next
Learn about large custom dictionaries.