This topic describes how to create and rebuild large custom dictionaries. It also covers several error scenarios.
When to choose a large custom dictionary over a regular custom dictionary
Regular custom dictionary detectors are sufficient when you have tens of thousands of sensitive words or phrases that you want to scan your content for. If you have more or if your term list changes frequently, consider creating a large custom dictionary, which can support tens of millions of terms.
How large custom dictionaries differ from other custom infoTypes
Large custom dictionaries are different from other custom infoTypes in that each large custom dictionary has two components:
- A list of phrases that you create and define. The list is stored as either a text file within Cloud Storage or a column in a BigQuery table.
- The dictionary files, which Sensitive Data Protection generates and stores in Cloud Storage. Dictionary files are composed of a copy of your term list plus bloom filters, which aid in searching and matching.
Create a large custom dictionary
This section describes how to create, edit, and rebuild a large custom dictionary.
Create a term list
Create a list that contains all the words and phrases that you want the new infoType detector to search for. Do one of the following:
- Place a text file with each word or phrase on its own line into a Cloud Storage bucket.
- Designate one column of a BigQuery table as the container for the words and phrases. Give each entry its own row in the column. You can use an existing BigQuery table, as long as all dictionary words and phrases are in a single column.
It's possible to assemble a term list that is too large for Sensitive Data Protection to process. If you see an error message, see Troubleshooting errors later in this topic.
Create a stored infoType
After you create your term list, use Sensitive Data Protection to create a dictionary:
Console
In a Cloud Storage bucket, create a new folder where Sensitive Data Protection will store the generated dictionary.
Sensitive Data Protection creates folders containing the dictionary files at the location that you specify.
In the Google Cloud console, go to the Create infoType page.
For Type, select Large custom dictionary.
For InfoType ID, enter an identifier for the stored infoType.
You will use this identifier when configuring your inspection and de-identification jobs. You can use letters, numbers, hyphens, and underscores in the name.
For InfoType display name, enter a name for your stored infoType.
You can use spaces and punctuation in the name.
For Description, enter a description of what your stored infoType detects.
For Storage type, select the location of your term list:
- BigQuery: Enter the project ID, dataset ID, and table ID. In the Field name field, enter the column identifier. You can designate at most one column from the table.
- Google Cloud Storage: Enter the path to the file.
For Output bucket or folder, enter the Cloud Storage location of the folder that you created in step 1.
Click Create.
A summary of the stored infoType appears. When the dictionary is generated and the new stored infoType is ready to use, the status of the infoType shows Ready.
C#
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
REST
- Create a new folder for the dictionary in a Cloud Storage bucket. Sensitive Data Protection creates folders containing the dictionary files at the location that you specify.
- Create the dictionary using the
storedInfoTypes.create
method. Thecreate
method takes the following parameters:- A
StoredInfoTypeConfig
object, which contains the configuration of the stored infoType. It includes:description
: A description of the dictionary.displayName
: The name you want to give the dictionary.LargeCustomDictionaryConfig
: Contains the configuration of the large custom dictionary. It includes:BigQueryField
: Specified if your term list is stored in BigQuery. Includes a reference to the table that your list is stored in, plus the field that contains each dictionary phrase.CloudStorageFileSet
: Specified if your term list is stored in Cloud Storage. Includes the URL to the source location in Cloud Storage, in the following form:"gs://[PATH_TO_GS]"
. Wildcards are supported.outputPath
: The path to the location in a Cloud Storage bucket to store the created dictionary.
storedInfoTypeId
: The identifier for the stored infoType. You use this identifier to refer to the stored infoType when you rebuild it, delete it, or use it in an inspection or de-identification job. If you leave this field empty, the system generates an identifier for you.
- A
Following is example JSON that, when sent to the storedInfoTypes.create
method, creates a new stored infoType—specifically, a large custom
dictionary detector. This example creates a stored infoType from a term
list stored in a publicly available
BigQuery database (bigquery-public-data.samples.github_nested
).
The database contains all GitHub usernames used in commits. The output path for
the generated dictionary is set to a Cloud Storage bucket called
dlptesting
, and the stored infoType is named github-usernames
.
JSON input
POST https://dlp.googleapis.com/v2/projects/PROJECT_ID/storedInfoTypes
{
"config":{
"displayName":"GitHub usernames",
"description":"Dictionary of GitHub usernames used in commits",
"largeCustomDictionary":{
"outputPath":{
"path":"gs://[PATH_TO_GS]"
},
"bigQueryField":{
"table":{
"datasetId":"samples",
"projectId":"bigquery-public-data",
"tableId":"github_nested"
}
}
}
},
"storedInfoTypeId":"github-usernames"
}
Rebuild the dictionary
If you want to update your dictionary, you first update your source term list, and then you instruct Sensitive Data Protection to rebuild the stored infoType.
Update the existing source term list in either Cloud Storage or BigQuery.
Add, remove, or change the terms or phrases as needed.
Create a new version of the stored infoType by "rebuilding" it using either the Google Cloud console or the
storedInfoTypes.patch
method.Rebuilding creates a new version of the dictionary, which replaces the old dictionary.
When you rebuild a stored infoType to a new version, the old version is deleted. While Sensitive Data Protection is updating the stored infoType, its status is "pending." During this time, the old version of the stored infoType still exists. Any scans that you run while the stored infoType is in pending state will be run using the old version of the stored infoType.
To rebuild the stored infoType:
Console
- Update and save your term list in either Cloud Storage or BigQuery.
In the Google Cloud console, go to your list of stored infoTypes.
Click the ID of the stored infoType that you want to update.
On the InfoType details page, click Rebuild data.
Sensitive Data Protection rebuilds the stored infoType with the changes you made to the source term list. Once the status of the stored infoType is "Ready," you can use it. Any templates or job triggers that use the stored infoType will automatically use the rebuilt version.
C#
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
REST
Update the term list
If you're updating only the list of terms in the large custom
dictionary, your
storedInfoTypes.patch
request requires only the name
field. Provide the full resource name of the
stored infoType that you want to rebuild.
The following patterns represent valid entries for the name
field:
organizations/ORGANIZATION_ID/storedInfoTypes/STORED_INFOTYPE_ID
projects/PROJECT_ID/storedInfoTypes/STORED_INFOTYPE_ID
Replace STORED_INFOTYPE_ID with the identifier of the stored infoType that you want to rebuild.
If you don't know the identifier of the stored infoType, call the
storedInfoTypes.list
method to view a list of all current stored infoTypes.
Example
PATCH https://dlp.googleapis.com/v2/projects/PROJECT_ID/storedInfoTypes/STORED_INFOTYPE_ID
In this case, a request body isn't required.
Switch the source term list
You can change the source term list for a stored infoType
from one stored in BigQuery to one stored in
Cloud Storage. Use the
storedInfoTypes.patch
method, but include a
CloudStorageFileSet
object in
LargeCustomDictionaryConfig
where you'd used a
BigQueryField
object before. Then, set the updateMask
parameter to the stored infoType
parameter that you rebuilt, in
FieldMask
format. For instance, the following JSON states in the updateMask
parameter
that the URL of the Cloud Storage path has been updated
(large_custom_dictionary.cloud_storage_file_set.url
):
Example
PATCH https://dlp.googleapis.com/v2/projects/PROJECT_ID/storedInfoTypes/github-usernames
{
"config":{
"largeCustomDictionary":{
"cloudStorageFileSet":{
"url":"gs://[BUCKET_NAME]/[PATH_TO_FILE]"
}
}
},
"updateMask":"large_custom_dictionary.cloud_storage_file_set.url"
}
Similarly, you can switch your term list from one stored in a BigQuery table to one stored in a Cloud Storage bucket.
Scan content using a large custom dictionary detector
Scanning content using a large custom dictionary detector is similar to scanning content using any other custom infoType detector.
This procedure assumes that you have an existing stored infoType. For more information, see Create a stored infoType on this page.
Console
You can apply a large custom dictionary detector when doing the following:
- Creating a new job
- Creating or editing a job trigger
- Creating or editing a template
- Configuring data profiling
In the Configure detection section of the page, in the InfoTypes subsection, you can specify your large custom dictionary infoType.
- Click Manage infoTypes.
- In the InfoTypes pane, click the Custom tab.
- Click Add custom infoType.
In the Add custom infoType pane, do the following:
- For Type, select Stored infoType.
- For InfoType, enter a name for the custom infoType. You can use letters, numbers, and underscores.
For Likelihood, select the default likelihood level that you want to assign to to all findings that match this custom infoType. You can further fine-tune the likelihood level of individual findings by using hotword rules.
If you don't specify a default value, the default likelihood level is set to
VERY_LIKELY
. For more information, see Match likelihood.For Sensitivity, select the sensitivity level that you want to assign to to all findings that match this custom infoType. If you don't specify a value, the sensitivity levels of those findings are set to
HIGH
.Sensitivity scores are used in data profiles. When profiling your data, Sensitive Data Protection uses the sensitivity scores of the infoTypes to calculate the sensitivity level.
For Stored infoType name, select the stored infoType that you want to base the new custom infoType on.
Click Done to close the Add custom infoType pane.
Optional: On the Built-in tab, edit your selection of built-in infoTypes.
Click Done to close the InfoTypes pane.
The custom infoType is added to the list of infoTypes that Sensitive Data Protection scans for. However, this selection isn't final until you save the job, job trigger, template, or scan configuration.
When you're done creating or editing the configuration, click Save.
C#
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
REST
When sent to the
content.inspect
method, the following example scans the given text using the specified stored
infoType detector. The infoType
parameter is required because
all custom infoTypes must have a name
that doesn't conflict with built-in infoTypes or other custom infoTypes. The
storedType
parameter contains the full resource path of the stored infoType.
JSON input
POST https://dlp.googleapis.com/v2/projects/PROJECT_ID/content:inspect
{
"inspectConfig":{
"customInfoTypes":[
{
"infoType":{
"name":"GITHUB_LOGINS"
},
"storedType":{
"name":"projects/PROJECT_ID/storedInfoTypes/github-logins"
}
}
]
},
"item":{
"value":"The commit was made by githubuser."
}
}
Troubleshoot errors
If you get an error while attempting to create a stored infoType from a term list stored in Cloud Storage, the following are possible causes:
- You've run into an upper limit for stored
infoTypes. Depending on
the problem, there are several workarounds:
- If you run into the upper limit for a single input file in Cloud Storage (200 MB), try splitting the file into multiple files. You can use multiple files to assemble a single custom dictionary as long as the combined size of all files doesn't exceed 1 GB.
- BigQuery doesn't have the same limits as Cloud Storage. Consider moving the terms into a BigQuery table. The maximum size of a custom dictionary column in BigQuery is 1 GB and the maximum number of rows is 5,000,000.
- If your term list file exceeds all applicable limits for source term lists, you must split the term list file into multiple files and create a dictionary for each file. Then, create a separate scan job for each dictionary.
- One or more of your terms doesn't contain at least one letter or number. Sensitive Data Protection can't scan for terms that are composed solely of spaces or symbols. It must have at least one letter or number. Look at your term list and see if there are any such terms included, and then fix or delete them.
- Your term list contains a phrase with too many "components." A component in this context is a continuous sequence containing only letters, only numbers, or only non-letter and non-digit characters such as spaces or symbols. Look at your term list and see if there are any such terms included, and then fix or delete them.
- The Sensitive Data Protection service agent does not have access to
dictionary source data or to the Cloud Storage bucket for storing
dictionary files. To fix this issue, grant the Sensitive Data Protection
service agent the Storage Admin (
roles/storage.admin
) role or BigQuery Data Owner (roles/bigquery.dataOwner
) and BigQuery Job User (roles/bigquery.jobUser
) roles.
API overview
Creating a stored infoType is required if you are creating a large custom dictionary detector.
A stored infoType is represented in Sensitive Data Protection by the
StoredInfoType
object. It consists of the following related objects:
StoredInfoTypeVersion
includes the creation date and time and the last five error messages that occurred when the current version was created.StoredInfoTypeConfig
contains the configuration of the stored infoType, including its name and description. For a large custom dictionary, thetype
must be aLargeCustomDictionaryConfig
.LargeCustomDictionaryConfig
specifies both of the following:- The location within Cloud Storage or BigQuery where your list of phrases is stored.
- The location in Cloud Storage to store the generated dictionary files.
StoredInfoTypeState
contains the state of the most current version and any pending versions of the stored infoType. State information includes whether the stored infoType is being rebuilt, is ready to use, or is invalid.
Dictionary matching specifics
Following is guidance about how Sensitive Data Protection matches dictionary words and phrases. These points apply to both regular and large custom dictionaries:
- Dictionary words are case-insensitive. If your dictionary includes
Abby
, it will match onabby
,ABBY
,Abby
, and so on. - All characters—in dictionaries or in content to be scanned—other
than letters, digits, and other alphabetic characters contained within the Unicode
Basic Multilingual Plane
are considered as whitespace when scanning for matches. If your dictionary
scans for
Abby Abernathy
, it will match onabby abernathy
,Abby, Abernathy
,Abby (ABERNATHY)
, and so on. - The characters surrounding any match must be of a different type (letters
or digits) than the adjacent characters within the word. If your dictionary
scans for
Abi
, it will match the first three characters ofAbi904
, but not ofAbigail
. - Dictionary words containing characters in the Supplementary Multilingual Plane of the Unicode standard can yield unexpected findings. Examples of such characters are emojis, scientific symbols, and historical scripts.
Letters, digits, and other alphabetic characters are defined as follows:
- Letters: characters with general categories
Lu
,Ll
,Lt
,Lm
, orLo
in the Unicode specification - Digits: characters with general category
Nd
in the Unicode specification - Other alphabetic characters: characters with general category
Nl
in the Unicode specification or with contributory propertyOther_Alphabetic
as defined by the Unicode Standard
To create, edit, or delete a stored infoType, you use the following methods:
storedInfoTypes.create
: Creates a new stored infoType given theStoredInfoTypeConfig
that you specify.storedInfoTypes.patch
: Rebuilds the stored infoType with a newStoredInfoTypeConfig
that you specify. If none is specified, this method creates a new version of the stored infoType with the existingStoredInfoTypeConfig
.storedInfoTypes.get
: Retrieves theStoredInfoTypeConfig
and any pending versions of the specified stored infoType.storedInfoTypes.list
: Lists all current stored infoTypes.storedInfoTypes.delete
: Deletes the specified stored infoType.