Creating a Stored Custom Dictionary Detector

Regular custom dictionary detectors are ideal when you have a short list of sensitive words or phrases that you want to scan your content for. If you have more than a few words or phrases to scan for, or if your list of words or phrases changes frequently, consider creating a stored custom dictionary, which can search for up to tens of millions of words or phrases.

This topic describes how to create and update stored custom dictionaries, and it covers several error scenarios.

Anatomy of a stored custom dictionary

You may be familiar with how to create a regular dictionary custom infoType or a regex custom infoType: You define a CustomInfoType object that you send to the content.deidentify method. Stored custom dictionaries are different. Each stored custom dictionary has two components:

  • A list of phrases that you create and define. The list is stored as either a text file within Cloud Storage or a column in a BigQuery table.
  • The generated dictionary files, which are generated by the DLP API based on your phrase list. The dictionary files are stored in Cloud Storage, and are comprised of a copy of the source phrase data plus bloom filters, which aid in searching and matching. You can't edit these files directly.

Create a new stored dictionary

This section describes how to create, edit, and update a new stored dictionary.

Create a phrase list

The first step to creating a stored custom dictionary is to create a word and phrase list. You have two choices:

  • Place a text file with each word or phrase on its own line into a Cloud Storage bucket.
  • Designate one column of a BigQuery table as the container for the phrases. Give each phrase its own row in the column. You can use an existing BigQuery table, as long as all of the dictionary words and phrases are in a single column.

Be aware that it is possible to assemble a term list that is too large for the DLP API to process. If you see an error message, see "Troubleshooting errors," later in this topic.

Create the dictionary

After you've created your term list, use the DLP API to create a dictionary:

  1. Create a new directory for the dictionary in a Cloud Storage bucket. The DLP API creates directories containing the dictionary files at the location you specify.
  2. Use the DLP API's storedInfoTypes.create method to create the dictionary. The create method takes the following parameters:
    • A StoredInfoTypeConfig object, which contains the stored infoType's configuration. It includes:
      • description: A description of the dictionary.
      • displayName: The name you want to give the dictionary.
      • LargeCustomDictionaryConfig: Contains the configuration of the stored custom dictionary. It includes:
        • BigQueryField: Specified if your term list is stored in BigQuery. Includes a reference to the table your list is stored in, plus the field that contains each dictionary phrase.
        • CloudStorageFileSet: Specified if your term list is stored in Cloud Storage. Includes the URL to the source location in Cloud Storage, in the following form: "gs://path-to-gs". Wildcards are supported.
        • outputPath: The path to the location in a Cloud Storage bucket to store the created dictionary.
    • storedInfoTypeId: The identifier for the stored infoType. This value is how you will refer to the stored infoType when you update or delete it, or use it in an inspection or de-identification job. If you leave this field empty, the system generates an identifier for you.

Following is example JSON that, when sent to the storedInfoTypes.create method, creates a new stored custom dictionary. In this example, we've instructed the DLP API to create a stored custom dictionary from a term list stored in the publicly available BigQuery database of all GitHub usernames used in commits (bigquery-public-data.samples.github_nested). The output path for the generated dictionary is set to a Cloud Storage bucket called dlptesting, and the stored custom dictionary has been given the name github-usernames.

JSON Input:

POST https://dlp.googleapis.com/v2/projects/project-id/storedInfoTypes?key={YOUR_API_KEY}

{
 "config": {
  "displayName": "GitHub usernames",
  "description": "Dictionary of github usernames used in commits",
  "largeCustomDictionary": {
   "outputPath": {
    "path": "gs://dlptesting"
   },
   "bigQueryField": {
    "table": {
     "datasetId": "samples",
     "projectId": "bigquery-public-data",
     "tableId": "github_nested"
    }
   }
  }
 },
 "storedInfoTypeId": "github-usernames"
}

Update the dictionary

If you want to add or remove terms or phrases to your dictionary, you first update your source term list, and then you instruct the DLP API to update the dictionary.

  1. Update the existing source term list in either Cloud Storage or BigQuery. Add, remove, or change the terms or phrases as needed.
  2. Use the DLP API's storedInfoTypes.patch method to create a new version of the dictionary, which replaces the old dictionary.

If the only thing you're trying to do is to update the dictionary with new, deleted, or changed terms, you need only call the storedInfoTypes.patch method by itself. Be sure to populate the name field with the resource name of the organization or project, and of the stored infoType to be updated. You provided the name of the stored infoType when you created it in the storedInfoTypeId parameter. If you don't remember the identifier for the stored infoType you want to update, call the storedInfoTypes.list method to view a list of all current stored infoTypes.

The following patterns represent valid entries for the name field:

  • organizations/org-id/storedInfoTypes/stored-infoType-id
  • projects/project-id/storedInfoTypes/stored-infoType-id

When you update a stored custom dictionary to a new version, the old version of the stored custom dictionary is deleted. While the DLP API is updating the stored custom dictionary, the dictionary's status is "pending." When the status of the new version of the dictionary is pending, the old version of the dictionary still exists. Any scans that you run while the dictionary is being updated will be run using the old version of the dictionary.

You can change the source term list for an existing stored custom dictionary from one stored in a BigQuery table to one stored in a Cloud Storage bucket, and vice-versa. Use the storedInfoTypes.patch method, but include a CloudStorageFileSet object in LargeCustomDictionaryConfig where you'd used a BigQueryField object before, or vice-versa. You must also specify the updateMask parameter, setting it to the stored custom dictionary parameter that you updated, in FieldMask format. For instance, the following JSON, when sent to the storedInfoTypes.patch method, states in the updateMask parameter that the URL of the Cloud Storage path has been updated (large_custom_dictionary.cloud_storage_file_set.url):

PATCH https://dlp.googleapis.com/v2/projects/project-id/storedInfoTypes/github-usernames?key={YOUR_API_KEY}

{
 "config": {
  "largeCustomDictionary": {
   "cloudStorageFileSet": {
    "url": "gs://bucket-name/path-to-file"
   }
  }
 },
 "updateMask": "large_custom_dictionary.cloud_storage_file_set.url"
}

Scan content using stored custom dictionary detectors

Scanning content using a stored custom dictionary detector is similar to scanning content using any other custom infoType detector.

The following JSON, when sent to the content.inspect method, scans the given snippet of text using the specified stored custom dictionary detector. Note that the infoType parameter is required because all custom infoTypes, including stored custom dictionaries, must have a name that does not conflict with built-in infoTypes or other custom infoTypes. The storedType parameter contains the full resource path of the stored infoType.

JSON Input:

POST https://dlp.googleapis.com/v2/projects/project-id/content:inspect?key={YOUR_API_KEY}

{
 "inspectConfig": {
  "customInfoTypes": [
   {
    "infoType": {
     "name": "GITHUB_LOGINS"
    },
    "storedType": {
     "name": "projects/dlapi-test/storedInfoTypes/github-logins"
    }
   }
  ]
 },
 "item": {
  "value": "The commit was made by githubuser."
 }
}

Troubleshooting errors

If, when you attempt to create a custom stored dictionary, you see an error that explains that the DLP API can't create a dictionary from your Cloud Storage-based term list, there are a few possible causes:

  • You've run into an upper limit for stored custom dictionaries. Depending on the problem, there are several workarounds:
    • If you run into the upper limit for a single stored custom dictionary in Cloud Storage (200 MB), you can try splitting the file into multiple files. You can still use these files to assemble a single custom dictionary, as long as the total size of the files doesn't exceed the maximum combined size for all stored custom dictionary files in Cloud Storage (1 GB).
    • BigQuery does not have the same limits as Cloud Storage. Consider moving the terms into a BigQuery table, but be aware of both the maximum size of stored custom dictionary column in BigQuery (1 GB) and the maximum number of rows (5,000,000).
    • If your term list file exceeds all of the applicable limits for stored custom dictionary source term lists, you must split the term list file into multiple files in Cloud Storage, create a dictionary for each file, and then create a separate scan job for each dictionary that's created.
  • One or more of your terms doesn't contain at least one letter or number. The DLP API can't scan for terms that are comprised solely of spaces or symbols. It must have at least one letter or number. Look at your term list and see if there are any such terms included, and then fix or delete them.
  • Your term list contains a phrase with too many "components." A component in this context is a continuous sequence containing only letters, only numbers, or only non-letter and non-digit characters such as spaces or symbols. Look at your term list and see if there are any such terms included, and then fix or delete them.
  • The DLP service account does not have access to dictionary source data or to the Cloud Storage bucket for storing dictionary files. To fix this issue, grant the DLP service account the Cloud Storage admin role or BigQuery dataOwner and jobUser roles.

API overview

A stored custom dictionary is considered a stored infoType because of its size and complexity. At this time, stored custom dictionaries are the only type of stored infoType.

A stored infoType is represented in the DLP API by the StoredInfoType object. It's accompanied by the following related objects:

  • StoredInfoTypeConfig contains the stored infoType's configuration, which includes details such as name and description.
  • StoredInfoTypeVersion contains more information about the stored infoType, such as its creation date and time and the last five error messages that occurred when the current version of the stored infoType was created.
  • StoredInfoTypeState contains the state of the most current version and any pending versions of the stored infoType. State information includes whether the stored infoType is being updated, is ready to use, or is invalid.

To create, edit, or delete a stored infoType, you use the following methods:

  • storedInfoTypes.create: Creates a new stored infoType given the StoredInfoTypeConfig that you specify.
  • storedInfoTypes.patch: Updates the stored infoType with a new StoredInfoTypeConfig that you specify, or if none is specified, creates a new version of the stored infoType with the existing StoredInfoTypeConfig.
  • storedInfoTypes.get: Retrieves the StoredInfoTypeConfig and any pending versions of the specified stored infoType.
  • storedInfoTypes.list: Lists all current stored infoTypes.
  • storedInfoTypes.delete: Deletes the specified stored infoType.

In addition to the stored infoType APIs described here, the following object applies specifically to stored custom dictionaries:

  • LargeCustomDictionaryConfig specifies both of the following:
    • The location within Cloud Storage or BigQuery where your list of phrases is stored.
    • The location in Cloud Storage to store the generated dictionary files.

Dictionary matching specifics

Following is guidance about how the DLP API matches dictionary words and phrases. These points apply to both regular and stored custom dictionaries:

  • Dictionary words are case-insensitive. If your dictionary includes Abby, it will match on abby, ABBY, Abby, and so on.
  • All characters—in dictionaries or in content to be scanned—other than letters and digits contained within the Unicode Basic Multilingual Plane are considered as whitespace when scanning for matches. If your dictionary scans for Abby Abernathy, it will match on abby abernathy, Abby, Abernathy, Abby (ABERNATHY), and so on.
  • The characters surrounding any match must be of a different type (letters or digits) than the adjacent characters within the word. If your dictionary scans for Abi, it will match the first three characters of Abi904, but not of Abigail.
  • Dictionary words containing a large number of characters that are not letters or digits may result in unexpected findings, because those characters are treated as whitespace.
Was this page helpful? Let us know how we did:

Send feedback about...

Data Loss Prevention API