Creating a stored custom dictionary detector

Regular custom dictionary detectors are sufficient when you have a list of up to several tens of thousands of sensitive words or phrases that you want to scan your content for. If you have more than this many words or phrases to scan for, or if your list of words or phrases changes frequently, consider creating a stored custom dictionary, which you can use to scan for up to tens of millions of words or phrases.

This topic describes how to create and rebuild stored custom dictionaries, and it covers several error scenarios.

Anatomy of a stored custom dictionary

You may be familiar with how to create a regular dictionary custom infoType or a regex custom infoType: You configure the custom infoType, and then use it when you set up your inspection or de-identification scan. Stored custom dictionaries are different from these custom infoTypes, though, in that each stored custom dictionary has two components:

  • A list of phrases that you create and define. The list is stored as either a text file within Cloud Storage or a column in a BigQuery table.
  • The generated dictionary files, which are generated by Cloud DLP based on your phrase list. The dictionary files are stored in Cloud Storage, and are comprised of a copy of the source phrase data plus bloom filters, which aid in searching and matching. You can't edit these files directly.

Create a new stored custom dictionary

This section describes how to create, edit, and rebuild a stored custom dictionary.

Create a phrase list

The first step to creating a stored custom dictionary is to create a word and phrase list. You have two choices:

  • Place a text file with each word or phrase on its own line into a Cloud Storage bucket.
  • Designate one column of a BigQuery table as the container for the phrases. Give each phrase its own row in the column. You can use an existing BigQuery table, as long as all of the dictionary words and phrases are in a single column.

Be aware that it is possible to assemble a term list that is too large for Cloud DLP to process. If you see an error message, see "Troubleshooting errors," later in this topic.

Create the dictionary

After you've created your term list, use Cloud DLP to create a dictionary:

Console

  1. Create a new folder for the dictionary in a Cloud Storage bucket. Cloud DLP creates folders containing the dictionary files at the location you specify.
  2. In the Cloud Console, open Cloud DLP.

    Go to Cloud DLP UI

  3. From the Create menu, choose Stored infoType.

    Screenshot of DLP UI with Create menu > Stored infoType selected.

    Alternatively, click the following button:

    Create new infoType

The Create infoType page contains the following sections:

Configure infoType

The Configure infoType section is where you name and describe your stored custom dictionary infoType.

  • In the InfoType ID field, enter an identifier for the custom infoType. This will be how you refer to the infoType when configuring your inspection and de-identification jobs. You can use letters, numbers, hyphens, and underscores in the name.
  • In the InfoType display name field, enter a name for your custom infoType. You can use spaces and punctuation in the name.
  • In the Description field, enter a description for what your custom infoType detects.

Choose data location

The Choose data location section is where you specify where to find the list of words and phrases from which to create your stored custom dictionary.

  • Choose BigQuery if the words and phrases to search for are listed in a BigQuery table. Remember that you can designate at most one column from the table. Enter the project ID, dataset ID, and table ID in the specified fields. In the Field name field, enter the column identifier.
  • Choose Google Cloud Storage if the words and phrases to search for are listed in a text file in Cloud Storage. Enter the path to the file in the specified field.

The final section is where you specify a location for Cloud DLP to save the compacted stored custom dictionary.

In the Output bucket or folder field, enter the path to which you want the dictionary saved.

Click Create to create the stored custom dictionary. The infoType details screen appears.

When the status is "Ready," the stored custom dictionary has been generated, and the new custom infoType is ready to use.

Protocol

  1. Create a new folder for the dictionary in a Cloud Storage bucket. Cloud DLP creates folders containing the dictionary files at the location you specify.
  2. Use the Cloud DLP API's storedInfoTypes.create method to create the dictionary. The create method takes the following parameters:
    • A StoredInfoTypeConfig object, which contains the stored infoType's configuration. It includes:
      • description: A description of the dictionary.
      • displayName: The name you want to give the dictionary.
      • LargeCustomDictionaryConfig: Contains the configuration of the stored custom dictionary. It includes:
        • BigQueryField: Specified if your term list is stored in BigQuery. Includes a reference to the table your list is stored in, plus the field that contains each dictionary phrase.
        • CloudStorageFileSet: Specified if your term list is stored in Cloud Storage. Includes the URL to the source location in Cloud Storage, in the following form: "gs://[PATH_TO_GS]". Wildcards are supported.
        • outputPath: The path to the location in a Cloud Storage bucket to store the created dictionary.
    • storedInfoTypeId: The identifier for the stored custom infoType. This value is how you will refer to the stored custom infoType when you rebuild or delete it, or use it in an inspection or de-identification job. If you leave this field empty, the system generates an identifier for you.

Following is example JSON that, when sent to the storedInfoTypes.create method, creates a new stored custom dictionary. In this example, we've instructed Cloud DLP to create a stored custom dictionary from a term list stored in the publicly available BigQuery database of all GitHub usernames used in commits (bigquery-public-data.samples.github_nested). The output path for the generated dictionary is set to a Cloud Storage bucket called dlptesting, and the stored custom dictionary has been given the name github-usernames.

JSON Input:

POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/storedInfoTypes?key={YOUR_API_KEY}

{
  "config":{
    "displayName":"GitHub usernames",
    "description":"Dictionary of github usernames used in commits",
    "largeCustomDictionary":{
      "outputPath":{
        "path":"gs://[PATH_TO_GS]"
      },
      "bigQueryField":{
        "table":{
          "datasetId":"samples",
          "projectId":"bigquery-public-data",
          "tableId":"github_nested"
        }
      }
    }
  },
  "storedInfoTypeId":"github-usernames"
}

Rebuild the dictionary

If you want to add or remove terms or phrases to your dictionary, you first update your source term list, and then you instruct Cloud DLP to rebuild the dictionary.

  1. Update the existing source term list in either Cloud Storage or BigQuery. Add, remove, or change the terms or phrases as needed.
  2. Create a new version of the dictionary by "rebuilding" it using either the Cloud Console or the Cloud DLP's storedInfoTypes.patch method. Doing this creates a new version of the dictionary, which replaces the old dictionary.

To rebuild the stored custom dictionary:

Console

  1. Update and save your term list in either Cloud Storage or BigQuery.
  2. In the Cloud Console, open Cloud DLP.
  3. Click the Configuration tab, and then InfoTypes.
  4. On the infoTypes screen, click the Custom tab. Your stored custom infoTypes appear here.

    Go to custom infoTypes listing

  5. Click the row with the stored infoType you want to update.

  6. On the infoType details screen, click Rebuild data.

Cloud DLP rebuilds the stored custom dictionary with the changes you made to the source term list. Once the status of the custom infoType is "Ready," you can use it. Any templates or job triggers that use the custom infoType will automatically use the rebuilt custom infoType.

Protocol

If the only thing you're trying to do is to add new terms or delete or change existing terms in the stored custom dictionary, you need only call the storedInfoTypes.patch method by itself to rebuild the dictionary. Be sure to populate the name field with the resource name of the organization or project, and of the stored custom infoType to be rebuilt. You provided the name of the stored infoType when you created it in the storedInfoTypeId parameter. If you don't remember the identifier for the stored custom infoType you want to rebuild, call the storedInfoTypes.list method to view a list of all current stored infoTypes.

The following patterns represent valid entries for the name field:

  • organizations/[ORG_ID]/storedInfoTypes/[STORED_INFOTYPE_ID]
  • projects/[PROJECT_ID]/storedInfoTypes/[STORED_INFOTYPE_ID]

When you rebuild a stored custom dictionary to a new version, the old version of the stored custom dictionary is deleted. While Cloud DLP is updating the stored custom dictionary, the dictionary's status is "pending." When the status of the new version of the dictionary is pending, the old version of the dictionary still exists. Any scans that you run while the dictionary is being rebuilt will be run using the old version of the dictionary.

You can change the source term list for an existing stored custom dictionary from one stored in a BigQuery table to one stored in a Cloud Storage bucket, and vice-versa. Use the storedInfoTypes.patch method, but include a CloudStorageFileSet object in LargeCustomDictionaryConfig where you'd used a BigQueryField object before, or vice-versa. You must also specify the updateMask parameter, setting it to the stored custom dictionary parameter that you rebuilt, in FieldMask format. For instance, the following JSON, when sent to the storedInfoTypes.patch method, states in the updateMask parameter that the URL of the Cloud Storage path has been updated (large_custom_dictionary.cloud_storage_file_set.url):

PATCH https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/storedInfoTypes/github-usernames?key={YOUR_API_KEY}

{
  "config":{
    "largeCustomDictionary":{
      "cloudStorageFileSet":{
        "url":"gs://[BUCKET_NAME]/[PATH_TO_FILE]"
      }
    }
  },
  "updateMask":"large_custom_dictionary.cloud_storage_file_set.url"
}

Scan content using stored custom dictionary detectors

Scanning content using a stored custom dictionary detector is similar to scanning content using any other custom infoType detector.

Console

You can use a custom dictionary detector when creating a new job, job trigger, or template. In the Configure detection section of the job or template creation workflow, you can specify your stored custom dictionary infoType in the Custom infoTypes subsection:

  1. Click Add custom infoType, and then click Stored infoType.

    Screenshot of DLP UI's create job trigger workflow, in the
      Custom infoTypes section.

  2. In the Add custom infoType section, type an infoType name in the InfoType field. You can use letters, numbers, and underscores.

  3. Click in the Stored infoType name field, and a menu will appear under the field with the paths to your stored custom dictionary infoTypes, as shown here:

    Screenshot of DLP UI's create job trigger workflow, in the
      Add custom infoType section.

  4. Choose the stored custom dictionary you want, and then click Done.

The custom infoType you added appears as in the following screen shot. Note that you can add additional custom infoTypes if you want.

Screenshot of DLP UI's create job trigger workflow, with custom infoType
    added.

You can now continue with the job, job trigger, or template creation process.

Protocol

The following JSON, when sent to the content.inspect method, scans the given snippet of text using the specified stored custom dictionary detector. Note that the infoType parameter is required because all custom infoTypes, including stored custom dictionaries, must have a name that does not conflict with built-in infoTypes or other custom infoTypes. The storedType parameter contains the full resource path of the stored custom infoType.

JSON Input:

POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/content:inspect?key={YOUR_API_KEY}

{
  "inspectConfig":{
    "customInfoTypes":[
      {
        "infoType":{
          "name":"GITHUB_LOGINS"
        },
        "storedType":{
          "name":"projects/[PROJECT_ID]/storedInfoTypes/github-logins"
        }
      }
    ]
  },
  "item":{
    "value":"The commit was made by githubuser."
  }
}

Troubleshooting errors

If, when you attempt to create a stored custom dictionary, you see an error that explains that Cloud DLP can't create a dictionary from your Cloud Storage-based term list, there are a few possible causes:

  • You've run into an upper limit for stored custom dictionaries. Depending on the problem, there are several workarounds:
    • If you run into the upper limit for a single stored custom dictionary in Cloud Storage (200 MB), you can try splitting the file into multiple files. You can still use these files to assemble a single custom dictionary, as long as the total size of the files doesn't exceed the maximum combined size for all stored custom dictionary files in Cloud Storage (1 GB).
    • BigQuery does not have the same limits as Cloud Storage. Consider moving the terms into a BigQuery table, but be aware of both the maximum size of custom dictionary column in BigQuery (1 GB) and the maximum number of rows (5,000,000).
    • If your term list file exceeds all of the applicable limits for custom dictionary source term lists, you must split the term list file into multiple files in Cloud Storage, create a dictionary for each file, and then create a separate scan job for each dictionary that's created.
  • One or more of your terms doesn't contain at least one letter or number. Cloud DLP can't scan for terms that are comprised solely of spaces or symbols. It must have at least one letter or number. Look at your term list and see if there are any such terms included, and then fix or delete them.
  • Your term list contains a phrase with too many "components." A component in this context is a continuous sequence containing only letters, only numbers, or only non-letter and non-digit characters such as spaces or symbols. Look at your term list and see if there are any such terms included, and then fix or delete them.
  • The DLP service account does not have access to dictionary source data or to the Cloud Storage bucket for storing dictionary files. To fix this issue, grant the DLP service account the Cloud Storage admin role or BigQuery dataOwner and jobUser roles.

API overview

A stored custom dictionary is considered a stored infoType because of its size and complexity. At this time, stored custom dictionaries are the only type of stored infoType.

A stored infoType is represented in Cloud DLP by the StoredInfoType object. It's accompanied by the following related objects:

  • StoredInfoTypeConfig contains the stored infoType's configuration, which includes details such as name and description.
  • StoredInfoTypeVersion contains more information about the stored infoType, such as its creation date and time and the last five error messages that occurred when the current version of the stored infoType was created.
  • StoredInfoTypeState contains the state of the most current version and any pending versions of the stored infoType. State information includes whether the stored infoType is being rebuilt, is ready to use, or is invalid.

To create, edit, or delete a stored infoType, you use the following methods:

  • storedInfoTypes.create: Creates a new stored infoType given the StoredInfoTypeConfig that you specify.
  • storedInfoTypes.patch: Rebuilds the stored infoType with a new StoredInfoTypeConfig that you specify, or if none is specified, creates a new version of the stored infoType with the existing StoredInfoTypeConfig.
  • storedInfoTypes.get: Retrieves the StoredInfoTypeConfig and any pending versions of the specified stored infoType.
  • storedInfoTypes.list: Lists all current stored infoTypes.
  • storedInfoTypes.delete: Deletes the specified stored infoType.

In addition to the stored infoType APIs described here, the following object applies specifically to stored custom dictionaries:

  • LargeCustomDictionaryConfig specifies both of the following:
    • The location within Cloud Storage or BigQuery where your list of phrases is stored.
    • The location in Cloud Storage to store the generated dictionary files.

Dictionary matching specifics

Following is guidance about how Cloud DLP matches dictionary words and phrases. These points apply to both regular and stored custom dictionaries:

  • Dictionary words are case-insensitive. If your dictionary includes Abby, it will match on abby, ABBY, Abby, and so on.
  • All characters—in dictionaries or in content to be scanned—other than letters and digits contained within the Unicode Basic Multilingual Plane are considered as whitespace when scanning for matches. If your dictionary scans for Abby Abernathy, it will match on abby abernathy, Abby, Abernathy, Abby (ABERNATHY), and so on.
  • The characters surrounding any match must be of a different type (letters or digits) than the adjacent characters within the word. If your dictionary scans for Abi, it will match the first three characters of Abi904, but not of Abigail.
  • Dictionary words containing a large number of characters that are not letters or digits may result in unexpected findings, because those characters are treated as whitespace.
Was this page helpful? Let us know how we did:

Send feedback about...

Data Loss Prevention Documentation