Modifying infoType detectors to refine scan results

The Cloud Data Loss Prevention (DLP) API's built-in infoType detectors are effective at finding common types of sensitive data. Custom infoType detectors enable you to fully customize your own sensitive data detector. Inspection rules help refine the scan results that the DLP API returns by modifying the detection mechanism of a given infoType detector.

If you want to exclude or include more values from the results that are returned by a built-in infoType detector, you can create a new custom infoType from scratch and define all the criteria that the DLP API should look for. Alternatively, you can refine the findings that the DLP API’s built-in or custom detectors return according to criteria that you specify. You can do this by adding inspection rules that can help reduce noise, increase precision and recall, or adjust likelihood certainty of scan findings.

This topic discusses how to use the two types of inspection rules to either exclude certain findings or add additional findings, all according to custom criteria that you specify. Presented in this topic are several scenarios in which you might want to alter an existing infoType detector.

The two types of inspection rules are:

Exclusion rules

Exclusion rules are useful in situations like the following:

  • You want to exclude duplicate scan matches in results that are caused by overlapping infoType detectors. For example, you’re scanning for email addresses and phone numbers, but you are receiving two hits for email addresses with phone numbers in them, such as “206-555-0764@example.org.”
  • You’re experiencing noise in your scan findings. For example, you’re seeing the same dummy email address (such as example@example.com”) or domain (such as “example.com”) returned an inordinate number of times by a scan for legitimate email addresses.
  • You have a list of terms, phrases, or combination of characters that you want to exclude from findings.

Exclusion rules API overview

The DLP API defines an exclusion rule in the ExclusionRule object. Within ExclusionRule you specify one of the following:

  • A Dictionary object, which indicates that the exclusion rule is a regular dictionary rule.
  • A Regex object, which indicates that the exclusion rule is a regular expression rule.
  • An ExcludeInfoTypes object, which contains an array of infoType detectors. If a finding is matched by any of the infoType detectors listed here, the finding will be excluded from the scan results.

Exclusion rule example scenarios

Each of the following JSON code snippets illustrates how to configure the DLP API for the given scenario.

Omit specific email address from EMAIL_ADDRESS detector scan

The following JSON snippet illustrates how to indicate to the DLP API using an InspectConfig that it should avoid matching on “example@example.com” in a scan that uses the infoType detector EMAIL_ADDRESS:

...
    "inspectConfig":{
      "ruleSet":[
        {
          "infoTypes":[
            {
              "name":"EMAIL_ADDRESS"
            }
          ],
          "rules":[
            {
              "exclusionRule":{
                "dictionary":{
                  "wordList":{
                    "words":[
                      "example@example.com"
                    ]
                  }
                },
                "matchingType": "MATCHING_TYPE_FULL_MATCH"
              }
            }
          ]
        }
      ]
    }
...

Omit email addresses ending with a specific domain from EMAIL_ADDRESS detector scan

The following JSON snippet illustrates how to indicate to the DLP API using an InspectConfig that it should avoid matching on any email addresses that end with “@example.com” in a scan that uses the infoType detector EMAIL_ADDRESS:

...
    "inspectConfig":{
      "ruleSet":[
        {
          "infoTypes":[
            {
              "name":"EMAIL_ADDRESS"
            }
          ],
          "rules":[
            {
              "exclusionRule":{
                "regex":{
                  "pattern":".+@example.com"
                },
                "matchingType": "MATCHING_TYPE_FULL_MATCH"
              }
            }
          ]
        }
      ]
    }
...

Omit any scan matches that include the substring “REDACTED”

The following JSON snippet illustrates how to indicate to the DLP API using an InspectConfig that it should exclude any findings that include the substring “REDACTED:”

...
    "inspectConfig":{
      "ruleSet":[
        {
          "infoTypes":[
            {
              "name":"ALL_BASIC"
            }
          ],
          "rules":[
            {
              "exclusionRule":{
                "dictionary":{
                  "wordList":{
                    "words":[
                      "REDACTED"
                    ]
                  }
                },
                "matchingType": "MATCHING_TYPE_PARTIAL_MATCH"
              }
            }
          ]
        }
      ]
    }
...

Omit scan matches that include the substring “Jimmy” from a custom infoType detector scan

The following JSON snippet illustrates how to indicate to the DLP API using an InspectConfig that it should avoid matching on the name “Jimmy” in a scan that uses the specified custom regex detector:

...
    "inspectConfig":{
      "customInfoTypes":[
        {
          "infoType":{
            "name":"CUSTOM_NAME_DETECTOR"
          },
          "regex":{
            "pattern":"\d{2,6}\s\w.\s(\b\w*\b\s){2,4}\w*\."
          }
        }
      ],
      "ruleSet":[
        {
          "infoTypes":[
            {
              "name":"CUSTOM_NAME_DETECTOR"
            }
          ],
          "rules":[
            {
              "exclusionRule":{
                "dictionary":{
                  "wordList":{
                    "words":[
                      "jimmy"
                    ]
                  }
                },
                "matchingType": "MATCHING_TYPE_PARTIAL_MATCH"
              }
            }
          ]
        }
      ]
    }
...

Omit scan matches from a PERSON_NAME detector scan that overlap with a custom detector

In this scenario, the user does not want a match from a DLP API scan using the PERSON_NAME built-in detector returned if the match would also be matched in a scan using the custom regex detector defined in the first part of the snippet. The following JSON snippet specifies both a custom regex detector and an exclusion rule in the InspectConfig. The custom regex detector specifies the names to exclude from results. The exclusion rule specifies that if any results returned from a scan for PERSON_NAME would also be matched by the custom regex detector, they are omitted. Note that VIP_DETECTOR in this case is marked as EXCLUSION_TYPE_EXCLUDE, so it will not produce results itself. It will only affect results produced by the PERSON_NAME detector.

...
    "inspectConfig":{
      "customInfoTypes":[
        {
          "infoType":{
            "name":"VIP_DETECTOR"
          },
          "regex":{
            "pattern":"Larry Page|Sergey Brin"
          },
          "exclusionType":"EXCLUSION_TYPE_EXCLUDE"
        }
      ],
      "ruleSet":[
        {
          "infoTypes":[
            {
              "name":"PERSON_NAME"
            }
          ],
          "rules":[
            {
              "exclusionRule":{
                "excludeInfoTypes":{
                  "infoTypes":[
                    {
                      "name":"VIP_DETECTOR"
                    }
                  ]
                },
                "matchingType": "MATCHING_TYPE_FULL_MATCH"
              }
            }
          ]
        }
      ]
    }
...

Omit matches on PERSON_NAME detector if also matched by EMAIL_ADDRESS detector

The following JSON snippet illustrates how to indicate to the DLP API using an InspectConfig that it should only return one match in the case that matches for the PERSON_NAME detector overlap with matches for the EMAIL_ADDRESS detector. Doing this is to avoid the situation where an email address such as “james@example.com” matches on both the PERSON_NAME and EMAIL_ADDRESS detectors.

...
    "inspectConfig":{
      "ruleSet":[
        {
          "infoTypes":[
            {
              "name":"PERSON_NAME"
            }
          ],
          "rules":[
            {
              "exclusionRule":{
                "excludeInfoTypes":{
                  "infoTypes":[
                    {
                      "name":"EMAIL_ADDRESS"
                    }
                  ]
                },
                "matchingType": "MATCHING_TYPE_PARTIAL_MATCH"
              }
            }
          ]
        }
      ]
    }
...

Omit matches on domain names that are part of email addresses in a DOMAIN_NAME detector scan

The following JSON snippet illustrates how to indicate to the DLP API using an InspectConfig that it should only return matches for a DOMAIN_NAME detector scan if the match does not overlap with a match in an EMAIL_ADDRESS detector scan. In this scenario, the main scan is a DOMAIN_NAME detector scan. The user does not want a domain name match returned in findings if the domain name is used in an email address:

...
    "inspectConfig":{
      "infoTypes":[
        {
          "name":"DOMAIN_NAME"
        },
        {
          "name":"EMAIL_ADDRESS"
        }
      ],
      "customInfoTypes":[
        {
          "infoType":{
            "name":"EMAIL_ADDRESS"
          },
          "exclusionType":"EXCLUSION_TYPE_EXCLUDE"
        }
      ],
      "ruleSet":[
        {
          "infoTypes":[
            {
              "name":"DOMAIN_NAME"
            }
          ],
          "rules":[
            {
              "exclusionRule":{
                "excludeInfoTypes":{
                  "infoTypes":[
                    {
                      "name":"EMAIL_ADDRESS"
                    }
                  ]
                },
                "matchingType": "MATCHING_TYPE_PARTIAL_MATCH"
              }
            }
          ]
        }
      ]
    }
...

Hotword rules

Hotword rules are useful in a situation such as the following:

  • You want to change likelihood values assigned to scan matches based on the match’s proximity to a hotword. For example, you want to set the likelihood value higher for matches on patient names depending on the names’ proximity to the word “patient.”

Hotword rules API overview

Within the DLP API's InspectionRule object, you specify a HotwordRule object, which adjusts the likelihood of findings within a certain proximity of hotwords.

InspectionRule objects are grouped as a “rule set” in an InspectionRuleSet object, along with a list of infoType detectors the rule set applies to. Rules within a rule set are applied in the order specified.

Hotword rule example scenarios

The following JSON code snippet illustrates how to configure the DLP API for the given scenario.

Increase the likelihood of a PERSON_NAME match if there is the hotword “patient” nearby

The following JSON snippet illustrates using an InspectConfig a scenario in which you want to scan a medical database for patient names. You can use the DLP API’s built-in PERSON_NAME infoType detector, but that will cause the DLP API to match on all names of people, not just names of patients. To fix this, you can include a hotword rule that looks for the word “patient” within a certain character proximity from the first character of potential matches. You can then assign findings matching this pattern a likelihood of “very likely,” since they correspond to your special criteria. Setting the minimum Likelihood to VERY_LIKELY within InspectConfig ensures that only matches to this configuration are returned in findings.

...
  "inspectConfig":{
    "ruleSet":[
      {
        "infoTypes":[
          {
            "name":"PERSON_NAME"
          }
        ],
        "rules":[
          {
            "hotwordRule":{
              "hotwordRegex":{
                "pattern":"patient"
              },
              "proximity":{
                "windowBefore":50
              },
              "likelihoodAdjustment":{
                "fixedLikelihood":"VERY_LIKELY"
              }
            }
          }
        ]
      }
    ],
    "minLikelihood":"VERY_LIKELY"
  }
...

For more detailed information about hotwords, see Customizing match likelihood.

Multiple inspection rules scenario

The following InspectConfig JSON snippet illustrates applying both exclusion and hotword rules. This snippet’s rule set includes both hotword rules and dictionary and regex exclusion rules. Notice that the four rules are specified in an array within the rules element.

...
  "inspectConfig":{
    "ruleSet":[
      {
        "infoTypes":[
          {
            "name":"PERSON_NAME"
          }
        ],
        "rules":[
          {
            "hotwordRule":{
              "hotwordRegex":{
                "pattern":"patient"
              },
              "proximity":{
                "windowBefore":5
              },
              "likelihoodAdjustment":{
                "fixedLikelihood":"VERY_LIKELY"
              }
            }
          },
          {
            "hotwordRule":{
              "hotwordRegex":{
                "pattern":"doctor"
              },
              "proximity":{
                "windowBefore":5
              },
              "likelihoodAdjustment":{
                "fixedLikelihood":"UNLIKELY"
              }
            }
          },
          {
            "exclusionRule":{
              "dictionary":{
                "wordList":{
                  "words":[
                    "Quasimodo"
                  ]
                }
              },
              "matchingType": "MATCHING_TYPE_PARTIAL_MATCH"
            }
          },
          {
            "exclusionRule":{
              "regex":{
                "pattern":"REDACTED"
              },
              "matchingType": "MATCHING_TYPE_PARTIAL_MATCH"
            }
          }
        ]
      }
    ]
  }
...

Overlapping infoType detectors

It is possible to define a custom infoType detector that has the same name as a built-in infoType detector. As shown in the example in the "Hotword rule example scenarios" section, when you create a custom infoType detector with the same name as a built-in infoType, any findings detected by the new infoType detector are added to those detected by the built-in detector. This is only true as long as the built-in infoType is specified in the list of infoTypes in the InspectConfig object.

When creating new custom infoType detectors, test them thoroughly on example content to ensure they work as you intend.

Was this page helpful? Let us know how we did:

Send feedback about...

Data Loss Prevention API