Customizing match likelihood

Hotword rules allow you to further extend built-in and custom infoType detectors with powerful context rules. Hotword rules allow you to add a regex and proximity detector to an existing infoType detector, and to adjust the match likelihood value appropriately. A hotword rule is a kind of inspection rule, which is specified in rule sets. Each rule set is applied to a set of infoTypes, which can be either custom or built-in.

Anatomy of a hotword rule

An infoType detector can have zero or more hotword rules. You define each hotword rule (HotwordRule object) inside an inspection rule (InspectionRule object). Each inspection rule is specified within a (InspectionRuleSet object, which in turn is contained in an InspectConfig object.

As a JSON object, a single hotword rule inside a "inspectionRules" array looks like this:

"inspectionRules":[
  {
    "hotwordRule":{
      "hotwordRegex":{
        "pattern":"[REGEX_PATTERN]"
      },
      "proximity":{
        "windowAfter":"[NUM_CHARS_TO_CONSIDER_AFTER_FINDING]",
        "windowBefore":"[NUM_CHARS_TO_CONSIDER_BEFORE_FINDING]"
      }
      "likelihoodAdjustment":{
        "fixedLikelihood":"[LIKELIHOOD_VALUE]"
             -- OR --
        "relativeLikelihood":"[LIKELIHOOD_ADJUSTMENT]"
      },
    }
  },
  ...
]

Each hotword rule has three components:

  • "hotwordRegex": A regex pattern (Regex object) defining what qualifies as a hotword.
  • "proximity": The proximity of the finding within which the entire hotword must be contained. This field contains a Proximity object, which is comprised of two values:

    • "windowBefore": Number of characters before the finding to consider.
    • "windowAfter": Number of characters after the finding to consider.
  • "likelihoodAdjustment": The adjustment to the likelihood of a finding. This field contains a LikelihoodAdjustment object, which can be set to one of two values:

    • "fixedLikelihood": A fixed Likelihood value to set the finding to.
    • "relativeLikelihood": A number that indicates the levels by which to increase or decrease the likelihood of the finding. For example, if a finding would be POSSIBLE without the detection rule and relativeLikelihood is 1, then it is upgraded to LIKELY, while a value of -1 would downgrade it to UNLIKELY. Likelihood may never drop below VERY_UNLIKELY or exceed VERY_LIKELY, so applying an adjustment of 1 followed by an adjustment of -1 when base likelihood is VERY_LIKELY will result in a final likelihood of LIKELY.

Hotword example: Match medical record numbers

Suppose you wanted to detect a custom infoType such as a medical record number in the form "###-#-#####", and you wanted to boost Cloud DLP finding's match likelihood when the hotword "MRN" was before—but not after—this number. Therefore:

  • 123-4-56789 would match as POSSIBLE.
  • MRN 123-4-56789 would match as VERY_LIKELY.

The JSON example and code snippets below shows the custom regex defined as explained in Creating a regex infoType detector, but with the appropriate hotword rule added on:

Protocol

See the JSON quickstart for more information about using the Cloud DLP API with JSON.

JSON input:

POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/content:inspect?key={YOUR_API_KEY}

{
  "item":{
    "value":"Patient's MRN 444-5-22222 and just a number 333-2-33333"
  },
  "inspectConfig":{
    "customInfoTypes":[
      {
        "infoType":{
          "name":"C_MRN"
        },
        "regex":{
          "pattern":"[0-9]{3}-[0-9]{1}-[0-9]{5}"
        },
        "likelihood":"POSSIBLE",
      }
    ],
    "ruleSet":[
        {
        "infoTypes": [{"name" : "C_MRN"}],
        "rules":[
          {
            "hotwordRule":{
              "hotwordRegex":{
                "pattern":"(?i)(mrn|medical)(?-i)"
              },
              "likelihoodAdjustment":{
                "fixedLikelihood":"VERY_LIKELY"
              },
              "proximity":{
                "windowBefore":10
              }
            }
          }
        ]
      }
    ]
  }
}

JSON output (abbreviated):

{
  "result": {
    "findings": [
      {
        "infoType": {
          "name": "C_MRN"
        },
        "likelihood": "VERY_LIKELY",
        "location": {
          "byteRange": {
            "start": "14",
            "end": "25"
          },
          "codepointRange": { ... }
        }
      },
      {
        "infoType": {
          "name": "C_MRN"
        },
        "likelihood": "POSSIBLE",
          "byteRange": {
            "start": "44",
            "end": "55"
          },
          "codepointRange": { ... }
        }
      }
    ]
  }
}

The output shows that, using the custom infoType detector we gave the name C_MRN and the custom regex, Cloud DLP has correctly identified the medical record number. Further, because of the context matching in the hotword rule, Cloud DLP assigned the first result (which had MRN close by) a certainty of VERY_LIKELY, as configured. Second finding lacked the context, thus the certainty stayed at POSSIBLE.

Java


import com.google.cloud.dlp.v2.DlpServiceClient;
import com.google.privacy.dlp.v2.ByteContentItem;
import com.google.privacy.dlp.v2.ByteContentItem.BytesType;
import com.google.privacy.dlp.v2.ContentItem;
import com.google.privacy.dlp.v2.CustomInfoType;
import com.google.privacy.dlp.v2.CustomInfoType.DetectionRule.HotwordRule;
import com.google.privacy.dlp.v2.CustomInfoType.DetectionRule.LikelihoodAdjustment;
import com.google.privacy.dlp.v2.CustomInfoType.DetectionRule.Proximity;
import com.google.privacy.dlp.v2.CustomInfoType.Regex;
import com.google.privacy.dlp.v2.Finding;
import com.google.privacy.dlp.v2.InfoType;
import com.google.privacy.dlp.v2.InspectConfig;
import com.google.privacy.dlp.v2.InspectContentRequest;
import com.google.privacy.dlp.v2.InspectContentResponse;
import com.google.privacy.dlp.v2.InspectionRule;
import com.google.privacy.dlp.v2.InspectionRuleSet;
import com.google.privacy.dlp.v2.Likelihood;
import com.google.privacy.dlp.v2.LocationName;
import com.google.protobuf.ByteString;
import java.io.IOException;

public class InspectWithHotwordRules {

  public static void main(String[] args) throws Exception {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String textToInspect = "Patient's MRN 444-5-22222 and just a number 333-2-33333";
    String customRegexPattern = "[1-9]{3}-[1-9]{1}-[1-9]{5}";
    String hotwordRegexPattern = "(?i)(mrn|medical)(?-i)";
    inspectWithHotwordRules(projectId, textToInspect, customRegexPattern, hotwordRegexPattern);
  }

  // Inspects a BigQuery Table
  public static void inspectWithHotwordRules(
      String projectId, String textToInspect, String customRegexPattern, String hotwordRegexPattern)
      throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (DlpServiceClient dlp = DlpServiceClient.create()) {
      // Specify the type and content to be inspected.
      ByteContentItem byteItem =
          ByteContentItem.newBuilder()
              .setType(BytesType.TEXT_UTF8)
              .setData(ByteString.copyFromUtf8(textToInspect))
              .build();
      ContentItem item = ContentItem.newBuilder().setByteItem(byteItem).build();

      // Specify the regex pattern the inspection will look for.
      Regex regex = Regex.newBuilder().setPattern(customRegexPattern).build();

      // Construct the custom regex detector.
      InfoType infoType = InfoType.newBuilder().setName("C_MRN").build();
      CustomInfoType customInfoType =
          CustomInfoType.newBuilder().setInfoType(infoType).setRegex(regex).build();

      // Specify hotword likelihood adjustment.
      LikelihoodAdjustment likelihoodAdjustment =
          LikelihoodAdjustment.newBuilder().setFixedLikelihood(Likelihood.VERY_LIKELY).build();

      // Specify a window around a finding to apply a detection rule.
      Proximity proximity = Proximity.newBuilder().setWindowBefore(10).build();

      // Construct hotword rule.
      HotwordRule hotwordRule =
          HotwordRule.newBuilder()
              .setHotwordRegex(Regex.newBuilder().setPattern(hotwordRegexPattern).build())
              .setLikelihoodAdjustment(likelihoodAdjustment)
              .setProximity(proximity)
              .build();

      // Construct rule set for the inspect config.
      InspectionRuleSet inspectionRuleSet =
          InspectionRuleSet.newBuilder()
              .addInfoTypes(infoType)
              .addRules(InspectionRule.newBuilder().setHotwordRule(hotwordRule))
              .build();

      // Construct the configuration for the Inspect request.
      InspectConfig config =
          InspectConfig.newBuilder()
              .addCustomInfoTypes(customInfoType)
              .setIncludeQuote(true)
              .setMinLikelihood(Likelihood.POSSIBLE)
              .addRuleSet(inspectionRuleSet)
              .build();

      // Construct the Inspect request to be sent by the client.
      InspectContentRequest request =
          InspectContentRequest.newBuilder()
              .setParent(LocationName.of(projectId, "global").toString())
              .setItem(item)
              .setInspectConfig(config)
              .build();

      // Use the client to send the API request.
      InspectContentResponse response = dlp.inspectContent(request);

      // Parse the response and process results
      System.out.println("Findings: " + response.getResult().getFindingsCount());
      for (Finding f : response.getResult().getFindingsList()) {
        System.out.println("\tQuote: " + f.getQuote());
        System.out.println("\tInfo type: " + f.getInfoType().getName());
        System.out.println("\tLikelihood: " + f.getLikelihood());
      }
    }
  }
}

Python

def inspect_with_medical_record_number_w_custom_hotwords(
    project,
    content_string,
):
    """Uses the Data Loss Prevention API to analyze string with medical record
       number custom regex detector, with custom hotwords rules to boost finding
       certainty under some circumstances.
    Args:
        project: The Google Cloud project id to use as a parent resource.
        content_string: The string to inspect.
    Returns:
        None; the response from the API is printed to the terminal.
    """

    # Import the client library.
    import google.cloud.dlp

    # Instantiate a client.
    dlp = google.cloud.dlp_v2.DlpServiceClient()

    # Construct a custom regex detector info type called "C_MRN",
    # with ###-#-##### pattern, where each # represents a digit from 1 to 9.
    # The detector has a detection likelihood of POSSIBLE.
    custom_info_types = [
        {
            "info_type": {"name": "C_MRN"},
            "regex": {"pattern": "[1-9]{3}-[1-9]{1}-[1-9]{5}"},
            "likelihood": "POSSIBLE",
        }
    ]

    # Construct a rule set with hotwords "mrn" and "medical", with a likelohood
    # boost to VERY_LIKELY when hotwords are present within the 10 character-
    # window preceding the PII finding.
    hotword_rule = {
        "hotword_regex": {
            "pattern": "(?i)(mrn|medical)(?-i)"
        },
        "likelihood_adjustment": {
            "fixed_likelihood": "VERY_LIKELY"
        },
        "proximity": {
            "window_before": 10
        }
    }

    rule_set = [
        {
            "info_types": [{"name": "C_MRN"}],
            "rules": [{"hotword_rule": hotword_rule}],
        }
    ]

    # Construct the configuration dictionary with the custom regex info type.
    inspect_config = {
        "custom_info_types": custom_info_types,
        "rule_set": rule_set,
    }

    # Construct the `item`.
    item = {"value": content_string}

    # Convert the project id into a full resource id.
    parent = dlp.project_path(project)

    # Call the API.
    response = dlp.inspect_content(parent, inspect_config, item)

    # Print out the results.
    if response.result.findings:
        for finding in response.result.findings:
            try:
                if finding.quote:
                    print(f"Quote: {finding.quote}")
            except AttributeError:
                pass
            print(f"Info type: {finding.info_type.name}")
            print(f"Likelihood: {finding.likelihood}")
    else:
        print("No findings.")