Match likelihood

Scan results are categorized based on how likely they are to represent a match. Sensitive Data Protection uses a bucketized representation of likelihood, which is intended to indicate how likely it is that a piece of data matches a given infoType.

How likelihood works

When you configure a Sensitive Data Protection scan, you set the infoTypes that you want Sensitive Data Protection to scan for. To narrow the scan results, you can set a minimum likelihood level in your request.

For each potential match (finding) that is detected during the scan, Sensitive Data Protection assigns a likelihood level. The likelihood level of a finding describes how likely it is that the finding matches an infoType that you're scanning for. For example, Sensitive Data Protection might assign a likelihood of LIKELY to a finding that looks like an email address.

When Sensitive Data Protection returns the results, it filters out any findings that have a lower likelihood than the minimum likelihood level that you set in your request. For example, if you set the minimum likelihood to POSSIBLE, you get only the findings that were evaluated as POSSIBLE, LIKELY, and VERY_LIKELY. If you set the minimum likelihood to VERY_LIKELY, you get the smallest number of findings.

Likelihood levels

The following table lists the possible likelihood values that Sensitive Data Protection can assign to a finding.

ENUM Description
VERY_UNLIKELY Characterized by the following:
  • A weak signal.
  • Absence of contextual clues.
  • Negative signals for a given infoType.
UNLIKELY Characterized by the following:
  • One or more weak signals.
  • A stronger signal for another infoType.
POSSIBLE Characterized by the following:
  • One or more signals toward a given infoType. Signals can include passing checksums.
  • Lack of a strong contextual clue and unique, specific formatting.
LIKELY Characterized by one or more strong signals for a given infoType. Signals can include passing checksums, strong contextual clues, and unique, specific formatting.
VERY_LIKELY Characterized by having many strong signals for a given infoType. Signals can include passing checksums, strong contextual clues, and unique, specific formatting.

Choosing a minimum likelihood level for the scan results

In general, when you set a higher minimum likelihood level in your Sensitive Data Protection request, the results have a lower number of false positives (sometimes called noise). However, the results can also exclude more true positives. Choosing a minimum likelihood level involves finding the right balance between recall and precision.

For example, suppose that there are 10 street addresses in a document and Sensitive Data Protection identified 5 street addresses. However, among the findings that Sensitive Data Protection identified, there are actually only 4 street addresses.

  • Recall is the number of true positive instances out of the total number of relevant instances. In this example, the recall is 4/10.
  • Precision is the number of true positive instances out of the total number of instances that Sensitive Data Protection identifies. In this example, the precision is 4/5.

In this example, the precision is high but the recall is relatively low.

The minimum likelihood level that you set affects the level of recall and precision that you get in your scan results. The following table describes when each minimum likelihood level is useful and how recall and precision vary at each level.

Minimum likelihood level Description
LIKELIHOOD_UNSPECIFIED Default value; same as POSSIBLE.
VERY_UNLIKELY Useful if you need the highest recall. This minimum likelihood level generates the most noise.
UNLIKELY Useful if you need higher recall. This minimum likelihood level generates some noise.
POSSIBLE Useful if you want a balance of precision and recall.
LIKELY Useful if you need a higher precision at the expense of some recall.
VERY_LIKELY Useful if you want the highest precision at the expense of recall.

Default minimum likelihood

If you don't set a minimum likelihood in your request, or if you set it to LIKELIHOOD_UNSPECIFIED, Sensitive Data Protection returns only the findings with a likelihood of POSSIBLE and higher.