InfoTypes and infoType detectors

Sensitive Data Protection uses information types—or infoTypes—to define what it scans for. An infoType is a type of sensitive data, such as a name, email address, telephone number, identification number, credit card number, and so on. An infoType detector is the corresponding detection mechanism that matches on an infoType's matching criteria.

Best practices for selecting infoTypes

Understanding your data is one of the first critical steps in protecting it. As a best practice, you should collect, store, and process only the information that you have a business need for. By identifying the data you are handling, you can make informed decisions for your business, users, and data security and privacy posture.

Some of your business use cases might require certain sensitive information, and others might not. There is no single solution that supports all use cases. For this reason, Sensitive Data Protection offers flexible control over the types of data to scan for. If you're using infoTypes for de-identification or masking, you also have control of when and how data is transformed.

General guidelines

Consider the following general guidelines when selecting infoTypes.

Sensitive information that you don't need to collect

Each service in your business should collect only the data that the service needs. For example, certain services in your business don't need to collect financial information. For those services, consider enabling infoType detectors like CREDIT_CARD_NUMBER, FINANCIAL_ACCOUNT_NUMBER, and other infoTypes in the industry category FINANCE.

There might be valid use cases for collecting personal information, but you shouldn't share it broadly with your team. For example, a customer who files a support ticket might give you contact information, so that you can contact them to resolve any issues. You don't want everyone on the team who views the ticket to see the personally identifiable information (PII). Consider enabling infoType detectors like PHONE_NUMBER, EMAIL_ADDRESS, and other infoTypes in the type category PII.

Categories of sensitive data that are under industry, data privacy, or jurisdictional regulations

Certain information types are considered sensitive because of how they are issued or what they can be used for. In other cases, contextual and demographic information are considered a protected category. These types of information might have additional restrictions on how they are collected, used, and managed. Consider enabling infoType detectors in the following categories:

Type category SPII, GOVERNMENT_ID, and DEMOGRAPHIC
Industry category HEALTH

Choosing between similar infoTypes

Consider the following when choosing between similar infoType detectors.

Passports

If you don't need to scan for passport identifiers from a specific country, then choose the generalized detector: PASSPORT.

Certain country-specific passport detectors, like UK_PASSPORT, are available. However, some country-specific passport detectors can only identify passports with specific formats or with the presence of contextual clues.

Person names

When scanning for people's names, use PERSON_NAME for most use cases instead of FIRST_NAME or LAST_NAME.

PERSON_NAME is a detector for people's names. It includes single-word names and full names. This detector attempts to detect, for example, names like Jane, Jane Smith, and Jane Marie Smith using various technologies, including natural language understanding. FIRST_NAME and LAST_NAME are subsets of this detector that attempt to identify parts of a name. Findings from these detectors are always subsets of findings from PERSON_NAME.

Dates and times

If you don't need to scan for all dates, consider using a targeted date detector like DATE_OF_BIRTH. This detector attempts to identify context indicating that the date is related to when a person is born.

The DATE detector attempts to find all dates regardless of context. It also flags relative dates, like today or yesterday. Similarly, TIME attempts to find all timestamps.

Locations

If you don't need to scan for all locations, consider using STREET_ADDRESS instead of the LOCATION detector. The STREET_ADDRESS detector attempts find fully qualified addresses, which are usually more precise than generic locations and can be considered more sensitive.

The LOCATION infoType detector attempts to find any location regardless of context—for example, Paris or Canada.

InfoType detectors that require context

Many infoType detectors require contextual clues to be present before they identify a match. If a built-in infoType detector isn't flagging items that you expect to be flagged, because no contextual clues occur in close proximity to those items, then consider using GENERIC_ID or a custom infoType detector instead.

Information types lacking a common industry definition

Some information types lack a common industry definition. Examples are medical record numbers, account numbers, PINs, and security codes. For these types, consider using infoTypes like GENERIC_ID, FINANCIAL_ACCOUNT_NUMBER, and MEDICAL_RECORD_NUMBER. These detectors use a combination of entity detection and context to find potentially sensitive elements.

Higher-latency infoType detectors

Avoid enabling infoType detectors that you don't need. Although the following are useful in certain scenarios, these infoTypes can make requests run much more slowly than requests that don't include them:
- PERSON_NAME
- FEMALE_NAME
- MALE_NAME
- FIRST_NAME
- LAST_NAME
- DATE_OF_BIRTH
- LOCATION
- STREET_ADDRESS
- ORGANIZATION_NAME
Always specify infoType detectors explicitly. Don't use an empty infoTypes list.

How to use infoTypes

Sensitive Data Protection uses infoType detectors in the configuration for its scans to determine what to inspect for and how to transform findings. InfoType names are also used when displaying or reporting scan results.

For example, if you wanted to look for email addresses in a block of text, you would specify the EMAIL_ADDRESS infoType detector in the inspection configuration. If you wanted to redact email addresses from the text block, you would specify EMAIL_ADDRESS in both the inspection configuration and the de-identification configuration to indicate how to redact or transform that type.

Further, you could use a combination of built-in and custom infoType detectors to exclude a subset of email addresses from scan findings. First, create a custom infoType called INTERNAL_EMAIL_ADDRESS and configure it to exclude internal test email addresses. Then, you can set up your scan to include findings for EMAIL_ADDRESS, but include an exclusion rule that excludes any findings that match INTERNAL_EMAIL_ADDRESS. For more information about exclusion rules and other features of custom infoType detectors, see Creating custom infoType detectors.

Sensitive Data Protection provides a set of built-in infoType detectors that you specify by name, each of which is listed in InfoType detector reference. These detectors use a variety of techniques to discover and classify each type. For example, some types will require a pattern match, some may have mathematical checksums, some have special digit restrictions, and others may have specific prefixes or context around the findings.

Examples

When you set up Sensitive Data Protection to scan your content, you include the infoType detectors to use in the scan configuration.

For example, the following JSON and code samples demonstrate a simple scan request to the DLP API. Notice that the PHONE_NUMBER detector is specified in inspectConfig, which instructs Sensitive Data Protection to scan the given string for a phone number.

C#

To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.

To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


using System;
using Google.Api.Gax.ResourceNames;
using Google.Cloud.Dlp.V2;

public class InspectPhoneNumber
{
    public static InspectContentResponse Inspect(
        string projectId,
        string text,
        Likelihood minLikelihood = Likelihood.Possible)
    {
        // Instantiate a client.
        var dlp = DlpServiceClient.Create();

        // Set content item.
        var contentItem = new ContentItem { Value = text };

        // Construct inspect config.
        var inspectConfig = new InspectConfig
        {
            InfoTypes = { new InfoType { Name = "PHONE_NUMBER" } },
            IncludeQuote = true,
            MinLikelihood = minLikelihood
        };

        // Construct a request.
        var request = new InspectContentRequest
        {
            ParentAsLocationName = new LocationName(projectId, "global"),
            InspectConfig = inspectConfig,
            Item = contentItem,
        };

        // Call the API.
        var response = dlp.InspectContent(request);

        // Inspect the results.
        var resultFindings = response.Result.Findings;

        Console.WriteLine($"Findings: {resultFindings.Count}");

        foreach (var f in resultFindings)
        {
            Console.WriteLine("\tQuote: " + f.Quote);
            Console.WriteLine("\tInfo type: " + f.InfoType.Name);
            Console.WriteLine("\tLikelihood: " + f.Likelihood);
        }

        return response;
    }
}

Go

To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.

To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import (
	"context"
	"fmt"
	"io"

	dlp "cloud.google.com/go/dlp/apiv2"
	"cloud.google.com/go/dlp/apiv2/dlppb"
)

// inspectPhoneNumber demonstrates a simple scan request to the Cloud DLP API.
// Notice that the PHONE_NUMBER detector is specified in inspectConfig,
// which instructs Cloud DLP to scan the given string for a phone number.
func inspectPhoneNumber(w io.Writer, projectID, textToInspect string) error {
	// projectID := "my-project-id"
	// textToInspect := "My phone number is (123) 555-6789"

	ctx := context.Background()

	// Initialize a client once and reuse it to send multiple requests. Clients
	// are safe to use across goroutines. When the client is no longer needed,
	// call the Close method to cleanup its resources.
	client, err := dlp.NewClient(ctx)
	if err != nil {
		return err
	}
	// Closing the client safely cleans up background resources.
	defer client.Close()

	// Create and send the request.
	req := &dlppb.InspectContentRequest{
		Parent: fmt.Sprintf("projects/%s/locations/global", projectID),
		Item: &dlppb.ContentItem{
			DataItem: &dlppb.ContentItem_Value{
				Value: textToInspect,
			},
		},
		InspectConfig: &dlppb.InspectConfig{
			// Specify the type of info the inspection will look for.
			// See https://cloud.google.com/dlp/docs/infotypes-reference
			// for complete list of info types
			InfoTypes: []*dlppb.InfoType{
				{Name: "PHONE_NUMBER"},
			},
			IncludeQuote: true,
		},
	}

	// Send the request.
	resp, err := client.InspectContent(ctx, req)
	if err != nil {
		fmt.Fprintf(w, "receive: %v", err)
		return err
	}

	// Process the results.
	result := resp.Result
	fmt.Fprintf(w, "Findings: %d\n", len(result.Findings))
	for _, f := range result.Findings {
		fmt.Fprintf(w, "\tQuote: %s\n", f.Quote)
		fmt.Fprintf(w, "\tInfo type: %s\n", f.InfoType.Name)
		fmt.Fprintf(w, "\tLikelihood: %s\n", f.Likelihood)
	}
	return nil
}

Java

To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.

To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


import com.google.cloud.dlp.v2.DlpServiceClient;
import com.google.privacy.dlp.v2.ContentItem;
import com.google.privacy.dlp.v2.Finding;
import com.google.privacy.dlp.v2.InfoType;
import com.google.privacy.dlp.v2.InspectConfig;
import com.google.privacy.dlp.v2.InspectContentRequest;
import com.google.privacy.dlp.v2.InspectContentResponse;
import com.google.privacy.dlp.v2.Likelihood;
import com.google.privacy.dlp.v2.LocationName;
import java.io.IOException;

public class InspectPhoneNumber {

  public static void main(String[] args) throws Exception {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String textToInspect = "My name is Gary and my email is gary@example.com";
    inspectString(projectId, textToInspect);
  }

  // Inspects the provided text.
  public static void inspectString(String projectId, String textToInspect) throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (DlpServiceClient dlp = DlpServiceClient.create()) {
      // Specify the type and content to be inspected.
      ContentItem item = ContentItem.newBuilder().setValue(textToInspect).build();

      // Specify the type of info the inspection will look for.
      // See https://cloud.google.com/dlp/docs/infotypes-reference for complete list of info types
      InfoType infoType = InfoType.newBuilder().setName("PHONE_NUMBER").build();

      // Construct the configuration for the Inspect request.
      InspectConfig config =
          InspectConfig.newBuilder()
              .setIncludeQuote(true)
              .setMinLikelihood(Likelihood.POSSIBLE)
              .addInfoTypes(infoType)
              .build();

      // Construct the Inspect request to be sent by the client.
      InspectContentRequest request =
          InspectContentRequest.newBuilder()
              .setParent(LocationName.of(projectId, "global").toString())
              .setItem(item)
              .setInspectConfig(config)
              .build();

      // Use the client to send the API request.
      InspectContentResponse response = dlp.inspectContent(request);

      // Parse the response and process results
      System.out.println("Findings: " + response.getResult().getFindingsCount());
      for (Finding f : response.getResult().getFindingsList()) {
        System.out.println("\tQuote: " + f.getQuote());
        System.out.println("\tInfo type: " + f.getInfoType().getName());
        System.out.println("\tLikelihood: " + f.getLikelihood());
      }
    }
  }
}

Node.js

To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.

To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

// Imports the Google Cloud Data Loss Prevention library
const DLP = require('@google-cloud/dlp');

// Instantiates a client
const dlp = new DLP.DlpServiceClient();

// The project ID to run the API call under
// const projectId = 'my-project';

// The string to inspect
// const string = 'My email is gary@example.com and my phone number is (223) 456-7890.';

// The minimum likelihood required before returning a match
// const minLikelihood = 'LIKELIHOOD_UNSPECIFIED';

// The maximum number of findings to report per request (0 = server maximum)
// const maxFindings = 0;

// The infoTypes of information to match
// See https://cloud.google.com/dlp/docs/concepts-infotypes for more information
// about supported infoTypes.
// const infoTypes = [{ name: 'PHONE_NUMBER' }];

// The customInfoTypes of information to match
// const customInfoTypes = [{ infoType: { name: 'DICT_TYPE' }, dictionary: { wordList: { words: ['foo', 'bar', 'baz']}}},
//   { infoType: { name: 'REGEX_TYPE' }, regex: {pattern: '\\(\\d{3}\\) \\d{3}-\\d{4}'}}];

// Whether to include the matching string
// const includeQuote = true;

async function inspectPhoneNumber() {
  // Construct item to inspect
  const item = {value: string};

  // Construct request
  const request = {
    parent: `projects/${projectId}/locations/global`,
    inspectConfig: {
      infoTypes: infoTypes,
      customInfoTypes: customInfoTypes,
      minLikelihood: minLikelihood,
      includeQuote: includeQuote,
      limits: {
        maxFindingsPerRequest: maxFindings,
      },
    },
    item: item,
  };

  // Run request
  const [response] = await dlp.inspectContent(request);
  const findings = response.result.findings;
  if (findings.length > 0) {
    console.log('Findings:');
    findings.forEach(finding => {
      if (includeQuote) {
        console.log(`\tQuote: ${finding.quote}`);
      }
      console.log(`\tInfo type: ${finding.infoType.name}`);
      console.log(`\tLikelihood: ${finding.likelihood}`);
    });
  } else {
    console.log('No findings.');
  }
}
inspectPhoneNumber();

PHP

To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.

To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

use Google\Cloud\Dlp\V2\Client\DlpServiceClient;
use Google\Cloud\Dlp\V2\ContentItem;
use Google\Cloud\Dlp\V2\InfoType;
use Google\Cloud\Dlp\V2\InspectConfig;
use Google\Cloud\Dlp\V2\InspectContentRequest;
use Google\Cloud\Dlp\V2\Likelihood;

/**
 * Inspect data for phone numbers
 * Demonstrates a simple scan request to the Cloud DLP API. Notice that the PHONE_NUMBER detector is specified in inspectConfig, which instructs Cloud DLP to scan the given string for a phone number.
 *
 * @param string $projectId         The Google Cloud project id to use as a parent resource.
 * @param string $textToInspect     The string to inspect.
 */
function inspect_phone_number(
    // TODO(developer): Replace sample parameters before running the code.
    string $projectId,
    string $textToInspect = 'My name is Gary and my phone number is (415) 555-0890'
): void {
    // Instantiate a client.
    $dlp = new DlpServiceClient();

    $parent = "projects/$projectId/locations/global";

    // Specify what content you want the service to Inspect.
    $item = (new ContentItem())
        ->setValue($textToInspect);

    $inspectConfig = (new InspectConfig())
        // The infoTypes of information to match
        ->setInfoTypes([
            (new InfoType())->setName('PHONE_NUMBER'),
        ])
        // Whether to include the matching string
        ->setIncludeQuote(true)
        ->setMinLikelihood(Likelihood::POSSIBLE);

    // Run request
    $inspectContentRequest = (new InspectContentRequest())
        ->setParent($parent)
        ->setInspectConfig($inspectConfig)
        ->setItem($item);
    $response = $dlp->inspectContent($inspectContentRequest);

    // Print the results
    $findings = $response->getResult()->getFindings();
    if (count($findings) == 0) {
        printf('No findings.' . PHP_EOL);
    } else {
        printf('Findings:' . PHP_EOL);
        foreach ($findings as $finding) {
            printf('  Quote: %s' . PHP_EOL, $finding->getQuote());
            printf('  Info type: %s' . PHP_EOL, $finding->getInfoType()->getName());
            printf('  Likelihood: %s' . PHP_EOL, Likelihood::name($finding->getLikelihood()));
        }
    }
}

Python

To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.

To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import google.cloud.dlp


def inspect_phone_number(
    project: str,
    content_string: str,
) -> None:
    """Uses the Data Loss Prevention API to analyze strings for protected data.
    Args:
        project: The Google Cloud project id to use as a parent resource.
        content_string: The string to inspect phone number from.
    """

    # Instantiate a client.
    dlp = google.cloud.dlp_v2.DlpServiceClient()

    # Prepare info_types by converting the list of strings into a list of
    # dictionaries (protos are also accepted).
    info_types = [{"name": "PHONE_NUMBER"}]

    # Construct the configuration dictionary.
    inspect_config = {
        "info_types": info_types,
        "include_quote": True,
    }

    # Construct the `item`.
    item = {"value": content_string}

    # Convert the project id into a full resource id.
    parent = f"projects/{project}"

    # Call the API.
    response = dlp.inspect_content(
        request={"parent": parent, "inspect_config": inspect_config, "item": item}
    )

    # Print out the results.
    if response.result.findings:
        for finding in response.result.findings:
            print(f"Quote: {finding.quote}")
            print(f"Info type: {finding.info_type.name}")
            print(f"Likelihood: {finding.likelihood}")
    else:
        print("No findings.")

REST

JSON input:

POST https://dlp.googleapis.com/v2/projects/[PROJECT-ID]/content:inspect?key={YOUR_API_KEY}

{
  "item":{
    "value":"My phone number is (415) 555-0890"
  },
  "inspectConfig":{
    "includeQuote":true,
    "minLikelihood":"POSSIBLE",
    "infoTypes":{
      "name":"PHONE_NUMBER"
    }
  }
}

When you send the preceding request the specified endpoint, Sensitive Data Protection returns the following:

JSON output:

{
  "result":{
    "findings":[
      {
        "quote":"(415) 555-0890",
        "infoType":{
          "name":"PHONE_NUMBER"
        },
        "likelihood":"VERY_LIKELY",
        "location":{
          "byteRange":{
            "start":"19",
            "end":"33"
          },
          "codepointRange":{
            "start":"19",
            "end":"33"
          }
        },
        "createTime":"2018-10-29T23:46:34.535Z"
      }
    ]
  }
}

You must specify particular infoTypes listed in the reference in your inspection configuration. If you don't specify any infoTypes, Sensitive Data Protection uses a default infoTypes list that is intended for testing purposes only. The default list might not be suitable for your use cases.

For more information on how to use infoType detectors to scan your content, see one of the how-to topics about inspecting, redacting, or de-identifying.

Certainty and testing

Findings are reported with a certainty score called likelihood. The likelihood score indicates how likely a finding matches the corresponding type. For example, a type may return a lower likelihood if it only matches the pattern and return a higher likelihood if it matches the pattern and has positive context around it. For this reason, you may notice that a single finding could match several types at lower likelihood. Also, a finding may not appear or might have lower certainty if it doesn't match properly, or if it has negative context around it. For example, a finding might not reported if it matches the structure for the specified infoType but fails the infoType's checksum. Or a finding could match more than one infoType but have context that boosts one of them, and thus only get reported for that type.

If you are testing various detectors, you may notice that fake or sample data does not get reported because that fake or sample data is not passing enough checks to report.

Kinds of infoType detectors

Sensitive Data Protection includes several kinds of infoType detectors, all of which are summarized here:

Built-in infoType detectors are built into Sensitive Data Protection. They include detectors for country- or region-specific sensitive data types as well as globally applicable data types.
Custom infoType detectors are detectors that you create yourself. There are three kinds of custom infoType detectors:
- Small custom dictionary detectors are simple word lists that Sensitive Data Protection matches on. Use small custom dictionary detectors when you have a list of up to several tens of thousands of words or phrases. Small custom dictionary detectors are preferred if you don't anticipate your word list changing significantly.
- Large custom dictionary detectors are generated by Sensitive Data Protection using large lists of words or phrases stored in either Cloud Storage or BigQuery. Use large custom dictionary detectors when you have a large list of words or phrases—up to tens of millions.
- Regular expressions (regex) detectors enable Sensitive Data Protection to detect matches based on a regular expression pattern.

In addition, Sensitive Data Protection includes the concept of inspection rules, which enable you to fine-tune scan results using the following:

Exclusion rules enable you to decrease the number of findings returned by adding rules to a built-in or custom infoType detector.
Hotword rules enable you to increase the quantity or change the likelihood value of findings returned by adding rules to a built-in or custom infoType detector.

Built-in infoType detectors

Built-in infoType detectors are built into Sensitive Data Protection, and include detectors for country- or region-specific sensitive data types such as the French Numéro d'Inscription au Répertoire (NIR) (FRANCE_NIR), UK driver's license number (UK_DRIVERS_LICENSE_NUMBER), and US Social Security number (US_SOCIAL_SECURITY_NUMBER). They also include globally applicable data types such as a person name (PERSON_NAME), telephone numbers (PHONE_NUMBER), email addresses (EMAIL_ADDRESS), and credit card numbers (CREDIT_CARD_NUMBER). To detect content that corresponds to infoTypes, Sensitive Data Protection leverages various techniques including pattern matching, checksums, machine-learning, context analysis, and others.

The list of built-in infoType detectors is always being updated. For a complete list of currently supported built-in infoType detectors, see InfoType detector reference.

You can also view a complete list of all built-in infoType detectors by calling Sensitive Data Protection's infoTypes.list method.

Language support

Country-specific infoTypes support the English language and the respective country's languages. Most global infoTypes work with multiple languages. Test Sensitive Data Protection with your data to verify that it meets your requirements.

Custom infoType detectors

There are three kinds of custom infoType detectors:

Small custom dictionary detectors
Large custom dictionary detectors
Regular expressions (regex)

In addition, Sensitive Data Protection includes inspection rules, which enable you to fine-tune scan results by adding the following to existing detectors:

Exclusion rules
Hotword rules

Small custom dictionary detectors

Use small custom dictionary detectors (also referred to as "regular custom dictionary detectors") to match a short (up to several tens of thousands) list of words or phrases. A small custom dictionary can act as its own unique detector.

Custom dictionary detectors are useful when you want to scan for a list of words or phrases that are not easily matched by a regular expression or a built-in detector. For example, suppose you want to scan for conference rooms that are commonly referred to by their assigned room names rather than their room numbers, such as state or region names, landmarks, fictional characters, and so on. You can make a small custom dictionary detector that contains a list of these room names. Sensitive Data Protection can scan your content for each of the room names and return a match when it encounters one of them in context. Learn more about how Sensitive Data Protection matches dictionary words and phrases in the "Dictionary matching specifics" section of Creating a regular custom dictionary detector.

For more details about how small dictionary custom infoType detectors work, as well as examples in action, see Creating a regular custom dictionary detector.

Large custom dictionary detectors

Use large custom dictionary detectors (also referred to as "stored custom dictionary detectors") when you have more than a few words or phrases to scan for, or if your list of words or phrases changes frequently. Large custom dictionary detectors can match on up to tens of millions of words or phrases.

Large custom dictionary detectors are created differently from both regular expression custom detectors and small custom dictionary detectors. Each large custom dictionary has two components:

A list of phrases that you create and define. The list is stored as either a text file within Cloud Storage or a column in a BigQuery table.
The generated dictionary files, which are built by Sensitive Data Protection based on your phrase list. The dictionary files are stored in Cloud Storage, and are comprised of a copy of the source phrase data plus bloom filters, which aid in searching and matching. You can't edit these files directly.

Once you've created a word list and then used Sensitive Data Protection to generate a custom dictionary, you initiate or schedule a scan using a large custom dictionary detector in a similar way as other infoType detectors.

For more details about how large custom dictionary detectors work, as well as examples in action, see Creating a stored custom dictionary detector.

Regular expressions

A regular expression (regex) custom infoType detector allows you to create your own infoType detectors that enable Sensitive Data Protection to detect matches based on a regex pattern. For example, suppose that you had medical record numbers in the form ###-#-#####. You could define a regex pattern such as the following:

[1-9]{3}-[1-9]{1}-[1-9]{5}

The Sensitive Data Protection would then match items like this:

123-4-56789

You can also specify a likelihood to assign to each custom infoType match. That is, when Sensitive Data Protection matches the sequence you specify, it will assign the likelihood that you have indicated. This is useful because if your custom regex defines a sequence that is common enough it could easily match some other random sequence, you would not want Sensitive Data Protection to label every match as VERY_LIKELY. Doing so would erode confidence in scan results and potentially cause the wrong information to be matched or de-identified.

For more information about regular expression custom infoType detectors, and to see them in action, see Creating a custom regex detector.

Inspection rules

You use inspection rules to refine the results returned by existing infoType detectors—either built-in or custom. Inspection rules can be useful for times when the results that Sensitive Data Protection returns need to be augmented in some way, either by adding to and excluding from the existing infoType detector.

The two types of inspection rules are:

Exclusion rules
Hotword rules

For more information about inspection rules, see Modifying infoType detectors to refine scan results.

Exclusion rules

Exclusion rules enable you to decrease the quantity or precision of findings returned by adding rules to a built-in or custom infoType detector. Exclusion rules can help you reduce noise or other unwanted findings from being returned by an infoType detector.

For example, if you scan a database for email addresses, you can add an exclusion rule in the form of a custom regex that instructs Sensitive Data Protection to exclude any findings ending in "@example.com."

For more information about exclusion rules, see Modifying infoType detectors to refine scan results.

Hotword rules

Hotword rules enable you to increase the quantity or accuracy of findings returned by adding rules to a built-in or custom infoType detector. Hotword rules can effectively help you loosen an existing infoType detector's rules.

For example, suppose you want to scan a medical database for patient names. You can use Sensitive Data Protection's built-in PERSON_NAME infoType detector, but that will cause Sensitive Data Protection to match on all names of people, not just names of patients. To fix this, you can include a hotword rule in the form of a regex custom infoType that looks for the word "patient" within a certain character proximity from the first character of potential matches. You can then assign findings matching this pattern a likelihood of "very likely," since they correspond to your special criteria.

For more information about hotword rules, see Modifying infoType detectors to refine scan results.

Examples

To get a better idea of how infoTypes match on findings, look at the following examples of matching on a series of digits to determine whether they constitute a US Social Security number or a US Individual Taxpayer Identification Number. Keep in mind that these examples are for built-in infoType detectors. When you create a custom infoType detector, you specify the criteria that determine the likelihood of a scan match.

Example 1

"SSN 222-22-2222"

Reports a high likelihood score of VERY_LIKELY for a US_SOCIAL_SECURITY_NUMBER because:

It is in the standard Social Security number format, which raises the certainty.
It has context nearby ("SSN") that boosts towards US_SOCIAL_SECURITY_NUMBER.

Example 2

"999-99-9999"

Reports a low likelihood score of VERY_UNLIKELY for a US_SOCIAL_SECURITY_NUMBER because:

It is in the standard format, which raises the certainty.
It starts with a 9, which is not allowed in Social Security numbers, thus lowers the certainty.
It lacks context, which lowers the certainty.

Example 3

"999-98-9999"

Reports a likelihood score of POSSIBLE for a US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER and VERY_UNLIKELY for US_SOCIAL_SECURITY_NUMBER because:

It has the standard format for both US_SOCIAL_SECURITY_NUMBER and US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER.
It starts with a 9 and has another digit check, which boosts certainty for US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER.
It lacks any context, which lowers the certainty for both.

What's next

The Sensitive Data Protection team releases new infoType detectors and groups periodically. To learn how to get the latest list of built-in infoTypes, see Listing built-in infoType detectors.

InfoTypes and infoType detectors

Best practices for selecting infoTypes

General guidelines

Sensitive information that you don't need to collect

Information that you need to collect but don't want to share broadly with your team

Categories of sensitive data that are under industry, data privacy, or jurisdictional regulations

Choosing between similar infoTypes

Passports

Person names

Dates and times

Locations

InfoType detectors that require context

Information types lacking a common industry definition

Higher-latency infoType detectors

How to use infoTypes

Examples

C#

Go

Java

Node.js

PHP

Python

REST

Certainty and testing

Kinds of infoType detectors

Built-in infoType detectors

Language support

Custom infoType detectors

Small custom dictionary detectors

Large custom dictionary detectors

Regular expressions

Inspection rules

Exclusion rules

Hotword rules

Examples

Example 1

Example 2

Example 3

What's next