De-identification

Cloud Data Loss Prevention uses information types—or infoTypes—to define what it scans for. An infoType is a type of sensitive data, such as a name, email address, telephone number, identification number, credit card number, and so on.

Every infoType defined in Cloud DLP has a corresponding detector. Cloud DLP uses infoType detectors in the configuration for its scans to determine what to inspect for and how to transform findings. InfoType names are also used when displaying or reporting scan results.

This topic describes infoTypes and infoType detectors in detail, and provides guidance for how to use infoType detectors when scanning content for sensitive data using Cloud DLP.

Specifying infoType detectors

When you set up Cloud DLP to scan your content, you include the infoType detectors to use in the scan configuration.

For example, the following JSON demonstrates a simple scan request to the DLP API. Notice that the PHONE_NUMBER detector is specified in inspectConfig, which instructs Cloud DLP to scan the given string for a phone number.

POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/content:inspect?key={YOUR_API_KEY}

{
  "item":{
    "value":"My phone number is (415) 555-0890"
  },
  "inspectConfig":{
    "includeQuote":true,
    "minLikelihood":"POSSIBLE",
    "infoTypes":{
      "name":"PHONE_NUMBER"
    }
  }
}

The preceding request returns the following:

{
  "result":{
    "findings":[
      {
        "quote":"(415) 555-0890",
        "infoType":{
          "name":"PHONE_NUMBER"
        },
        "likelihood":"VERY_LIKELY",
        "location":{
          "byteRange":{
            "start":"19",
            "end":"33"
          },
          "codepointRange":{
            "start":"19",
            "end":"33"
          }
        },
        "createTime":"2018-10-29T23:46:34.535Z"
      }
    ]
  }
}

Always specify infoTypes in your scan configuration. If you don't specify any infoTypes, Cloud DLP uses a default infoTypes list. Depending on the amount of content to scan, scanning for default infoTypes can be prohibitively time-consuming or expensive.

For more information on how to use infoType detectors to scan your content, see one of the how-to topics about inspecting, redacting, or de-identifying.

Kinds of infoType detectors

Information type (or "infoType") detectors are the mechanisms that Cloud DLP uses to find sensitive data.

Cloud DLP includes several kinds of infoType detectors, all of which are summarized here:

  • Built-in infoType detectors are built into Cloud DLP. They include detectors for country- or region-specific sensitive data types as well as globally applicable data types.
  • Custom infoType detectors are detectors that you create yourself. There are three kinds of custom infoType detectors:
    • Regular custom dictionary detectors are simple word lists that Cloud DLP matches on. Use regular custom dictionary detectors when you have a list of up to several tens of thousands of words or phrases. Regular custom dictionary detectors are preferred if you don't anticipate your word list changing significantly.
    • Stored custom dictionary detectors are generated by Cloud DLP using large lists of words or phrases stored in either Cloud Storage or BigQuery. Use stored custom dictionary detectors when you have a large list of words or phrases—up to tens of millions.
    • Regular expressions (regex) detectors enable Cloud DLP to detect matches based on a regular expression pattern.

In addition, Cloud DLP includes the concept of inspection rules, which enable you to fine-tune scan results using the following:

  • Exclusion rules enable you to decrease the number of findings returned by adding rules to a built-in or custom infoType detector.
  • Hotword rules enable you to increase the quantity or change the likelihood value of findings returned by adding rules to a built-in or custom infoType detector.

Built-in infoType detectors

Built-in infoType detectors are built into Cloud DLP, and include detectors for country- or region-specific sensitive data types such as the French Numéro d'Inscription au Répertoire (NIR) (FRANCE_NIR), UK driver's license number (UK_DRIVERS_LICENSE_NUMBER), and US Social Security number (US_SOCIAL_SECURITY_NUMBER). They also include globally applicable data types such as a person name (PERSON_NAME), telephone numbers (PHONE_NUMBER), email addresses (EMAIL_ADDRESS), and credit card numbers (CREDIT_CARD_NUMBER).To detect content that corresponds to infoTypes, Cloud DLP leverages various techniques including pattern matching, checksums, machine-learning, context analysis, and others.

The list of built-in infoType detectors is always being updated. For a complete list of currently supported built-in infoType detectors, see InfoType detector reference.

You can also view a complete list of all built-in infoType detectors by calling Cloud DLP's infoTypes.list method.

Built-in infoType detectors are not a 100% accurate detection method. For example, they can't guarantee compliance with regulatory requirements. You must decide what data is sensitive and how to best protect it. Google recommends that you test your settings to make sure your configuration meets your requirements.

Custom infoType detectors

There are three kinds of custom infoType detectors:

In addition, Cloud DLP includes inspection rules, which enable you to fine-tune scan results by adding the following to existing detectors:

Regular custom dictionary detectors

Use regular custom dictionary detectors to match a short (up to several tens of thousands) list of words or phrases. A regular custom dictionary can act as its own unique detector.

Custom dictionary detectors are useful when you want to scan for a list of words or phrases that are not easily matched by a regular expression or a built-in detector. For example, suppose you want to scan for conference rooms that are commonly referred to by their assigned room names rather than their room numbers, such as state or region names, landmarks, fictional characters, and so on. You can make a regular custom dictionary detector that contains a list of these room names. Cloud DLP can scan your content for each of the room names and return a match when it encounters one of them in context. Learn more about how Cloud DLP matches dictionary words and phrases in the "Dictionary matching specifics" section of Creating a Regular Custom Dictionary Detector.

For more details about how regular dictionary custom infoType detectors work, as well as examples in action, see Creating a Regular Custom Dictionary Detector.

Stored custom dictionary detectors

Use stored custom dictionary detectors when you have more than a few words or phrases to scan for, or if your list of words or phrases changes frequently. Stored custom dictionary detectors can match on up to tens of millions of words or phrases.

Stored custom dictionary detectors, by their nature as very large custom detectors, are created differently from both regular expression custom detectors and regular custom dictionary detectors. Each stored custom dictionary has two components:

  • A list of phrases that you create and define. The list is stored as either a text file within Cloud Storage or a column in a BigQuery table.
  • The generated dictionary files, which are built by Cloud DLP based on your phrase list. The dictionary files are stored in Cloud Storage, and are comprised of a copy of the source phrase data plus bloom filters, which aid in searching and matching. You can't edit these files directly.

Once you've created a word list and then used Cloud DLP to generate a custom dictionary, you initiate or schedule a scan using a stored custom dictionary detector in a similar way as other infoType detectors.

For more details about how stored custom dictionary detectors work, as well as examples in action, see Creating a Stored Custom Dictionary Detector.

Regular expressions

A regular expression (regex) custom infoType detector allows you to create your own infoType detectors that enable Cloud DLP to detect matches based on a regex pattern. For example, suppose that you had medical record numbers in the form ###-#-#####. You could define a regex pattern such as the following:

[1-9]{3}-[1-9]{1}-[1-9]{5}

The Cloud DLP would then match items like this:

123-4-56789

You can also specify a likelihood to assign to each custom infoType match. That is, when Cloud DLP matches the sequence you specify, it will assign the likelihood that you have indicated. This is useful because if your custom regex defines a sequence that is common enough it could easily match some other random sequence, you would not want Cloud DLP to label every match as VERY_LIKELY. Doing so would erode confidence in scan results and potentially cause the wrong information to be de-identified.

For more information about regular expression custom infoType detectors, and to see them in action, see Creating a Custom Regex Detector.

Inspection rules

You use inspection rules to refine the results returned by existing infoType detectors—either built-in or custom. Inspection rules can be useful for times when the results that Cloud DLP returns need to be augmented in some way, either by adding to and excluding from the existing infoType detector.

The two types of inspection rules are:

  • Exclusion rules
  • Hotword rules

For more information about inspection rules, see Modifying InfoType Detectors to Refine Scan Results.

Exclusion rules

Exclusion rules enable you to decrease the quantity or precision of findings returned by adding rules to a built-in or custom infoType detector. Exclusion rules can help you reduce noise or other unwanted findings from being returned by an infoType detector.

For example, if you scan a database for email addresses, you can add an exclusion rule in the form of a custom regex that instructs Cloud DLP to exclude any findings ending in "@example.com."

For more information about exclusion rules, see Modifying InfoType Detectors to Refine Scan Results.

Hotword rules

Hotword rules enable you to increase the quantity or accuracy of findings returned by adding rules to a built-in or custom infoType detector. Hotword rules can effectively help you loosen an existing infoType detector's rules.

For example, suppose you want to scan a medical database for patient names. You can use Cloud DLP's built-in PERSON_NAME infoType detector, but that will cause Cloud DLP to match on all names of people, not just names of patients. To fix this, you can include a hotword rule in the form of a regex custom infoType that looks for the word "patient" within a certain character proximity from the first character of potential matches. You can then assign findings matching this pattern a likelihood of "very likely," since they correspond to your special criteria.

For more information about hotword rules, see Modifying InfoType Detectors to Refine Scan Results.