De-identifying Sensitive Data in Text Content

The Data Loss Prevention API can de-identify sensitive data in text content, including text stored in container structures such as tables. De-identification is the process of removing identifying information from data. The API detects sensitive data such as personally identifiable information (PII), and then uses a de-identification transformation to mask, delete, or otherwise obscure the data. For example, de-identification techniques can include any of the following:

  • "Masking" sensitive data by partially or fully replacing characters with a symbol, such as an asterisk (*) or hash (#).
  • Replacing each instance of sensitive data with a "token," or surrogate, string.
  • Encrypting and replacing sensitive data using a randomly generated or pre-determined key.

You can feed information to the API using JSON over HTTP, as well as the CLI and several programming languages using the DLP client libraries. To set up the CLI, refer to the quickstart. For more information about submitting information in JSON format, see the JSON quickstart.

To de-identify sensitive data, use the DLP API’s content.deidentify method. It takes the following as arguments:

  • One or more text strings or table structures (ContentItem objects within an items[] array argument) for the API to inspect. All items are treated as content type text/*
  • A deidentifyConfig argument, which specifies de-identification configuration information (DeidentifyConfig). This argument is covered in more detail in the following section.
  • You can also include an inspectConfig argument to specify detection configuration information (InspectConfig) such as what types of data (or “infoTypes”—for example, phone numbers) to look for and whether to filter findings above a certain likelihood threshold.

The API returns the same items you gave it, in the same format, but any text identified as containing sensitive information according to your criteria has been de-identified.

Transformations

To use the de-identification feature effectively, you must specify one or more transformations. There are two categories of transformations:

  • InfoTypeTransformations: Transformations that are only applied to values within submitted text that are identified as a specific infoType.
  • RecordTransformations: Transformations that are only applied to values within submitted tabular text data that are identified as a specific infoType, or on an entire column of tabular data.

InfoType Transformations

You can specify one or more infoType transformations per request. Within each InfoTypeTransformation object, you specify both of the following:

  • One or more infoTypes to which a transformation should be applied (the infoTypes[] array object).
  • A primitive transformation (the PrimitiveTransformation object). Note that specifying an infoType is optional, and if not specified, the API will match all available infoTypes.

Primitive Transformations

You must specify at least one primitive transformation to apply to input text, regardless of whether you're applying it only to certain infoTypes or to the entire text string. You have several transformation options, which are summarized here:

Transformation Object Description
Replacement ReplaceValueConfig Replaces each input value with a given value.
Redaction RedactConfig Redacts a value by removing it.
Mask with character CharacterMaskConfig Masks a string either fully or partially by replacing a given number of characters with a specified fixed character.
Pseudonymization by replacing input value with cryptographic hash CryptoHashConfig Replaces input values with a 32-byte hexadecimal string generated using a given data encryption key.
Pseudonymization by replacing with cryptographic format preserving token CryptoReplaceFfxFpeConfig Replaces an input value with a “token,” or surrogate value, of the same length using format-preserving encryption (FPE) with the FFX mode of operation.
Bucket values based on fixed size ranges FixedSizeBucketingConfig Masks input values by replacing them with “buckets,” or ranges within which the input value falls.
Bucket values based on custom size ranges BucketingConfig Buckets input values based on user-configurable ranges and replacement values.
Replace with infoType ReplaceWithInfoTypeConfig Replaces an input value with the name of its infoType.
Extract time data TimePartConfig Extracts or preserves a portion of Date, Timestamp, and TimeOfDay values.
replaceConfig

Setting replaceConfig to a ReplaceValueConfig object replaces matched input values with a value you specify.

For example, suppose you’ve set replaceConfig to "<phone number>" for all PHONE_NUMBER infoTypes, and the following string is sent to the DLP API:

John Smith, 123 Main St, Seattle, WA 98122, 206-555-0123.

The returned string will be the following:

John Smith, 123 Main St, Seattle, WA 98122, <phone number>.

redactConfig

Specifying redactConfig redacts a given value by removing it completely. The redactConfig message has no arguments; specifying it enables its transformation.

For example, suppose you’ve specified redactConfig for all PHONE_NUMBER infoTypes, and the following string is sent to the DLP API:

John Smith, 123 Main St, Seattle, WA 98122, 206-555-0123.

The returned string will be the following:

John Smith, 123 Main St, Seattle, WA 98122, .
characterMaskConfig

Setting characterMaskConfig to a CharacterMaskConfig object partially masks a string by replacing a given number of characters with a fixed character. Masking can start from the beginning or end of the string. This transformation also works with number types such as long integers.

The CharacterMaskConfig object has several of its own arguments:

  • maskingCharacter: The character to use to mask each character of a sensitive value. For example, you could specify an asterisk (*) or hash (#) to mask a series of numbers such as those in a credit card number.
  • numberToMask: The number of characters to mask. If you don’t set this value, all matching characters will be masked.
  • reverseOrder: Whether to mask characters in reverse order. Setting reverseOrder to true causes characters in matched values to be masked from the end toward the beginning of the value. Setting it to false causes masking to begin at the start of the value.
  • charactersToIgnore[]: One or more characters to skip when masking values. For example, specify a hyphen here to leave the hyphens in place when masking a telephone number. You can also specify a group of common characters (CharacterGroup) to ignore when masking.

For example, suppose you've set characterMaskConfig to the following values for all PHONE_NUMBER infoTypes:

{
  "maskingCharacter": "#",
  "numberToMask": 5,
  "reverseOrder": true,
  "charactersToIgnore": [
    {
      "charactersToSkip":"-"
    }
  ],
}

Then you send the following string to the DLP API:

John Smith, 123 Main St, Seattle, WA 98122, 206-555-0123.

The returned string will be the following:

John Smith, 123 Main St, Seattle, WA 98122, 206-55#-####.

Node.js

For more on installing and creating a DLP API client, refer to DLP API Client Libraries.

// Imports the Google Cloud Data Loss Prevention library
const DLP = require('@google-cloud/dlp');

// Instantiates a client
const dlp = new DLP.DlpServiceClient();

// The string to deidentify
// const string = 'My SSN is 372819127';

// (Optional) The maximum number of sensitive characters to mask in a match
// If omitted from the request or set to 0, the API will mask any matching characters
// const numberToMask = 5;

// (Optional) The character to mask matching sensitive data with
// const maskingCharacter = 'x';

// Construct deidentification request
const items = [{type: 'text/plain', value: string}];
const request = {
  deidentifyConfig: {
    infoTypeTransformations: {
      transformations: [
        {
          primitiveTransformation: {
            characterMaskConfig: {
              maskingCharacter: maskingCharacter,
              numberToMask: numberToMask,
            },
          },
        },
      ],
    },
  },
  items: items,
};

// Run deidentification request
dlp
  .deidentifyContent(request)
  .then(response => {
    const deidentifiedItems = response[0].items;
    console.log(deidentifiedItems[0].value);
  })
  .catch(err => {
    console.log(`Error in deidentifyWithMask: ${err.message || err}`);
  });

Java

For more on installing and creating a DLP API client, refer to DLP API Client Libraries.

/**
 * Deidentify a string by masking sensitive information with a character using the DLP API.
 * @param string The string to deidentify.
 * @param maskingCharacter (Optional) The character to mask sensitive data with.
 * @param numberToMask (Optional) The number of characters' worth of sensitive data to mask.
 *                     Omitting this value or setting it to 0 masks all sensitive chars.
 */

// instantiate a client
try (DlpServiceClient dlpServiceClient = DlpServiceClient.create()) {

  // string = "My SSN is 372819127";
  // numberToMask = 5;
  // maskingCharacter = 'x';

  ContentItem contentItem =
      ContentItem.newBuilder()
          .setType("text/plain")
          .setValue(string)
          .build();

  CharacterMaskConfig characterMaskConfig =
      CharacterMaskConfig.newBuilder()
          .setMaskingCharacter(maskingCharacter.toString())
          .setNumberToMask(numberToMask)
          .build();

  // Create the deidentification transformation configuration
  PrimitiveTransformation primitiveTransformation =
      PrimitiveTransformation.newBuilder()
          .setCharacterMaskConfig(characterMaskConfig)
          .build();

  InfoTypeTransformation infoTypeTransformationObject =
      InfoTypeTransformation.newBuilder()
          .setPrimitiveTransformation(primitiveTransformation)
          .build();

  InfoTypeTransformations infoTypeTransformationArray =
      InfoTypeTransformations.newBuilder()
          .addTransformations(infoTypeTransformationObject)
          .build();

  // Create the deidentification request object
  DeidentifyConfig deidentifyConfig =
      DeidentifyConfig.newBuilder()
          .setInfoTypeTransformations(infoTypeTransformationArray)
          .build();

  DeidentifyContentRequest request =
      DeidentifyContentRequest.newBuilder()
          .setDeidentifyConfig(deidentifyConfig)
          .addItems(contentItem)
          .build();

  // Execute the deidentification request
  DeidentifyContentResponse response = dlpServiceClient.deidentifyContent(request);

  // Print the character-masked input value
  // e.g. "My SSN is 123456789" --> "My SSN is *********"
  for (ContentItem item : response.getItemsList()) {
    System.out.println(item.getValue());
  }
} catch (Exception e) {
  System.out.println("Error in deidentifyWithMask: " + e.getMessage());
}

cryptoHashConfig

Setting cryptoHashConfig to a CryptoHashConfig object performs pseudonymization on an input value by replacing an input value with an encrypted "digest," or hash value. The digest is computed by taking the SHA-256 hash of the input value. The cryptographic key used to take the hash is taken from the CryptoKey object. The digest is encoded as a 32-byte hexadecimal string.

For example, suppose you’ve specified cryptoHashConfig for all PHONE_NUMBER infoTypes, and the CryptoKey object consists of a TransientCryptoKey, which is a randomly-generated key. Then, the following string is sent to the DLP API:

John Smith, 123 Main St, Seattle, WA 98122, 206-555-0123.

The returned string will look like the following:

John Smith, 123 Main St, Seattle, WA 98122, 41D1567F7F99F1DC2A5FAB886DEE5BEE.

Of course, the hex string will be cryptographically generated and different from the one shown here.

cryptoReplaceFfxFpeConfig

Setting cryptoReplaceFfxFpeConfig to a CryptoReplaceFfxFpeConfig object performs pseudonymization on an input value by replacing an input value with a token. This token is:

  • The encrypted input value.
  • The same length as the input value.
  • Computed using format-preserving encryption (FPE) in FFX mode keyed on the cryptographic key specified by cryptoKey.
  • Comprised of the characters specified by alphabet.

The input value:

  • Must be at least two characters long (or the empty string).
  • Must be comprised of the characters specified by alphabet.

When transforming structured data (tabular data with records and fields), a context may be specified. When specified, context defines a field whose value in a given record will be taken as the "tweak." For example, in the table below, "Patient ID" may be chosen to be the context, in which case the tweak is taken as "4672" for the first record, "3246" for the second record, and so on.

To understand the purpose of specifying a context, suppose first that the "Name" field in the table below is transformed without using a context. In this case, each identical name will be replaced with the same token. This implies that two matching tokens refer to the same name. However, this implication may be unwanted since it may reveal sensitive relationships between records. This is where we can make use of a context to break these relationships, in which case these relationships will only hold for the tokens generated using an identical tweak.

For example, consider the table below.

Bill Number Patient ID Name ...
223 4672 John  
224 3246 Debra  
225 3529 Nate  
226 4098 Debra  
...      

Applying this transformation to "Name" without specifying a context results in the following transformed table (the exact token values depend on the specified cryptoKey):

Bill Number Patient ID Name ...
223 4672 gCUv  
224 3246 Eusyv  
225 3529 dsla  
226 4098 Eusyv  
...      

Note that in the table above the tokens for records with name "Debra" are the same. To break this relationship, we can specify "Patient ID" as the context and run the transformation over the original table. This yields the following transformed table (the exact token values depend on the specified cryptoKey):

Bill Number Patient ID Name ...
223 4672 Agca  
224 3246 vSHig  
225 3529 kqHX  
226 4098 CUgv  
...      

Notice now how "Debra" is replaced with different tokens since "Patient ID" was different between the two records.

Node.js

For more on installing and creating a DLP API client, refer to DLP API Client Libraries.

// Imports the Google Cloud Data Loss Prevention library
const DLP = require('@google-cloud/dlp');

// Instantiates a client
const dlp = new DLP.DlpServiceClient();

// The string to deidentify
// const string = 'My SSN is 372819127';

// The set of characters to replace sensitive ones with
// For more information, see https://cloud.google.com/dlp/docs/reference/rest/v2beta1/content/deidentify#FfxCommonNativeAlphabet
// const alphabet = 'ALPHA_NUMERIC';

// The name of the Cloud KMS key used to encrypt ('wrap') the AES-256 key
// const keyName = 'projects/YOUR_GCLOUD_PROJECT/locations/YOUR_LOCATION/keyRings/YOUR_KEYRING_NAME/cryptoKeys/YOUR_KEY_NAME';

// The encrypted ('wrapped') AES-256 key to use
// This key should be encrypted using the Cloud KMS key specified above
// const wrappedKey = 'YOUR_ENCRYPTED_AES_256_KEY'

// Construct deidentification request
const items = [{type: 'text/plain', value: string}];
const request = {
  deidentifyConfig: {
    infoTypeTransformations: {
      transformations: [
        {
          primitiveTransformation: {
            cryptoReplaceFfxFpeConfig: {
              cryptoKey: {
                kmsWrapped: {
                  wrappedKey: wrappedKey,
                  cryptoKeyName: keyName,
                },
              },
              commonAlphabet: alphabet,
            },
          },
        },
      ],
    },
  },
  items: items,
};

// Run deidentification request
dlp
  .deidentifyContent(request)
  .then(response => {
    const deidentifiedItems = response[0].items;
    console.log(deidentifiedItems[0].value);
  })
  .catch(err => {
    console.log(`Error in deidentifyWithFpe: ${err.message || err}`);
  });

Java

For more on installing and creating a DLP API client, refer to DLP API Client Libraries.

/**
 * Deidentify a string by encrypting sensitive information while preserving format.
 * @param string The string to deidentify.
 * @param alphabet The set of characters to use when encrypting the input. For more information,
 *                 see cloud.google.com/dlp/docs/reference/rest/v2beta1/content/deidentify
 * @param keyName The name of the Cloud KMS key to use when decrypting the wrapped key.
 * @param wrappedKey The encrypted (or "wrapped") AES-256 encryption key.
 */

// instantiate a client
try (DlpServiceClient dlpServiceClient = DlpServiceClient.create()) {

  // string = "My SSN is 372819127";
  // alphabet = FfxCommonNativeAlphabet.ALPHA_NUMERIC;
  // keyName = "projects/GCP_PROJECT/locations/REGION/keyRings/KEYRING_ID/cryptoKeys/KEY_NAME";
  // wrappedKey = "YOUR_ENCRYPTED_AES_256_KEY"

  ContentItem contentItem =
      ContentItem.newBuilder()
          .setType("text/plain")
          .setValue(string)
          .build();

  // Create the format-preserving encryption (FPE) configuration
  KmsWrappedCryptoKey kmsWrappedCryptoKey =
      KmsWrappedCryptoKey.newBuilder()
          .setWrappedKey(ByteString.copyFrom(BaseEncoding.base64().decode(wrappedKey)))
          .setCryptoKeyName(keyName)
          .build();

  CryptoKey cryptoKey =
      CryptoKey.newBuilder()
          .setKmsWrapped(kmsWrappedCryptoKey)
          .build();

  CryptoReplaceFfxFpeConfig cryptoReplaceFfxFpeConfig =
      CryptoReplaceFfxFpeConfig.newBuilder()
          .setCryptoKey(cryptoKey)
          .setCommonAlphabet(alphabet)
          .build();

  // Create the deidentification transformation configuration
  PrimitiveTransformation primitiveTransformation =
      PrimitiveTransformation.newBuilder()
          .setCryptoReplaceFfxFpeConfig(cryptoReplaceFfxFpeConfig)
          .build();

  InfoTypeTransformation infoTypeTransformationObject =
      InfoTypeTransformation.newBuilder()
          .setPrimitiveTransformation(primitiveTransformation)
          .build();

  InfoTypeTransformations infoTypeTransformationArray =
      InfoTypeTransformations.newBuilder()
          .addTransformations(infoTypeTransformationObject)
          .build();

  // Create the deidentification request object
  DeidentifyConfig deidentifyConfig =
      DeidentifyConfig.newBuilder()
          .setInfoTypeTransformations(infoTypeTransformationArray)
          .build();

  DeidentifyContentRequest request =
      DeidentifyContentRequest.newBuilder()
          .setDeidentifyConfig(deidentifyConfig)
          .addItems(contentItem)
          .build();

  // Execute the deidentification request
  DeidentifyContentResponse response = dlpServiceClient.deidentifyContent(request);

  // Print the deidentified input value
  // e.g. "My SSN is 123456789" --> "My SSN is 7261298621"
  for (ContentItem item : response.getItemsList()) {
    System.out.println(item.getValue());
  }
} catch (Exception e) {
  System.out.println("Error in deidentifyWithFpe: " + e.getMessage());
}

fixedSizeBucketingConfig

The bucketing transformations—this one and bucketingConfig—serve to mask numerical data by “bucketing” it into ranges. The resulting number range is a hyphenated string consisting of a lower bound, a hyphen, and an upper bound.

Setting fixedSizeBucketingConfig to the FixedSizeBucketingConfig object buckets input values based on fixed size ranges. The FixedSizeBucketingConfig object consists of the following:

  • lowerBound: The lower bound value of all of the buckets. Values less than this one are grouped together in a single bucket.
  • upperBound: The upper bound value of all of the buckets. Values greater than this one are grouped together in a single bucket.
  • bucketSize: The size of each bucket other than the minimum and maximum buckets.

For example, if lowerBound is set to 10, upperBound is set to 89, and bucketSize is set to 10, then the following buckets would be used: -10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80, 80-89, 89+.

bucketingConfig

The bucketingConfig transformation offers more flexibility than the other bucketing transformation, fixedSizeBucketingConfig. Instead of specifying upper and lower bounds and an interval value with which to create equal-sized buckets, you specify the maximum and minimum values for each bucket you want created. Each maximum and minimum value pair must have the same type.

Setting fixedSizeBucketingConfig to the BucketingConfig object specifies custom buckets. The BucketingConfig object consists of a buckets[] array of Bucket objects. Each Bucket object consists of the following:

  • min: The lower bound of the bucket’s range. Omit this value to create a bucket that has no lower bound.
  • max: The upper bound of the bucket’s range. Omit this value to create a bucket that has no upper bound.
  • replacementValue: The value with which to replace values that fall within the lower and upper bounds. If you don’t provide a replacementValue, a hyphenated min-max range will be used instead.

If a value falls outside of the defined ranges, the TransformationSummary returned will contain an error message.

For example, consider the following configuration for the bucketingConfig transformation:

"bucketingConfig":
{
  "buckets":
  [
    {
      "min":
      {
        "integerValue": "1"
      },
      "max":
      {
        "integerValue": "30"
      },
      "replacementValue":
      {
        "stringValue": "LOW"
      }
    },
    {
      "min":
      {
        "integerValue": "31"
      },
      "max":
      {
        "integerValue": "65"
      },
      "replacementValue":
      {
        "stringValue": "MEDIUM"
      }
    },
    {
      "min":
      {
        "integerValue": "66"
      },
      "max":
      {
        "integerValue": "100"
      },
      "replacementValue":
      {
        "stringValue": "HIGH"
      }
    }
  ]
}

This defines the following behavior:

  • Integer values falling between 1 and 30 are masked by being replaced with LOW.
  • Integer values falling between 31-65 are masked by being replaced with MEDIUM.
  • Integer values falling between 66-100 are masked by being replaced with HIGH.
replaceWithInfoTypeConfig

Specifying replaceWithInfoTypeConfig replaces each matched value with the name of the infoType. The replaceWithInfoTypeConfig message has no arguments; specifying it enables its transformation.

For example, suppose you’ve specified replaceWithInfoTypeConfig for all PHONE_NUMBER infoTypes, and the following string is sent to the DLP API:

John Smith, 123 Main St, Seattle, WA 98122, 206-555-0123.

The returned string will be the following:

John Smith, 123 Main St, Seattle, WA 98122, PHONE_NUMBER.
timePartConfig

Setting timePartConfig to a TimePartConfig object preserves a portion of a matched value that includes Date, Timestamp, and TimeOfDay values. The TimePartConfig object consists of a partToExtract argument, which can be set to any of the TimePart enumerated values, including year, month, day of the month, and so on.

For example, suppose you’ve configured a timePartConfig transformation by setting partToExtract to YEAR. After sending the data in the first column below to the DLP API, you’d end up with the transformed values in the second column:

Original values Transformed values
9/21/1976 1976
6/7/1945 1945
1/20/2009 2009
7/4/1776 1776
8/1/1984 1984
4/21/1982 1982

Record Transformations

Record transformations (the RecordTransformations object) are only applied to values within tabular data that are identified as a specific infoType. Within RecordTransformations, there are two further subcategories of transformations:

  • fieldTransformations[]: Transformations that apply various field transformations.
  • recordSuppressions[]: Rules defining which records get suppressed completely. Records that match any suppression rule within recordSuppressions[] are omitted from the output.

Field Transformations

Each FieldTransformation object includes three arguments:

  • fields: One or more input fields (FieldID objects) to apply the transformation to.
  • condition: A condition (a RecordCondition object) that must evaluate to true for the transformation to be applied. For example, apply a bucket transformation to an age column of a record only if the ZIP code column for the same record is within a specific range. Or, redact a field only if the birthdate field puts a person's age at 85 or above.
  • One of the following two transformation type arguments. Specifying one is required:

Record Suppressions

In addition to applying transformations to field data, you can also instruct the DLP API to de-identify data by simply suppressing records when certain suppression conditions evaluate to true. You can apply both field transformations and record suppressions in the same request.

You set the recordSuppressions message of the RecordTransformations object to an array of one or more RecordSuppression objects.

Each RecordSuppression object contains a single RecordCondition object, which in turn contains a single Expressions object.

An Expressions object contains:

  • logicalOperator: One of the LogicalOperator enumerated types.
  • conditions: An array of one or more Condition objects. A Condition is a comparison of a field value and another value, both of which be of type string, boolean, integer, double, Timestamp, or TimeofDay.

If the comparison evaluates to true, the record is suppressed, and vice-versa. If the compared values are not the same type, a warning is given and the condition evaluates to false.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Data Loss Prevention API