Pseudonymization

Concept overview

Pseudonymization (also known as tokenization) is a de-identification technique in which sensitive data are replaced with surrogates (also known as tokens) in a consistent manner. This contrasts with other de-identification techniques such as redaction or generalization in that the surrogates retain their referential properties within the de-identified dataset. Furthermore, depending on which transformation is used to produce the surrogates, these surrogates may be reversed back into their original sensitive values by an authorized user.

Transformation example

To understand how pseudonymization can be useful, consider the following example.

Suppose we have the following data:

Employee ID Date Compensation
11111 2015 $10
11111 2016 $20
22222 2016 $15

Perhaps we would like to know how employee compensation changed over time in order to identify any employees who are outliers. When we find an outlier, we'd like to correct the earning disparity. Furthermore, we would like to do this without exposing the sensitive compensation data for any individual.

Supposing that the only way to identify an individual is by "Employee ID", this column needs to be de-identified.

We can then apply a pseudonymization transformation to "Employee ID". Consider two options for pseudonymization transformations: CryptoHashConfig and CryptoReplaceFfxFpeConfig. Since we'd like to be able to recover the actual employee who is an outlier, we'll need a reversable transformation. That narrows our selection down to CryptoReplaceFfxFpeConfig.

Next we configure the transformation as follows:

"cryptoReplaceFfxFpeConfig": {
  "cryptoKey": {
    "unwrapped": {
      "key": "abcdefghijklmnop"
    }
  },
  "commonAlphabet": "NUMERIC"
}

We apply this transformation to "Employee ID" via the content.deidentify method as follows:

JSON Input:

{
 "deidentifyConfig": {
  "recordTransformations": {
   "fieldTransformations": [
    {
     "primitiveTransformation": {
      "cryptoReplaceFfxFpeConfig": {
       "cryptoKey": {
        "unwrapped": {
         "key": "YWJjZGVmZ2hpamtsbW5vcA=="
        }
       },
       "commonAlphabet": "NUMERIC"
      }
     },
     "fields": [
      {
       "name": "Employee ID"
      }
     ]
    }
   ]
  }
 },
 "item": {
  "table": {
   "headers": [
    {
     "name": "Employee ID"
    },
    {
     "name": "Date"
    },
    {
     "name": "Compensation"
    }
   ],
   "rows": [
    {
     "values": [
      {
       "stringValue": "11111"
      },
      {
       "stringValue": "2015"
      },
      {
       "stringValue": "$10"
      }
     ]
    },
    {
     "values": [
      {
       "stringValue": "11111"
      },
      {
       "stringValue": "2016"
      },
      {
       "stringValue": "$20"
      }
     ]
    },
    {
     "values": [
      {
       "stringValue": "22222"
      },
      {
       "stringValue": "2016"
      },
      {
       "stringValue": "$15"
      }
     ]
    }
   ]
  }
 }
}

URL:

POST https://dlp.googleapis.com/v2/{parent=projects/*}/content:deidentify

JSON Output:

{
 "item": {
  "table": {
   "headers": [
    {
     "name": "Employee ID"
    },
    {
     "name": "Date"
    },
    {
     "name": "Compensation"
    }
   ],
   "rows": [
    {
     "values": [
      {
       "stringValue": "34668"
      },
      {
       "stringValue": "2015"
      },
      {
       "stringValue": "$10"
      }
     ]
    },
    {
     "values": [
      {
       "stringValue": "34668"
      },
      {
       "stringValue": "2016"
      },
      {
       "stringValue": "$20"
      }
     ]
    },
    {
     "values": [
      {
       "stringValue": "82979"
      },
      {
       "stringValue": "2016"
      },
      {
       "stringValue": "$15"
      }
     ]
    }
   ]
  }
 },
 "overview": {
  "transformedBytes": "15",
  "transformationSummaries": [
   {
    "field": {
     "name": "Employee ID"
    },
    "results": [
     {
      "count": "3",
      "code": "SUCCESS"
     }
    ],
    "fieldTransformations": [
     {
      "fields": [
       {
        "name": "Employee ID"
       }
      ],
      "primitiveTransformation": {
       "cryptoReplaceFfxFpeConfig": {
        "cryptoKey": {
         "unwrapped": {
          "key": "YWJjZGVmZ2hpamtsbW5vcA=="
         }
        },
        "commonAlphabet": "NUMERIC"
       }
      }
     }
    ],
    "transformedBytes": "15"
   }
  ]
 }
}

The table post-transformation is as follows:

Employee ID Date Compensation
34668 2015 $10
34668 2016 $20
82979 2016 $15

This transformation suits our needs by removing the real employee IDs and replacing identical IDs consistently so that records belonging to a given employee are still linked. For example, employee ID "11111" has been replaced consistently by "34668" across the first two records. The resulting data's value for analysis has been preserved while preserving privacy as well.

Reversal example

To illustrate reversal, we will pick up from the previous example. Suppose that analysis determines that the employee with ID "34668" should have been paid a higher wage, the data owners (managers at the company perhaps) would like to adjust the compensation for that employee. In order to do this, they will need the real employee ID. To reverse the "34668" into the real ID, we use the content.reidentify method as shown in the following JSON example:

JSON Input:

{
 "reidentifyConfig": {
  "recordTransformations": {
   "fieldTransformations": [
    {
     "primitiveTransformation": {
      "cryptoReplaceFfxFpeConfig": {
       "cryptoKey": {
        "unwrapped": {
         "key": "YWJjZGVmZ2hpamtsbW5vcA=="
        }
       },
       "commonAlphabet": "NUMERIC"
      }
     },
     "fields": [
      {
       "name": "Employee ID"
      }
     ]
    }
   ]
  }
 },
 "item": {
  "table": {
   "headers": [
    {
     "name": "Employee ID"
    }
   ],
   "rows": [
    {
     "values": [
      {
       "stringValue": "34668"
      }
     ]
    }
   ]
  }
 }
}

URL:

POST https://dlp.googleapis.com/v2/{parent=projects/*}/content:deidentify

JSON Output:

{
 "item": {
  "table": {
   "headers": [
    {
     "name": "Employee ID"
    }
   ],
   "rows": [
    {
     "values": [
      {
       "stringValue": "11111"
      }
     ]
    }
   ]
  }
 },
 "overview": {
  "transformedBytes": "5",
  "transformationSummaries": [
   {
    "field": {
     "name": "Employee ID"
    },
    "results": [
     {
      "count": "1",
      "code": "SUCCESS"
     }
    ],
    "fieldTransformations": [
     {
      "fields": [
       {
        "name": "Employee ID"
       }
      ],
      "primitiveTransformation": {
       "cryptoReplaceFfxFpeConfig": {
        "cryptoKey": {
         "unwrapped": {
          "key": "YWJjZGVmZ2hpamtsbW5vcA=="
         }
        },
        "commonAlphabet": "NUMERIC"
       }
      }
     }
    ],
    "transformedBytes": "5"
   }
  ]
 }
}

As shown in the JSON output, on return we obtain the real employee ID: "11111".

Contexts

When transforming structured data (tabular data with records and fields) with a pseudonymization transformation such as CryptoReplaceFfxFpeConfig, a context may be specified. When specified, context defines a field whose value in a given record will be taken as the "tweak." For example, in the table below, "Patient ID" may be chosen to be the context, in which case the tweak is taken as "4672" for the first record, "3246" for the second record, and so on.

To understand the purpose of specifying a context, suppose first that the "Name" field in the table below is transformed without using a context. In this case, each identical name will be replaced with the same token. This implies that two matching tokens refer to the same name. However, this implication may be unwanted since it may reveal sensitive relationships between records. This is where we can make use of a context to break these relationships, in which case these relationships will only hold for the tokens generated using an identical tweak.

For example, consider the table below.

Bill Number Patient ID Name ...
223 4672 John  
224 3246 Debra  
225 3529 Nate  
226 4098 Debra  
...      

Applying this transformation to "Name" without specifying a context results in the following transformed table (the exact token values depend on the specified cryptoKey):

Bill Number Patient ID Name ...
223 4672 gCUv  
224 3246 Eusyv  
225 3529 dsla  
226 4098 Eusyv  
...      

Note that in the table above the tokens for records with name "Debra" are the same. To break this relationship, we can specify "Patient ID" as the context and run the transformation over the original table. This yields the following transformed table (the exact token values depend on the specified cryptoKey):

Bill Number Patient ID Name ...
223 4672 Agca  
224 3246 vSHig  
225 3529 kqHX  
226 4098 CUgv  
...      

Notice now how "Debra" is replaced with different tokens since "Patient ID" was different between the two records.

Surrogate annotations

In the examples above, structured (tabular) data is used. This makes reversal easy since the token to be reversed can be easily identified. However, when the token exists in free text, performing reversal requires locating the token first. Reversible transformations provide for this via the surrogateInfoType field. For details, see CryptoReplaceFfxFpeConfig.

This field works together with the custom info type SurrogateType to facilitate inspecting free text for tokens.

De-identification in free text code example

Consider the following example. We're using the content.deidentify method to transform a phone number found in free text using a token.

JSON Input:

{
 "deidentifyConfig": {
  "infoTypeTransformations": {
   "transformations": [
    {
     "infoTypes": [
      {
       "name": "PHONE_NUMBER"
      }
     ],
     "primitiveTransformation": {
      "cryptoReplaceFfxFpeConfig": {
       "cryptoKey": {
        "unwrapped": {
         "key": "YWJjZGVmZ2hpamtsbW5vcA=="
        }
       },
       "commonAlphabet": "NUMERIC",
       "surrogateInfoType": {
        "name": "PHONE_TOKEN"
       }
      }
     }
    }
   ]
  }
 },
 "inspectConfig": {
  "infoTypes": [
   {
    "name": "PHONE_NUMBER"
   }
  ],
  "minLikelihood": "UNLIKELY"
 },
 "item": {
  "value": "My phone number is 4359916732"
 }
}

URL:

POST https://dlp.googleapis.com/v2/{parent=projects/*}/content:deidentify

After the JSON is sent to this URL, the DLP API returns the following output.

JSON Output:

{
 "item": {
  "value": "My phone number is PHONE_TOKEN(10):0328896717"
 },
 "overview": {
  "transformedBytes": "10",
  "transformationSummaries": [
   {
    "infoType": {
     "name": "PHONE_NUMBER"
    },
    "transformation": {
     "cryptoReplaceFfxFpeConfig": {
      "cryptoKey": {
       "unwrapped": {
        "key": "YWJjZGVmZ2hpamtsbW5vcA=="
       }
      },
      "commonAlphabet": "NUMERIC",
      "surrogateInfoType": {
       "name": "PHONE_TOKEN"
      }
     }
    },
    "results": [
     {
      "count": "1",
      "code": "SUCCESS"
     }
    ],
    "transformedBytes": "10"
   }
  ]
 }
}

The DLP API has successfully de-identified the phone number (0328896717).

Re-identification in free text code example

In this second example, we use the content.reidentify method to reverse the transformed text from the first example back into the original number.

JSON Input:

{
 "reidentifyConfig": {
  "infoTypeTransformations": {
   "transformations": [
    {
     "infoTypes": [
      {
       "name": "PHONE_TOKEN"
      }
     ],
     "primitiveTransformation": {
      "cryptoReplaceFfxFpeConfig": {
       "cryptoKey": {
        "unwrapped": {
         "key": "YWJjZGVmZ2hpamtsbW5vcA=="
        }
       },
       "commonAlphabet": "NUMERIC",
       "surrogateInfoType": {
        "name": "PHONE_TOKEN"
       }
      }
     }
    }
   ]
  }
 },
 "inspectConfig": {
  "customInfoTypes": [
   {
    "infoType": {
     "name": "PHONE_TOKEN"
    },
    "surrogateType": {
    }
   }
  ]
 },
 "item": {
  "value": "My phone number is PHONE_TOKEN(10):0328896717"
 }
}

URL:

POST https://dlp.googleapis.com/v2/{parent=projects/*}/content:reidentify

JSON Output:

{
 "item": {
  "value": "My phone number is 4359916732"
 },
 "overview": {
  "transformedBytes": "26",
  "transformationSummaries": [
   {
    "infoType": {
     "name": "PHONE_TOKEN"
    },
    "transformation": {
     "cryptoReplaceFfxFpeConfig": {
      "cryptoKey": {
       "unwrapped": {
        "key": "YWJjZGVmZ2hpamtsbW5vcA=="
       }
      },
      "commonAlphabet": "NUMERIC",
      "surrogateInfoType": {
       "name": "PHONE_TOKEN"
      }
     }
    },
    "results": [
     {
      "count": "1",
      "code": "SUCCESS"
     }
    ],
    "transformedBytes": "26"
   }
  ]
 }
}

The DLP API has successfully re-identified the phone number (4359916732).

Resources

For more information about how to use the DLP API to pseudonymize, de-identify, and re-identify sensitive data, see De-identifying Sensitive Data in Text Content.

Was this page helpful? Let us know how we did:

Send feedback about...

Data Loss Prevention API