Measuring Re-identification Risk in De-identified Content

De-identification is the process of removing identifying information from data. The Data Loss Prevention API can detect and de-identify sensitive data for you according to how you’ve configured it to conform to your business’ requirements.

Conversely, re-identification is the process of matching up de-identified data with other available data to determine the person to whom the data belongs. Re-identification is most often talked about in the context of sensitive personal information, such as medical or financial data.

This topic goes into more detail about each of these re-identification measurement techniques, and gives a brief overview of how the DLP API enables you to use them with your de-identified data.

Re-identification risk metrics

If you don’t correctly or adequately de-identify sensitive data, you risk an attacker re-identifying the data, which can have serious privacy implications. The DLP API can help compute the likelihood that de-identified data will be re-identified, according to several metrics.

Before diving into the metrics, we’ll first define a few common terms:

  • Identifiers: Identifiers can be used to uniquely identify an individual. For example, someone’s full name or government ID number are considered identifiers.
  • Quasi-identifiers: Quasi-identifiers don’t uniquely identify an individual, but, when combined and cross-referenced with individual records, they can substantially increase the likelihood that an attacker will be able to re-identify an individual. For example, ZIP codes and ages are considered quasi-identifiers.
  • Sensitive data: Sensitive data is data that is protected against unauthorized exposure. Attributes like health conditions, salary, criminal offenses, and geographic location are considered sensitive data. Note that there can be overlap between identifiers and sensitive data.
  • Equivalence classes: An equivalence class is a group of rows with identical quasi-identifiers.

There are three techniques that the DLP API can use to determine re-identification risk:

  • k-anonymity: A property of a dataset that indicates the re-identifiability of its records. A dataset is k-anonymous if quasi-identifiers for each person in the dataset are identical to at least k – 1 other people also in the dataset.
  • l-diversity: An extension of k-anonymity that additionally measures the diversity of sensitive values for each column in which they occur. A dataset has l-diversity if, for every set of rows with identical quasi-identifiers, there are at least l distinct values for each sensitive attribute.
  • k-map: Computes re-identifiability risk by comparing a given de-identified dataset of subjects with a larger re-identification—or “attack”—dataset. The DLP API doesn’t know the attack dataset, but it statistically models it by using publicly available data like the US Census, by using a custom statistical model (indicated as one or more BigQuery tables), or by extrapolating from the distribution of values in the input dataset. Each dataset—the sample dataset and the re-identification dataset—shares one or more quasi-identifier columns.

About k-anonymity

When collecting data for research purposes, de-identification can be essential for helping maintain people's privacy. At the same time, de-identification may result in a dataset losing its practical usefulness. K-anonymity was born out of a desire both to quantify the re-identifiability of a dataset, and to balance the usefulness of de-identified people data and the privacy of the people whose data is being used. It is a property of a dataset that can be used to assess the re-identifiability of records within the dataset.

As an example, consider a set of patient data:

Patient ID Full Name ZIP Code Age Condition ...
746572 John J. Jacobsen 98122 29 Heart disease
652978 Debra D. Dreb 98115 29 Diabetes, Type II
075321 Abraham A. Abernathy 98122 54 Cancer, Liver
339012 Karen K. Krakow 98115 88 Heart disease
995212 William W. Wertheimer 98115 54 Asthma

This dataset contains all three types of data we described previously: identifiers, quasi-identifiers, and sensitive data.

If sensitive data like health conditions aren't masked or redacted, an attacker could potentially use the quasi-identifiers to which each one is attached, potentially cross-referencing with another dataset that contains similar quasi-identifiers, and re-identify the people to whom that sensitive data applies.

A dataset is said to be k-anonymous if every combination of values for demographic columns in the dataset appears for at least k different records. Recall that a group of rows with identical quasi-identifiers is called an "equivalence class." For example, if you've de-identified the quasi-identifiers enough that there is a minimum of four rows whose quasi-identifier values are identical, the dataset's k-anonymity value is 4.

For more information about k-anonymity, see "Protecting Privacy when Disclosing Information: k-Anonymity and Its Enforcement through Generalization and Suppression," from the Harvard University Data Privacy Lab.

For information about how to compute k-anonymity with the DLP API, see Risk analysis using the DLP API, later in this topic.

About l-diversity

L-diversity is closely related to k-anonymity, and was created to help address a de-identified dataset’s susceptibility to attacks such as:

  • Homogeneity attacks, in which attackers predict sensitive values for a set of k-anonymized data by taking advantage of the homogeneity of values within a set of k records.
  • Background knowledge attacks, in which attackers take advantage of associations between quasi-identifier values that have a certain sensitive attribute to narrow down the attribute’s possible values.

L-diversity attempts to measure how much an attacker can learn about people in terms of k-anonymity and equivalence classes (sets of rows with identical quasi-identifier values). A dataset has l-diversity if, for every equivalence class, there are at least l unique values for each sensitive attribute. For each equivalence class, how many sensitive attributes are there in the dataset? For example, if l-diversity = 1, that means everyone has the same sensitive attribute, if l-diversity = 2, that means everyone has one of two sensitive attributes, and so on.

For more information about l-diversity, see "l-Diversity: Privacy Beyond k-Anonymity," from the Cornell University Department of Computer Science.

For information about how to compute k-anonymity with the DLP API, see Risk analysis using the DLP API, later in this topic.

About k-map

K-map is very similar to k-anonymity, except that it assumes that the attacker most likely doesn’t know who is in the dataset. Use k-map if your dataset is relatively small, or if the level of effort involved in generalizing attributes would be too high.

Just like k-anonymity, k-map requires you to determine which columns of your database are quasi-identifiers. In doing this, you are stating what data an attacker will most likely use to re-identify subjects. In addition, computing a k-map value requires a re-identification dataset: a larger table with which to compare rows in the original dataset.

A small sample dataset

Consider the following example dataset. This sample data is part of a larger hypothetical database of people with a certain genetic disease.

ZIP code age
85535 79
60629 42

Taken on its own, this appears to be the same amount of information for both individuals. In fact, considering k-anonymity for the larger dataset might lead to the assertion that the subject corresponding to the second row is highly identifiable. However, if you back up and consider the data, you’ll realize it’s not. In particular, consider the United States ZIP code 85535, in which about 20 people currently live. There is probably only one person of exactly 79 years of age living in the 85535 ZIP code. Compare this to ZIP code 60629, which is part of the Chicago metropolitan area and houses over 100,000 people. There are approximately 1,000 people of exactly 42 years of age in that ZIP code.

The first row in our small dataset was easily re-identified, but not the second. According to k-anonymity, however, both rows might be completely unique in the larger dataset.

Enter k-map

K-map, like k-anonymity, requires you to determine which columns of your database are quasi-identifiers. The DLP API’s risk analysis APIs simulate a re-identification dataset to approximate the steps an attacker might go through to compare the original dataset in order to re-identify the data. For our previous example, since it deals in US locations (ZIP codes) and personal data (ages), and since we assume that the attacker doesn’t know who has the genetic disease, the re-identification dataset could be everyone living in the US.

Now that you have quasi-identifiers and a re-identification dataset, you can compute the k-map value: Your data satisfies the k-map value k if every combination of values for the quasi-identifiers appears at least k times in the re-identification dataset.

Given this definition, and that the first row in our database likely only corresponds to one person in the US, the example dataset doesn’t satisfy a k-map value requirement of 2 or more. To get a larger k-map value, we could remove age values like we’ve done here:

ZIP code age
85535 **
60629 **

As previously mentioned, the 85535 ZIP code has about 20 people and 60629 has over 100,000. Therefore, we can estimate that this new, generalized dataset has a k-map value of around 20.

Risk analysis using the DLP API

The DLP API can analyze your structured data stored in BigQuery tables and compute the following privacy metrics:

Computing k-anonymity with the DLP API

You can compute the k-anonymity value based on one or more columns, or fields, by setting the kAnonymityConfig message to the KAnonymityConfig object and sending a request to projects.dataSource.analyze. Within the KAnonymityConfig object, you specify the following:

  • quasiIds[]: One or more quasi-identifiers (FieldId objects) to scan and use to compute k-anonymity. When you specify multiple quasi-identifiers, they are considered a single composite key. Structs and repeated data types are not supported, but nested fields are supported as long as they are not structs themselves or nested within a repeated field.
  • entityId: An optional entity identifier (EntityId object), containing a field ID (FieldId object). In this context, "entity" means one or more fields (rows), each of which corresponds to a single person within the dataset. Specifying an entity ID indicates that generalizations or analysis must be consistent across multiple rows pertaining to the same entity. Otherwise, the entity would contribute to the k-anonymity computation more than once.

Computing l-diversity with the DLP API

You can compute the l-diversity value for one or more columns, or fields, by setting the lDiversityConfig message to the LDiversityConfig object. Within the LDiversityConfig object, you specify the following:

  • quasiIds[]: A set of quasi-identifiers (FieldId objects) that indicate how equivalence classes are defined for the l-diversity computation. As with KAnonymityConfig, when you specify multiple fields, they are considered a single composite key.
  • sensitiveAttribute: Sensitive field (FieldId object) for computing the l-value.

Computing k-map estimates with the DLP API

You can estimate k-map values using the DLP API, which uses a statistical model to estimate a re-identification dataset. This is in contrast to the other risk analysis methods, in which the attack dataset is explicitly known. Depending on the type of data, the DLP API uses publicly available datasets (for example, from the US Census) or a custom statistical model (for example, one or more BigQuery tables that you specify), or it extrapolates from the distribution of values in your input dataset.

To compute a k-map estimate using the DLP API, set the kMapEstimationConfig message to the KMapEstimationConfig object. Within the KMapEstimationConfig object, you specify the following:

  • quasiIds[]: Required. Fields (TaggedField objects) considered to be quasi-identifiers to scan and use to compute k-map. No two columns can have the same tag. These can be any of the following:

    • An infoType: This causes the DLP API to use the relevant public dataset as a statistical model of population, including US ZIP codes, region codes, ages, and genders.
    • A custom infoType: A custom tag wherein you indicate an auxiliary table (an AuxiliaryTable object) that contains statistical information about the possible values of this column.
    • The inferred tag: If no semantic tag is indicated, specify inferred. The DLP API infers the statistical model from the distribution of values in the input data.
  • regionCode: An ISO 3166-1 alpha-2 region code for the DLP API to use in statistical modeling. This value is required if no column is tagged with a region-specific infoType (for example, a US ZIP code) or a region code.

  • auxiliaryTables[]: Auxiliary tables (AuxiliaryTable objects) to use in the analysis. Each custom tag used to tag a quasi-identifier column (from quasiIds[]) must appear in exactly one column of one auxiliary table.

Send feedback about...

Data Loss Prevention API