Re-identification risk analysis (or just risk analysis) is the process of analyzing sensitive data to find properties that might increase the risk of subjects being identified, or of sensitive information about individuals being revealed. You can use risk analysis methods before de-identification to help determine an effective de-identification strategy, or after de-identification to monitor for any changes or outliers.
De-identification is the process of removing identifying information from data. Cloud Data Loss Prevention (DLP) can detect and de-identify sensitive data for you according to how you've configured it to conform to your organization's requirements.
Conversely, re-identification is the process of matching up de-identified data with other available data to determine the person to whom the data belongs. Re-identification is most often talked about in the context of sensitive personal information, such as medical or financial data.
For more information about using Cloud DLP to measure various types of risk, see Measuring re-identification and disclosure risk.
Risk analysis terms and techniques
If you don't correctly or adequately de-identify sensitive data, you risk an attacker re-identifying the data or learning sensitive information about individuals, which can have serious privacy implications. Cloud DLP can help compute quantify this risk, according to several metrics.
Before diving into the metrics, we'll first define a few common terms:
- Identifiers: Identifiers can be used to uniquely identify an individual. For example, someone's full name or government ID number are considered identifiers.
- Quasi-identifiers: Quasi-identifiers don't uniquely identify an individual, but, when combined and cross-referenced with individual records, they can substantially increase the likelihood that an attacker will be able to re-identify an individual. For example, ZIP codes and ages are considered quasi-identifiers.
- Sensitive data: Sensitive data is data that is protected against unauthorized exposure. Attributes like health conditions, salary, criminal offenses, and geographic location are typically considered sensitive data. Note that there can be overlap between identifiers and sensitive data.
- Equivalence classes: An equivalence class is a group of rows with identical quasi-identifiers.
There are four techniques that Cloud DLP can use to quantify the level of risk associated with a dataset:
- k-anonymity: A property of a dataset that indicates the re-identifiability of its records. A dataset is k-anonymous if quasi-identifiers for each person in the dataset are identical to at least k – 1 other people also in the dataset.
- l-diversity: An extension of k-anonymity that additionally measures the diversity of sensitive values for each column in which they occur. A dataset has l-diversity if, for every set of rows with identical quasi-identifiers, there are at least l distinct values for each sensitive attribute.
- k-map: Computes re-identifiability risk by comparing a given de-identified dataset of subjects with a larger re-identification—or "attack"—dataset. Cloud DLP doesn't know the attack dataset, but it statistically models it by using publicly available data like the US Census, by using a custom statistical model (indicated as one or more BigQuery tables), or by extrapolating from the distribution of values in the input dataset. Each dataset—the sample dataset and the re-identification dataset—shares one or more quasi-identifier columns.
- Delta-presence (δ-presence): Estimates the probability that a given user in a larger population is present in the dataset. This is used when membership in the dataset is itself sensitive information. Similarly to k-map, Cloud DLP doesn't know the attack dataset, but statistically models it using publicly available data, user-specified distributions, or extrapolation from the input dataset.
When collecting data for research purposes, de-identification can be essential for helping maintain participants' privacy. At the same time, de-identification may result in a dataset losing its practical usefulness. k-anonymity was created from a need both to quantify the re-identifiability of a dataset and to balance the usefulness of de-identified people data and the privacy of the people whose data is being used. It is a property of a dataset that can be used to assess the re-identifiability of records within the dataset.
As an example, consider a set of patient data:
|Patient ID||Full Name||ZIP Code||Age||Condition||...|
|746572||John J. Jacobsen||98122||29||Heart disease|
|652978||Debra D. Dreb||98115||29||Diabetes, Type II|
|075321||Abraham A. Abernathy||98122||54||Cancer, Liver|
|339012||Karen K. Krakow||98115||88||Heart disease|
|995212||William W. Wertheimer||98115||54||Asthma|
This dataset contains all three types of data we described previously: identifiers, quasi-identifiers, and sensitive data.
If sensitive data like health conditions aren't masked or redacted, an attacker could potentially use the quasi-identifiers to which each one is attached, potentially cross-referencing with another dataset that contains similar quasi-identifiers, and re-identify the people to whom that sensitive data applies.
A dataset is said to be k-anonymous if every combination of values for demographic columns in the dataset appears for at least k different records. Recall that a group of rows with identical quasi-identifiers is called an "equivalence class." For example, if you've de-identified the quasi-identifiers enough that there is a minimum of four rows whose quasi-identifier values are identical, the dataset's k-anonymity value is 4.
Entity IDs and computing k-anonymity
An important option that Cloud DLP includes when calculating k-anonymity is the optional entity identifier (ID). An entity ID enables you to more accurately determine k-anonymity in the common scenario wherein several rows of your dataset correspond to the same user. Otherwise, if every row, regardless of user, is counted separately, the total user count used for calculating the dataset's k-anonymity value is made artificially high. This makes computed k-anonymity values inaccurate.
Consider the following simple set of data:
|User ID||ZIP code|
Without using an entity ID to note when different rows belong to the same user, the total user count that is used when calculating k-anonymity is 8, even though the actual number of users is 4. In this dataset, using traditional k-anonymity calculation methods (without an entity ID), 3 people have a k-anonymity value of 3, and 5 people have a k-anonymity value of 5, even though there are just 4 actual people in the database.
Using an entity ID causes Cloud DLP to consider the multiset of ZIP codes that a user is associated with as a quasi-identifier when calculating k-anonymity. In the case of our example, there are actually three "composite" quasi-identifier values because there are three distinct combinations of quasi-identifier that are assigned to users: 42000, the multiset of 17000 and 42000, and the multiset of 17000, 42000, and 42000. They correspond to users as follows:
-  is associated with 1 unique user (01).
- [17000, 42000] is associated with 2 unique users (02 and 04).
- [17000, 42000, 42000] is associated with 1 unique user (03).
As you can see, this method takes into account that users may occur more than once in our ZIP code database, and it treats them accordingly when calculating k-anonymity.
For more information about k-anonymity, see Protecting Privacy when Disclosing Information: k-Anonymity and Its Enforcement through Generalization and Suppression, by Pierangela Samarati and Latanya Sweeney of the Harvard University Data Privacy Lab.
To learn how to compute k-anonymity with Cloud DLP, with or without entity IDs, see Computing k-anonymity for a dataset.
l-diversity is closely related to k-anonymity, and was created to help address a de-identified dataset's susceptibility to attacks such as:
- Homogeneity attacks, in which attackers predict sensitive values for a set of k-anonymized data by taking advantage of the homogeneity of values within a set of k records.
- Background knowledge attacks, in which attackers take advantage of associations between quasi-identifier values that have a certain sensitive attribute to narrow down the attribute's possible values.
l-diversity attempts to measure how much an attacker can learn about people in terms of k-anonymity and equivalence classes (sets of rows with identical quasi-identifier values). A dataset has l-diversity if, for every equivalence class, there are at least l unique values for each sensitive attribute. For each equivalence class, how many sensitive attributes are there in the dataset? For example, if l-diversity = 1, that means everyone has the same sensitive attribute, if l-diversity = 2, that means everyone has one of two sensitive attributes, and so on.
For more information about l-diversity, see l-Diversity: Privacy Beyond k-Anonymity, by Ashwin Machanavajjhala, Johannes Gerke, and Daniel Kifer of the Cornell University Department of Computer Science.
To learn how to compute l-diversity with Cloud DLP, see Computing l-diversity for a dataset.
k-map is very similar to k-anonymity, except that it assumes that the attacker most likely doesn't know who is in the dataset. Use k-map if your dataset is relatively small, or if the level of effort involved in generalizing attributes would be too high.
Just like k-anonymity, k-map requires you to determine which columns of your database are quasi-identifiers. In doing this, you are stating what data an attacker will most likely use to re-identify subjects. In addition, computing a k-map value requires a re-identification dataset: a larger table with which to compare rows in the original dataset.
Consider the following small example dataset. This sample data is part of a larger hypothetical database, gathered from a survey whose answers included sensitive information.
Taken on its own, this appears to be the same amount of information for both individuals. In fact, considering k-anonymity for the larger dataset might lead to the assertion that the subject corresponding to the second row is highly identifiable. However, if you back up and consider the data, you'll realize it's not. In particular, consider the United States ZIP code 85535, in which about 20 people currently live. There is probably only one person of exactly 79 years of age living in the 85535 ZIP code. Compare this to ZIP code 60629, which is part of the Chicago metropolitan area and houses over 100,000 people. There are approximately 1,000 people of exactly 42 years of age in that ZIP code.
The first row in our small dataset was easily re-identified, but not the second. According to k-anonymity, however, both rows might be completely unique in the larger dataset.
k-map, like k-anonymity, requires you to determine which columns of your database are quasi-identifiers. Cloud DLP's risk analysis APIs simulate a re-identification dataset to approximate the steps an attacker might go through to compare the original dataset in order to re-identify the data. For our previous example, since it deals in US locations (ZIP codes) and personal data (ages), and since we assume that the attacker doesn't know who participated in the survey, the re-identification dataset could be everyone living in the US.
Now that you have quasi-identifiers and a re-identification dataset, you can compute the k-map value: Your data satisfies k-map with value k if every combination of values for the quasi-identifiers appears at least k times in the re-identification dataset.
Given this definition, and that the first row in our database likely only corresponds to one person in the US, the example dataset doesn't satisfy a k-map value requirement of 2 or more. To get a larger k-map value, we could remove age values like we've done here:
As previously mentioned, the 85535 ZIP code has about 20 people and 60629 has over 100,000. Therefore, we can estimate that this new, generalized dataset has a k-map value of around 20.
For more information about k-map and its relationship to k-anonymity, see Protecting Privacy Using k-Anonymity, by Khaled El Emam and Fida Kamal Dankar, in the Journal of the American Medical Informatics Association.
To learn how to compute k-map estimates with Cloud DLP, see Computing k-map for a dataset.
Delta-presence (δ-presence) estimates the risk associated with an attacker who wants to find out whether their target is in the dataset. This is slightly different than re-identification risk in that the goal is not to find which exact record corresponds which individual, only to know whether an individual is part of the dataset. Using this metric is particularly appropriate if all individuals in the dataset share a common sensitive attribute; for example, they all have the same medical diagnosis.
Like the other risk metrics, δ-presence requires you to determine which columns of your database are quasi-identifiers. In doing this, you are stating what data an attacker will most likely use to find out which individuals are in the dataset. Like k-map, computing δ-presence requires an attack dataset: a larger table with which to compare rows in the original dataset.
Consider the following small example dataset. This sample data is part of a larger hypothetical database of people with a certain genetic disease.
In the United States ZIP code 85942, there are approximately 2 people aged 72, and in ZIP code 77970, there are approximately 5 people aged 53. The first two records are not exactly re-identifiable because both have the same quasi-identifiers. But since only two individuals share these quasi-identifiers in the larger population, an attacker can deduce that both of them suffer from the genetic disease. δ-presence quantifies this particular risk by computing the ratio of people with certain quasi-identifiers that are in the dataset.
δ-presence, like the other risk metrics, requires you to determine which columns of your database are quasi-identifiers. And like for k-map estimation, Cloud DLP's risk analysis APIs simulate a population dataset to approximate the dataset that an attacker might use to find out who is in the dataset. For our previous example, since it deals in US locations (ZIP codes) and personal data (ages), and since we assume that the attacker doesn't know who has the genetic disease, this population dataset could be everyone living in the US.
Now that you have quasi-identifiers and a re-identification dataset, you can compute the δ-presence value: your data satisfies the δ-presence with value δ if every combination of values for the quasi-identifiers appears at most δ * k times in your dataset, where k is the total number of people with these quasi-identifier values in the population dataset. Unlike k in k-anonymity or k-map, the δ in δ-presence is a real number between 0 and 1.
Given this definition, and that both people of age 72 in ZIP code 85942 in the general population are also in our database, this dataset doesn't satisfy δ-presence for any δ strictly smaller than 1. To get a lower δ-presence value, we could remove the age value of the first two rows:
Now, since 80 people live in ZIP code 85942, the δ value for the first two records is approximately 2 / 80 = 2.5%; and the δ value for the third record is approximately 1 / 5 = 20%. Therefore, we can estimate that this new, generalized dataset has a δ-presence value of around 20%.
For more information about δ-presence estimation based on statistical data, see δ-Presence Without Complete World Knowledge, by Mehmet Ercan Nergiz and Chris Clifton from the Purdue University Department of Computer Science Technical Reports.
To learn how to compute δ-presence estimates with Cloud DLP, see Computing δ-presence for a dataset.