Re-identification risk analysis, or just risk analysis, is the process of analyzing sensitive data to find properties that might increase the risk of subjects being identified. You can use risk analysis methods before de-identification to help determine an effective de-identification strategy or after de-identification to monitor for any changes or outliers.
Cloud Data Loss Prevention (DLP) can compute four re-identification risk metrics: k-anonymity, l-diversity, k-map, and δ-presence. If you're not familiar with risk analysis or these metrics, see the risk analysis concept topic before continuing on.
This section provides overviews of how to use Cloud DLP for risk analysis of structured data using any of these metrics, plus other related topics.
Calculate re-identification risk
Cloud DLP can analyze your structured data stored in BigQuery tables and compute the following re-identification risk metrics. Click the link for the metric you want to calculate to learn more.
|k-anonymity||A property of a dataset that indicates the re-identifiability of its records. A dataset is k-anonymous if quasi-identifiers for each person in the dataset are identical to at least k – 1 other people also in the dataset.|
|l-diversity||An extension of k-anonymity that additionally measures the diversity of sensitive values for each column in which they occur. A dataset has l-diversity if, for every set of rows with identical quasi-identifiers, there are at least l distinct values for each sensitive attribute.|
|k-map||Computes re-identifiability risk by comparing a given de-identified dataset of subjects with a larger re-identification—or "attack"—dataset.|
|δ-presence||Estimates the probability that a given user in a larger population is present in the dataset. This is used when membership in the dataset is itself sensitive information.|
Calculate other statistics
Cloud DLP can also compute numerical and categorical
statistics for data stored in BigQuery tables using the same
DlpJob resource as the
risk analysis APIs.
|Numerical statistics||Determines minimum, maximum, and quantile values for an individual BigQuery column.|
|Categorical numerical statistics||Computes categorical numerical statistics for the individual histogram buckets within a BigQuery column.|
For more information, see Computing numerical and categorical statistics.
Visualize re-identification risk
|Data Studio||After calculating k-anonymity values for a dataset using Cloud DLP, you can visualize the results in Google Data Studio. By doing so, you'll also be able to better understand re-identificaiton risk and help evaluate the trade-offs in utility you might be making if you redact or de-identify data.|