This topic shows how to measure k-anonymity of a dataset using Sensitive Data Protection and visualize it in Looker Studio. By doing so, you'll also be able to better understand risk and help evaluate the trade-offs in utility you might be making if you redact or de-identify data.
Though the focus of this topic is on visualizing the k-anonymity re-identification risk analysis metric, you can also visualize the l-diversity metric using the same methods.
This topic assumes you're already familiar with the concept of k-anonymity and its utility for assessing the re-identifiability of records within a dataset. It will also be helpful to be at least somewhat familiar with how to compute k-anonymity using Sensitive Data Protection, and with using Looker Studio.
Introduction
De-identification techniques can be helpful in protecting your subjects' privacy while you process or use data. But how do you know if a dataset has been sufficiently de-identified? And how will you know whether your de-identification has resulted in too much data loss for your use case? That is, how can you compare re-identification risk with the utility of the data to help make data-driven decisions?
Calculating the k-anonymity value of a dataset helps answer these questions by assessing the re-identifiability of the dataset's records. Sensitive Data Protection contains built-in functionality to calculate a k-anonymity value on a dataset based on quasi-identifiers that you specify. This helps enable you to quickly evaluate whether de-identifying a certain column or combination of columns will result in a dataset that is more or less likely to be re-identified.
Example dataset
Following are the first few rows of a large example dataset.
user_id |
age |
title |
score |
---|---|---|---|
602-61-8588 |
24 |
Biostatistician III |
733 |
771-07-8231 |
46 |
Executive Secretary |
672 |
618-96-2322 |
69 |
Programmer I |
514 |
... |
... |
... |
... |
For the purposes of this tutorial, user_id
will not be addressed, as the focus
is on quasi-identifiers. In a real-world scenario, you would want to ensure
that user_id
is redacted or tokenized appropriately. The
score
column is proprietary to this dataset, and it's unlikely an attacker
would be able to learn it by other means, so you will not include it in the
analysis. Your focus will be on the remaining age
and title
columns, with
which an attacker could potentially learn about an individual through other
sources of data. The questions you're trying to answer for the dataset are:
- What effect will the two quasi-identifiers—
age
andtitle
—have on the overall re-identification risk of the de-identified data? - How will applying a de-identification transformation affect this risk?
You want to be sure that the combination of age
and title
won't map to a
small number of users. For example, suppose there is only one user in the
dataset whose title is Programmer I and who is age 69. An attacker might be able
to cross-reference that information with demographics or other available
information, figure out who the person is, and learn the value of their score.
For more information about this phenomenon, see the "Entity IDs and computing
k-anonymity" section in the Risk
analysis conceptual topic.
Step 1: Calculate k-anonymity on the dataset
First, use Sensitive Data Protection to calculate k-anonymity on the dataset by
sending the following JSON to the
DlpJob
resource. Within
this JSON, you set the entity ID to the
user_id
column, and identify the two quasi-identifiers as both the age
and
title
columns. You're also instructing Sensitive Data Protection to save the
results to a new BigQuery table.
JSON input:
POST https://dlp.googleapis.com/v2/projects/dlp-demo-2/dlpJobs { "riskJob": { "sourceTable": { "projectId": "dlp-demo-2", "datasetId": "dlp_testing", "tableId": "dlp_test_data_kanon" }, "privacyMetric": { "kAnonymityConfig": { "entityId": { "field": { "name": "id" } }, "quasiIds": [ { "name": "age" }, { "name": "job_title" } ] } }, "actions": [ { "saveFindings": { "outputConfig": { "table": { "projectId": "dlp-demo-2", "datasetId": "dlp_testing", "tableId": "test_results" } } } } ] } }
Once the k-anonymity job has completed, Sensitive Data Protection sends the job
results to a BigQuery table named
dlp-demo-2.dlp_testing.test_results
.
Step 2: Connect results to Looker Studio
Next, you'll connect the BigQuery table you produced in Step 1 to a new report in Looker Studio.
Open Looker Studio.
Click Create > Report.
In the Add data to report pane under Connect to data, click BigQuery. You may need to authorize Looker Studio to access your BigQuery tables.
In the column picker, select My projects. Then choose the project, dataset, and table. When you're done, click Add. If you see a notice that you're about to add data to this report, click Add to report.
The k-anonymity scan results have now been added to the new Looker Studio report. In the next step, you'll create the chart.
Step 3: Create the chart
Do the following to insert and configure the chart:
- In Looker Studio, if a table of values appears, select it and press Delete to remove it.
- On the Insert menu, click Combo chart.
- Click and draw a rectangle on the canvas where you want the chart to appear.
Next, configure the chart data under the Data tab so that the chart shows the effect of varying the size and value ranges of buckets:
- Clear the fields under the following headings by pointing to each field
and clicking the
- Date Range Dimension
- Dimension
- Metric
- Sort
X, as shown here: - With all fields cleared, drag the upper_endpoint field from the Available fields column to the Dimension heading.
- Drag the upper_endpoint field to the Sort heading, and then select Ascending.
- Drag both the bucket_size and bucket_value_count fields to the Metric heading.
- Point to the icon to the left of the bucket_size metric and an
- In the Name field, type
Unique row loss
. - Under Type, choose Percent.
- Under Comparison calculation, choose Percent of total.
- Under Running calculation, choose Running sum.
Edit icon appears.
Click the Edit
icon, and then do the following:
- In the Name field, type
- Repeat the previous step for the bucket_value_count metric, but in the
Name field, type
Unique quasi-identifier combination loss
.
Once you're done, the column should appear as shown here:
Finally, configure the chart to display a line chart for both metrics:
- Click the Style tab in the pane on the right of the window.
- For both Series #1 and Series #2, choose Line.
- To view the final chart on its own, click the View button in the upper-right corner of the window.
Following is an example chart after completing the preceding steps.
Interpreting the chart
The generated chart plots, on the y-axis, the potential percentage of data loss for both unique rows and unique quasi-identifier combinations to achieve, on the x-axis, a k-anonymity value.
Higher k-anonymity values indicate less risk of re-identification. To achieve higher k-anonymity values, however, you would need to remove higher percentages of the total rows and higher unique quasi-identifier combinations, which might decrease the utility of the data.
Thankfully, dropping data is not your only option to reduce re-identification risk. Other de-identification techniques can strike a better balance between loss and utility. For example, to address the kind of data loss associated with higher k-anonymity values and this dataset, you could try bucketing ages or job titles to reduce the uniqueness of age/job title combinations. For example, you could try bucketing ages in ranges of 20-25, 25-30, 30-35, and so on. For more information about how to do this, see Generalization and bucketing and De-identifying sensitive data in text content.