Visualizing re-identification risk using k‑anonymity

This topic shows how to measure k-anonymity of a dataset using Cloud Data Loss Prevention (DLP) and visualize it in Google Data Studio. By doing so, you'll also be able to better understand risk and help evaluate the trade-offs in utility you might be making if you redact or de-identify data. Additional metrics like l-diversity are also available in the API, but the focus of this topic will be on k-anonymity.

Introduction

De-identification techniques can be very helpful in protecting your subjects' privacy while you process or use data. But how do you know if a dataset has been sufficiently de-identified? And how will you know whether your de-identification has resulted in too much data loss for your use case? That is, how can you compare re-identification risk with the utility of the data to help make data-driven decisions?

Calculating the k-anonymity value of a dataset helps answer these questions by assessing the re-identifiability of the dataset's records. Cloud DLP contains built-in functionality to calculate a k-anonymity value on a dataset based on quasi-identifiers that you specify. This helps enable you to quickly evaluate whether de-identifying a certain column or combination of columns will result in a dataset that is more or less likely to be re-identified.

Example dataset

Following are the first few rows of a large example dataset.

user_id zip_code age score
121317473 94043 25 52
121317474 92104 43 87
... ... ... ...

For the purposes of this tutorial, user_id will not be addressed, as the focus is on quasi-identifiers. In a real-world scenario, you would want to ensure that it is redacted or tokenized appropriately. The score column is proprietary to this dataset, and it's unlikely an attacker would be able to learn it by other means, so you will not include it in the analysis. Your focus will be on the remaining zip_code and age columns, with which an attacker could potentially learn about an individual through other sources of data. The questions you're trying to answer for the dataset are:

  • What effect will the two quasi-identifiers—zip_code and age—have on the overall re-identification risk of the de-identified data?
  • How will applying a de-identification transformation affect this risk?

You want to be sure that the combination of zip_code and age won't map to a small number of users. For example, suppose there is only one user in the dataset who lives in ZIP code 94043 and who is age 25. An attacker might be able to cross-reference that information with demographics about the area or other available information, figure out who the person is, and learn the value of their score. For more information about this phenomenon, see the "Entity IDs and computing k-anonymity" section in the Risk analysis conceptual topic.

Step 1: Calculate k-anonymity on the dataset

First, use Cloud DLP to calculate k-anonymity on the dataset. Send the following JSON to the resource. Within this JSON, you set the DlpJob entity ID to the user_id column, and identify the two quasi-identifiers as both the zip_code and age columns. You're also instructing Cloud DLP to save the results to a BigQuery table.

JSON Input:

POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/dlpJobs?key={YOUR_API_KEY}

{
  "riskJob":{
    "privacyMetric":{
      "kAnonymityConfig":{
        "entityId":{
          "field":{
            "name":"user_id"
          }
        },
        "quasiIds":[
          {
            "name":"zip_code"
          },
          {
            "name":"age"
          }
        ]
      }
    },
    "actions":[
      {
        "saveFindings":{
          "outputConfig":{
            "table":{
              "projectId":"dlp-demo-2",
              "datasetId":"risk",
              "tableId":"risk1"
            }
          }
        }
      }
    ],
    "sourceTable":{
      "projectId":"dlp-demo-2",
      "datasetId":"deid",
      "tableId":"source1"
    }
  }
}

Once the k-anonymity job has completed, Cloud DLP sends the job results to a BigQuery table named dlp-demo-2.risk.risk1.

Step 2: Connect results to Google Data Studio

Next, you'll connect the BigQuery table you produced in Step 1 to a new report in Google Data Studio.

  1. Open Google Data Studio, and then click the Blank option under Start a new report.
  2. In the Add a data source pane on the right, click Create New Data Source at the bottom.
  3. In the Google Connectors section, point to BigQuery, and then click Select.
  4. On the BigQuery data source page, choose the project, dataset, and table from the column picker. For this example, choose dlp-demo-2 for the project, risk for the dataset, and risk1 for the table.
  5. Click the Connect button, which turns blue once you've chosen from all three columns. Once you've connected, you see a color-coded list of fields, shown here: List of fields in Data Studio.
  6. In the Field column, find the upper_endpoint field. In its row under Aggregation, choose Sum from the drop-down menu.
  7. Click Add to Report.

The k-anonymity scan results have now been added to the new Data Studio report. In the next step, you'll create the chart.

Step 3: Create the chart

Finally, you create the chart based on the imported fields. Do the following to insert the chart:

  1. In Data Studio, on the Insert menu, click Combo chart.
  2. Click and draw a rectangle on the editor page where you want the chart to appear.

Next, configure the chart data so that the chart shows the effect of varying the size and value ranges of buckets:

  1. Under the Data tab on the right, remove the Date Range Dimension by pointing at timestamp and then clicking the circled X, as shown here:
    Detail of timestamp field with delete button enabled.
  2. Drag the upper_endpoint field into the Dimension and Sort fields in the right column, and then select Ascending from the drop-down menu under the Sort field.
  3. Drag the bucket_size and bucket_value_count fields into the Metric field, and then remove any additional Metric selections in the right column. Once you're done, the column should appear as shown here:
    Screen shot of fields list.
  4. Point to the icon to the left of the bucket_size metric and an edit (pencil) icon appears. Click the edit icon, and then select both of the following from the corresponding drop-down menus:

    • Display as > Percent of total
    • Apply running calculation > Running sum
  5. Repeat the previous step, but for the bucket_value_count metric.

Finally, configure the chart to display a line chart for both metrics:

  1. Click the STYLE tab in the pane on the right of the window.
  2. For both series (#1 and #2, which represent bucket_size and bucket_value_count), choose Line.
  3. To view the final chart on its own, click the View button in the upper-right corner of the window.

Final chart, with k-anonymity = 10 highlighted.

Interpreting the chart

The chart that is generated has k-anonymity values on its x-axis, and percentage of data loss on its y-axis. For example, in the screenshot above, the highlighted data point is a k-anonymity value of 10. This can be interpreted as follows: If you drop all rows with a k-anonymity value of at most 10, you will lose 82% of the rows from the dataset. Additionally, this would result in a 92% loss of unique age/ZIP code combinations. The chart above illustrates that it is difficult to achieve a k-anonymity value above 2 or 3 in this dataset without dropping a significant number of rows and values.

Thankfully, dropping data is not the only option. Other de-identification techniques can strike a better balance between loss and utility. For example, to address the kind of data loss associated with higher k-anonymity values and this dataset, you could try bucketing ages or ZIP codes to reduce the uniqueness of age/ZIP code combinations. For example, you could try bucketing ages in ranges of 20-25, 25-30, 30-35, and so on. For more information about how to do this, see Generalization and bucketing and De-identifying sensitive data in text content.

Was this page helpful? Let us know how we did:

Send feedback about...

Data Loss Prevention Documentation