Visualizing re-identification risk using Data Studio

This topic shows how to measure k-anonymity of a dataset using Cloud Data Loss Prevention (DLP) and visualize it in Google Data Studio. By doing so, you'll also be able to better understand risk and help evaluate the trade-offs in utility you might be making if you redact or de-identify data.

Though the focus of this topic is on visualizing the k-anonymity re-identification risk analysis metric, you can also visualize the l-diversity metric using the same methods.

This topic assumes you're already familiar with the concept of k-anonymity and its utility for assessing the re-identifiability of records within a dataset. It will also be helpful to be at least somewhat familiar with how to compute k-anonymity using Cloud DLP, and with using Data Studio.

Introduction

De-identification techniques can be helpful in protecting your subjects' privacy while you process or use data. But how do you know if a dataset has been sufficiently de-identified? And how will you know whether your de-identification has resulted in too much data loss for your use case? That is, how can you compare re-identification risk with the utility of the data to help make data-driven decisions?

Calculating the k-anonymity value of a dataset helps answer these questions by assessing the re-identifiability of the dataset's records. Cloud DLP contains built-in functionality to calculate a k-anonymity value on a dataset based on quasi-identifiers that you specify. This helps enable you to quickly evaluate whether de-identifying a certain column or combination of columns will result in a dataset that is more or less likely to be re-identified.

Example dataset

Following are the first few rows of a large example dataset.

user_id age title score
602-61-8588 24 Biostatistician III 733
771-07-8231 46 Executive Secretary 672
618-96-2322 69 Programmer I 514
... ... ... ...

For the purposes of this tutorial, user_id will not be addressed, as the focus is on quasi-identifiers. In a real-world scenario, you would want to ensure that user_id is redacted or tokenized appropriately. The score column is proprietary to this dataset, and it's unlikely an attacker would be able to learn it by other means, so you will not include it in the analysis. Your focus will be on the remaining age and title columns, with which an attacker could potentially learn about an individual through other sources of data. The questions you're trying to answer for the dataset are:

  • What effect will the two quasi-identifiers—age and title—have on the overall re-identification risk of the de-identified data?
  • How will applying a de-identification transformation affect this risk?

You want to be sure that the combination of age and title won't map to a small number of users. For example, suppose there is only one user in the dataset whose title is Programmer I and who is age 69. An attacker might be able to cross-reference that information with demographics or other available information, figure out who the person is, and learn the value of their score. For more information about this phenomenon, see the "Entity IDs and computing k-anonymity" section in the Risk analysis conceptual topic.

Step 1: Calculate k-anonymity on the dataset

First, use Cloud DLP to calculate k-anonymity on the dataset by sending the following JSON to the DlpJob resource. Within this JSON, you set the entity ID to the user_id column, and identify the two quasi-identifiers as both the age and title columns. You're also instructing Cloud DLP to save the results to a new BigQuery table.

JSON input:

POST https://dlp.googleapis.com/v2/projects/dlp-demo-2/dlpJobs

{
  "riskJob": {
    "sourceTable": {
      "projectId": "dlp-demo-2",
      "datasetId": "dlp_testing",
      "tableId": "dlp_test_data_kanon"
    },
    "privacyMetric": {
      "kAnonymityConfig": {
        "entityId": {
          "field": {
            "name": "id"
          }
        },
        "quasiIds": [
          {
            "name": "age"
          },
          {
            "name": "job_title"
          }
        ]
      }
    },
    "actions": [
      {
        "saveFindings": {
          "outputConfig": {
            "table": {
              "projectId": "dlp-demo-2",
              "datasetId": "dlp_testing",
              "tableId": "test_results"
            }
          }
        }
      }
    ]
  }
}

Once the k-anonymity job has completed, Cloud DLP sends the job results to a BigQuery table named dlp-demo-2.dlp_testing.test_results.

Step 2: Connect results to Data Studio

Next, you'll connect the BigQuery table you produced in Step 1 to a new report in Google Data Studio.

  1. Open Data Studio.

    Open Data Studio

  2. Click Create > Report.

  3. In the Add data to report pane under Connect to data, click BigQuery. You may need to authorize Data Studio to access your BigQuery tables.

  4. In the column picker, select My projects. Then choose the project, dataset, and table. When you're done, click Add. If you see a notice that you're about to add data to this report, click Add to report.

The k-anonymity scan results have now been added to the new Data Studio report. In the next step, you'll create the chart.

Step 3: Create the chart

Do the following to insert and configure the chart:

  1. In Data Studio, if a table of values appears, select it and press Delete to remove it.
  2. On the Insert menu, click Combo chart.
  3. Click and draw a rectangle on the canvas where you want the chart to appear.

Next, configure the chart data under the Data tab so that the chart shows the effect of varying the size and value ranges of buckets:

  1. Clear the fields under the following headings by pointing to each field and clicking the X, as shown here:
    Detail of timestamp field with delete button
   enabled.
    • Date Range Dimension
    • Dimension
    • Metric
    • Sort
  2. With all fields cleared, drag the upper_endpoint field from the Available fields column to the Dimension heading.
  3. Drag the upper_endpoint field to the Sort heading, and then select Ascending.
  4. Drag both the bucket_size and bucket_value_count fields to the Metric heading.
  5. Point to the icon to the left of the bucket_size metric and an Edit icon appears. Click the Edit icon, and then do the following:
    1. In the Name field, type Unique row loss.
    2. Under Type, choose Percent.
    3. Under Comparison calculation, choose Percent of total.
    4. Under Running calculation, choose Running sum.
  6. Repeat the previous step for the bucket_value_count metric, but in the Name field, type Unique quasi-identifier combination loss.

Once you're done, the column should appear as shown here:

Screen shot of fields list.

Finally, configure the chart to display a line chart for both metrics:

  1. Click the Style tab in the pane on the right of the window.
  2. For both Series #1 and Series #2, choose Line.
  3. To view the final chart on its own, click the View button in the upper-right corner of the window.

Following is an example chart after completing the preceding steps.

Final chart

Interpreting the chart

The generated chart plots, on the y-axis, the potential percentage of data loss for both unique rows and unique quasi-identifier combinations to achieve, on the x-axis, a k-anonymity value.

Higher k-anonymity values indicate less risk of re-identification. To achieve higher k-anonymity values, however, you would need to remove higher percentages of the total rows and higher unique quasi-identifier combinations, which might decrease the utility of the data.

Thankfully, dropping data is not your only option to reduce re-identification risk. Other de-identification techniques can strike a better balance between loss and utility. For example, to address the kind of data loss associated with higher k-anonymity values and this dataset, you could try bucketing ages or job titles to reduce the uniqueness of age/job title combinations. For example, you could try bucketing ages in ranges of 20-25, 25-30, 30-35, and so on. For more information about how to do this, see Generalization and bucketing and De-identifying sensitive data in text content.