K-anonymity is a property of a dataset that indicates the re-identifiability of its records. A dataset is k-anonymous if quasi-identifiers for each person in the dataset are identical to at least k – 1 other people also in the dataset.
You can compute the k-anonymity value based on one or more columns, or fields, of a dataset. This topic demonstrates how to compute k-anonymity values for a dataset using Sensitive Data Protection. For more information about k-anonymity or risk analysis in general, see the risk analysis concept topic before continuing on.
Before you begin
Before continuing, be sure you've done the following:
- Sign in to your Google Account.
- In the Google Cloud console, on the project selector page, select or create a Google Cloud project. Go to the project selector
- Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.
- Enable Sensitive Data Protection. Enable Sensitive Data Protection
- Select a BigQuery dataset to analyze. Sensitive Data Protection calculates the k-anonymity metric by scanning a BigQuery table.
- Determine an identifier (if applicable) and at least one quasi-identifier in the dataset. For more information, see Risk analysis terms and techniques.
Compute k-anonymity
Sensitive Data Protection performs risk analysis whenever a risk analysis job runs. You must create the job first, either by using the Google Cloud console, sending a DLP API request, or using a Sensitive Data Protection client library.
Console
In the Google Cloud console, go to the Create risk analysis page.
In the Choose input data section, specify the BigQuery table to scan by entering the project ID of the project containing the table, the dataset ID of the table, and the name of the table.
Under Privacy metric to compute, select k-anonymity.
In the Job ID section, you can optionally give the job a custom identifier and select a resource location in which Sensitive Data Protection will process your data. When you're done, click Continue.
In the Define fields section, you specify identifiers and quasi-identifiers for the k-anonymity risk job. Sensitive Data Protection accesses the metadata of the BigQuery table you specified in the previous step and attempts to populate the list of fields.
- Select the appropriate checkbox to specify a field as either an identifier (ID) or quasi-identifier (QI). You must select either 0 or 1 identifiers and at least 1 quasi-identifier.
- If Sensitive Data Protection isn't able to populate the fields, click Enter field name to manually enter one or more fields and set each one as identifier or quasi-identifier. When you're done, click Continue.
In the Add actions section, you can add optional actions to perform when the risk job is complete. The available options are:
- Save to BigQuery: Saves the results of the risk analysis scan to a BigQuery table.
Publish to Pub/Sub: Publishes a notification to a Pub/Sub topic.
Notify by email: Sends you an email with results. When you're done, click Create.
The k-anonymity risk analysis job starts immediately.
C#
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
REST
To run a new risk analysis job to compute k-anonymity, send a request to the
projects.dlpJobs
resource, where PROJECT_ID indicates your project
identifier:
https://dlp.googleapis.com/v2/projects/PROJECT_ID/dlpJobs
The request contains a
RiskAnalysisJobConfig
object, which is composed of the following:
A
PrivacyMetric
object. This is where you specify that you're calculating k-anonymity by including aKAnonymityConfig
object.A
BigQueryTable
object. Specify the BigQuery table to scan by including all of the following:projectId
: The project ID of the project containing the table.datasetId
: The dataset ID of the table.tableId
: The name of the table.
A set of one or more
Action
objects, which represent actions to run, in the order given, at the completion of the job. EachAction
object can contain one of the following actions:SaveFindings
object: Saves the results of the risk analysis scan to a BigQuery table.PublishToPubSub
object: Publishes a notification to a Pub/Sub topic.JobNotificationEmails
object: Sends you an email with results.
Within the
KAnonymityConfig
object, you specify the following:quasiIds[]
: One or more quasi-identifiers (FieldId
objects) to scan and use to compute k-anonymity. When you specify multiple quasi-identifiers, they are considered a single composite key. Structs and repeated data types are not supported, but nested fields are supported as long as they are not structs themselves or nested within a repeated field.entityId
: Optional identifier value that, when set, indicates that all rows corresponding to each distinctentityId
should be grouped together for k-anonymity computation. Typically, anentityId
will be a column that represents a unique user, like a customer ID or a user ID. When anentityId
appears on several rows with different quasi-identifier values, these rows will be joined to form a multiset that will be used as the quasi-identifiers for that entity. For more information about entity IDs, see Entity IDs and computing k-anonymity in the Risk analysis conceptual topic.
As soon as you send a request to the DLP API, it starts the risk analysis job.
List completed risk analysis jobs
You can view a list of the risk analysis jobs that have been run in the current project.
Console
To list running and previously run risk analysis jobs in the Google Cloud console, do the following:
In the Google Cloud console, open Sensitive Data Protection.
Click the Jobs & job triggers tab at the top of the page.
Click the Risk jobs tab.
The risk job listing appears.
Protocol
To list running and previously run risk analysis jobs, send a GET request to
the
projects.dlpJobs
resource. Adding a job type filter (?type=RISK_ANALYSIS_JOB
) narrows the
response to only risk analysis jobs.
https://dlp.googleapis.com/v2/projects/PROJECT_ID/dlpJobs?type=RISK_ANALYSIS_JOB
The response you receive contains a JSON representation of all current and previous risk analysis jobs.
View k-anonymity job results
Sensitive Data Protection in the Google Cloud console features built-in visualizations for completed k-anonymity jobs. After following the instructions in the previous section, from the risk analysis job listing, select the job for which you want to view results. Assuming the job has run successfully, the top of the Risk analysis details page looks like this:
At the top of the page is information about the k-anonymity risk job, including its job ID and, under Container, its resource location.
To view the results of the k-anonymity calculation, click the K-anonymity tab. To view the risk analysis job's configuration, click the Configuration tab.
The K-anonymity tab first lists the entity ID (if any) and the quasi-identifiers used to calculate k-anonymity.
Risk chart
The Re-identification risk chart plots, on the y-axis, the potential percentage of data loss for both unique rows and unique quasi-identifier combinations to achieve, on the x-axis, a k-anonymity value. The chart's color also indicates risk potential. Darker shades of blue indicate a higher risk, while lighter shades indicate less risk.
Higher k-anonymity values indicate less risk of re-identification. To achieve higher k-anonymity values, however, you would need to remove higher percentages of the total rows and higher unique quasi-identifier combinations, which might decrease the utility of the data. To see a specific potential percentage loss value for a certain k-anonymity value, hover your cursor over the chart. As shown in the screenshot, a tooltip appears on the chart.
To view more detail about a specific k-anonymity value, click the corresponding data point. A detailed explanation is shown under the chart and a sample data table appears further down the page.
Risk sample data table
The second component to the risk job results page is the sample data table. It displays quasi-identifier combinations for a given target k-anonymity value.
The first column of the table lists the k-anonymity values. Click a k-anonymity value to view corresponding sample data that would need to be dropped to achieve that value.
The second column displays the respective potential data loss of unique rows and quasi-identifier combinations, as well as the number of groups with at least k records and the total number of records.
The last column displays a sample of groups that share a quasi-identifier combination, along with the number of records that exist for that combination.
Retrieve job details using REST
To retrieve the results of the k-anonymity risk analysis job using the REST
API, send the following GET request to the
projects.dlpJobs
resource. Replace PROJECT_ID with your project ID and
JOB_ID with the identifier of the job you want to obtain results for.
The job ID was returned when you started the job, and can also be retrieved by
listing all jobs.
GET https://dlp.googleapis.com/v2/projects/PROJECT_ID/dlpJobs/JOB_ID
The request returns a JSON object containing an instance of the job. The results
of the analysis are inside the "riskDetails"
key, in an
AnalyzeDataSourceRiskDetails
object. For more information, see the API reference for the
DlpJob
resource.
Code sample: Compute for k-anonymity with an entity ID
This example creates a risk analysis job that computes for k-anonymity with an entity ID.
For more information about entity IDs, see Entity IDs and computing k-anonymity.
C#
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Go
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
PHP
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Sensitive Data Protection, see Sensitive Data Protection client libraries.
To authenticate to Sensitive Data Protection, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
What's next
- Learn how to calculate the l-diversity value for a dataset.
- Learn how to calculate the k-map value for a dataset.
- Learn how to calculate the δ-presence value for a dataset.