Computing Risk Analysis Metrics for BigQuery Content

Understanding your data is crucial to properly managing it. This guide introduces a set of metrics and tools that you can use to help understand potential risks or outliers in your data. There are several techniques available, and the ones presented here are just a few metrics that you can use as part of a comprehensive analysis of your data and how it is exposed and used.

The Data Loss Prevention API helps you compute the likelihood that de-identified data will be "re-identified," according to several metrics.

The analyze method of the dataSource resource schedules risk analysis jobs over content stored in Google BigQuery. This topic discusses both what risk analysis looks for and how to implement it.

You create a risk analysis job by giving the dataSource.analyze method two pieces of information:

PrivacyMetric

The DLP API can analyze your structured data and compute the following privacy metrics:

NumericalStatsConfig

You can compute numerical statistics for an individual BigQuery column by setting the NumericalStatsConfig privacy metric.

Set NumericalStatsConfig to the name of the column to scan. The returned scan operation, when run, will cause the DLP API to compute statistics for the following number types:

  • integer
  • float
  • date
  • datetime
  • timestamp
  • time

The statistics that a scan run returns include the minimum value, the maximum value, and percentile—or "quantile"—values.

Node.js

For more on installing and creating a DLP API client, refer to DLP API Client Libraries.

// Imports the Google Cloud Data Loss Prevention library
const DLP = require('@google-cloud/dlp');

// Instantiates a client
const dlp = new DLP.DlpServiceClient();

// (Optional) The project ID to run the API call under
// const projectId = process.env.GCLOUD_PROJECT;

// The ID of the dataset to inspect, e.g. 'my_dataset'
// const datasetId = 'my_dataset';

// The ID of the table to inspect, e.g. 'my_table'
// const tableId = 'my_table';

// The name of the column to compute risk metrics for, e.g. 'age'
// Note that this column must be a numeric data type
// const columnName = 'firstName';

const sourceTable = {
  projectId: projectId,
  datasetId: datasetId,
  tableId: tableId,
};

// Construct request for creating a risk analysis job
const request = {
  privacyMetric: {
    numericalStatsConfig: {
      field: {
        columnName: columnName,
      },
    },
  },
  sourceTable: sourceTable,
};

// Create helper function for unpacking values
const getValue = obj => obj[Object.keys(obj)[0]];

// Run risk analysis job
dlp
  .analyzeDataSourceRisk(request)
  .then(response => {
    const operation = response[0];
    return operation.promise();
  })
  .then(completedJobResponse => {
    const results = completedJobResponse[0].numericalStatsResult;

    console.log(
      `Value Range: [${getValue(results.minValue)}, ${getValue(
        results.maxValue
      )}]`
    );

    // Print unique quantile values
    let tempValue = null;
    results.quantileValues.forEach((result, percent) => {
      const value = getValue(result);

      // Only print new values
      if (
        tempValue !== value &&
        !(tempValue && tempValue.equals && tempValue.equals(value))
      ) {
        console.log(`Value at ${percent}% quantile: ${value}`);
        tempValue = value;
      }
    });
  })
  .catch(err => {
    console.log(`Error in numericalRiskAnalysis: ${err.message || err}`);
  });

Java

For more on installing and creating a DLP API client, refer to DLP API Client Libraries.

/**
 * Calculate numerical statistics for a column in a BigQuery table using the DLP API.
 * @param projectId The Google Cloud Platform project ID to run the API call under.
 * @param datasetId The BigQuery dataset to analyze.
 * @param tableId The BigQuery table to analyze.
 * @param columnName The name of the column to analyze, which must contain only numerical data.
 */

// instantiate a client
try (DlpServiceClient dlpServiceClient = DlpServiceClient.create()) {

  // projectId = process.env.GCLOUD_PROJECT;
  // datasetId = "my_dataset";
  // tableId = "my_table";
  // columnName = "firstName";

  FieldId fieldId =
      FieldId.newBuilder()
          .setColumnName(columnName)
          .build();

  NumericalStatsConfig numericalStatsConfig =
      NumericalStatsConfig.newBuilder()
          .setField(fieldId)
          .build();

  BigQueryTable bigQueryTable =
      BigQueryTable.newBuilder()
          .setProjectId(projectId)
          .setDatasetId(datasetId)
          .setTableId(tableId)
          .build();

  PrivacyMetric privacyMetric =
      PrivacyMetric.newBuilder()
          .setNumericalStatsConfig(numericalStatsConfig)
          .build();

  AnalyzeDataSourceRiskRequest request =
      AnalyzeDataSourceRiskRequest.newBuilder()
          .setPrivacyMetric(privacyMetric)
          .setSourceTable(bigQueryTable)
          .build();

  // asynchronously submit a risk analysis operation
  OperationFuture<RiskAnalysisOperationResult, RiskAnalysisOperationMetadata, Operation>
      responseFuture = dlpServiceClient.analyzeDataSourceRiskAsync(request);

  // ...
  // block on response
  RiskAnalysisOperationResult response = responseFuture.get();
  NumericalStatsResult results =
      response.getNumericalStatsResult();

  System.out.println(
      "Value range: [" + results.getMaxValue() + ", " + results.getMinValue() + "]");

  // Print out unique quantiles
  String previousValue = "";
  for (int i = 0; i < results.getQuantileValuesCount(); i++) {
    Value valueObj = results.getQuantileValues(i);
    String value = valueObj.toString();

    if (!previousValue.equals(value)) {
      System.out.println("Value at " + i + "% quantile: " + value.toString());
      previousValue = value;
    }
  }
} catch (Exception e) {
  System.out.println("Error in numericalStatsAnalysis: " + e.getMessage());
}

CategoricalStatsConfig

You can compute categorical numerical statistics for an individual BigQuery column by setting the CategoricalStatsConfig privacy metric.

Set CategoricalStatsConfig to the name of the column to scan. The returned scan operation, when run, will cause the DLP API to compute statistics for the given column. The CategoricalStatsConfig metric can be applied to all column types, except for arrays and structs. If your column types consist solely of the types specified in NumericalStatsConfig, it may be more informative to scan for that metric.

A scan run is an Operation resource, and contains the following fields:

  • name: The name given to the specific scan run operation by the server.
  • metadata: A RiskAnalysisOperationMetadata object, which contains metadata about the scan run operation, including the time it was created, the privacy metric to compute (a PrivacyMetric object), and the input dataset on which metrics are being computed (a BigQueryTable object).
  • done: A Boolean value indicating whether the operation has finished running.
  • error: The error result of the operation if it failed or was cancelled.
  • response: A RiskAnalysisOperationResult object containing a CategoricalStatsResult object, which is a histogram of the number of occurrences of each of the values contained in the column. For more information about what is contained in the response, see the CategoricalStatsHistogramBucket object.

Node.js

For more on installing and creating a DLP API client, refer to DLP API Client Libraries.

// Imports the Google Cloud Data Loss Prevention library
const DLP = require('@google-cloud/dlp');

// Instantiates a client
const dlp = new DLP.DlpServiceClient();

// (Optional) The project ID to run the API call under
// const projectId = process.env.GCLOUD_PROJECT;

// The ID of the dataset to inspect, e.g. 'my_dataset'
// const datasetId = 'my_dataset';

// The ID of the table to inspect, e.g. 'my_table'
// const tableId = 'my_table';

// The name of the column to compute risk metrics for, e.g. 'firstName'
// const columnName = 'firstName';

const sourceTable = {
  projectId: projectId,
  datasetId: datasetId,
  tableId: tableId,
};

// Construct request for creating a risk analysis job
const request = {
  privacyMetric: {
    categoricalStatsConfig: {
      field: {
        columnName: columnName,
      },
    },
  },
  sourceTable: sourceTable,
};

// Create helper function for unpacking values
const getValue = obj => obj[Object.keys(obj)[0]];

// Run risk analysis job
dlp
  .analyzeDataSourceRisk(request)
  .then(response => {
    const operation = response[0];
    return operation.promise();
  })
  .then(completedJobResponse => {
    const results =
      completedJobResponse[0].categoricalStatsResult
        .valueFrequencyHistogramBuckets[0];
    console.log(
      `Most common value occurs ${results.valueFrequencyUpperBound} time(s)`
    );
    console.log(
      `Least common value occurs ${results.valueFrequencyLowerBound} time(s)`
    );
    console.log(`${results.bucketSize} unique values total.`);
    results.bucketValues.forEach(bucket => {
      console.log(
        `Value ${getValue(bucket.value)} occurs ${bucket.count} time(s).`
      );
    });
  })
  .catch(err => {
    console.log(`Error in categoricalRiskAnalysis: ${err.message || err}`);
  });

Java

For more on installing and creating a DLP API client, refer to DLP API Client Libraries.

/**
 * Calculate categorical statistics for a column in a BigQuery table using the DLP API.
 * @param projectId The Google Cloud Platform project ID to run the API call under.
 * @param datasetId The BigQuery dataset to analyze.
 * @param tableId The BigQuery table to analyze.
 * @param columnName The name of the column to analyze, which need not contain numerical data.
 */

// instantiate a client
try (DlpServiceClient dlpServiceClient = DlpServiceClient.create()) {

  // projectId = process.env.GCLOUD_PROJECT;
  // datasetId = "my_dataset";
  // tableId = "my_table";
  // columnName = "firstName";

  FieldId fieldId =
      FieldId.newBuilder()
          .setColumnName(columnName)
          .build();

  CategoricalStatsConfig categoricalStatsConfig =
      CategoricalStatsConfig.newBuilder()
          .setField(fieldId)
          .build();

  BigQueryTable bigQueryTable =
      BigQueryTable.newBuilder()
          .setProjectId(projectId)
          .setDatasetId(datasetId)
          .setTableId(tableId)
          .build();

  PrivacyMetric privacyMetric =
      PrivacyMetric.newBuilder()
          .setCategoricalStatsConfig(categoricalStatsConfig)
          .build();

  AnalyzeDataSourceRiskRequest request =
      AnalyzeDataSourceRiskRequest.newBuilder()
          .setPrivacyMetric(privacyMetric)
          .setSourceTable(bigQueryTable)
          .build();

  // asynchronously submit a risk analysis operation
  OperationFuture<RiskAnalysisOperationResult, RiskAnalysisOperationMetadata, Operation>
      responseFuture = dlpServiceClient.analyzeDataSourceRiskAsync(request);

  // ...
  // block on response
  RiskAnalysisOperationResult response = responseFuture.get();
  CategoricalStatsHistogramBucket results =
      response.getCategoricalStatsResult().getValueFrequencyHistogramBuckets(0);

  System.out.println(
      "Most common value occurs " + results.getValueFrequencyUpperBound() + " time(s)");
  System.out.println(
      "Least common value occurs " + results.getValueFrequencyLowerBound() + " time(s)");

  for (ValueFrequency valueFrequency : results.getBucketValuesList()) {
    System.out.println("Value "
        + valueFrequency.getValue().toString()
        + " occurs "
        + valueFrequency.getCount()
        + " time(s)."
    );
  }

} catch (Exception e) {
  System.out.println("Error in categoricalStatsAnalysis: " + e.getMessage());
}

KAnonymityConfig

K-anonymity can be used to help assess re-identification probability. If you're already familiar with k-anonymity and just want to see how to compute it using the DLP API, see Computing k-anonymity with the DLP API.

About k-anonymity

When collecting data for research purposes, de-identification can be essential for helping maintain people's privacy. At the same time, de-identification may result in a dataset losing its practical usefulness. K-anonymity was born out of a need to balance the usefulness of de-identified people data and the privacy of the people whose data is being used by helping to reduce the risk of re-identification. It is a property of a dataset that can be used to assess the re-identifiability of records within the dataset.

As an example, consider a set of patient data:

Patient ID Full Name ZIP Code Age Condition ...
746572 John J. Jacobsen 98122 29 Heart disease
652978 Debra D. Dreb 98115 29 Diabetes, Type II
075321 Abraham A. Abernathy 98122 54 Cancer, Liver
339012 Karen K. Krakow 98115 88 Heart disease
995212 William W. Wertheimer 98115 54 Asthma
...

In this dataset, there are three types of data:

  • Identifiers: Patient ID and Full Name are considered identifiers because either can be used to uniquely identify an individual.
  • Quasi-identifiers: ZIP Code and Age are considered quasi-identifiers because, individually, they do not uniquely identify an individual. As quasi-identifiers are combined and associated to individual records, however, the likelihood that an attacker will be able to re-identify an individual can increase substantially.
  • Sensitive data: Health conditions are considered "sensitive data," as are attributes like salary, criminal offenses, and geographic location. Note that there can be overlap between identifiers and sensitive data.

If sensitive data like health conditions aren't masked or redacted, an attacker could potentially use the quasi-identifiers to which each one is attached, potentially cross-referencing with another dataset that contains similar quasi-identifiers, and re-identify the people to whom that sensitive data applies.

Data is considered to have k-anonymity if, after de-identification and for each row of data, there are at least k-1 other rows with the same value for each quasi-identifier. A group of rows with identical quasi-identifiers are called "equivalence classes." For example, if you've de-identified the quasi-identifiers enough that there is a minimum of four rows whose quasi-identifier values are identical, the dataset's k-anonymity value is 4.

Computing k-anonymity with the DLP API

You can compute the k-anonymity value based on one or more columns, or fields, by setting the kAnonymityConfig message to the KAnonymityConfig object. Within the KAnonymityConfig object, you specify the following:

  • quasiIds[]: One or more quasi-identifiers to scan and use to compute k-anonymity. When you specify multiple quasi-identifiers, they are considered a single composite key. Structs and repeated data types are not supported, but nested fields are supported as long as they are not structs themselves or nested within a repeated field.
  • entityId: An optional entity identifier (EntityId object), containing a field ID (FieldId object). In this context, "entity" means one or more fields (rows) that correspond to a single person within the dataset. Specifying an entity ID indicates that generalizations or analysis must be consistent across multiple rows pertaining to the same entity. Otherwise, the entity would contribute to the k-anonymity computation more than once.

Node.js

For more on installing and creating a DLP API client, refer to DLP API Client Libraries.

// Imports the Google Cloud Data Loss Prevention library
const DLP = require('@google-cloud/dlp');

// Instantiates a client
const dlp = new DLP.DlpServiceClient();

// (Optional) The project ID to run the API call under
// const projectId = process.env.GCLOUD_PROJECT;

// The ID of the dataset to inspect, e.g. 'my_dataset'
// const datasetId = 'my_dataset';

// The ID of the table to inspect, e.g. 'my_table'
// const tableId = 'my_table';

// A set of columns that form a composite key ('quasi-identifiers')
// const quasiIds = [{ columnName: 'age' }, { columnName: 'city' }];

const sourceTable = {
  projectId: projectId,
  datasetId: datasetId,
  tableId: tableId,
};

// Construct request for creating a risk analysis job
const request = {
  privacyMetric: {
    kAnonymityConfig: {
      quasiIds: quasiIds,
    },
  },
  sourceTable: sourceTable,
};

// Create helper function for unpacking values
const getValue = obj => obj[Object.keys(obj)[0]];

// Run risk analysis job
dlp
  .analyzeDataSourceRisk(request)
  .then(response => {
    const operation = response[0];
    return operation.promise();
  })
  .then(completedJobResponse => {
    const results =
      completedJobResponse[0].kAnonymityResult
        .equivalenceClassHistogramBuckets[0];
    console.log(
      `Bucket size range: [${results.equivalenceClassSizeLowerBound}, ${results.equivalenceClassSizeUpperBound}]`
    );

    results.bucketValues.forEach(bucket => {
      const quasiIdValues = bucket.quasiIdsValues.map(getValue).join(', ');
      console.log(`  Quasi-ID values: {${quasiIdValues}}`);
      console.log(`  Class size: ${bucket.equivalenceClassSize}`);
    });
  })
  .catch(err => {
    console.log(`Error in kAnonymityAnalysis: ${err.message || err}`);
  });

Java

For more on installing and creating a DLP API client, refer to DLP API Client Libraries.

/**
 * Calculate k-anonymity for quasi-identifiers in a BigQuery table using the DLP API.
 * @param projectId The Google Cloud Platform project ID to run the API call under.
 * @param datasetId The BigQuery dataset to analyze.
 * @param tableId The BigQuery table to analyze.
 * @param quasiIds The names of columns that form a composite key ('quasi-identifiers').
 */

// instantiate a client
try (DlpServiceClient dlpServiceClient = DlpServiceClient.create()) {

  // projectId = process.env.GCLOUD_PROJECT;
  // datasetId = 'my_dataset';
  // tableId = 'my_table';
  // quasiIds = [{ columnName: 'age' }, { columnName: 'city' }];

  List<FieldId> quasiIdFields =
      quasiIds
          .stream()
          .map(columnName -> FieldId.newBuilder().setColumnName(columnName).build())
          .collect(Collectors.toList());

  KAnonymityConfig kanonymityConfig =
      KAnonymityConfig.newBuilder()
          .addAllQuasiIds(quasiIdFields)
          .build();

  BigQueryTable bigQueryTable =
      BigQueryTable.newBuilder()
          .setProjectId(projectId)
          .setDatasetId(datasetId)
          .setTableId(tableId)
          .build();

  PrivacyMetric privacyMetric =
      PrivacyMetric.newBuilder()
          .setKAnonymityConfig(kanonymityConfig)
          .build();

  AnalyzeDataSourceRiskRequest request =
      AnalyzeDataSourceRiskRequest.newBuilder()
          .setPrivacyMetric(privacyMetric)
          .setSourceTable(bigQueryTable)
          .build();

  // asynchronously submit a risk analysis operation
  OperationFuture<RiskAnalysisOperationResult, RiskAnalysisOperationMetadata, Operation>
      responseFuture = dlpServiceClient.analyzeDataSourceRiskAsync(request);

  // ...
  // block on response
  RiskAnalysisOperationResult response = responseFuture.get();
  KAnonymityHistogramBucket results =
      response.getKAnonymityResult().getEquivalenceClassHistogramBuckets(0);

  System.out.println("Bucket size range: ["
      + results.getEquivalenceClassSizeLowerBound()
      + ", "
      + results.getEquivalenceClassSizeUpperBound()
      + "]"
  );

  for (KAnonymityEquivalenceClass bucket : results.getBucketValuesList()) {
    List<String> quasiIdValues = bucket.getQuasiIdsValuesList()
        .stream()
        .map(v -> v.toString())
        .collect(Collectors.toList());

    System.out.println("\tQuasi-ID values: " + String.join(", ", quasiIdValues));
    System.out.println("\tClass size: " + bucket.getEquivalenceClassSize());
  }
} catch (Exception e) {
  System.out.println("Error in kAnonymityAnalysis: " + e.getMessage());
}

LDiversityConfig

L-diversity is another measure of de-identification that is used to help preserve privacy. If you’re already familiar with l-diversity and just want to see how to compute it using the DLP API, see Computing l-diversity with the DLP API.

About l-diversity

L-diversity is closely related to k-anonymity, and was created to help address a de-identified dataset’s susceptibility to attacks such as:

  • A homogeneity attack, in which attackers predict sensitive values for a set of k-anonymized data by taking advantage of the homogeneity of values within a set of k records.
  • A background knowledge attack, in which attackers take advantage of associations between quasi-identifier values that have a certain sensitive attribute to narrow down the attribute’s possible values.

L-diversity attempts to measure how much an attacker can learn about people in terms of k-anonymity and equivalence classes (sets of rows with identical quasi-identifier values). For each equivalence class, how many sensitive attributes are there in the dataset? For example, if l-diversity = 1, that means everyone has the same sensitive attribute, if l-diversity = 2, that means everyone has one of two sensitive attributes, and so on.

For more information about l-diversity, see "l-Diversity: Privacy Beyond k-Anonymity," from the Cornell University Department of Computer Science.

Computing l-diversity with the DLP API

You can compute the l-diversity value for one or more columns, or fields, by setting the lDiversityConfig message to the LDiversityConfig object. Within the LDiversityConfig object, you specify the following:

  • quasiIds[]: A set of quasi-identifiers that indicate how equivalence classes are defined for the l-diversity computation. As with KAnonymityConfig, when you specify multiple fields, they are considered a single composite key.
  • sensitiveAttribute: Sensitive field for computing the l-value.

Node.js

For more on installing and creating a DLP API client, refer to DLP API Client Libraries.

// Imports the Google Cloud Data Loss Prevention library
const DLP = require('@google-cloud/dlp');

// Instantiates a client
const dlp = new DLP.DlpServiceClient();

// (Optional) The project ID to run the API call under
// const projectId = process.env.GCLOUD_PROJECT;

// The ID of the dataset to inspect, e.g. 'my_dataset'
// const datasetId = 'my_dataset';

// The ID of the table to inspect, e.g. 'my_table'
// const tableId = 'my_table';

// The column to measure l-diversity relative to, e.g. 'firstName'
// const sensitiveAttribute = 'name';

// A set of columns that form a composite key ('quasi-identifiers')
// const quasiIds = [{ columnName: 'age' }, { columnName: 'city' }];

const sourceTable = {
  projectId: projectId,
  datasetId: datasetId,
  tableId: tableId,
};

// Construct request for creating a risk analysis job
const request = {
  privacyMetric: {
    lDiversityConfig: {
      quasiIds: quasiIds,
      sensitiveAttribute: {
        columnName: sensitiveAttribute,
      },
    },
  },
  sourceTable: sourceTable,
};

// Create helper function for unpacking values
const getValue = obj => obj[Object.keys(obj)[0]];

// Run risk analysis job
dlp
  .analyzeDataSourceRisk(request)
  .then(response => {
    const operation = response[0];
    return operation.promise();
  })
  .then(completedJobResponse => {
    const results =
      completedJobResponse[0].lDiversityResult
        .sensitiveValueFrequencyHistogramBuckets[0];

    console.log(
      `Bucket size range: [${results.sensitiveValueFrequencyLowerBound}, ${results.sensitiveValueFrequencyUpperBound}]`
    );
    results.bucketValues.forEach(bucket => {
      const quasiIdValues = bucket.quasiIdsValues.map(getValue).join(', ');
      console.log(`  Quasi-ID values: {${quasiIdValues}}`);
      console.log(`  Class size: ${bucket.equivalenceClassSize}`);
      bucket.topSensitiveValues.forEach(valueObj => {
        console.log(
          `    Sensitive value ${getValue(
            valueObj.value
          )} occurs ${valueObj.count} time(s).`
        );
      });
    });
  })
  .catch(err => {
    console.log(`Error in lDiversityAnalysis: ${err.message || err}`);
  });

Java

For more on installing and creating a DLP API client, refer to DLP API Client Libraries.

/**
 * Calculate l-diversity for an attribute relative to quasi-identifiers in a BigQuery table.
 * @param projectId The Google Cloud Platform project ID to run the API call under.
 * @param datasetId The BigQuery dataset to analyze.
 * @param tableId The BigQuery table to analyze.
 * @param sensitiveAttribute The name of the attribute to compare the quasi-ID against
 * @param quasiIds A set of column names that form a composite key ('quasi-identifiers').
 */

// instantiate a client
try (DlpServiceClient dlpServiceClient = DlpServiceClient.create()) {

  // projectId = process.env.GCLOUD_PROJECT;
  // datasetId = "my_dataset";
  // tableId = "my_table";
  // sensitiveAttribute = "name";
  // quasiIds = [{ columnName: "age" }, { columnName: "city" }];

  FieldId sensitiveAttributeField =
      FieldId.newBuilder()
          .setColumnName(sensitiveAttribute)
          .build();

  List<FieldId> quasiIdFields =
      quasiIds
          .stream()
          .map(columnName -> FieldId.newBuilder().setColumnName(columnName).build())
          .collect(Collectors.toList());

  LDiversityConfig ldiversityConfig =
      LDiversityConfig.newBuilder()
          .addAllQuasiIds(quasiIdFields)
          .setSensitiveAttribute(sensitiveAttributeField)
          .build();

  BigQueryTable bigQueryTable =
      BigQueryTable.newBuilder()
          .setProjectId(projectId)
          .setDatasetId(datasetId)
          .setTableId(tableId)
          .build();

  PrivacyMetric privacyMetric =
      PrivacyMetric.newBuilder()
          .setLDiversityConfig(ldiversityConfig)
          .build();

  AnalyzeDataSourceRiskRequest request =
      AnalyzeDataSourceRiskRequest.newBuilder()
          .setPrivacyMetric(privacyMetric)
          .setSourceTable(bigQueryTable)
          .build();

  // asynchronously submit a risk analysis operation
  OperationFuture<RiskAnalysisOperationResult, RiskAnalysisOperationMetadata, Operation>
      responseFuture = dlpServiceClient.analyzeDataSourceRiskAsync(request);

  // ...
  // block on response
  RiskAnalysisOperationResult response = responseFuture.get();
  LDiversityHistogramBucket results =
      response.getLDiversityResult().getSensitiveValueFrequencyHistogramBuckets(0);

  for (LDiversityEquivalenceClass bucket : results.getBucketValuesList()) {
    List<String> quasiIdValues = bucket.getQuasiIdsValuesList()
        .stream()
        .map(v -> v.toString())
        .collect(Collectors.toList());

    System.out.println("\tQuasi-ID values: " + String.join(", ", quasiIdValues));
    System.out.println("\tClass size: " + bucket.getEquivalenceClassSize());

    for (ValueFrequency valueFrequency : bucket.getTopSensitiveValuesList()) {
      System.out.println("\t\tSensitive value "
          + valueFrequency.getValue().toString()
          + " occurs "
          + valueFrequency.getCount()
          + " time(s).");
    }
  }
} catch (Exception e) {
  System.out.println("Error in lDiversityAnalysis: " + e.getMessage());
}

BigQueryTable

The sourceTable argument consists of a BigQueryTable object, which defines the location of the BigQuery table to scan. The BigQueryTable object is used across the DLP API, and consists of:

  • projectID: The project ID of the Google Cloud Platform project that contains the table. If you omit this value, the project ID is inferred from the API call.
  • datasetId: The dataset ID of the table.
  • tableId: The name of the table.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Data Loss Prevention API