Deploying a solution for migrating sensitive data

Stay organized with collections Save and categorize content based on your preferences.
Last reviewed 2021-06-29 UTC

This document shows you how to implement the Migrate Sensitive Data in BigQuery Using Dataflow and Cloud Data Loss Prevention reference pattern to solve some of the key challenges in migrating and processing sensitive data for real-time analytics use cases.

Overview

In order to use advanced data and analytics capabilities that are available through online services, many organizations are migrating on-premises data platforms to the cloud. This often includes migrating sensitive data like credit card numbers, medical identifiers and history, and vehicle identification numbers (VINs). The Migrate Sensitive Data in BigQuery Using Dataflow and Cloud Data Loss Prevention reference pattern can help you migrate large datasets that contain this type of sensitive data.

This article gives you a guided tour through using the reference pattern. It discusses key design and implementation details that can help you deploy the pattern in your organization.

Data migration tasks

Key data migration tasks addressed by this pattern:

  • Inspecting petabyte-scale datasets to see if they contain sensitive information.
  • Migrating and de-identifying the sensitive data.
  • Re-identifying the de-identified results, to enable downstream applications that require the original data.

De-identifying data

De-identification transforms sensitive data, so that you can still use it while reducing the risk of exposure. This transformation uses both non-reversible methods like cryptographic hashing and masking, and reversible methods like deterministic encryption.

For example, consider how you could analyze an issue affecting multiple customer accounts without exposing personally identifiable information (PII). If you use de-identification, you can obscure this data while retaining the ability to work with it.

Compare the tables below:

Comparison of two tables, one with sensitive data and one with de-identified data.

In the table on the left, PII is exposed. In the de-identified table on the right, this sensitive data is obscured, but you can still see that the account_id values in rows 1 and 3 are the same, while the emails are different. Also, you can still get an idea of the age range of the individual without exposing the specific age.

Reference pattern architecture

The reference pattern enables you to build an automated processing pipeline that you can use to accomplish the data migration tasks.

The pattern architecture is illustrated in the following diagram:

An illustration that shows the products used by the pattern, and the way work flows between them.

A directed acyclic graph (DAG) for an implemented pipeline looks similar to the following:

A directed acyclic graph (DAG) for a pipeline created by implementing the pattern.

Specifying data sources

The reference pattern supports ingesting data from different sources. You can import your data in Cloud Storage, or you can leave it in its original location if that is required for compliance or cost reasons. In that case, you can process the data where it is, whether that is on-premises or in another cloud.

The reference pattern is implemented using the Apache Beam open-source framework, so you can choose to execute the pipeline using any of Beam's available runners. You can also integrate a variety of data sources by using different Beam I/O transforms. This lets you load data from an Amazon S3 bucket, or extend the pattern to work with other connectors as needed. This flexible approach also allows you to use the data tokenization product that works best for you.

Processing data files

The reference pattern allows you to process data in multiple formats, like CSV, Avro, and JSON. It automatically creates an appropriate table schema in BigQuery by using Beam's DynamicDestinations feature. For large CSV file processing, it uses Splittable DoFn to split the file by byte size, using a line delimiter. The code snippet below illustrates this approach:

@ProcessElement
public void processElement(ProcessContext c, RestrictionTracker<OffsetRange, Long> tracker)
    throws IOException {
  String fileName = c.element().getKey();
  try (SeekableByteChannel channel = getReader(c.element().getValue())) {
    FileReader reader =
        new FileReader(
            channel, tracker.currentRestriction().getFrom(), recordDelimiter.getBytes());
    while (tracker.tryClaim(reader.getStartOfNextRecord())) {
      reader.readNextRecord();
      String contents = reader.getCurrent().toStringUtf8();
      String key = String.format("%s~%d", fileName, new Random().nextInt(keyRange));
      numberOfRowsRead.inc();
      c.outputWithTimestamp(KV.of(key, contents), Instant.now());
    }
  }
}

If your dataset contains a large number of small files, then you might not require file chunking. For such a dataset, you can use the ContextualTextIO transform and use metadata information like file name and row number for tracking and reporting purposes. The code snippets below illustrate this approach:

case CSV:
  records =
      inputFiles
          .apply(
              "SplitCSVFile",
              ParDo.of(
                  new CSVFileReaderSplitDoFn(
                      options.getKeyRange(),
                      options.getRecordDelimiter(),
                      options.getSplitSize())))
          .apply(
              "ConvertToDLPRow",
              ParDo.of(new ConvertCSVRecordToDLPRow(options.getColumnDelimiter(), header))
                  .withSideInputs(header));
  break;
@ProcessElement
public void processElement(ProcessContext context) throws IOException {
  Table.Row.Builder rowBuilder = Table.Row.newBuilder();
  Row nonTokenizedRow = context.element();
  String filename = Util.sanitizeCSVFileName(
      nonTokenizedRow.getLogicalTypeValue("resourceId", ResourceId.class).getFilename());
  String input = nonTokenizedRow.getString("value");
  LOG.debug("File Name: {} Value: {}", filename, input);
  List<String> csvHeader = context.sideInput(header);

Managing latency and quota

As the pipeline ingests individual records, the records are grouped together in small batches by using the Beam BatchRequestForDLP. Batch size is determined by a configurable threshold that you specify in bytes. Once the batch size threshold is met, the grouped records are sent together to the Cloud Data Loss Prevention API in a single request. The DLP API then analyzes all records within the submitted batch and returns all results for that batch in the response.

Using batches allows you to reduce the number of DLP API requests, which helps you to remain within your quota allocation. It also allows you to increase throughput and reduce overall latency by processing multiple batches in parallel.

Using the key range parameter

To balance latency with DLP API quota, use the --keyRange pipeline parameter. Increasing the --keyrange value increases parallelism in the pipeline, which can reduce job latency but also uses more resources, causing you to approach your usage limits more quickly. If you have a small number of large files, consider setting a higher value for --keyRange. On the other hand, if you have a lot of small files, then increasing --keyRange might be less effective.

For example, when testing the reference pattern, a 400 MB CSV file was processed, containing 2 million rows with the default pipeline settings of 5 workers, a 0.5 MB batch size threshold, and a key range of 100. The screenshot below shows that the pipeline remained inside the project's DLP API quota of 10,000 calls per minute and successfully processed the file in less than 10 minutes:

Graph showing pipeline throughput compared to quota.

We then processed this data faster by increasing the key size parameter to 1000. This required a maximum of 800 API calls (400 MB/0.5 MB = 800), which was still lower than the project quota of 10,000 API calls per minute. In the screenshot below, you can see that the same data processed in less than a minute:

Graph showing pipeline throughput when key size is increased to 1000.

The results of this benchmark analysis are dependent on the data that was processed and the Cloud DLP configuration that was used. You might have to adjust additional pipeline parameters available in the reference pattern to fine-tune the performance for your specific requirements.

Using the batch size parameter

Cloud DLP provides a set of built-in information types called infoTypes. Some of the infoType detectors, such as DATE, DATE_OF_BIRTH, FIRST_NAME, LAST_NAME, and LOCATION, can affect operation latency. For better performance when processing requests that contain latency-sensitive infoTypes, reduce the --batchSize parameter value to use a smaller batch size. This ensures that Cloud DLP processes the requests in small, distributed chunks. The trade-off is that it takes longer to process the requests.

For example, inspecting 1 TB of data for non-latency-sensitive infoTypes using a batch size of 500 KB/payload takes about 8 minutes. Inspecting that same data for latency-sensitive infoTypes using a batch size of 50 KB/payload takes about 15 minutes.

Executing the pipeline using a Flex Template

The reference pattern uses a Dataflow Flex Template so that you can configure the pipeline's structure at runtime. This allows you to re-use the same pipeline for different purposes. The following are examples of how you can trigger the pipeline to address different use cases:

  • To inspect a CSV file in an Amazon S3 bucket:

    --filePattern=s3://<myBucket>/file.csv --DLPMethod=INSPECT --awsRegion=<myAWSRegion> --awsCredentialsProvider=$AWS_CRED
    
  • To de-identify data in an Avro file in a Cloud Storage bucket:

    --filePattern=gs://<myBucket>/file.avro --DLPMethod=DEID
    
  • To re-identify data in BigQuery:

    --tableRef=<myProjectId>:<myDataset>.<myTable> --queryPath=gs://<myBucket>/reid_query.sql"
    

Handling inspection accuracy

When doing inspection, it is important to monitor the accuracy of your results by validating the number of findings that match with your input data. Although in most cases the built-in Cloud DLP infoTypes provide sufficient accuracy, you can further improve the results by using custom infoTypes.

One example of when you might want to do this comes from the financial services industry. Some organizations in that sector need to classify credit card numbers as internal or external during inspection. To address this, you can create custom infoType detectors using a regular expression (regex) to detect matches, based on a pattern for a given set of credit card numbers. Additionally, you can use inspection rulesets to customize both built-in and custom infoType detectors by using context rules. In this example, you could use exclusion rules to do the following:

  • Inspect valid credit card numbers for a given set of card ranges, for example CardType#1 and CardType#2. You can do this by using a combination of custom regex-based infoType detectors along with the built-in CREDIT_CARD_NUMBER infoType detector using inverse match:

    Shows the inspection ruleset to do an inverse match.

  • Reduce the number of false positives caused by overlapping custom regex-based infoType detectors. You can do this by using the built-in CREDIT_CARD_NUMBER infoType detector using full match to identify cards that don't belong to the given set of card ranges:

    Shows the inspection ruleset to do a full match

In pipelines where you expect to have dense Cloud DLP findings, you can use a custom Counter metric to identify truncated findings from the Cloud DLP inspection content response. The code snippet below illustrates this approach:

@ProcessElement
public void processElement(
    @Element KV<String, InspectContentResponse> element, MultiOutputReceiver out) {
  String fileName = element.getKey().split("\\~")[0];
  String timeStamp = Util.getTimeStamp();

  if (element.getValue().getResult().getFindingsTruncated()) {
    numberofTimesFindingsTruncated.inc();
  }

  numberofTimesFindingsTruncated.inc();

Use the custom Counter metric to see how often findings are being truncated, and reduce the pipeline batch size if necessary to minimize this.

What's next