This tutorial shows you how to run an automated Dataflow pipeline to de-identify the sample dataset used in the Creating DLP de-identification transformation templates for PII datasets tutorial. The sample dataset contains large-scale personally identifiable information (PII).
This document is part of a series:
- De-identification and re-identification of PII in large-scale datasets using Cloud DLP
- Creating Cloud DLP de-identification transformation templates for PII datasets
- Running an automated Dataflow pipeline to de-identify a PII dataset (this document)
- Validate PII dataset in BigQuery and re-identify using an automated pipeline
This tutorial assumes that you are familiar with shell scripting and Dataflow pipelines.
Reference architecture
This tutorial demonstrates the data de-identification pipeline that is illustrated in the following diagram.
The data de-identification streaming pipeline de-identifies sensitive data in text content using Dataflow. You can reuse the pipeline for multiple transformations and use cases.
Objectives
- Trigger and monitor the Dataflow pipeline to de-identify a sample dataset.
- Understand the code behind the pipeline.
Costs
This tutorial uses the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.
Before you begin
- Complete part 2 of the series.
Reviewing the pipeline parameters
This tutorial uses a Dataflow pipeline developed using the Apache Beam Java SDK. To repeatedly solve common data-related tasks at scale, Dataflow provides a framework known as Google-provided templates. If you use these templates, you don't need to write or maintain any pipeline code. In this tutorial, you trigger an automated data masking/tokenization using Cloud DLP from Cloud Storage to BigQuery pipeline with the following parameters.
Pipeline parameters | Value | Notes |
---|---|---|
numWorkers
|
5 | Set up by default |
maxNumWorkers
|
10 | |
machineType
|
n1-standard-4
|
|
pollingInterval
|
30 seconds | |
windowInterval
|
30 seconds | |
inputFilePattern
|
gs://${DATA_STORAGE_BUCKET}/CCRecords_*.csv
|
Created during part 2 of the tutorial. |
deidentifyTemplateName
|
${DEID_TEMPLATE_NAME}
|
|
inspectTemplateName
|
${INSPECT_TEMPLATE_NAME}
|
|
datasetName
|
deid_dataset
|
|
batchSize
|
500
|
For a total number of 100k records with batch size of 500, there are 200
API calls in parallel. By default the batch size is set to 100 .
|
dlpProjectName
|
${PROJECT_ID}
|
For this tutorial, your default Google Cloud project is used. |
jobId
|
my-deid-job
|
The Dataflow's job ID |
Running the pipeline
In Cloud Shell, set up application default credentials.
gcloud auth activate-service-account \ ${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \ --key-file=service-account-key.json --project=${PROJECT_ID} export GOOGLE_APPLICATION_CREDENTIALS=service-account-key.json
Running the pipeline:
export JOB_ID=my-deid-job gcloud dataflow jobs run ${JOB_ID} \ --gcs-location gs://dataflow-templates/latest/Stream_DLP_GCS_Text_to_BigQuery \ --region ${REGION} \ --parameters \ "inputFilePattern=gs://${DATA_STORAGE_BUCKET}/CCRecords_1564602825.csv,dlpProjectId=${PROJECT_ID},deidentifyTemplateName=${DEID_TEMPLATE_NAME},inspectTemplateName=${INSPECT_TEMPLATE_NAME},datasetName=deid_dataset,batchSize=500"
To monitor the pipeline, in the Google Cloud Console, go to the Dataflow page.
Click on the job's ID (
my-deid-job
). You see the job's graph:To validate the amount of data processed by the pipeline, click Process Tokenized Data.
The number of elements added for the DLP-Tokenization transformation is 200 and the number of elements added for the Process Tokenized Data transformation is 100,000.
To validate the total number of records inserted in the BigQuery tables, click
Write To BQ
.The number of elements added for the Process Tokenized Data transformation is 100,000.
Handling exceptions in the pipeline
The DLP API has a default limit of 600 API calls/minute. The pipeline processes the request in parallel based on the batch size that you specify.
The pipeline is configured to have a maximum of ten n1-standard-4
workers.
If you need to process a large dataset faster than the default configuration of
the Google-provided templates, you can
customize the pipeline
to update the number of workers and the machine type. If you increase the
number of workers, you might need to increase your default
quota
of the number of vCPUs, the in-use IP addresses, and the SSDs for the
Cloud project.
Reviewing the pipeline code
The full pipeline code is in the GitHub repository.
This pipeline uses a built-in Beam File IO transform to poll for new files that are configured for the automated version of the pipeline every 30 seconds. The pipeline continuously looks for new files until it's stopped or terminated.
Readable files can contain the same header. Instead of creating the header for every element, the pipeline uses a design pattern known as side-input. The header can be created only once for the window and passed as an input to other transformations.
Cloud DLP has a maximum payload size of 512 KB per API request and 600 concurrent API calls/minute. To manage this limitation, the pipeline uses the user-defined batch size as a parameter. For example, the sample dataset has 500k rows. A batch size of 1000 means 500 DLP API calls in parallel with an assumption that each request doesn't exceed the maximum payload size. A lower batch size contributes to a higher number of API calls. This batch size might cause a
quota resource exception
. If you need to increase the quota limit, see Quota increases.The pipeline uses a built-in BigQuery IO transform connector to write to BigQuery. This connector automatically creates a BigQuery table and schema by using the dynamic destination feature. To achieve low latency, the pipeline also uses BigQuery streaming inserts.
You have successfully completed this tutorial. You've triggered an automated de-identification pipeline using Cloud DLP templates to process a large dataset. In the next tutorial, you validate the de-identified dataset in BigQuery and re-identify the data using another Dataflow pipeline.
Cleaning up
If you don't intend to continue with the tutorials in the series, the easiest way to eliminate billing is to delete the Cloud project you created for the tutorial. Alternatively, you can delete the individual resources.
Delete the project
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Validating de-identified data in BigQuery and re-identifying PII data.
- Creating Cloud DLP de-identification transformation templates for PII datasets.
- De-identification and re-identification of PII in large-scale datasets using Cloud DLP.
- Review the sample code in the Migrate Sensitive Data in BigQuery Using Dataflow & Cloud DLP repo on GitHub.
- Learn about other pattern recognition solutions.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.