Redacting confidential data

This tutorial shows you how to use the Cloud Data Fusion plugin for Cloud DLP to redact sensitive data.

Scenario

Consider the following scenario, in which some sensitive customer information needs to be redacted:

Your support team documents the details of each support case they handle in a support ticket. All of the information in the support tickets is pulled into a CSV file. The support technicians are not supposed to document any customer information that's considered sensitive, but sometimes they mistakenly do so. You notice that in the CSV file some customers' phone numbers appear.

You want to go through the CSV file and hide all phone numbers. You create a Cloud Data Fusion pipeline that redacts the sensitive customer data by using the Cloud DLP plugin.

In this tutorial, you create a pipeline that does the following:

  • Redacts customer phone numbers by masking them with the # character.
  • Stores the masked sensitive data and the non-sensitive data in a Cloud Storage bucket.

Objectives

  • Connect Cloud Data Fusion to a Cloud Storage source.
  • Deploy the Cloud DLP plugin.
  • Create a custom Cloud DLP template.
  • Use the Redact transform plugin to mask sensitive customer data.
  • Write the output data to Cloud Storage.

Costs

This tutorial uses billable components of Google Cloud, including:

Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Cloud Data Fusion, Cloud Storage, BigQuery, and Cloud Dataproc APIs.

    Enable the APIs

  5. Create a Cloud Data Fusion instance.

Get Cloud DLP permissions

  1. In the Cloud Console, go to the IAM page.

    Open the IAM page

  2. In the permissions table, in the Member column, find the service account that matches the format service-project-number@gcp-sa-datafusion.iam.gserviceaccount.com.

  3. Click the Edit button to the right of the service account.

  4. Click Add Another Role.

  5. Click the dropdown that appears.

  6. Use the search bar to search and then select DLP Administrator.

  7. Click Save.

  8. Check that DLP Administrator appears in the Role column.

When using Cloud Data Fusion, you use both the Cloud Console and the separate Cloud Data Fusion UI. In the Cloud Console, you can create a Cloud Console project and create and delete Cloud Data Fusion instances. In the Cloud Data Fusion UI, you can use the various pages, such as Studio or Wrangler, to use Cloud Data Fusion features.

  1. In the Cloud Console, open the Instances page.

    Open the Instances page

  2. In the Actions column for the instance, click the View Instance link. The Cloud Data Fusion UI opens in a new browser tab.

Create the pipeline

Create a pipeline that redacts sensitive customer data. The pipeline you build does the following:

  • Reads the input data using the Cloud Storage source plugin.
  • Deploys the Cloud DLP plugin from the Hub and apply the Redact transform plugin.
  • Writes the output data using a Cloud Storage sink plugin.

Load the customer data

This tutorial uses the input dataset, CallCenterRecords.csv, provided in a publicly available Cloud Storage bucket.

  1. In the Cloud Data Fusion UI, click the menu and navigate to the Studio page.

  2. In the Source menu, click the GCS plugin.

    image

  3. Hold the pointer over the GCS node that appears and click Properties.

  4. Under Reference Name, enter a reference name.

  5. Under Path, enter gs://datafusion-sample-datasets/CallCenterRecords.csv.

  6. Under Format, select csv.

  7. Under Output Schema, under Name, enter the following by clicking the add button for each data type:

    • Date
    • Bank
    • State
    • Zip
    • Notes
  8. Make sure all data types are of type string. To change the type, click Type and select String from the dropdown.

  9. Check the Null box for each data type. This ensures that the pipeline doesn't fail when it encounters a null (empty) value.

    image

  10. Click Validate to ensure that there are no errors.

  11. Click the X button in the upper-right corner of the dialog box.

Redact sensitive data

The Redact transform plugin identifies sensitive records in your input stream of data and applies transformations that you define to those records. A record of data is considered sensitive if it matches pre-defined Cloud DLP filters you choose or a custom template you define.

In this tutorial, you want to redact customer phone numbers that some support technicians on your team accidentally took note of. They entered the sensitive information in the Notes section of the support tickets, which appears as the Notes column in the CSV file. You create a custom Cloud DLP template, and then provide the template ID in the properties menu of the Redact transform plugin.

Deploy the Cloud DLP plugin

  1. In the Cloud Data Fusion UI, click Hub in the upper right.

  2. Click the Data Loss Prevention plugin.

  3. Click Deploy.

  4. Click Finish.

  5. Click the X button in the upper-right corner of the Cloud DLP | Deploy dialog box.

  6. Click the X button to exit the Hub.

Create a custom template

  1. In the Cloud Console, open Cloud DLP.

    Open the Cloud DLP page

  2. From the Create menu, choose Template. image

  3. Under Define template, in the Template ID field, enter an ID for your template. You will need the template ID later in the tutorial.

  4. Click Continue.

  5. Under Configure detection, click Manage infotypes.

  6. In the Built-in tab, use the filter to search for "phone number".

    image

  7. Select PHONE_NUMBER.

  8. Click Done.

  9. Click Create.

Learn more about creating Cloud DLP templates.

Apply the Redact transform

  1. Back in the Cloud Data Fusion UI, on the Studio page, click to expand the Transform menu.

  2. Click the Redact transform plugin.

    image

  3. Drag a connection arrow from the GCS node to the Redact node.

    image

  4. Hold the pointer over the Redact node and click Properties.

    1. Set Custom Template to Yes.

    2. Under Template ID, enter the template ID of the custom template you created.

    3. Under Matching, apply Masking on Custom template within Notes.

    4. Under Masking Character, enter #.

      image

    5. Click Validate to ensure that there are no errors.

    6. Click the X button in the upper-right corner of the dialog box.

Store the output data

Store the results of your pipeline in a Cloud Storage file.

  1. In the Cloud Data Fusion UI, on the Studio page, click to expand the Sink menu.

  2. Click GCS.

  3. Drag a connection arrow from the Redact node to the GCS2 node.

    image

  4. Hold the pointer over the GCS2 node and click Properties.

    1. Under Reference Name, enter a reference name.

    2. Under Path, enter the path of a Cloud Storage bucket where you'd like to store the pipeline results. Cloud Data Fusion creates the Cloud Storage bucket for you. Make sure you follow the bucket naming guidelines.

    3. Under Format, select CSV.

      image

    4. Click Validate to ensure that there are no errors.

    5. Click the X button in the upper-right corner of the dialog box.

Run the pipeline in preview mode

Run the pipeline in preview mode before you deploy it.

  1. Click Preview, and then click Run.

    image

    The Run button displays the pipeline status, which starts with Starting, then turns to Stop, and then to Run.

  2. When the preview run completes, on the Redact node, click Preview Data to see a side-by-side comparison of the input and output data. Check that phone numbers have been masked with the # character.

    image

Redact another data type

While examining the preview run results, you notice that there's still sensitive information that appears in the Notes column: email addresses. You go back and edit the Cloud DLP template to redact email addresses as well.

  1. In the Cloud Console, go to the Cloud DLP page.

    Open the Cloud DLP page

  2. In the Configuration tab, select your template.

  3. Click Edit.

  4. Click Manage infotypes.

  5. In the Built-in tab, use the filter to search for "OR" "email address".

    image

  6. Select all and click Done.

  7. Click Save.

  8. Once again, run your pipeline in preview mode. Cloud Data Fusion will automatically use the updated Cloud DLP template.

  9. Check that both phone numbers and email addresses have been masked with the # character.

    image

Deploy and run the pipeline

  1. Make sure Preview mode is unchecked.

  2. Click Save. Clicking Save prompts you to name your pipeline. Then, click OK.

    image

  3. Click Deploy.

  4. When deployment completes, click Run. Running your pipeline can take a few minutes. While you wait, you can observe the Status of the pipeline transition from Provisioning to Starting to Running to Deprovisioning to Succeeded.

View the results

  1. In the Cloud Console, go to the Cloud Storage page.

    Open the Cloud Storage page

  2. In the Storage browser, navigate to the sink Cloud Storage bucket you specified in the sink Cloud Storage plugin properties.

  3. In Link URL, click the link to download the CSV file with the results. Check that the phone numbers and email addresses have been masked with the # character.

    image

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

Delete the Cloud Data Fusion instance

Follow these instructions to delete your Cloud Data Fusion instance.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

What's next