Automating the Classification of Data Uploaded to Cloud Storage

This tutorial shows how to implement an automated data quarantine and classification system using Cloud Storage and other Google Cloud Platform (GCP) products. The tutorial assumes that you are familiar with GCP and basic shell programming.

In every organization, the people who protect data and who ensure that it is classified and treated appropriately are facing an ever increasing amount of data. These data protection officers find that quarantining and classifying data can be complicated and time consuming, especially given hundreds or thousands of files a day.

What if you could take each file, upload it to a quarantine location, and have it automatically classified and moved to the appropriate location based on the classification result? This tutorial shows you how to implement such a system by using Cloud Functions, Cloud Storage, and the Cloud Data Loss Prevention API.

Objectives

  • Create Cloud Storage buckets to be used as part of the quarantine and classification pipeline.
  • Create a Cloud Pub/Sub topic and subscription to notify you when file processing is completed.
  • Create a simple Cloud Function that invokes the DLP API when files are uploaded.
  • Upload some sample files to the quarantine bucket to invoke the Cloud Function. The function uses the DLP API to inspect and classify the files and move them to the appropriate bucket.

Costs

This tutorial uses billable GCP components, including:

  • Cloud Storage
  • Cloud Functions
  • Cloud Data Loss Prevention API

You can use the Pricing Calculator to generate a cost estimate based on your projected usage.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. Enable the Cloud Functions, Cloud Storage, and Cloud Data Loss Prevention APIs.

    Enable the APIs

Granting permissions to service accounts

Your first step is to grant permissions to two service accounts: the Cloud Functions service account and the DLP service account.

Grant permissions to the Cloud Functions service account

  1. In the GCP Console, open the IAM & Admin page and select the project you created:

    GO TO THE IAM & ADMIN PAGE

  2. Locate the Cloud Functions service account. This account has the format [PROJECT_ID]@appspot.gserviceaccount.com. Replace [PROJECT_ID] with your project ID.

  3. From the Role(s) drop-down list next to [PROJECT_ID]@appspot.gserviceaccount.com, select the following roles:

    • Project > Owner
    • Cloud DLP > DLP Administrator
    • Service Management > DLP API service agent
  4. Click Save.

Grant permissions to the DLP service account

  1. In the GCP Console, open the IAM & Admin page and select the project you created:

    GO TO THE IAM & ADMIN PAGE

  2. Locate the DLP API Service Agent service account. This account has the format service-[PROJECT_NUMBER]@dlp-api.iam.gserviceaccount.com. Replace [PROJECT_NUMBER] with your project number.

  3. From the Role(s) drop-down list next to the service-[PROJECT_NUMBER]@dlp-api.iam.gserviceaccount.com, select Project > Viewer role, and then click Save.

Building the quarantine and classification pipeline

In this section, you build the quarantine and classification pipeline shown in the following diagram.

Quarantine and classification workflow

The numbers in this pipeline correspond to these steps:

  1. You upload files to Cloud Storage.
  2. You invoke a Cloud Function.
  3. The Cloud Data Loss Prevention API inspects and classifies the data.
  4. The file is moved to the appropriate bucket.

Create Cloud Storage buckets

Following the guidance outlined in the Bucket and object naming guidelines, create three uniquely named buckets, which you use throughout this tutorial:

  • Bucket 1: Replace [YOUR_QUARANTINE_BUCKET] with a unique name of your choosing.
  • Bucket 2: Replace [YOUR_SENSITIVE_DATA_BUCKET] with a unique name of your choosing.
  • Bucket 3: Replace [YOUR_NON_SENSITIVE_DATA_BUCKET] with a unique name of your choosing.

Console

  1. Open the Cloud Storage browser in the GCP Console:

    GO TO Cloud Storage BROWSER

  2. Click Create bucket.

  3. In the Bucket name text box, enter the name you selected for [YOUR_QUARANTINE_BUCKET], and then click Create.
  4. Repeat for the [YOUR_SENSITIVE_DATA_BUCKET] and [YOUR_NON_SENSITIVE_DATA_BUCKET] buckets.

gcloud

  1. Open Cloud Shell:

    GO TO Cloud Shell

  2. Create three buckets using the following commands:

    gsutil mb gs://[YOUR_QUARANTINE_BUCKET]
    gsutil mb gs://[YOUR_SENSITIVE_DATA_BUCKET]
    gsutil mb gs://[YOUR_NON_SENSITIVE_DATA_BUCKET]
    

Create a Cloud Pub/Sub topic and subscription

Console

  1. Open the Cloud Pub/Sub Topics page:

    GO TO Cloud Pub/Sub TOPICS

  2. Click Create a topic.

  3. In the text box that has an entry of the format PROJECTS/[YOUR_PROJECT_NAME]/TOPICS/, append the topic name, like this:

    PROJECTS/[YOUR_PROJECT_NAME]/TOPICS/[PUB/SUB TOPIC]
    
  4. Click Create.

  5. Select the newly created topic, click the three dots (...) that follow the topic name, and then select New subscription.
  6. In the text box that has an entry of the format PROJECTS/[YOUR_PROJECT_NAME]/TOPICS/[PUB/SUB TOPIC], append the subscription name, like this:

    PROJECTS/[YOUR_PROJECT_NAME]/TOPICS/[PUB/SUB TOPIC]/[PUB/SUB SUBSCRIPTION]
    
  7. Click Create.

gcloud

  1. Open Cloud Shell:

    GO TO Cloud Shell

  2. Create a topic, replacing [PUB/SUB TOPIC] with a name of your choosing:

    gcloud pubsub topics create [PUB/SUB TOPIC]

  3. Create a subscription, replacing [PUB/SUB SUBSCRIPTION] with a name of your choosing:

    gcloud pubsub subscriptions create [PUB/SUB SUBSCRIPTION] --topic [PUB/SUB TOPIC]

Create a Cloud Function

Console

  1. Open the Cloud Functions Overview page:

    GO TO THE Cloud Functions OVERVIEW PAGE

  2. Select the project for which you enabled Cloud Functions.

  3. Click Create function.
  4. In the Name box, replace the default name with dlpQuarantineGCS.
  5. In the Trigger field, check Cloud Storage bucket.
  6. In the Bucket field, click browse, select your quarantine bucket by highlighting the bucket in the drop-down list, and then click Select.
  7. Under Source code, check Inline editor.
  8. Paste the following code into the Index.js box, replacing the existing text:

    /**
     * Copyright 2018, Google, Inc.
     * Licensed under the Apache License, Version 2.0 (the "License");
     * you may not use this file except in compliance with the License.
     * You may obtain a copy of the License at
     *
     *    http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless  required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */
    
    'use strict';
    
    // Start the debug agent. Remove or comment out if not required
    require('@google-cloud/debug-agent').start();
    
    
    // User-configurable constants:
    
    // The minimum likelihood required before returning a match
    const MIN_LIKELIHOOD = 'LIKELIHOOD_UNSPECIFIED';
    
    // The maximum number of findings to report (0 = server maximum)
    const MAX_FINDINGS = 0;
    
    // The infoTypes of information to match
    const INFO_TYPES = [
      { name: 'PHONE_NUMBER' },
      { name: 'EMAIL_ADDRESS' },
      { name: 'US_SOCIAL_SECURITY_NUMBER' }
    ];
    
    // The bucket the to-be-scanned files are uploaded to
    const STAGING_BUCKET = '[YOUR_QUARANTINE_BUCKET]';
    // The bucket to move "safe" files to
    const NONSENSITIVE_BUCKET = '[YOUR_NON_SENSITIVE_DATA_BUCKET]';
    
    // The bucket to move "unsafe" files to
    const SENSITIVE_BUCKET = '[YOUR_SENSITIVE_DATA_BUCKET]';
    
    // The project ID to run the DLP API call under
    const PROJECT_ID = '[PROJECT_ID HOSTING STAGING_BUCKET]';
    
    // Pub/Sub topic to notify once the  DLP job completes
    const PUB_SUB_TOPIC = '[PUB/SUB TOPIC]';
    
    // Pub/Sub subscription to use when listening for job complete notifications
    const PUB_SUB_SUBSCR = '[PUB/SUB SUBSCRIPTION]';
    
    
    // Initialize the Google Cloud client libraries
    const DLP = require('@google-cloud/dlp');
    const dlp = new DLP.DlpServiceClient();
    
    const Storage = require('@google-cloud/storage');
    const storage = Storage();
    
    const Pubsub = require('@google-cloud/pubsub');
    const pubsub = new Pubsub();
    
    /**
     * Background Cloud Function to scan a GCS file using the DLP API and move it
     * to another bucket based on the DLP API's findings
     *
     * @param {object} event The Google Cloud Storage event.
     * @param {function} callback Called at completion of processing the file.
     */
    exports.dlpQuarantineGCS = (event, callback) => {
        var file = event.data;
        console.log('Processing file: ' + file.name);
        setTimeout(() => inspectGCSFile(
          PROJECT_ID,
          file.bucket,
          file.name,
          PUB_SUB_TOPIC,
          PUB_SUB_SUBSCR,
          MIN_LIKELIHOOD,
          MAX_FINDINGS,
          INFO_TYPES,
          callback
        ), 2000);
    };
    
    function inspectGCSFile(
        projectId,
        bucketName,
        fileName,
        topicId,
        subscriptionId,
        minLikelihood,
        maxFindings,
        infoTypes,
        done) {
      // Get reference to the file to be inspected
      const storageItem = {
        cloudStorageOptions: {
          fileSet: {url: `gs://${bucketName}/${fileName}`},
        },
      };
    
      // Construct REST request for creating an inspect job
      const request = {
        parent: dlp.projectPath(projectId),
        inspectJob: {
          inspectConfig: {
            infoTypes: infoTypes,
            minLikelihood: minLikelihood,
            limits: {
              maxFindingsPerRequest: maxFindings,
            },
          },
          storageConfig: storageItem,
          actions: [
            {
              pubSub: {
                topic: `projects/${projectId}/topics/${topicId}`,
              },
            },
          ],
        },
      };
    
      // Verify the Pub/Sub topic and listen for job notifications via an
      // existing subscription.
      let subscription;
      pubsub.topic(topicId)
          .get()
          .then(topicResponse => {
            return topicResponse[0].subscription(subscriptionId);
          })
          .then(subscriptionResponse => {
            subscription = subscriptionResponse;
            // Create a DLP GCS File inspection job and wait for it to complete
            // (using promises)
            return dlp.createDlpJob(request);
          })
          .then(jobsResponse => {
            // Get the DLP job ID
            return jobsResponse[0].name;
          })
          .then(jobName => {
            // Watch the Pub/Sub topic until the DLP job completes processing file
            return new Promise((resolve, reject) => {
              const messageHandler = message => {
                if (message.attributes &&
                    message.attributes.DlpJobName === jobName) {
                  message.ack();
                  subscription.removeListener('message', messageHandler);
                  subscription.removeListener('error', errorHandler);
                  resolve(jobName);
                } else {
                  message.nack();
                }
              };
    
              const errorHandler = err => {
                subscription.removeListener('message', messageHandler);
                subscription.removeListener('error', errorHandler);
                reject(err);
              };
    
              subscription.on('message', messageHandler);
              subscription.on('error', errorHandler);
            });
          })
          .then(jobName => {
            // Wait for DLP job to fully complete
            return new Promise(resolve => setTimeout(resolve(jobName), 500));
          })
          .then(jobName => dlp.getDlpJob({name: jobName}))
          .then(wrappedJob => {
            const job = wrappedJob[0];
            console.log(`Job ${job.name} status: ${job.state}`);
    
            // set default destination to "sensitive" bucket
            var destBucketName = SENSITIVE_BUCKET;
    
            const infoTypeStats = job.inspectDetails.result.infoTypeStats;
            if (infoTypeStats.length > 0) {
              infoTypeStats.forEach(infoTypeStat => {
                console.log(
                    `  Found ${infoTypeStat.count} instance(s)` +
                    ` of infoType ${infoTypeStat.infoType.name}.`);
              });
            } else {
              // if no infotype mnatch set destination to "non sensitive" bucket
              destBucketName = NONSENSITIVE_BUCKET;
              console.log('No Matching infoType.');
            }
            // set destination to target bucket
            const destBucket = storage.bucket(destBucketName);
            // Move file to appropriate bucket
            // NOTE: No atomic "move" option exists in GCS, so this may fail to
            // delete the quarantined file
            return storage.bucket(bucketName).file(fileName).move(destBucket);
          })
          .catch((err) => {
            if (err.message.toLowerCase().indexOf('not found') > -1) {
              console.error('[Fail] Error in inspectGCSFile:' + err);
              done();
            } else {
              console.error('[Retry] Error in inspectGCSFile:' + err);
              done(err);
            }
          });
    }
    

  9. Adjust the following lines in the code that you pasted into the index.js box, replacing the variables with the project ID of your project, the corresponding buckets, and the Cloud Pub/Sub topic and subscription names that you created earlier.

    [YOUR_QUARANTINE_BUCKET]
    [YOUR_SENSITIVE_DATA_BUCKET]
    [YOUR_NON_SENSITIVE_DATA_BUCKET]
    [PROJECT_ID HOSTING STAGING_BUCKET]
    [PUB/SUB TOPIC]
    [PUB/SUB SUBSCRIPTION]
    
  10. In the Function to execute text box, replace processFile with dlpQuarantineGCS.

  11. Paste the following code into the package.json text box, replacing the existing text:

    {
      "name": "gcp-functions-classification-dlp",
      "version": "0.0.1",
      "private": true,
      "license": "Apache-2.0",
      "author": "Google Inc.",
      "repository": {
        "type": "git",
        "url": "https://github.com/GoogleCloudPlatform/nodejs-docs-samples.git"
      },
      "engines": {
        "node": ">=4.3.2"
      },
      "scripts": {
        "lint": "repo-tools lint",
        "pretest": "npm run lint",
        "test": "ava -T 20s --verbose test/*.test.js"
      },
      "dependencies": {
        "@google-cloud/debug-agent": "^2.3.0",
        "@google-cloud/dlp": "^0.3.0",
        "@google-cloud/storage": "^1.5.1",
        "@google-cloud/pubsub": "0.16.4",
        "pug": "2.0.0-rc.4",
        "safe-buffer": "5.1.1"
      },
      "devDependencies": {
        "@google-cloud/nodejs-repo-tools": "2.1.3",
        "@google-cloud/pubsub": "^0.15.0",
        "ava": "0.24.0",
        "proxyquire": "1.8.0",
        "sinon": "4.1.2",
        "supertest": "^3.0.0",
        "uuid": "^3.1.0"
      },
      "cloud-repo-tools": {
        "requiresKeyFile": true,
        "requiresProjectId": true,
        "requiredEnvVars": [
          "BASE_URL"
        ]
      }
    }
    

    The following screenshot shows the code in the package.json text box:

    package.json settings

  12. Click Save.

    A green checkmark beside the function indicates a successful deployment.

    successful deployment

When the Cloud Function has successfully deployed, continue to the next section.

gcloud

  1. Open a Cloud Shell session and clone the GitHub repository that contains the code and some sample data files:

    OPEN IN Cloud Shell

  2. Change directories to the folder the repository has been cloned to:

    cd ~/dlp-cloud-functions-tutorials/gcs-dlp-classification/src

  3. Adjust the following lines in the code that you pasted into the index.js box, replacing the following bucket variables with the corresponding buckets you created earlier. Also replace the topic and subscription variables with the names you chose.

    [YOUR_QUARANTINE_BUCKET]
    [YOUR_SENSITIVE_DATA_BUCKET]
    [YOUR_NON_SENSITIVE_DATA_BUCKET]
    [PROJECT_ID_HOSTING STAGING_BUCKET]
    [PUB/SUB TOPIC]
    [PUB/SUB SUBSCRIPTION]
    
  4. Deploy the function, replacing [YOUR_QUARANTINE_BUCKET] with your bucket name:

    gcloud beta functions deploy dlpQuarantineGCS --trigger-bucket [YOUR_QUARANTINE_BUCKET]

  5. Validate that the function has successfully deployed:

    gcloud beta functions describe dlpQuarantineGCS

    A successful deployment is indicated by a ready status similar to the following:

    status:  READY
    timeout:  60s
    

When the Cloud Function has successfully deployed, continue to the next section.

Upload sample files to the quarantine bucket

The GitHub repository associated with this article includes sample data files. The folder that contains these files consists of two types of files: those containing nonsensitive data and those containing sensitive data, where sensitive data is classified as containing one or more of the following infoTypes.

US_SOCIAL_SECURITY_NUMBER
EMAIL_ADDRESS
PERSON_NAME
LOCATION
PHONE_NUMBER

The data types that are used to classify the sample files are defined in the INFO_TYPES constant in the index.js file, which is initially set to [‘PHONE_NUMBER', ‘EMAIL_ADDRESS'].

  1. If you have not already cloned the repository, open Cloud Shell and clone the GitHub repository that contains the code and some sample data files:

    OPEN IN Cloud Shell

  2. Change folders to the sample data files:

    cd ~/dlp-cloud-functions-tutorials/sample_data/

  3. Copy the sample data files to the quarantine bucket by using the gsutil command, replacing [YOUR_QUARANTINE_BUCKET] with the name of your quarantine bucket:

    gsutil -m  cp * gs://[YOUR_QUARANTINE_BUCKET]/

    The Cloud Data Loss Prevention API inspects and classifies each file uploaded to the quarantine bucket and moves it to the appropriate target bucket based on its classification.

  4. In the Cloud Storage console, open the Storage Browser page:

    GO TO Cloud Storage BROWSER

  5. Select one of the target buckets that you created earlier and review the uploaded files. Also review the other buckets that you created.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

After you've finished the current tutorial, you can clean up the resources you created on Google Cloud Platform so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Delete the project

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...