This tutorial shows how to implement an automated data quarantine and classification system using Cloud Storage and other Google Cloud products. The tutorial assumes that you are familiar with Google Cloud and basic shell programming.
In every organization, data protection officers like you face an ever-increasing amount of data, data that must be protected and treated appropriately. Quarantining and classifying that data can be complicated and time consuming, especially given hundreds or thousands of files a day.
What if you could take each file, upload it to a quarantine location, and have it automatically classified and moved to the appropriate location based on the classification result? This tutorial shows you how to implement such a system by using Cloud Functions, Cloud Storage, and Cloud Data Loss Prevention.
Objectives
- Create Cloud Storage buckets to be used as part of the quarantine and classification pipeline.
- Create a Pub/Sub topic and subscription to notify you when file processing is completed.
- Create a simple Cloud Function that invokes the DLP API when files are uploaded.
- Upload some sample files to the quarantine bucket to invoke the Cloud Function. The function uses the DLP API to inspect and classify the files and move them to the appropriate bucket.
Costs
This tutorial uses billable Google Cloud components, including:
- Cloud Storage
- Cloud Functions
- Cloud Data Loss Prevention
You can use the Pricing Calculator to generate a cost estimate based on your projected usage.
Before you begin
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Cloud Functions, Cloud Storage, and Cloud Data Loss Prevention APIs.
Granting permissions to service accounts
Your first step is to grant permissions to two service accounts: the Cloud Functions service account and the Cloud DLP service account.
Grant permissions to the App Engine default service account
In the Cloud Console, open the IAM & Admin page and select the project you created:
Locate the App Engine service account. This account has the format
[PROJECT_ID]@appspot.gserviceaccount.com
. Replace[PROJECT_ID]
with your project ID.Select the edit icon edit next to the service account.
Add the following roles:
- Project > Owner
- Cloud DLP > DLP Administrator
- Service Management > DLP API Service Agent
Click Save.
Grant permissions to the DLP service account
In the Cloud Console, open the IAM & Admin page and select the project you created:
Locate the Cloud DLP Service Agent service account. This account has the format
service-[PROJECT_NUMBER]@dlp-api.iam.gserviceaccount.com
. Replace[PROJECT_NUMBER]
with your project number.Select the edit icon edit next to the service account.
Add the role Project > Viewer, and then click Save.
Building the quarantine and classification pipeline
In this section, you build the quarantine and classification pipeline shown in the following diagram.
The numbers in this pipeline correspond to these steps:
- You upload files to Cloud Storage.
- You invoke a Cloud Function.
- Cloud DLP inspects and classifies the data.
- The file is moved to the appropriate bucket.
Create Cloud Storage buckets
Following the guidance outlined in the bucket naming guidelines, create three uniquely named buckets, which you use throughout this tutorial:
- Bucket 1: Replace
[YOUR_QUARANTINE_BUCKET]
with a unique name. - Bucket 2: Replace
[YOUR_SENSITIVE_DATA_BUCKET]
with a unique name. - Bucket 3: Replace
[YOUR_NON_SENSITIVE_DATA_BUCKET]
with a unique name.
console
In the Cloud Console, open the Cloud Storage browser:
Click Create bucket.
In the Bucket name text box, enter the name you selected for
[YOUR_QUARANTINE_BUCKET]
, and then click Create.Repeat for the
[YOUR_SENSITIVE_DATA_BUCKET]
and[YOUR_NON_SENSITIVE_DATA_BUCKET]
buckets.
gcloud
Open Cloud Shell:
Create three buckets using the following commands:
gsutil mb gs://[YOUR_QUARANTINE_BUCKET] gsutil mb gs://[YOUR_SENSITIVE_DATA_BUCKET] gsutil mb gs://[YOUR_NON_SENSITIVE_DATA_BUCKET]
Create a Pub/Sub topic and subscription
console
Open the Pub/Sub Topics page:
Click Create a topic.
In the text box that has an entry of the format
PROJECTS/[YOUR_PROJECT_NAME]/TOPICS/
, append the topic name, like this:PROJECTS/[YOUR_PROJECT_NAME]/TOPICS/[PUB/SUB_TOPIC]
Click Create.
Select the newly created topic, click the three dots (...) that follow the topic name, and then select New subscription.
In the text box that has an entry of the format
PROJECTS/[YOUR_PROJECT_NAME]/TOPICS/[PUB/SUB_TOPIC]
, append the subscription name, like this:PROJECTS/[YOUR_PROJECT_NAME]/TOPICS/[PUB/SUB_TOPIC]/[PUB/SUB_SUBSCRIPTION]
Click Create.
gcloud
Open Cloud Shell:
Create a topic, replacing
[PUB/SUB_TOPIC]
with a name of your choosing:gcloud pubsub topics create [PUB/SUB_TOPIC]
Create a subscription, replacing
[PUB/SUB_SUBSCRIPTION]
with a name of your choosing:gcloud pubsub subscriptions create [PUB/SUB_SUBSCRIPTION] --topic [PUB/SUB_TOPIC]
Create the Cloud Functions
This section steps through deploying the Python script containing the following two Cloud Functions:
- A function that is invoked when an object is uploaded to Cloud Storage.
- A function that is invoked when a message is received in the Pub/Sub queue.
Creating the first function
console
Open the Cloud Functions Overview page:
Select the project for which you enabled Cloud Functions.
Click Create function.
In the Name box, replace the default name with
create_DLP_job
.In the Trigger field, select Cloud Storage.
In the Bucket field, click browse, select your quarantine bucket by highlighting the bucket in the drop-down list, and then click Select.
Under Runtime, select Python 3.7.
Under Source code, check Inline editor.
Paste the following code into the main.py box, replacing the existing text:
Adjust the following lines in the code that you pasted into the main.py box, replacing the variables with the project ID of your project, the corresponding buckets, and the Pub/Sub topic and subscription names that you created earlier.
[YOUR_QUARANTINE_BUCKET] [YOUR_SENSITIVE_DATA_BUCKET] [YOUR_NON_SENSITIVE_DATA_BUCKET] [PROJECT_ID_HOSTING_STAGING_BUCKET] [PUB/SUB_TOPIC]
In the Function to execute text box, replace
hello_gcs
withcreate_DLP_job
.Paste the following code into the requirements.txt text box, replacing the existing text:
Click Save.
A green checkmark beside the function indicates a successful deployment.
gcloud
Open a Cloud Shell session and clone the GitHub repository that contains the code and some sample data files:
Change directories to the folder the repository has been cloned to:
cd gcs-dlp-classification-python/
Adjust the following lines in the code in the main.py box, replacing the following bucket variables with the corresponding buckets you created earlier. Also replace the Pub/Sub topic and subscription variables with the names you chose.
[YOUR_QUARANTINE_BUCKET] [YOUR_SENSITIVE_DATA_BUCKET] [YOUR_NON_SENSITIVE_DATA_BUCKET] [PROJECT_ID_HOSTING_STAGING_BUCKET] [PUB/SUB_TOPIC]
Deploy the function, replacing
[YOUR_QUARANTINE_BUCKET]
with your bucket name:gcloud functions deploy create_DLP_job --runtime python37 \ --trigger-resource [YOUR_QUARANTINE_BUCKET] \ --trigger-event google.storage.object.finalize
Validate that the function has successfully deployed:
gcloud functions describe create_DLP_job
A successful deployment is indicated by a ready status similar to the following:
status: READY timeout: 60s
When the Cloud Function has successfully deployed, continue to the next section to create the second Cloud Function.
Creating the second function
console
Open the Cloud Functions Overview page:
Select the project for which you enabled Cloud Functions.
Click Create function.
In the Name box, replace the default name with
resolve_DLP
.In the Trigger field, select Pub/Sub.
In the Topic field, enter
[PUB/SUB_TOPIC]
.Under Source code, check Inline editor.
Under Runtime, select Python 3.7.
Paste the following code into the main.py box, replacing the existing text:
Adjust the following lines in the code that you pasted into the main.py box, replacing the variables with the project ID of your project, the corresponding buckets, and the Pub/Sub topic and subscription names that you created earlier.
[YOUR_QUARANTINE_BUCKET] [YOUR_SENSITIVE_DATA_BUCKET] [YOUR_NON_SENSITIVE_DATA_BUCKET] [PROJECT_ID_HOSTING_STAGING_BUCKET] [PUB/SUB_TOPIC]
In the Function to execute text box, replace
helloPubSub
withresolve_DLP
.Paste the following into the requirements.txt text box, replacing the existing text:
Click Save.
A green checkmark beside the function indicates a successful deployment.
gcloud
Open (or reopen) a Cloud Shell session and clone the GitHub repository that contains the code and some sample data files:
Change directories to the folder with the Python code:
cd gcs-dlp-classification-python
Adjust the following lines in the code in the main.py box, replacing the following bucket variables with the corresponding buckets you created earlier. Also replace the Pub/Sub topic and subscription variables with the names you chose.
[YOUR_QUARANTINE_BUCKET] [YOUR_SENSITIVE_DATA_BUCKET] [YOUR_NON_SENSITIVE_DATA_BUCKET] [PROJECT_ID_HOSTING_STAGING_BUCKET] [PUB/SUB_TOPIC]
Deploy the function, replacing
[PUB/SUB_TOPIC]
with your Pub/Sub topic:gcloud functions deploy resolve_DLP --runtime python37 --trigger-topic [PUB/SUB_TOPIC]
Validate that the function has successfully deployed:
gcloud functions describe resolve_DLP
A successful deployment is indicated by a ready status similar to the following:
status: READY timeout: 60s
When the Cloud Function has successfully deployed, continue to the next section.
Upload sample files to the quarantine bucket
The GitHub repository associated with this article includes sample data files.
The folder contains some files that have sensitive data and other files
that have nonsensitive data. Sensitive data is classified as containing one
or more of the following INFO_TYPES
values:
US_SOCIAL_SECURITY_NUMBER EMAIL_ADDRESS PERSON_NAME LOCATION PHONE_NUMBER
The data types that are used to classify the sample files are defined in the
INFO_TYPES
constant in the main.py
file, which is initially set to
[‘PHONE_NUMBER', ‘EMAIL_ADDRESS']
.
If you have not already cloned the repository, open Cloud Shell and clone the GitHub repository that contains the code and some sample data files:
Change folders to the sample data files:
cd ~/dlp-cloud-functions-tutorials/sample_data/
Copy the sample data files to the quarantine bucket by using the
gsutil
command, replacing[YOUR_QUARANTINE_BUCKET]
with the name of your quarantine bucket:gsutil -m cp * gs://[YOUR_QUARANTINE_BUCKET]/
Cloud DLP inspects and classifies each file uploaded to the quarantine bucket and moves it to the appropriate target bucket based on its classification.
In the Cloud Storage console, open the Storage Browser page:
Select one of the target buckets that you created earlier and review the uploaded files. Also review the other buckets that you created.
Cleaning up
After you've finished the current tutorial, you can clean up the resources that you created on Google Cloud so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.
Delete the project
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Try setting different
valid data type values
for
INFO_TYPES
. - Learn more about inspecting storage and databases for sensitive data using the Cloud DLP.
- Learn more about Cloud Functions.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.