Analyzing and reporting on DLP API scan findings

The Cloud Data Loss Prevention (DLP) API helps you find, understand, and manage the sensitive data that exists within your infrastructure. Once you've scanned your content for sensitive data using the DLP API, you have several options for what to do with that data intelligence. This topic shows you how to leverage the power of other Google Cloud Platform features such as BigQuery, Cloud SQL, and Google Data Studio to:

  • Store DLP API scan results directly in BigQuery.
  • Generate reports on where sensitive data resides in your infrastructure.
  • Run rich SQL analytics to understand where sensitive data is stored and what kind it is.
  • Automate alerts, or actions to trigger based on a single set or a combination of findings.

This topic also contains a complete example of how to use the DLP API along with other GCP features to accomplish all of these things.

Scan a storage bucket

First, run a scan on your data. Following is basic information about how to scan storage repositories using DLP API. For full instructions on scanning storage repositories, including the use of client libraries, see Inspecting Storage and Databases for Sensitive Data.

To run a scan operation on a GCP storage repository, assemble a JSON object that includes the following configuration objects:

  • InspectJobConfig: Configures the DLP scan job, and consists of:

    • StorageConfig: The storage repository to scan.
    • InspectConfig: How and what to scan for. You can also use an inspection template to define the inspection configuration.
    • Action: Task(s) to execute on the completion of the job. This can include saving findings to a BigQuery table or publishing a notification to Cloud Pub/Sub.

In our example, we're scanning a Cloud Storage bucket for person names, phone numbers, US Social Security numbers, and email addresses; and then sending the findings to a BigQuery table dedicated to storing DLP output. The following JSON can be saved to a file or sent directly to the create method of the DlpJob DLP API resource.

JSON Input:

POST https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/dlpJobs?key={YOUR_API_KEY}

{
  "inspectJob":{
    "inspectConfig":{
      "infoTypes":[
        {
          "name":"PERSON_NAME"
        },
        {
          "name":"PHONE_NUMBER"
        },
        {
          "name":"US_SOCIAL_SECURITY_NUMBER"
        },
        {
          "name":"EMAIL_ADDRESS"
        }
      ],
      "includeQuote":true
    },
    "storageConfig":{
      "cloudStorageOptions":{
        "fileSet":{
          "url":"gs://[BUCKET_NAME]/**"
        }
      }
    },
    "actions":[
      {
        "saveFindings":{
          "outputConfig":{
            "table":{
              "projectId":"[PROJECT_ID]",
              "datasetId":"[DATASET_ID]",
              "tableId":"[TABLE_ID]"
            }
          }
        }
      }
    ]
  }
}

Note that by specifying two asterisks (**) after the Cloud Storage bucket address (gs://[BUCKET_NAME]/**), we're instructing the scan job to scan recursively. Placing a single asterisk (*) would instruct the job to scan only the specified directory level and no deeper.

The output will be saved to the specified table within the given dataset and project. Subsequent jobs that specify the given table ID will append findings to the same table. You could have also left out a "tableId" key if you wanted to instruct the DLP API to create a new table every time the scan is run.

After we send this JSON in a request to the projects.dlpJobs.create method via the specified URL, we get the following response:

JSON Output:

{
  "name":"projects/[PROJECT_ID]/dlpJobs/[JOB_ID]",
  "type":"INSPECT_JOB",
  "state":"PENDING",
  "inspectDetails":{
    "requestedOptions":{
      "snapshotInspectTemplate":{

      },
      "jobConfig":{
        "storageConfig":{
          "cloudStorageOptions":{
            "fileSet":{
              "url":"gs://[BUCKET_NAME]/**"
            }
          }
        },
        "inspectConfig":{
          "infoTypes":[
            {
              "name":"PERSON_NAME"
            },
            {
              "name":"PHONE_NUMBER"
            },
            {
              "name":"US_SOCIAL_SECURITY_NUMBER"
            },
            {
              "name":"EMAIL_ADDRESS"
            }
          ],
          "minLikelihood":"POSSIBLE",
          "limits":{

          },
          "includeQuote":true
        },
        "actions":[
          {
            "saveFindings":{
              "outputConfig":{
                "table":{
                  "projectId":"[PROJECT_ID]",
                  "datasetId":"[DATASET_ID]",
                  "tableId":"[TABLE_ID]"
                }
              }
            }
          }
        ]
      }
    }
  },
  "createTime":"2018-11-19T21:09:07.926Z"
}

Once the job has completed, it saves its findings to the given BigQuery table.

To get the status of the job, call the projects.dlpJobs.get method, or send a GET request to the following URL, replacing [PROJECT_ID] with your project ID and [JOB_ID] with the job identifier given in the Cloud Data Loss Prevention API's response to the job creation request (the job identifier will be preceded by a "i-"):

GET https://dlp.googleapis.com/v2/projects/[PROJECT_ID]/dlpJobs/[JOB_ID]?key={YOUR_API_KEY}

For the job we just created, this request returns the following JSON. Notice that a summary of the results of the scan are returned after the inspection details. If the scan hadn't yet completed, its "state" key would specify "RUNNING".

JSON Output:

{
  "name":"projects/[PROJECT_ID]/dlpJobs/[JOB_ID]",
  "type":"INSPECT_JOB",
  "state":"DONE",
  "inspectDetails":{
    "requestedOptions":{
      "snapshotInspectTemplate":{

      },
      "jobConfig":{
        "storageConfig":{
          "cloudStorageOptions":{
            "fileSet":{
              "url":"gs://[BUCKET_NAME]/**"
            }
          }
        },
        "inspectConfig":{
          "infoTypes":[
            {
              "name":"PERSON_NAME"
            },
            {
              "name":"PHONE_NUMBER"
            },
            {
              "name":"US_SOCIAL_SECURITY_NUMBER"
            },
            {
              "name":"EMAIL_ADDRESS"
            }
          ],
          "minLikelihood":"POSSIBLE",
          "limits":{

          },
          "includeQuote":true
        },
        "actions":[
          {
            "saveFindings":{
              "outputConfig":{
                "table":{
                  "projectId":"[PROJECT_ID]",
                  "datasetId":"[DATASET_ID]",
                  "tableId":"[TABLE_ID]"
                }
              }
            }
          }
        ]
      }
    },
    "result":{
      "processedBytes":"536734051",
      "totalEstimatedBytes":"536734051",
      "infoTypeStats":[
        {
          "infoType":{
            "name":"PERSON_NAME"
          },
          "count":"269679"
        },
        {
          "infoType":{
            "name":"EMAIL_ADDRESS"
          },
          "count":"256"
        },
        {
          "infoType":{
            "name":"PHONE_NUMBER"
          },
          "count":"7"
        }
      ]
    }
  },
  "createTime":"2018-11-19T21:09:07.926Z",
  "startTime":"2018-11-19T21:10:20.660Z",
  "endTime":"2018-11-19T22:07:39.725Z"
}

Run analytics in BigQuery

Now that we've created a new BigQuery table with the results of our DLP API scan, the next step is to run analytics on the table.

On the left side of the Google Cloud Console under Big Data, click BigQuery. Open your project and your dataset, and then locate the new table that was created.

You can run SQL queries on this table to find out more about what the DLP API found within your data bucket. For example, run the following to count all the scan results by infoType, replacing the placeholders with the appropriate real values:

SELECT
  info_type.name,
  COUNT(*) AS iCount
FROM
  `[PROJECT_ID].[DATASET_ID].[TABLE_ID]`
GROUP BY
  info_type.name

This query results in a summary of findings for that bucket that might look something like the following:

Example summary of DLP API findings.

Create a report in Data Studio

Data Studio enables you to create custom reports that can be based on BigQuery tables. In this section, we create a simple table report in Data Studio that is based on DLP API findings stored in BigQuery.

  1. Open Data Studio and start a new report.
  2. Click Create New Data Source.
  3. From the list of Connectors, click BigQuery. If necessary, authorize Data Studio to connect to your BigQuery projects by clicking Authorize.
  4. Now, choose which table to search, and then click My Projects or Shared Projects, depending on where your project resides. Find your project, dataset, and table in the lists on the page.
  5. Click Connect to run the report.
  6. Click Add to Report.

Now we'll create a table that will display the frequency of each infoType. Select the field info_type.name as the Dimension. The resulting table will look similar to the following:

An example table in Data Studio.

Next steps

This is just the start of what you can visualize using Data Studio and the output from DLP API. You can add in other charting elements and drill-down filters to create dashboards and reports. For more information about what is available in Data Studio, see the Data Studio Product Overview.

Was this page helpful? Let us know how we did:

Send feedback about...

Data Loss Prevention API