Analyzing and reporting on DLP API findings

The Cloud Data Loss Prevention (DLP) API helps you find, understand, and manage the sensitive data that exists within your infrastructure. Once you've scanned your content for sensitive data using the DLP API, you have several options for what to do with that data intelligence. This topic shows you how to leverage the power of other Google Cloud Platform features such as BigQuery, Cloud SQL, and Google Data Studio to:

  • Store DLP API scan results directly in BigQuery.
  • Generate reports on where sensitive data resides in your infrastructure.
  • Run rich SQL analytics to understand where sensitive data is stored and what kind it is.
  • Automate alerts, or actions to trigger based on a single set or a combination of findings.

This topic also contains a complete example of how to use the DLP API along with other GCP features to accomplish all of these things.

Scan a storage bucket

First, run a scan on your data. To run a scan operation on a GCP storage repository, assemble a JSON object that includes the following configuration objects:

  • InspectJobConfig: Configures the DLP scan job, and consists of:

    • StorageConfig: The storage repository to scan.
    • InspectConfig: How and what to scan for. You can also use an inspection template to define the inspection configuration.
    • Action: Task(s) to execute on the completion of the job. This can include saving findings to a BigQuery table or publishing a notification to Cloud Pub/Sub.

In our example, we're scanning a Cloud Storage bucket for person names, phone numbers, US Social Security numbers, and email addresses; and then sending the findings to a BigQuery table dedicated to storing DLP output. The following JSON can be saved to a file or sent directly to the create method of the DlpJob DLP API resource:

{
  "inspectJob":
  {
    "inspectConfig":
    {
      "infoTypes":
      [
        {
          "name": "PERSON_NAME"
        },
        {
          "name": "PHONE_NUMBER"
        },
        {
          "name": "US_SOCIAL_SECURITY_NUMBER"
        },
        {
          "name": "EMAIL_ADDRESS"
        }
      ],
      "includeQuote": true
    },
    "storageConfig":
    {
      "cloudStorageOptions":
      {
        "fileSet":
        {
          "url": "gs://bucket-name/**"
        }
      }
    },
    "actions":
    [
      {
        "saveFindings":
        {
          "outputConfig":
          {
            "table":
            {
              "projectId": "project-id",
              "datasetId": "bq-dataset-id"
            }
          }
        }
      }
    ]
  }
}

Note that by specifying two asterisks after the Cloud Storage bucket address (gs://bucket-name/**), we're instructing the scan job to scan recursively. Also, the output will be saved to the dataset bq-dataset-id under the project project-id. By leaving out a "tableId" key, we're instructing the DLP API to create a new table every time the scan is run.

Now we send this JSON in a request to the projects.dlpJobs.create method via the following URL:

POST https://dlp.googleapis.com/v2/{parent=projects/*}/dlpJobs
{
  "inspectJob":{
    "inspectConfig":{
      "infoTypes":[
        {
          "name":"PERSON_NAME"
        },
        {
          "name":"PHONE_NUMBER"
        },
        {
          "name":"US_SOCIAL_SECURITY_NUMBER"
        },
        {
          "name":"EMAIL_ADDRESS"
        }
      ],
      "includeQuote":true
    },
    "storageConfig":{
      "cloudStorageOptions":{
        "fileSet":{
          "url":"gs://bucket-name/**"
        }
      }
    },
    "actions":[
      {
        "saveFindings":{
          "outputConfig":{
            "table":{
              "projectId":"project-id",
              "datasetId":"bq-dataset-id"
            }
          }
        }
      }
    ]
  }
}

Once the job has completed, its new BigQuery table (identified by the value of the "tableId" key) has been created and is ready to use.

To get the status of the job, call the projects.dlpJobs.get method, or send a GET request to the following URL:

GET https://dlp.googleapis.com/v2/{name=projects/*/dlpJobs/*}

For the job we just created, this request returns the following JSON. Notice that the results of the scan are returned after the inspection details:

{
  "name":"projects/project-id/dlpJobs/job-id",
  "type":"INSPECT_JOB",
  "state":"DONE",
  "inspectDetails":{
    "requestedOptions":{
      "snapshotInspectTemplate":{

      },
      "jobConfig":{
        "storageConfig":{
          "cloudStorageOptions":{
            "fileSet":{
              "url":"gs://bucket-name/**"
            }
          }
        },
        "inspectConfig":{
          "infoTypes":[
            {
              "name":"PERSON_NAME"
            },
            {
              "name":"PHONE_NUMBER"
            },
            {
              "name":"US_SOCIAL_SECURITY_NUMBER"
            },
            {
              "name":"EMAIL_ADDRESS"
            }
          ],
          "minLikelihood":"POSSIBLE",
          "limits":{

          },
          "includeQuote":true
        },
        "actions":[
          {
            "saveFindings":{
              "outputConfig":{
                "table":{
                  "projectId":"project-id",
                  "datasetId":"bq-dataset-id",
                  "tableId":"table-id"
                }
              }
            }
          }
        ]
      }
    },
    "result":{
      "processedBytes":"219845353",
      "totalEstimatedBytes":"219845353",
      "infoTypeStats":[
        {
          "infoType":{
            "name":"PERSON_NAME"
          },
          "count":"20594"
        },
        {
          "infoType":{
            "name":"PHONE_NUMBER"
          },
          "count":"4279"
        },
        {
          "infoType":{
            "name":"US_SOCIAL_SECURITY_NUMBER"
          },
          "count":"3089"
        }
      ]
    }
  },
  "createTime":"2018-04-11T20:48:00.273Z"
}

Run analytics in BigQuery

Now that we've created a new BigQuery table with the results of our DLP API scan, the next step is to run analytics on the table.

On the left side of the Google Cloud Console under Big Data, click BigQuery. Open your project and your dataset, and then locate the new table that was created.

You can run SQL queries on this table to find out more about what the DLP API found within your data bucket. For example, run the following to count all the scan results by infoType, replacing the placeholders with the appropriate real values:

SELECT
  info_type.name,
  COUNT( location.record_key.cloud_storage_key.file_path) AS icount
FROM
  [project-id:bq-dataset-id.table-name]
GROUP BY
  info_type.name

This query results in a summary of findings for that bucket that might look something like the following:

Example summary of DLP API findings.

Incorporate multiple scans into a single query

If you find yourself running multiple scans on different buckets or tables, or running the same scans multiple times or on a recurring basis, remember that you're creating a new BigQuery table for each scan. While the example in the previous section showed how to analyze results from just one of these scans, you can also run a BigQuery analysis across multiple scans.

If you haven't specified a "tableId" for your scans, the DLP API automatically creates tables in the dataset specified in "datasetId". Note that, in the example JSON from the first section, we specified a dataset with the placeholder name "bq-dataset-id." To query across all tables that were created within that dataset, we can use the following SQL query.

SELECT
  info_type.name,
  COUNT(location.record_key.cloud_storage_key.file_path) AS icount
FROM
  `project-id.bq-dataset-id.*`
GROUP BY
  info_type.name

Note the table name is expressed as table name is now expressed as project-id.bq-dataset-id.*. The wildcard character (*) indicates that the query should search all tables inside the dataset.

You can also analyze scans produced only in a certain year, month, or day. The DLP API names tables based on the date the scan was run, so it's easy to scan only the time period you're looking for. For example, the following query analyzes only those scans run in 2018:

SELECT
  info_type.name,
  COUNT( location.record_key.cloud_storage_key.file_path) AS icount
FROM
  `project-id.bq-dataset-id.dlp_googleapis_2018*`
GROUP BY
  info_type.name

The following query analyzes only those scans run in January 2018:

SELECT
  info_type.name,
  COUNT( location.record_key.cloud_storage_key.file_path) AS icount
FROM
  `project-id.bq-dataset-id.dlp_googleapis_2018_01*`
GROUP BY
  info_type.name

The following query analyzes only those scans run on January 23, 2018:

SELECT
  info_type.name,
  COUNT( location.record_key.cloud_storage_key.file_path) AS icount
FROM
  `project-id.bq-dataset-id.dlp_googleapis_2018_01_23*`
GROUP BY
  info_type.name

Create a report in Data Studio

Data Studio enables you to create custom reports that can be based on BigQuery tables. In this section, we create a simple pie chart in Data Studio that is based on DLP API findings stored in BigQuery.

  1. Open Data Studio and start a new report.
  2. Click Create New Data Source.
  3. From the list of Connectors, click BigQuery. If necessary, authorize Data Studio to connect to your BigQuery projects by clicking Authorize.
  4. Now, choose which table or tables to search:
    1. To search just one table: Click My Projects or Shared Projects, depending on where your project resides. Find your project, dataset, and table in the lists on the page.
    2. To search multiple tables: Click Custom Query, and then choose the project that contains your scan results. In the Enter Custom Query box, enter your custom query. Make sure Use Legacy SQL is unchecked.
  5. Click Connect to run the report.

The following custom query analyzes all tables within the "dlp_output" dataset using the given parameters:

SELECT
  info_type.name,
  location.record_key.cloud_storage_key.file_path,
  location.byte_range.start,
  location.byte_range.END
FROM
  `project-id.bq-dataset-id.*`

Once the report is run, we add a calculated field to tally the number of different files that contain each of the infoTypes we're looking for.

  1. At the top of the Field column, click Add.
  2. If you want, give the field a name in Name. Otherwise, the field will automatically populate.
  3. In the Formula field, enter COUNT(file_path).
  4. On the right, click Create Field.
  5. Set the Type to Number and the Aggregation to Auto.
  6. Click Add to Report.

Now we'll create a pie chart that will display the frequency of each infoType. Follow the instructions in About charts to add a pie chart to your report, selecting the "Count file_path" field as the Metric and the "name" (infoType) as the Dimension. The resulting pie chart will look similar to the following:

An example pie chart in Data Studio.

Was this page helpful? Let us know how we did:

Send feedback about...

Data Loss Prevention API