Query Sensitive Data Protection findings in BigQuery

This page provides example queries that you can use to analyze Sensitive Data Protection findings that were exported to BigQuery.

You can configure an inspection job or job trigger to save the findings to BigQuery. Doing so lets you query the findings for further analysis. When your findings are exported to BigQuery, the data is written to either a new or existing table.

For more information about all the actions that Sensitive Data Protection can perform after inspection, see the Actions conceptual topic.

For more information about running queries, see the following:

Columns of the BigQuery table

The columns of the table of exported findings are based on the attributes of the Finding object.

If you are configuring inspection of a BigQuery table, you can set up the job or job trigger such that the exported findings will contain the row's identifiers. Doing so lets you link the inspection findings back to the rows that contain them.

In the inspection job or job trigger, set the following fields to the names of the columns that uniquely identify each row in the table—that is, the columns that serve the purpose of a primary key:

  • If you're using the Google Cloud console, set the Identifying fields (comma separated) field.
  • If you're using the DLP API, set the identifyingFields property.

When the inspection is done and the findings are exported to BigQuery, each finding will contain the corresponding values of the columns that you specified. Those values will be in the location.content_locations.record_location.record_key.id_values field. You can then use those values to link the finding back to the specific row in the inspected BigQuery table.

Sample queries

You can use the following sample queries to analyze your findings. You can also use the queries in a visualization tool such as Looker Studio. These queries are provided to help you get started querying your findings data.

In each of the following queries, replace the following:

  • PROJECT_ID: the project identifier
  • DATASET: the BigQuery dataset name
  • TABLE_ID: the table ID

Select the count of each infoType

Google Cloud console

SELECT info_type.name,
COUNT(info_type.name) AS count
FROM `PROJECT_ID.DATASET.TABLE_ID`
GROUP BY info_type.name
ORDER BY count DESC;

Command-line

bq query --use_legacy_sql=false ' SELECT info_type.name,
COUNT(info_type.name) AS count
FROM `PROJECT_ID.DATASET.TABLE_ID`
GROUP BY info_type.name ORDER BY count DESC;'

Select the count of each infoType by day

Google Cloud console

SELECT info_type.name, cast(TIMESTAMP_SECONDS(create_time.seconds) as date) as day,
COUNT(locations.container_name) AS count
FROM `PROJECT_ID.DATASET.TABLE_ID`,
UNNEST(location.content_locations) AS locations
GROUP BY info_type.name, day
ORDER BY count DESC;

Command-line

bq query --use_legacy_sql=false ' SELECT info_type.name,
cast(TIMESTAMP_SECONDS(create_time.seconds) as date) as day,
COUNT(locations.container_name) AS count FROM `PROJECT_ID.DATASET.TABLE_ID`,
UNNEST(location.content_locations) AS locations
GROUP BY info_type.name, day ORDER BY count DESC;'

Selects the count of each infoType in each container

Google Cloud console

SELECT info_type.name, locations.container_name,
COUNT(locations.container_name) AS count
FROM `PROJECT_ID.DATASET.TABLE_ID`,
UNNEST(location.content_locations) AS locations
GROUP BY locations.container_name, info_type.name
ORDER BY count DESC;

Command-line

bq query --use_legacy_sql=false ' SELECT info_type.name, locations.container_name,
COUNT(locations.container_name) AS count FROM `PROJECT_ID.DATASET.TABLE_ID`,
UNNEST(location.content_locations) AS locations
GROUP BY locations.container_name,info_type.name ORDER BY count DESC;'

Selects the finding types found for each column of a table

This query will group all the findings by column name and is intended to work on findings from a BigQuery inspection job. This query is useful if you are trying to identify the likely types for a given column. You can adjust settings by modifying the WHERE and HAVING clauses. For example, if multiple table results are included in your findings table, you can limit these to just one job run or one table name.

Google Cloud console

SELECT
  table_counts.field_name,
  STRING_AGG( CONCAT(" ",table_counts.name," [count: ",CAST(table_counts.count_total AS String),"]")
  ORDER BY
    table_counts.count_total DESC) AS infoTypes
FROM (
  SELECT
    locations.record_location.field_id.name AS field_name,
    info_type.name,
    COUNT(*) AS count_total
  FROM
    `PROJECT_ID.DATASET.TABLE_ID`,
    UNNEST(location.content_locations) AS locations
  WHERE
    (likelihood = 'LIKELY'
      OR likelihood = 'VERY_LIKELY'
      OR likelihood = 'POSSIBLE')
  GROUP BY
    locations.record_location.field_id.name,
    info_type.name
  HAVING
    count_total>200 ) AS table_counts
GROUP BY
  table_counts.field_name
ORDER BY
  table_counts.field_name

The above query might produce a result like this for a sample table, where the infoTypes column tells us how many instances of each infoType was found for that given column.

field_name infoTypes
field1 CUSTOM_USER_US [count: 7004], CUSTOM_USER_EU [count: 2996]
field2 US_VEHICLE_IDENTIFICATION_NUMBER [count: 9597]
field3 EMAIL_ADDRESS [count: 10000]
field4 IP_ADDRESS [count: 10000]
field5 PHONE_NUMBER [count: 7815]
field6 US_SOCIAL_SECURITY_NUMBER [count: 10000]
field7 CREDIT_CARD_NUMBER [count: 10000]