This page provides example queries that you can use to analyze Sensitive Data Protection findings that were exported to BigQuery.
You can configure an inspection job or job trigger to save the findings to BigQuery. Doing so lets you query the findings for further analysis. When your findings are exported to BigQuery, the data is written to either a new or existing table.
For more information about all the actions that Sensitive Data Protection can perform after inspection, see the Actions conceptual topic.
For more information about running queries, see the following:
Columns of the BigQuery table
The columns of the table of exported findings are based on the attributes of the
Finding
object.
Link results back to the rows containing the findings
If you are configuring inspection of a BigQuery table, you can set up the job or job trigger such that the exported findings will contain the row's identifiers. Doing so lets you link the inspection findings back to the rows that contain them.
In the inspection job or job trigger, set the following fields to the names of the columns that uniquely identify each row in the table—that is, the columns that serve the purpose of a primary key:
- If you're using the Google Cloud console, set the Identifying fields (comma separated) field.
- If you're using the DLP API, set the
identifyingFields
property.
When the inspection is done and the findings are exported to
BigQuery, each finding will contain the corresponding values of
the columns that you specified. Those values will be in the
location.content_locations.record_location.record_key.id_values
field. You can
then use those values to link the finding back to the specific row in the
inspected BigQuery table.
Sample queries
You can use the following sample queries to analyze your findings. You can also use the queries in a visualization tool such as Looker Studio. These queries are provided to help you get started querying your findings data.
In each of the following queries, replace the following:
PROJECT_ID
: the project identifierDATASET
: the BigQuery dataset nameTABLE_ID
: the table ID
Select the count of each infoType
Google Cloud console
SELECT info_type.name, COUNT(info_type.name) AS count FROM `PROJECT_ID.DATASET.TABLE_ID` GROUP BY info_type.name ORDER BY count DESC;
Command-line
bq query --use_legacy_sql=false ' SELECT info_type.name, COUNT(info_type.name) AS count FROM `PROJECT_ID.DATASET.TABLE_ID` GROUP BY info_type.name ORDER BY count DESC;'
Select the count of each infoType by day
Google Cloud console
SELECT info_type.name, cast(TIMESTAMP_SECONDS(create_time.seconds) as date) as day, COUNT(locations.container_name) AS count FROM `PROJECT_ID.DATASET.TABLE_ID`, UNNEST(location.content_locations) AS locations GROUP BY info_type.name, day ORDER BY count DESC;
Command-line
bq query --use_legacy_sql=false ' SELECT info_type.name, cast(TIMESTAMP_SECONDS(create_time.seconds) as date) as day, COUNT(locations.container_name) AS count FROM `PROJECT_ID.DATASET.TABLE_ID`, UNNEST(location.content_locations) AS locations GROUP BY info_type.name, day ORDER BY count DESC;'
Selects the count of each infoType in each container
Google Cloud console
SELECT info_type.name, locations.container_name, COUNT(locations.container_name) AS count FROM `PROJECT_ID.DATASET.TABLE_ID`, UNNEST(location.content_locations) AS locations GROUP BY locations.container_name, info_type.name ORDER BY count DESC;
Command-line
bq query --use_legacy_sql=false ' SELECT info_type.name, locations.container_name, COUNT(locations.container_name) AS count FROM `PROJECT_ID.DATASET.TABLE_ID`, UNNEST(location.content_locations) AS locations GROUP BY locations.container_name,info_type.name ORDER BY count DESC;'
Selects the finding types found for each column of a table
This query will group all the findings by column name and is intended to work on findings from a BigQuery inspection job. This query is useful if you are trying to identify the likely types for a given column. You can adjust settings by modifying the WHERE and HAVING clauses. For example, if multiple table results are included in your findings table, you can limit these to just one job run or one table name.
Google Cloud console
SELECT table_counts.field_name, STRING_AGG( CONCAT(" ",table_counts.name," [count: ",CAST(table_counts.count_total AS String),"]") ORDER BY table_counts.count_total DESC) AS infoTypes FROM ( SELECT locations.record_location.field_id.name AS field_name, info_type.name, COUNT(*) AS count_total FROM `PROJECT_ID.DATASET.TABLE_ID`, UNNEST(location.content_locations) AS locations WHERE (likelihood = 'LIKELY' OR likelihood = 'VERY_LIKELY' OR likelihood = 'POSSIBLE') GROUP BY locations.record_location.field_id.name, info_type.name HAVING count_total>200 ) AS table_counts GROUP BY table_counts.field_name ORDER BY table_counts.field_name
The above query might produce a result like this for a sample table, where the infoTypes column tells us how many instances of each infoType was found for that given column.
field_name | infoTypes |
---|---|
field1 |
CUSTOM_USER_US [count: 7004] , CUSTOM_USER_EU [count: 2996] |
field2 |
US_VEHICLE_IDENTIFICATION_NUMBER [count: 9597] |
field3 |
EMAIL_ADDRESS [count: 10000] |
field4 |
IP_ADDRESS [count: 10000] |
field5 |
PHONE_NUMBER [count: 7815] |
field6 |
US_SOCIAL_SECURITY_NUMBER [count: 10000] |
field7 |
CREDIT_CARD_NUMBER [count: 10000] |