Create and use data profile scans

This page shows you how to create and manage data profile scans. Data profiling lets you identify common statistical characteristics of the columns in your BigQuery tables. This information helps you to understand and analyze your data more effectively.

For more information about Dataplex Universal Catalog data profile scans, see About data profiling.

Before you begin

Enable the Dataplex API.

Enable the API

Required roles

To profile BigQuery tables, you need the following permissions:

To run a data profile scan on a BigQuery table, you need permission to read the BigQuery table and permission to create a BigQuery job in the project used to scan the table.

Note: Dataplex Universal Catalog doesn't create a BigQuery job in the project. However, you need this permission to create a DryRun job to check for permissions for the table.
If the BigQuery table and the data profile scan are in different projects, then you need to give the Dataplex Universal Catalog service account read permission on the corresponding BigQuery table.

Note: If you haven't created any data quality or data profile scans yet or you don't have a Dataplex Universal Catalog lake in the BigQuery project, then create a service identifier by running: gcloud beta services identity create --service=dataplex.googleapis.com. This command returns a Dataplex Universal Catalog service identifier if it exists.
If the BigQuery data is organized in a Dataplex Universal Catalog lake, then to create a data profile scan, you need the Dataplex Universal Catalog roles roles/dataplex.metadataReader and roles/dataplex.viewer. This grants the following permissions:
- dataplex.lakes.list
- dataplex.lakes.get
- dataplex.zones.list
- dataplex.zones.get
- dataplex.entities.list
- dataplex.entities.get
- dataplex.operations.get
If you're scanning a BigQuery external table from Cloud Storage, then assign the Dataplex Universal Catalog service account either the Storage Object Viewer (roles/storage.objectViewer) role or the following permissions on the bucket:
- storage.buckets.get
- storage.objects.get
If you want to publish the data profile scan results in the BigQuery and Dataplex Universal Catalog pages in the Google Cloud console for the source tables, you must be granted the BigQuery Data Editor (roles/bigquery.dataEditor) role on the table. Alternatively, you need all of the following permissions:
- bigquery.tables.get
- bigquery.tables.update
- bigquery.tables.updateData
- bigquery.tables.delete
To export the scan results to a BigQuery table, your Dataplex Universal Catalog service account needs the BigQuery Data Editor (roles/bigquery.dataEditor) role. This grants the following permissions:
- bigquery.datasets.get
- bigquery.tables.create
- bigquery.tables.get
- bigquery.tables.getData
- bigquery.tables.update
- bigquery.tables.updateData
If you need to access columns protected by BigQuery column-level access policies, then assign the Dataplex Universal Catalog service account permissions on those columns. The user creating or updating a data scan also needs permissions on the columns.
If a table has BigQuery row-level access policies enabled, then you can only scan rows visible to the Dataplex Universal Catalog service account. Note that the individual user's access privileges are not evaluated for row-level policies.

Data scan roles and permissions

To use data profiling, ask your administrator to grant you one of the following IAM roles:

roles/dataplex.dataScanAdmin: Full access to DataScan resources.
roles/dataplex.dataScanEditor: Write access to DataScan resources.
roles/dataplex.dataScanViewer: Read access to DataScan resources, excluding the results.
roles/dataplex.dataScanDataViewer: Read access to DataScan resources, including the results.

The following table lists the data scan permissions:

Permission name	Grants permission to do the following:
`dataplex.datascans.create`	Create a `DataScan`
`dataplex.datascans.delete`	Delete a `DataScan`
`dataplex.datascans.get`	View `DataScan` details excluding results
`dataplex.datascans.getData`	View `DataScan` details including results
`dataplex.datascans.list`	List `DataScan`s
`dataplex.datascans.run`	Run a `DataScan`
`dataplex.datascans.update`	Update the description of a `DataScan`
`dataplex.datascans.getIamPolicy`	View the current IAM permissions on the scan
`dataplex.datascans.setIamPolicy`	Set IAM permissions on the scan

Create a data profile scan

Console

In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.

Go to Data profiling & quality
Click Create data profile scan.
Optional: Enter a Display name.
Enter an ID. See the Resource naming conventions.
Optional: Enter a Description.
In the Table field, click Browse. Choose the table to scan, and then click Select.

For tables in multi-region datasets, choose a region where to create the data scan.

To browse the tables organized within Dataplex Universal Catalog lakes, click Browse within Dataplex Lakes.
In the Scope field, choose Incremental or Entire data.
- If you choose Incremental data, in the Timestamp column field, select a column of type DATE or TIMESTAMP from your BigQuery table that increases as new records are added, and that can be used to identify new records. For tables partitioned on a column of type DATE or TIMESTAMP, we recommend using the partition column as the timestamp field.
Optional: To filter your data, do any of the following:
- To filter by rows, click select the Filter rows checkbox. Enter a valid SQL expression that can be used in a WHERE clause in GoogleSQL syntax. For example: col1 >= 0.
  
  The filter can be a combination of SQL conditions over multiple columns. For example: col1 >= 0 AND col2 < 10.
- To filter by columns, select the Filter columns checkbox.
  - To include columns in the profile scan, in the Include columns field, click Browse. Select the columns to include, and then click Select.
  - To exclude columns from the profile scan, in the Exclude columns field, click Browse. Select the columns to exclude, and then click Select.
  Note: You can use Include columns, Exclude columns, or both. If you use both the fields, then the data profile scan first selects the columns based on your input in the Include columns field and then excludes the columns based on your input in the Exclude columns field.
To apply sampling to your data profile scan, in the Sampling size list, select a sampling percentage. Choose a percentage value that ranges between 0.0% and 100.0% with up to 3 decimal digits.
- For larger datasets, choose a lower sampling percentage. For example, for a 1 PB table, if you enter a value between 0.1% and 1.0%, the data profile samples between 1-10 TB of data.
- There must be at least 100 records in the sampled data to return a result.
- For incremental data scans, the data profile scan applies sampling to the latest increment.
Optional: Publish the data profile scan results in the BigQuery and Dataplex Universal Catalog pages in the Google Cloud console for the source table. Select the Publish results to the BigQuery and Dataplex Catalog UI checkbox.

You can view the latest scan results in the Data profile tab in the BigQuery and Dataplex Universal Catalog pages for the source table. To enable users to access the published scan results, see the Grant access to data profile scan results section of this document.

The publishing option might not be available in the following cases:
- You don't have the required permissions on the table.
- Another data quality scan is set to publish results.
In the Schedule section, choose one of the following options:
- Repeat: Run the data profile scan on a schedule: hourly, daily, weekly, monthly, or custom. Specify how often the scan should run and at what time. If you choose custom, use cron format to specify the schedule.
- On-demand: Run the data profile scan on demand.
Click Continue.
Optional: Export the scan results to a BigQuery standard table. In the Export scan results to BigQuery table section, do the following:
1. In the Select BigQuery dataset field, click Browse. Select a BigQuery dataset to store the data profile scan results.
2. In the BigQuery table field, specify the table to store the data profile scan results. If you're using an existing table, make sure that it is compatible with the export table schema. If the specified table doesn't exist, Dataplex Universal Catalog creates it for you.
  
  Note: You can use the same results table for multiple data profile scans.
Optional: Add labels. Labels are key-value pairs that let you group related objects together or with other Google Cloud resources.
To create the scan, click Create.

If you set the schedule to on-demand, you can also run the scan now by clicking Run scan.

gcloud

To create a data profile scan, use the gcloud dataplex datascans create data-profile command.

If the source data is organized in a Dataplex Universal Catalog lake, include the --data-source-entity flag:

gcloud dataplex datascans create data-profile DATASCAN \
--location=LOCATION \
--data-source-entity=DATA_SOURCE_ENTITY

If the source data isn't organized in a Dataplex Universal Catalog lake, include the --data-source-resource flag:

gcloud dataplex datascans create data-profile DATASCAN \
--location=LOCATION \
--data-source-resource=DATA_SOURCE_RESOURCE

Replace the following variables:

DATASCAN: The name of the data profile scan.
LOCATION: The Google Cloud region in which to create the data profile scan.
DATA_SOURCE_ENTITY: The Dataplex Universal Catalog entity that contains the data for the data profile scan. For example, projects/test-project/locations/test-location/lakes/test-lake/zones/test-zone/entities/test-entity.
DATA_SOURCE_RESOURCE: The name of the resource that contains the data for the data profile scan. For example, //bigquery.googleapis.com/projects/test-project/datasets/test-dataset/tables/test-table.

REST

To create a data profile scan, use the dataScans.create method.

Export table schema

If you want to export the data profile scan results to an existing BigQuery table, make sure that it is compatible with the following table schema:

Column name	Column data type	Sub field name (if applicable)	Sub field data type	Mode	Example
data_profile_scan	`struct/record`	`resource_name`	`string`	nullable	`//dataplex.googleapis.com/projects/test-project/locations/europe-west2/datascans/test-datascan`
		`project_id`	`string`	nullable	`test-project`
		`location`	`string`	nullable	`us-central1`
		`data_scan_id`	`string`	nullable	`test-datascan`
data_source	`struct/record`	`resource_name`	`string`	nullable	Entity case: `//dataplex.googleapis.com/projects/test-project/locations/europe-west2/lakes/test-lake/zones/test-zone/entities/test-entity` Table case: `//bigquery.googleapis.com/projects/test-project/datasets/test-dataset/tables/test-table`
		`dataplex_entity_project_id`	`string`	nullable	`test-project`
		`dataplex_entity_project_number`	`integer`	nullable	`123456789012`
		`dataplex_lake_id`	`string`	nullable	(Valid only if source is entity) `test-lake`
		`dataplex_zone_id`	`string`	nullable	(Valid only if source is entity) `test-zone`
		`dataplex_entity_id`	`string`	nullable	(Valid only if source is entity) `test-entity`
		`table_project_id`	`string`	nullable	`dataplex-table`
		`table_project_number`	`int64`	nullable	`345678901234`
		`dataset_id`	`string`	nullable	(Valid only if source is table) `test-dataset`
		`table_id`	`string`	nullable	(Valid only if source is table) `test-table`
data_profile_job_id	`string`			nullable	`caeba234-cfde-4fca-9e5b-fe02a9812e38`
data_profile_job_configuration	`json`	`trigger`	`string`	nullable	`ondemand`/`schedule`
		`incremental`	`boolean`	nullable	`true`/`false`
		`sampling_percent`	`float`	nullable	(0-100) `20.0` (indicates 20%)
		`row_filter`	`string`	nullable	`col1 >= 0 AND col2 < 10`
		`column_filter`	`json`	nullable	`{"include_fields":["col1","col2"], "exclude_fields":["col3"]}`
job_labels	`json`			nullable	`{"key1":value1}`
job_start_time	`timestamp`			nullable	`2023-01-01 00:00:00 UTC`
job_end_time	`timestamp`			nullable	`2023-01-01 00:00:00 UTC`
job_rows_scanned	`integer`			nullable	`7500`
column_name	`string`			nullable	`column-1`
column_type	`string`			nullable	`string`
column_mode	`string`			nullable	`repeated`
percent_null	`float`			nullable	(0.0-100.0) `20.0` (indicates 20%)
percent_unique	`float`			nullable	(0.0-100.0) `92.5`
min_string_length	`integer`			nullable	(Valid only if column type is string) `10`
max_string_length	`integer`			nullable	(Valid only if column type is string) `4`
average_string_length	`float`			nullable	(Valid only if column type is string) `7.2`
min_value	`float`			nullable	(Valid only if column type is numeric - integer/float)
max_value	`float`			nullable	(Valid only if column type is numeric - integer/float)
average_value	`float`			nullable	(Valid only if column type is numeric - integer/float)
standard_deviation	`float`			nullable	(Valid only if column type is numeric - integer/float)
quartile_lower	`integer`			nullable	(Valid only if column type is numeric - integer/float)
quartile_median	`integer`			nullable	(Valid only if column type is numeric - integer/float)
quartile_upper	`integer`			nullable	(Valid only if column type is numeric - integer/float)
top_n	`struct/record - repeated`	`value`	`string`	nullable	`"4009"`
		`count`	`integer`	nullable	`20`
		`percent`	`float`	nullable	`10` (indicates 10%)

Export table setup

When you export to BigQueryExport tables, follow these guidelines:

For the field resultsTable, use the format: //bigquery.googleapis.com/projects/{project-id}/datasets/{dataset-id}/tables/{table-id}.
Use a BigQuery standard table.
If the table doesn't exist when the scan is created or updated, Dataplex Universal Catalog creates the table for you.
By default, the table is partitioned on the job_start_time column daily.
If you want the table to be partitioned in other configurations or if you don't want the partition, then recreate the table with the required schema and configurations and then provide the pre-created table as the results table.
Make sure the results table is in the same location as the source table.
If VPC-SC is configured on the project, then the results table must be in the same VPC-SC perimeter as the source table.
If the table is modified during the scan execution stage, then the current running job exports to the previous results table and the table change takes effect from the next scan job.
Don't modify the table schema. If you need customized columns, create a view upon the table.
To reduce costs, set an expiration on the partition based on your use case. For more information, see how to set the partition expiration.

Create multiple data profile scans

You can configure data profile scans for multiple tables in a BigQuery dataset at the same time by using the Google Cloud console.

In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.

Go to Data profiling & quality
Click Create data profile scan.
Select the Multiple data profile scans option.
Enter an ID prefix. Dataplex Universal Catalog automatically generates scan IDs by using the provided prefix and unique suffixes.
Enter a Description for all of the data profile scans.
In the Dataset field, click Browse. Select a dataset to pick tables from. Click Select.
If the dataset is multi-regional, select a Region in which to create the data profile scans.
Configure the common settings for the scans:
1. In the Scope field, choose Incremental or Entire data.
  
  Note: If you choose Incremental data, you can select only tables that are partitioned on a column of type DATE or TIMESTAMP.
2. To apply sampling to the data profile scans, in the Sampling size list, select a sampling percentage.
  
  Choose a percentage value between 0.0% and 100.0% with up to 3 decimal digits.
3. Optional: Publish the data profile scan results in the BigQuery and Dataplex Universal Catalog pages in the Google Cloud console for the source table. Select the Publish results to the BigQuery and Dataplex Catalog UI checkbox.
  
  You can view the latest scan results in the Data profile tab in the BigQuery and Dataplex Universal Catalog pages for the source table. To enable users to access the published scan results, see the Grant access to data profile scan results section of this document.
  
  Note: You must choose tables that don't have any existing scans publishing their results.
4. In the Schedule section, choose one of the following options:
  - Repeat: Run the data profile scans on a schedule: hourly, daily, weekly, monthly, or custom. Specify how often the scans should run and at what time. If you choose custom, use cron format to specify the schedule.
  - On-demand: Run the data profile scans on demand.
Click Continue.
In the Choose tables field, click Browse. Choose one or more tables to scan, and then click Select.
Click Continue.
Optional: Export the scan results to a BigQuery standard table. In the Export scan results to BigQuery table section, do the following:
1. In the Select BigQuery dataset field, click Browse. Select a BigQuery dataset to store the data profile scan results.
2. In the BigQuery table field, specify the table to store the data profile scan results. If you're using an existing table, make sure that it is compatible with the export table schema. If the specified table doesn't exist, Dataplex Universal Catalog creates it for you.
  
  Dataplex Universal Catalog uses the same results table for all of the data profile scans.
Optional: Add labels. Labels are key-value pairs that let you group related objects together or with other Google Cloud resources.
To create the scans, click Create.

If you set the schedule to on-demand, you can also run the scans now by clicking Run scan.

Run a data profile scan

Console

In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.

Go to Data profiling & quality
Click the data profile scan to run.
Click Run now.

gcloud

To run a data profile scan, use the gcloud dataplex datascans run command:

gcloud dataplex datascans run DATASCAN \
--location=LOCATION

Replace the following variables:

DATASCAN: The name of the data profile scan.
LOCATION: The Google Cloud region in which the data profile scan was created.

REST

To run a data profile scan, use the dataScans.run method.

View data profile scan results

Console

In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.

Go to Data profiling & quality
Click the name of a data profile scan.
- The Overview section displays information about the most recent jobs, including when the scan was run, the number of table records scanned, and the job status.
- The Data profile scan configuration section displays details about the scan.
To see detailed information about a job, such as the scanned table's columns, statistics about the columns that were found in the scan, and the job logs, click the Jobs history tab. Then, click a job ID.

gcloud

To view the results of a data profile scan job, use the gcloud dataplex datascans jobs describe command:

gcloud dataplex datascans jobs describe JOB \
--location=LOCATION \
--datascan=DATASCAN \
--view=FULL

Replace the following variables:

JOB: The job ID of the data profile scan job.
LOCATION: The Google Cloud region in which the data profile scan was created.
DATASCAN: The name of the data profile scan the job belongs to.
--view=FULL: To see the scan job result, specify FULL.

REST

To view the results of a data profile scan, use the dataScans.get method.

View published results

If the data profile scan results are published to the BigQuery and Dataplex Universal Catalog pages in the Google Cloud console, then you can see the latest scan results on the source table's Data profile tab.

In the Google Cloud console, go to the Dataplex Universal Catalog Search page.

Go to Search
Search for and then select the table.
Click the Data profile tab.

The latest published results are displayed.

Note: Published results might not be available if a scan is running for the first time.

View the most recent data profile scan job

Console

In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.

Go to Data profiling & quality
Click the name of a data profile scan.
Click the Latest job results tab.

The Latest job results tab, when there is at least one successfully completed run, provides information about the most recent job. It lists the scanned table's columns and statistics about the columns that were found in the scan.

gcloud

To view the most recent successful data profile scan, use the gcloud dataplex datascans describe command:

gcloud dataplex datascans describe DATASCAN \
--location=LOCATION \
--view=FULL

Replace the following variables:

DATASCAN: The name of the data profile scan to view the most recent job for.
LOCATION: The Google Cloud region in which the data profile scan was created.
--view=FULL: To see the scan job result, specify FULL.

REST

To view the most recent scan job, use the dataScans.get method.

View historical scan results

Dataplex Universal Catalog saves the data profile scan history of the last 300 jobs or for the past year, whichever occurs first.

Console

In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.

Go to Data profiling & quality
Click the name of a data profile scan.
Click the Jobs history tab.

The Jobs history tab provides information about past jobs, such as the number of records scanned in each job, the job status, and the time the job was run.
To view detailed information about a job, click any of the jobs in the Job ID column.

gcloud

To view historical data profile scan jobs, use the gcloud dataplex datascans jobs list command:

gcloud dataplex datascans jobs list \
--location=LOCATION \
--datascan=DATASCAN

Replace the following variables:

LOCATION: The Google Cloud region in which the data profile scan was created.
DATASCAN: The name of the data profile scan to view jobs for.

REST

To view historical data profile scan jobs, use the dataScans.jobs.list method.

To enable the users in your organization to view the scan results, do the following:

In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.

Go to Data profiling & quality
Click the data quality scan you want to share the results of.
Click the Permissions tab.
Do the following:
- To grant access to a principal, click Grant access. Grant the Dataplex DataScan DataViewer role to the associated principal.
- To remove access from a principal, select the principal that you want to remove the Dataplex DataScan DataViewer role from. Click Remove access, and then confirm when prompted.

Manage data profile scans for a specific table

The steps in this document show how to manage data profile scans across your project by using the Dataplex Universal Catalog Data profiling & quality page in the Google Cloud console.

You can also create and manage data profile scans when working with a specific table. In the Google Cloud console, on the Dataplex Universal Catalog page for the table, use the Data profile tab. Do the following:

In the Google Cloud console, go to the Dataplex Universal Catalog Search page.

Go to Search

Search for and then select the table.
Click the Data profile tab.
Depending on whether the table has a data profile scan whose results are published, you can work with the table's data profile scans in the following ways:
- Data profile scan results are published: the latest published scan results are displayed on the page.
  
  To manage the data profile scans for this table, click Data profile scan, and then select from the following options:
  - Create new scan: create a new data profile scan. For more information, see the Create a data profile scan section of this document. When you create a scan from a table's details page, the table is preselected.
  - Run now: run the scan.
  - Edit scan configuration: edit settings including the display name, filters, sampling size, and schedule.
  - Manage scan permissions: control who can access the scan results. For more information, see the Grant access to data profile scan results section of this document.
  - View historical results: view detailed information about previous data profile scan jobs. For more information, see the View data profile scan results and View historical scan results sections of this document.
  - View all scans: view a list of data profile scans that apply to this table.
- Data profile scan results aren't published: click the menu next to Quick data profile, and then select from the following options:
  - Customize data profiling: create a new data profile scan. For more information, see the Create a data profile scan section of this document. When you create a scan from a table's details page, the table is preselected.
  - View previous profiles: view a list of data profile scans that apply to this table.

Update a data profile scan

Console

In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.

Go to Data profiling & quality
Click the name of a data profile scan.
Click Edit, and then edit the values.
Click Save.

gcloud

To update a data profile scan, use the gcloud dataplex datascans update data-profile command:

gcloud dataplex datascans update data-profile DATASCAN \
--location=LOCATION \
--description=DESCRIPTION

Replace the following variables:

DATASCAN: The name of the data profile scan to update.
LOCATION: The Google Cloud region in which the data profile scan was created.
DESCRIPTION: The new description for the data profile scan.

REST

To edit a data profile scan, use the dataScans.patch method.

Delete a data profile scan

Console

In the Google Cloud console, go to the Dataplex Universal Catalog Data profiling & quality page.

Go to Data profiling & quality
Click the scan you want to delete.
Click Delete, and then confirm when prompted.

gcloud

To delete a data profile scan, use the gcloud dataplex datascans delete command:

gcloud dataplex datascans delete DATASCAN \
--location=LOCATION --async

Replace the following variables:

DATASCAN: The name of the data profile scan to delete.
LOCATION: The Google Cloud region in which the data profile scan was created.

REST

To delete a data profile scan, use the dataScans.delete method.

What's next?

Learn how to explore your data by generating data insights.
Learn about data profiling.
Learn about auto data quality.
Learn how to use auto data quality.

Create and use data profile scans Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Required roles

Data scan roles and permissions

Create a data profile scan

Console

gcloud

REST

Export table schema

Export table setup

Create multiple data profile scans

Run a data profile scan

Console

gcloud

REST

View data profile scan results

Console

gcloud

REST

View published results

View the most recent data profile scan job

Console

gcloud

REST

View historical scan results

Console

gcloud

REST

Grant access to data profile scan results

Manage data profile scans for a specific table

Update a data profile scan

Console

gcloud

REST

Delete a data profile scan

Console

gcloud

REST

What's next?

Create and use data profile scans