This page shows you how to create a data profile scan using the Google Cloud console, Google Cloud CLI, or REST API.
For more information about Dataplex data profile scans, see About data profiling.
Before you begin
In the Google Cloud console, enable the Dataplex API.
Permissions
To profile BigQuery tables, you need the following permissions:
To run a data profile scan on a BigQuery table, you need permission to read the BigQuery table and permission to create a BigQuery job in the project used to scan the table.
If the BigQuery table and the data profile scan are in different projects, then you need to give the Dataplex service account read permission on the corresponding BigQuery table.
If the BigQuery data is organized in a Dataplex lake, then to create a data profile scan, you need the Dataplex roles
roles/dataplex.metadataReader
androles/dataplex.viewer
. This grants the following permissions:dataplex.lakes.list
dataplex.lakes.get
dataplex.zones.list
dataplex.zones.get
dataplex.entities.list
dataplex.entities.get
dataplex.operations.get
If you're scanning a BigQuery external table from Cloud Storage, then assign the Dataplex service account either the Cloud Storage Object Viewer (
roles/storage.objectViewer
) role or the following permissions for the bucket:storage.buckets.get
storage.objects.get
If you want to publish the data profile scan results in the BigQuery and Data Catalog pages in the Google Cloud console for the source tables, you must be granted the BigQuery Data Editor (
roles/bigquery.dataEditor
) IAM role on the table. Alternatively, you need all of the following permissions:bigquery.tables.get
bigquery.tables.update
bigquery.tables.updateData
bigquery.tables.delete
To export the scan results to a BigQuery table, your Dataplex service account needs the BigQuery Data Editor (
roles/bigquery.dataEditor
) role. This grants the following permissions:bigquery.datasets.get
bigquery.tables.create
bigquery.tables.get
bigquery.tables.getData
bigquery.tables.update
bigquery.tables.updateData
If you need to access columns protected by BigQuery column-level access policies, then assign the Dataplex service account permissions for those columns. The user creating or updating a data scan also needs permissions for the columns.
If a table has BigQuery row-level access policies enabled, then you can only scan rows visible to the Dataplex service account. Note that the individual user's access privileges are not evaluated for row-level policies.
Data scan roles and permissions
To use data profiling, a project administrator either assigns a predefined role with permissions already granted, or grants individual permissions. The roles are as follows:
roles/dataplex.dataScanAdmin
: Full access toDataScan
resources.roles/dataplex.dataScanEditor
: Write access toDataScan
resources.roles/dataplex.dataScanViewer
: Read access toDataScan
resources, excluding the results.roles/dataplex.dataScanDataViewer
: Read access toDataScan
resources, including the results.
The following table lists the data scan permissions:
Permission name | Grants permission to do the following: |
---|---|
dataplex.datascans.create |
Create a DataScan |
dataplex.datascans.delete |
Delete a DataScan |
dataplex.datascans.get |
View DataScan details excluding results |
dataplex.datascans.getData |
View DataScan details including results |
dataplex.datascans.list |
List DataScan s |
dataplex.datascans.run |
Execute a DataScan |
dataplex.datascans.update |
Update the description of a DataScan |
dataplex.datascans.getIamPolicy |
View the current IAM permissions on the scan |
dataplex.datascans.setIamPolicy |
Set IAM permissions on the scan |
Create a data profile scan
Console
In the Google Cloud console, go to the Profile page.
Click Create data profile scan.
Enter a Display name.
To change the automatically generated scan ID, provide your own. See Resource naming convention.
Optional: Enter a Description.
In the Table field, click Browse.
Select a table and click Select.
In the Scope field, choose Incremental or Entire data.
- If you choose Incremental data, in the Timestamp column field,
select a column of type
DATE
orTIMESTAMP
from your BigQuery table that increases monotonically and can be used to identify new records. For tables partitioned on a column of typeDATE
orTIMESTAMP
, we recommend using the partition column as the timestamp field.
- If you choose Incremental data, in the Timestamp column field,
select a column of type
To apply sampling to your data profile scan, in the Sampling size list, select a sampling percentage.
- Choose a percentage value that ranges between 0.0% and 100.0% with up to 3 decimal digits.
- For larger datasets, choose a lower sampling percentage. For example, for a ~1 PB table, if you enter a value between 0.1% and 1.0%, Dataplex samples between 1-10 TB of data.
- You need at least 100 records in the sampled data to return a result.
- For incremental data scans, Dataplex applies sampling to the latest increment.
To filter by row, click Filters, and select Filter rows.
Enter a valid SQL expression that can be used in a
WHERE
clause in GoogleSQL syntax. For example:col1 >= 0
.The filter can be a combination of SQL conditions over multiple columns. For example:
col1 >= 0 AND col2 < 10
.
Optional: Click Filters. Select the checkbox Filter columns.
a. In the Include columns field, click Browse.
- Specify any columns to include in the profile scan. Select the columns of your choice by checking the boxes and clicking Select.
b. In the Exclude columns field, click Browse.
- Specify any columns to exclude from the profile scan. Select the columns of your choice by checking the boxes and clicking Select.
Optional: Publish the data profile scan results in the BigQuery and Data Catalog pages in the Google Cloud console for the source table. Click the Publish results to the BigQuery and Dataplex Catalog UI checkbox.
You can view the latest scan results in the Data Profile tab in the BigQuery and Data Catalog pages for the source table. To enable users to access the published scan results, see Share the published results.
The publishing option might not be available in the following cases:
- You don't have the required permissions on the table.
- Another data quality scan is set to publish results.
For more information about the permissions required to view the published results, see Permissions.
Optional: Export the scan results to a BigQuery standard table. Click Browse to select an existing BigQuery dataset to store the data profile scan results.
If the specified table doesn't exist, Dataplex creates it for you. If you are using an existing table, make sure that it is compatible with the table schema described later in this section.
Optional: Add labels. Labels are
key:value
pairs that allow you to group related objects together or with other Google Cloud resources.Under Schedule options, choose one of the following options:
Repeat: Run your data profile scan job on a schedule: daily, weekly, monthly, or custom. Specify how often the scan should run and at what time. If you choose custom, use cron format to specify the schedule.
On-demand: Create your data profile scan and run it at any time using the run now action.
Click Create.
gcloud
To create a data profile scan, run the following command:
gcloud dataplex datascans create data-profile DATASCAN \ --location=LOCATION \ --data-source-entity=DATA_SOURCE_ENTITY | --data-source-resource=DATA_SOURCE_RESOURCE
Replace the following variables:
DATASCAN
: The name of the data profile scan.LOCATION
: The Google Cloud region in which to create the data profile scan.DATA_SOURCE_ENTITY
: The Dataplex entity that contains the data for the data profile scan. For example,projects/test-project/locations/test-location/lakes/test-lake/zones/test-zone/entities/test-entity
.DATA_SOURCE_RESOURCE
: The name of the resource that contains the data for the data profile scan. For example,//bigquery.googleapis.com/projects/test-project/datasets/test-dataset/tables/test-table
.
For optional arguments, see the gcloud CLI reference.
REST
Use the APIs Explorer to create a data profile scan.
Create multiple data profile scans
Console
In the Google Cloud console, go to the Profile page.
Click Create multiple profile scans.
Enter an ID prefix. Dataplex automatically generates scan IDs by using the provided prefix and unique suffixes.
Enter a Description for all of the data profile scans.
In the Dataset field, click Browse. Select a dataset to pick tables from. Click Select.
If the dataset is multi-regional, select a Region in which to create the data profile scans.
Select Common configuration options:
In the Scope field, choose Incremental or Entire data.
To apply sampling to your data profile scans, in the Sampling size list, select a sampling percentage.
Choose a percentage value between 0.0% and 100.0% with up to 3 decimal digits.
To display the results of all the scans, select Publishing. You can view the results in Profile tab of the BigQuery or Data Catalog table details. Make sure you have the
bigquery.tables.update
permissions on the source tables.Under Schedule options, choose one of the following options:
Repeat: Run your data profile scan jobs on a schedule. Specify how often to run the scan (daily, weekly, monthly, or custom) and at what time. If you choose custom, use cron format to specify the schedule.
On-demand: Create your data profile scan jobs and run them at any time by clicking Run.
In the Choose tables option, click Browse. Choose one or more of the tables to be scanned. Click Select.
Select Additional settings:
To save the results of your data profile scans to a BigQuery table of your choice, choose a table in Export scan results to BigQuery table. Dataplex automatically copies and saves the results to this table for every scan job.
Click Browse to select a dataset.
Enter a BigQuery table to save results to. This can be an existing table, used by other Dataplex data profile scans to save results. If there is no such table with the specified name, Dataplex creates the table.
Add Labels to annotate your data profile scan.
Click Run scan to create and run all the scans. This option is only available for on-demand scans.
Click Create to create all the scans.
gcloud
Not supported.
REST
Not supported.
Export table schema
If you want to export the data profile scan results to an existing BigQuery table, make sure that it is compatible with the following table schema:
Column name | Column data type | Sub field name(if applicable) | Sub field data type | Mode | Example |
---|---|---|---|---|---|
data_profile_scan | struct/record |
resource_name |
string |
nullable | //dataplex.googleapis.com/projects/test-project/locations/europe-west2/datascans/test-datascan |
project_id |
string |
nullable | test-project |
||
location |
string |
nullable | us-central1 |
||
data_scan_id |
string |
nullable | test-datascan |
||
data_source | struct/record |
resource_name |
string |
nullable | Entity case:
//dataplex.googleapis.com/projects/test-project/locations/europe-west2/lakes/test-lake/zones/test-zone/entities/test-entity
Table case:
//bigquery.googleapis.com/projects/test-project/datasets/test-dataset/tables/test-table
|
dataplex_entity_project_id |
string |
nullable | test-project |
||
dataplex_entity_project_number |
integer |
nullable | 123456789012 |
||
dataplex_lake_id |
string |
nullable | (Valid only if source is entity)
test-lake
|
||
dataplex_zone_id |
string |
nullable | (Valid only if source is entity)
test-zone |
||
dataplex_entity_id |
string |
nullable | (Valid only if source is entity)
test-entity |
||
table_project_id |
string |
nullable | dataplex-table |
||
table_project_number |
int64 |
nullable | 345678901234 |
||
dataset_id |
string |
nullable | (Valid only if source is table)
test-dataset |
||
table_id |
string |
nullable | (Valid only if source is table)
test-table |
||
data_profile_job_id | string |
nullable | caeba234-cfde-4fca-9e5b-fe02a9812e38 |
||
data_profile_job_configuration | json |
trigger |
string |
nullable | ondemand /schedule |
incremental |
boolean |
nullable | true /false |
||
sampling_percent |
float |
nullable | (0-100)
20.0 (indicates 20%) |
||
row_filter |
string |
nullable | col1 >= 0 AND col2 < 10 |
||
column_filter |
json |
nullable | {"include_fields":["col1","col2"], "exclude_fields":["col3"]} |
||
job_labels | json |
nullable | {"key1":value1} |
||
job_start_time | timestamp |
nullable | 2023-01-01 00:00:00 UTC |
||
job_end_time | timestamp |
nullable | 2023-01-01 00:00:00 UTC |
||
job_rows_scanned | integer |
nullable | 7500 |
||
column_name | string |
nullable | column-1 |
||
column_type | string |
nullable | string |
||
column_mode | string |
nullable | repeated |
||
percent_null | float |
nullable | (0.0-100.0)
20.0 (indicates 20%) |
||
percent_unique | float |
nullable | (0.0-100.0)
92.5 |
||
min_string_length | integer |
nullable | (Valid only if column type is string)
10 |
||
max_string_length | integer |
nullable | (Valid only if column type is string)
4 |
||
average_string_length | float |
nullable | (Valid only if column type is string)
7.2 |
||
min_value | float |
nullable | (Valid only if column type is numeric - integer/float) | ||
max_value | float |
nullable | (Valid only if column type is numeric - integer/float) | ||
average_value | float |
nullable | (Valid only if column type is numeric - integer/float) | ||
standard_deviation | float |
nullable | (Valid only if column type is numeric - integer/float) | ||
quartile_lower | integer |
nullable | (Valid only if column type is numeric - integer/float) | ||
quartile_median | integer |
nullable | (Valid only if column type is numeric - integer/float) | ||
quartile_upper | integer |
nullable | (Valid only if column type is numeric - integer/float) | ||
top_n | struct/record - repeated |
value |
string |
nullable | "4009" |
count |
integer |
nullable | 20 |
||
percent |
float |
nullable | 10 (indicates 10%) |
Export table setup
When you export to BigQueryExport tables, follow these guidelines:
- For the field
resultsTable
, use the format://bigquery.googleapis.com/projects/{project-id}/datasets/{dataset-id}/tables/{table-id}
. - Use a BigQuery standard table.
- If the table doesn't exist when the scan is created or updated, Dataplex creates the table for you.
- By default, the table is partitioned on the
job_start_time
column daily. - If you want the table to be partitioned in other configurations or if you don't want the partition, then recreate the table with the required schema and configurations and then provide the pre-created table as the results table.
- Make sure the results table is in the same location as the source table.
- If VPC-SC is configured on the project, then the results table must be in the same VPC-SC perimeter as the source table.
- If the table is modified during the scan execution stage, then the current running job exports to the previous results table and the table change takes effect from the next scan job.
- Don't modify the table schema. If you need customized columns, create a view upon the table.
- To reduce costs, set an expiration on the partition based on your use case. For more information, see how to set the partition expiration.
Run a data profile scan
Console
- In the Google Cloud console, go to the Dataplex Profile page. Go to Profile
- Click the data profile scan to run.
- Click Run now.
gcloud
To run a data profile scan, run the following command:
gcloud dataplex datascans run DATASCAN \ --location=LOCATION
Replace the following variables:
DATASCAN
: The name of the data profile scan.LOCATION
: The Google Cloud region in which the data profile scan was created.
For optional arguments, see the gcloud CLI reference.
REST
Use the APIs Explorer to run your data profile scan.
View the data profile scan job results
Console
All of the data profile scans you create appear in the Profile page.
To see the detailed results of a scan, click the name of the scan.
The Overview section displays the scan runs, the time of each run, the number of table records scanned, and the job status.
The Profile scan configuration section contains details about the scan.
gcloud
To view the results of a data profile scan job, run the following command:
gcloud dataplex datascans jobs describe JOB \ --location=LOCATION \ --datascan=DATASCAN \ --view=FULL
Replace the following variables:
JOB
: The job ID of the data profile scan job.LOCATION
: The Google Cloud region in which the data profile scan was created.DATASCAN
: The name of the data profile scan the job belongs to.--view=FULL
: To see the scan job result, specifyFULL
.
For optional arguments, see the gcloud CLI reference.
REST
Use the APIs Explorer to view the results of a data profile scan.
View the most recent data profile scan job
Console
The Latest job results tab, when there is at least one successfully completed run, provides information about the latest job. It lists the scanned table's columns and statistics about the columns that were found in the scan.
gcloud
To view the most recent successful data profile scan, run the following command:
gcloud dataplex datascans describe DATASCAN \ --location=LOCATION \ --view=FULL
Replace the following variables:
DATASCAN
: The name of the data profile scan to view the most recent job for.LOCATION
: The Google Cloud region in which the data profile scan was created.--view=FULL
: To see the scan job result, specifyFULL
.
For optional arguments, see the gcloud CLI reference.
REST
Use the APIs Explorer to view the most recent scan job.
View all data profile scan jobs
Dataplex saves the data profile scan history of the last 300 jobs or for the past year, whichever occurs first.
Console
The Jobs history tab provides information about past jobs. It lists all of the jobs, the number of records scanned in each job, the job status, job execution time, and more.
To view the detailed information about a job, click any of the jobs under Job ID.
gcloud
To view all jobs of a data profile scan, run the following command:
gcloud dataplex datascans jobs list \ --location=LOCATION \ --datascan=DATASCAN
Replace the following variables:
LOCATION
: The Google Cloud region in which the data profile scan was created.DATASCAN
: The name of the data profile scan to view all jobs for.
For optional arguments, see the gcloud CLI reference.
REST
Use the APIs Explorer to view all scan jobs.
Share the published results
When creating a data profile scan, if you chose to publish the scan results in the BigQuery and Data Catalog pages in the Google Cloud console, then the latest scan results will be available in the Data profile tab in those pages.
You can enable the users in your organization to access the published scan results. To grant access to the scan results, follow these steps:
In the Google Cloud console, go to the Profile page.
Click the data profile scan you want to share the results of.
Go to the Permissions tab.
Click Grant access.
In the New principals field, add the principal to which you want to grant access.
In the Select a role field, select Dataplex DataScan DataViewer.
Click Save.
To remove access to the published scan results for a principal, follow these steps:
In the Google Cloud console, go to the Profile page.
Click the data profile scan you want to share the results of.
Go to the Permissions tab.
Select the principal for which you want to remove the Dataplex DataScan DataViewer role.
Click Remove access.
Click Confirm.
Update a data profile scan
Console
In the Google Cloud console, go to the Profile page.
In the row with the scan you'd like to edit, click > Edit.
Edit the values.
Click Save.
gcloud
To update a data profile scan, run the following command:
gcloud dataplex datascans update data-profile DATASCAN \ --location=LOCATION \ --description=DESCRIPTION
Replace the following variables:
DATASCAN
: The name of the data profile scan to update.LOCATION
: The Google Cloud region in which the data profile scan was created.DESCRIPTION
: The new description for the data profile scan.
For specification fields to update, see the gcloud CLI reference.
REST
Use the APIs Explorer to edit a data profile scan.
Delete a data profile scan
Console
In the Google Cloud console, go to the Profile page. Go to Dataplex Profile
Click the scan you want to delete.
Click Delete.
gcloud
To delete a data profile scan, run the following command:
gcloud dataplex datascans delete \ DATASCAN --location=LOCATION \ --async
Replace the following variables:
DATASCAN
: The name of the data profile scan to delete.LOCATION
: The Google Cloud region in which the data profile scan was created.
For optional arguments, see the gcloud CLI reference.
REST
Use the APIs Explorer to delete your data profile scan.
What's next?
- Learn about data profiling.
- Learn about auto data quality.
- Learn how to use auto data quality.