Dataplex data profiling lets you identify common statistical characteristics of the columns in your BigQuery tables. This information helps you to understand and analyze your data more effectively.
Information like typical data values, data distribution, and null counts can accelerate analysis. When combined with data classification, data profiling can detect data classes or sensitive information that, in turn, can enable access control policies.
Dataplex also uses this information to recommend rules for data quality checks.
Conceptual model
Dataplex lets you better understand the profile of your data by creating a data profiling scan.
The following diagram shows how Dataplex scans data to report on statistical characteristics.
A data profiling scan is associated with one BigQuery table and scans the table to generate the data profiling results. A data profiling scan supports several configuration options.
Configuration options
This section describes the configuration options available for running data profiling scans.
Scheduling options
You can schedule a data profiling scan with a defined frequency or on demand through the API or the Google Cloud console.
Scope
As part of the specification of a data profiling scan, you can specify the scope of a job as one of the following options:
Full table: The entire table is scanned in the data profiling scan. Sampling, row filters, and column filters are applied on the entire table before calculating the profiling statistics.
Incremental: Incremental data that you specify is scanned in the data profile scan. Specify a
Date
orTimestamp
column in the table to be used as an increment. Typically, this is the column on which the table is partitioned. Sampling, row filters, and column filters are applied on the incremental data before calculating the profiling statistics.
Filter data
You can filter data to be scanned for profiling by using row filters and column filters. Using filters helps you reduce the execution time and cost, and exclude sensitive and unuseful data.
Row filters: Row filters let you focus on data within a specific time period or from a specific segment, such as region. For example, you can filter out data with a timestamp before a certain date.
Column filters: Column filters lets you include and exclude specific columns from your table to run the data profiling scan.
Sample data
Dataplex lets you specify a percentage of records from your data to sample for running a data profiling scan. Creating data profiling scans on a smaller sample of data can reduce the execution time and cost of querying the entire dataset.
Multiple data profiling scans
Dataplex lets you create multiple data profiling scans at a time using the Google Cloud console. You can select up to 100 tables from one dataset and create a data profiling scan for each dataset. Learn more.
Export scan results to BigQuery table
You can export the data profiling scan results to a BigQuery table for further analysis. To customize reporting, you can connect the BigQuery table data to a Looker dashboard. You can build an aggregated report by using the same results table across multiple scans.
Data profiling results
The data profiling results include the following values:
Column type | Data profiling results |
---|---|
Numeric column |
|
String column |
|
Other non-nested columns (date, time, timestamp, binary, etc.) |
|
All other nested or complex data-type columns (such as Record, Array, JSON) or any column with repeated mode. |
|
The results include the number of records scanned in every execution.
Reporting and monitoring
You can monitor and analyze the data profiling results using the following reports and methods:
Reports published with the source table in the BigQuery and Data Catalog pages
If you have configured a data profiling scan to publish the results in the BigQuery and Data Catalog pages in the Google Cloud console, then you can view the latest data profiling scan results in these pages in the Data Profile tab, from any project.
Historical, per job report in Dataplex
On the Dataplex Profile page, you can view the detailed reports for the latest and historical jobs. This includes column-level profile information and the configuration that was used.
Analysis tab
On the Dataplex Profile page, you can use the Analysis tab to view the trends for a given statistic of a column over multiple profile jobs. For example, if you have an incremental scan, you can view how the average of a value has been trending over time.
Build your own dashboard or analytics
If you have configured a data profiling scan to export or save results to a BigQuery table, then you can build your own dashboards using tools, such as Looker Studio.
Limitations
- Data profiling results aren't published to Data Catalog as tags.
- Data profiling is supported for BigQuery tables with all column
types except
BIGNUMERIC
. A scan created for a table with aBIGNUMERIC
column results in a validation error and isn't successfully created. - The BigQuery tables to be scanned must have 300 columns or less.
Pricing
Dataplex uses the premium processing SKU to charge for data profiling. For more information, see Pricing.
Publishing data profiling results to Data Catalog isn't yet available. When it becomes available, it will be charged at the same rate as Catalog metadata storage pricing. For more information, see Pricing.
Dataplex premium processing for data profiling is billed per second with a one-minute minimum.
You aren't charged for failed profiling scans.
The charge depends on the number of rows, numbers of columns, the amount of data scanned, partitioning and clustering settings on the table, and the frequency of the scan.
There are several options to reduce the cost of data profiling scans:
- Sampling
- Incremental scans
- Column filtering
- Row filtering
To separate data profiling charges from other charges in Dataplex premium processing SKU, on the Cloud Billing report, use the label
goog-dataplex-workload-type
with valueDATA_PROFILE
.To filter aggregate charges, use the following labels:
goog-dataplex-datascan-data-source-dataplex-entity
goog-dataplex-datascan-data-source-dataplex-lake
goog-dataplex-datascan-data-source-dataplex-zone
goog-dataplex-datascan-data-source-project
goog-dataplex-datascan-data-source-region
goog-dataplex-datascan-id
goog-dataplex-datascan-job-id
What's next?
- Learn how to use data profiling.
- Learn about auto data quality.
- Learn how to use auto data quality.