Dataplex data profiling lets you identify common statistical characteristics of the columns of your BigQuery tables. This information helps data consumers understand their data better, which makes it possible to analyze data more effectively.
Dataplex also uses this information to recommend rules for data quality.
Conceptual model
Dataplex lets you better understand the profile of your data by creating a data profiling scan in Dataplex.
A data profiling scan is associated with one BigQuery table and scans the table to generate the profiling results. You can configure the scan to run on the entire table or incrementally on the newer data, and you can choose to run the scan on a schedule or on demand. The results of the scan are available as part of every scan execution.
To use data profiling, the BigQuery table, and the parent BigQuery dataset, must be a part of a Dataplex data zone.
The data profiling results include the following values:
Column type | Profile scan results |
---|---|
Numeric column |
|
String column |
|
Other non-nested columns (date, time, timestamp, binary) |
|
All other nested/complex data-type columns (such as Record, Array, JSON, or any column with its mode set to **repeated**.) | Percentage of null values. |
The results also include the number of records scanned in every execution.
Limitations (in Public Preview)
- Data profiling does not support data sampling.
- The BigQuery table to be scanned must be part of a Dataplex lake.
- Data profiling results cannot be published to Data Catalog.
- Data profiling is supported for BigQuery tables with all column
types except
BIGNUMERIC
. A scan created for a table with aBIGNUMERIC
column will show a validation error and won't be successfully created. - Currently there is a limit of 5TB on data scanned per project per day. This limit is across data profiling and data quality scans.
- The BigQuery tables to be scanned can have 300 columns or less.
- The BigQuery tables to be scanned must have at least 100 rows.
- BigQuery tables with the
Require
partition filter setting are not currently supported.
Pricing
Dataplex Data profiling Public Preview is currently free. Data profiling will be charged as Dataplex Premium Processing SKU. In Public Preview, publishing data quality results to Data Catalog is not currently available. When it becomes available, it will be charged with Data Catalog metadata storage pricing. See Pricing for more details.
What's next?
- Learn how to use data profiling.
- Learn about auto data quality.
- Learn how to use auto data quality.