About data profiling

Stay organized with collections Save and categorize content based on your preferences.

Dataplex data profiling lets you identify common statistical characteristics of the columns of your BigQuery tables. This information helps data consumers understand their data better, which makes it possible to analyze data more effectively.

Dataplex also uses this information to recommend rules for data quality.

Conceptual model

Dataplex lets you better understand the profile of your data by creating a data profiling scan in Dataplex.

Data profile scan.

A data profiling scan is associated with one BigQuery table and scans the table to generate the profiling results. You can configure the scan to run on the entire table or incrementally on the newer data, and you can choose to run the scan on a schedule or on demand. The results of the scan are available as part of every scan execution.

To use data profiling, the BigQuery table, and the parent BigQuery dataset, must be a part of a Dataplex data zone.

The data profiling results include the following values:

Column type Profile scan results
Numeric column
  • Percentage of null values
  • Percentage of unique (distinct) values
  • Top 10 most common values in the column: 10 or less, if the number of unique values in the column is less than 10 (null values are not included). The percentage of each value in the top 10 most common values is shown with respect to the total number of rows scanned in the current scan.
  • Average, standard deviation, minimum, lower quartile, median, upper quartile, and maximum values.
String column
  • Percentage of null values
  • Percentage of unique (distinct) values
  • Top 10 most common values in the column: 10 or less, if the number of unique values in the column is less than 10.
  • Average, minimum, and maximum length of the string.
Other non-nested columns (date, time, timestamp, binary)
  • Percentage of null values
  • Percentage of unique (distinct) values
  • Top 10 most common values in the column (10 or less, if the number of unique values in the column are less than 10)
All other nested/complex data-type columns (such as Record, Array, JSON, or any column with its mode set to **repeated**.) Percentage of null values.

The results also include the number of records scanned in every execution.

Limitations (in Public Preview)

  • Data profiling does not support data sampling.
  • The BigQuery table to be scanned must be part of a Dataplex lake.
  • Data profiling results cannot be published to Data Catalog.
  • Data profiling is supported for BigQuery tables with all column types except BIGNUMERIC. A scan created for a table with a BIGNUMERIC column will show a validation error and won't be successfully created.
  • Currently there is a limit of 5TB on data scanned per project per day. This limit is across data profiling and data quality scans.
  • The BigQuery tables to be scanned can have 300 columns or less.
  • The BigQuery tables to be scanned must have at least 100 rows.
  • BigQuery tables with the Require partition filter setting are not currently supported.

Pricing

Dataplex Data profiling Public Preview is currently free. Data profiling will be charged as Dataplex Premium Processing SKU. In Public Preview, publishing data quality results to Data Catalog is not currently available. When it becomes available, it will be charged with Data Catalog metadata storage pricing. See Pricing for more details.

What's next?