Dataplex auto data quality (AutoDQ) lets BigQuery users to define and measure data quality. You can get started by using the rule recommendations powered by Dataplex data profiling, or by building new rules in the Google Cloud console. Dataplex provides integrated monitoring, troubleshooting, and Cloud Logging alerting.
For measuring the quality of a table through AutoDQ, you need to create a
DataScan of type data quality. One data scan is associated with one table
only. To use AutoDQ, the BigQuery table (and the
parent BigQuery dataset) must be a part of a
Dataplex data zone. The scan is executed on resources
in Google's tenant project, so you don't need to handle any infrastructure
Creating and using a data quality scan can be broken down into four parts:
- Rule definition
- Rule execution
- Monitoring and alerting
Data quality rules define expectations from the data. In Dataplex, rules are associated with a data quality scan.
You create data quality rules in the following ways:
- Generate rule recommendations (from an existing data profiling scan)
- Use predefined rules
- Create custom SQL rules
Dataplex currently supports the following two broad categories of predefined rules: row-level or aggregate.
For the row-level category of rules, the expectation is applied against every individual row in the data. Each row independently passes or fails for the condition. For example,
Row-level checks require users to specify a passing threshold. When the percentage of rows passing the rule falls below the threshold value, the rule is considered as failed.
For the aggregate rules category, the provided expectation is applied against a single value aggregated across the entire data. For example,
Avg (someCol) >= 10. To pass, this check should evaluate to boolean
true. Aggregate rules will not provide an independent pass or fail count for each row.
For both of these rule categories, Dataplex allows users to set the following common parameters:
- Column to which the rule applies.
- Dimension, which must be from a set of predefined dimensions.
The following table shows the different rule types that are supported in row-level or aggregate categories:
(Name in Google Cloud console)
|Row-level or aggregate rule||Description||Supported column types||Rule-specific parameters|
|Row-level||Check if the value is between min and max.||All numeric, date, and timestamp type columns.||Required:
|Row-level||Validate that column values are not NULL.||All supported column types.||Required:
Passing threshold percentage.
|Row-level||Check if the values in a column are one of the specified values in a set.||All supported column types, except
|Row-level||Check the values again a specified regular expression.||String||Required:|
|Aggregate||Check if all the values in a column are unique.||All supported column types, except
Column and dimension from the supported parameters.
|Aggregate||Check if the given statistical measure matches the range expectation.||All supported numeric column types.||Required:
Supported custom SQL rule types
SQL rules provide flexibility to expand the validation with custom logic. These rules come in two types.
|Rule type||Row-level or aggregate rule||Description||Supported column types||Rule-specific parameters||Example|
|Row condition||Row-level||You can specify an expectation from every row by defining a SQL
expression in a
Dataplex computes the percentage of rows that pass this expectation and compare that against the passing threshold percentage to determine the success or failure of the rule.
This can include a reference to another table, for example, to create referential integrity checks.
Optional: Column to associate this rule with.
|Aggregate SQL expression||Aggregate||These rules are executed once per table. User should provide a SQL expression that evaluates to boolean true (PASS) or false (FAIL)
The SQL can include reference to another table using expression subqueries.
SQL condition to use
Column to associate this rule with
|Simple aggregate example
avg(price) > 100
Using an expression subquery to compare values across a different table
Dimensions allow aggregation of results of multiple data quality rules for monitoring and alerting. Every rule in Dataplex AutoDQ must be associated with a dimension. Dataplex supports the following dimensions:
Typed input in rules
All value parameters are passed as string values to the API. Dataplex requires inputs to follow the BigQuery format specified.
Binary-typed parameters can be passed as a
|Binary||Base64 encoded value||YXBwbGU=|
|Timestamp||YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]] [time_zone]
OR YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]][time_zone_offset]
|DateTime||YYYY-[M]M-[D]D [[H]H:[M]M:[S]S[.DDDDDD]]||2014-09-27 12:30:00.45|
A data quality scan can be scheduled with a defined frequency or on-demand through the API or Google Cloud console.
When a data quality scan executes, it creates a job. As part of the specification of a data quality scan, you can specify the scope of a job to be either of the following:
- Full Table
- In this case, every job will validate the entire table.
- In this case, every job will validate incremental data. To
determine increments, provide a
Timestampcolumn in the table that can be used as a marker. Typically this is the column on which the table is partitioned.
Monitoring and alerting
Every successfully executed data quality job will store the following information for further monitoring:
- Job execution details: Start time, end time, incremental start and end (in
case of incremental scans).
- Rules: The details of rules used for the execution of a job.
- Rule level success or failure.
- Dimension success or failure. Since dimension is made up of multiple rules, any rule failure in a given dimension fails the entire dimension.
- Data quality job success or failure. Any failure of an evaluated dimension is recorded as the quality failure for the entire job.
This information is available through the API and Google Cloud console.
Alerting: Job-level and dimension-level failures are logged into Cloud Logging
with additional information such as
job-id. This can be used to set up further
Data quality failures: Every failed rule (other than aggregate rules) includes a query to get the failed records. This query returns all the columns of the table (not just the failed column).
Limitations (in Public Preview)
- Data quality scan does not support data sampling.
- The BigQuery table to be scanned must be part of a Dataplex lake.
- Data quality results cannot be published to Data Catalog.
- Rule recommendations are only supported in the Google Cloud console.
- The data scanned per project per day is limited to
5 TB. This limit is across data profiling and data quality scans.
- The BigQuery tables to be scanned can have 300 columns or less.
- The choice of dimensions is fixed to one of the pre-defined seven dimensions.
- The number of rules per data quality scan is limited to 1000.
- BigQuery tables with the
Requirepartition filter setting are not currently supported.
Dataplex data quality Public Preview is currently free. Billing will be enabled sometime during the Public Preview phase with at least a month of notice in advance. In Public Preview, publishing data quality results to Data Catalog is not currently available. When it becomes available, it will be charged with Data Catalog metadata storage pricing. See Pricing for more details.
- Learn how to use auto data quality.
- Learn about data profiling.
- Learn how to use data profiling.