About auto data quality

Stay organized with collections Save and categorize content based on your preferences.

Dataplex auto data quality (AutoDQ) lets BigQuery users to define and measure data quality. You can get started by using the rule recommendations powered by Dataplex data profiling, or by building new rules in the Google Cloud console. Dataplex provides integrated monitoring, troubleshooting, and Cloud Logging alerting.

Conceptual model

Data quality scan.

For measuring the quality of a table through AutoDQ, you need to create a DataScan of type data quality. One data scan is associated with one table only. To use AutoDQ, the BigQuery table (and the parent BigQuery dataset) must be a part of a Dataplex data zone. The scan is executed on resources in Google's tenant project, so you don't need to handle any infrastructure complexity.

Creating and using a data quality scan can be broken down into four parts:

  • Rule definition
  • Rule execution
  • Monitoring and alerting
  • Troubleshooting

Rule definition

Data quality rules define expectations from the data. In Dataplex, rules are associated with a data quality scan.

You create data quality rules in the following ways:

  • Generate rule recommendations (from an existing data profiling scan)
  • Use predefined rules
  • Create custom SQL rules

Predefined rules

Dataplex currently supports the following two broad categories of predefined rules: row-level or aggregate.


For the row-level category of rules, the expectation is applied against every individual row in the data. Each row independently passes or fails for the condition. For example, column_A_value <1.

Row-level checks require users to specify a passing threshold. When the percentage of rows passing the rule falls below the threshold value, the rule is considered as failed.


For the aggregate rules category, the provided expectation is applied against a single value aggregated across the entire data. For example, Avg (someCol) >= 10. To pass, this check should evaluate to boolean true. Aggregate rules will not provide an independent pass or fail count for each row.

For both of these rule categories, Dataplex allows users to set the following common parameters:

  • Column to which the rule applies.
  • Dimension, which must be from a set of predefined dimensions.

The following table shows the different rule types that are supported in row-level or aggregate categories:

Rule type
(Name in Google Cloud console)
Row-level or aggregate rule Description Supported column types Rule-specific parameters
(Range Check)
Row-level Check if the value is between min and max. All numeric, date, and timestamp type columns. Required:
  • Passing threshold percentage
  • mean, min, or max values: Specify at least one value.

  • Enable strict min: If enabled, the rule check uses ">" instead of ">=".
  • Enable strict max: If enabled, the rule check uses "<" instead of "<=".
  • Enable ignore null: If enabled, null values are ignored in the rule check.
(Null check)
Row-level Validate that column values are not NULL. All supported column types. Required:
Passing threshold percentage.
(Set check)
Row-level Check if the values in a column are one of the specified values in a set. All supported column types, except Record and Struct. Required:
  • Set of string values to check against.
  • Passing threshold percentage.

  • Enable ignore null: If enabled, null values are ignored in the rule check.
  • Enable invert condition: If enabled, rule checks if the values are not from the set.
(Regex check)
Row-level Check the values again a specified regular expression. String Required:
  • Regex pattern used to check.
  • Passing threshold percentage.
  • Note: Google Standard SQL provides regular expression support using the re2 library; See that documentation for its regular expression syntax.
(Uniqueness Check)
Aggregate Check if all the values in a column are unique. All supported column types, except Record and Struct. Required:
Column and dimension from the supported parameters.
(Statistic check)
Aggregate Check if the given statistical measure matches the range expectation. All supported numeric column types. Required:
mean, min, or max values: Specify at least one value.

  • Enable strict min: If enabled, the rule check uses ">" instead of ">=".
  • Enable strict max: If enabled, the rule check uses "<" instead of "<=".
  • Enable ignore null: If enabled, null values are ignored in the rule check.

Supported custom SQL rule types

SQL rules provide flexibility to expand the validation with custom logic. These rules come in two types.

Rule type Row-level or aggregate rule Description Supported column types Rule-specific parameters Example
Row condition Row-level You can specify an expectation from every row by defining a SQL expression in a where clause. The SQL expression should evaluate to true (PASS) or false (FAIL) per row.
Dataplex computes the percentage of rows that pass this expectation and compare that against the passing threshold percentage to determine the success or failure of the rule.
This can include a reference to another table, for example, to create referential integrity checks.
All columns Required:
  • SQL condition to use
  • Passing threshold percentage
  • Dimension

Optional: Column to associate this rule with.
grossWeight <= netWeight
Aggregate SQL expression Aggregate These rules are executed once per table. User should provide a SQL expression that evaluates to boolean true (PASS) or false (FAIL)
The SQL can include reference to another table using expression subqueries.
All columns Required
SQL condition to use
Column to associate this rule with
Simple aggregate example
avg(price) > 100

Using an expression subquery to compare values across a different table
(SELECT COUNT(*) FROM `example_project.example_dataset.different-table`) < COUNT(*)


Dimensions allow aggregation of results of multiple data quality rules for monitoring and alerting. Every rule in Dataplex AutoDQ must be associated with a dimension. Dataplex supports the following dimensions:

  • Freshness
  • Volume
  • Completeness
  • Validity
  • Consistency
  • Accuracy
  • Uniqueness

Typed input in rules

All value parameters are passed as string values to the API. Dataplex requires inputs to follow the BigQuery format specified.

Binary-typed parameters can be passed as a base64-encoded string.

Type Supported formats Examples
Binary Base64 encoded value YXBwbGU=
Timestamp YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]] [time_zone]
OR YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]][time_zone_offset]
2014-09-27 12:30:00.45-08
Date YYYY-M[M]-D[D] 2014-09-27
Time [H]H:[M]M:[S]S[.DDDDDD] 12:30:00.45
DateTime YYYY-[M]M-[D]D [[H]H:[M]M:[S]S[.DDDDDD]] 2014-09-27 12:30:00.45

Rule execution

A data quality scan can be scheduled with a defined frequency or on-demand through the API or Google Cloud console.

When a data quality scan executes, it creates a job. As part of the specification of a data quality scan, you can specify the scope of a job to be either of the following:

Full Table
In this case, every job will validate the entire table.
In this case, every job will validate incremental data. To determine increments, provide a Date / Timestamp column in the table that can be used as a marker. Typically this is the column on which the table is partitioned.

Monitoring and alerting

Every successfully executed data quality job will store the following information for further monitoring:

  • Job execution details: Start time, end time, incremental start and end (in
    case of incremental scans).
  • Rules: The details of rules used for the execution of a job.
  • Rule level success or failure.
  • Dimension success or failure. Since dimension is made up of multiple rules, any rule failure in a given dimension fails the entire dimension.
  • Data quality job success or failure. Any failure of an evaluated dimension is recorded as the quality failure for the entire job.

This information is available through the API and Google Cloud console.

Alerting: Job-level and dimension-level failures are logged into Cloud Logging with additional information such as job-id. This can be used to set up further alerts.

Troubleshooting failures

Data quality failures: Every failed rule (other than aggregate rules) includes a query to get the failed records. This query returns all the columns of the table (not just the failed column).

Limitations (in Public Preview)

  • Data quality scan does not support data sampling.
  • The BigQuery table to be scanned must be part of a Dataplex lake.
  • Data quality results cannot be published to Data Catalog.
  • Rule recommendations are only supported in the Google Cloud console.
  • The data scanned per project per day is limited to 5 TB. This limit is across data profiling and data quality scans.
  • The BigQuery tables to be scanned can have 300 columns or less.
  • The choice of dimensions is fixed to one of the pre-defined seven dimensions.
  • The number of rules per data quality scan is limited to 1000.
  • BigQuery tables with the Require partition filter setting are not currently supported.


Dataplex data quality Public Preview is currently free. Billing will be enabled sometime during the Public Preview phase with at least a month of notice in advance. In Public Preview, publishing data quality results to Data Catalog is not currently available. When it becomes available, it will be charged with Data Catalog metadata storage pricing. See Pricing for more details.

What's next?