This document describes Dataplex automatic data quality, which lets you define and measure the quality of your data. You can automate the scanning of data, validate data against defined rules, and log alerts if your data doesn't meet quality requirements. You can manage data quality rules and deployments as code, improving the integrity of data production pipelines.
You can get started by using Dataplex data profiling rule recommendations, or by building custom rules in the Google Cloud console. Dataplex provides monitoring, troubleshooting, and Cloud Logging alerting that's integrated with Dataplex auto data quality.
Conceptual model
A data scan is a Dataplex job that samples data from
BigQuery and Cloud Storage and infers various types of
metadata. To measure the quality of a table using auto data quality, you create
a DataScan
object of type data quality
. The scan runs on only
one BigQuery table. The scan uses resources in a Google
tenant project, so you don't
need to set up your own infrastructure.
Creating and using a data quality scan consists of the following steps:
- Rule definition
- Rule execution
- Monitoring and alerting
- Troubleshooting
Rule definition
Data quality rules associated with a data quality scan define data expectations. You can create data quality rules in the following ways:
- Use recommendations from Dataplex data profiling
- Use the predefined rules
- Create custom SQL rules
Predefined rules
Dataplex supports two categories of predefined rules: row-level or aggregate.
- Row-level
For row-level category rules, the expectation is applied against each data row. Each row independently passes or fails the condition. For example,
column_A_value < 1
.Row-level checks require you to specify a passing threshold. When the percentage of rows passing the rule falls below the threshold value, the rule fails.
- Aggregate
For aggregate rules, the expectation is applied against a single value aggregated across the entire data. For example,
Avg(someCol) >= 10
. To pass, the check must evaluate to the booleantrue
. Aggregate rules don't provide an independent pass or fail count for each row.
For both of the rule categories, you can set the following parameters:
- The column to which the rule applies.
- A dimension from a set of predefined dimensions.
The following table lists the supported row-level and aggregate rule types:
Rule type (Name in Google Cloud console) |
Row-level or aggregate rule | Description | Supported column types | Rule-specific parameters |
---|---|---|---|---|
RangeExpectation (Range Check) |
Row-level | Check if the value is between min and max. | All numeric, date, and timestamp type columns. | Required:
Optional:
|
NonNullExpectation (Null check) |
Row-level | Validate that column values are not NULL. | All supported column types. | Required: Passing threshold percentage. |
SetExpectation (Set check) |
Row-level | Check if the values in a column are one of the specified values in a set. | All supported column types, except Record and Struct . |
Required:
Optional:
|
RegexExpectation (Regex check) |
Row-level | Check the values again a specified regular expression. | String | Required:
|
Uniqueness (Uniqueness Check) |
Aggregate | Check if all the values in a column are unique. | All supported column types, except Record and Struct . |
Required: Column and dimension from the supported parameters. |
StatisticRangeExpectation (Statistic check) |
Aggregate | Check if the given statistical measure matches the range expectation. | All supported numeric column types. | Required:mean , min , or
max values: Specify at least one value.
Optional:
|
Supported custom SQL rule types
SQL rules provide flexibility to expand the validation with custom logic. These rules come in two types.
Rule type | Row-level or aggregate rule | Description | Supported column types | Rule-specific parameters | Example |
---|---|---|---|---|---|
Row condition | Row-level | You can specify an expectation from every row by defining a SQL
expression in a where clause. The SQL expression should
evaluate to true (pass) or false (fail)
per row.
Dataplex computes the percentage of rows that pass this expectation and compare that against the passing threshold percentage to determine the success or failure of the rule. This can include a reference to another table, for example, to create referential integrity checks. |
All columns | Required:
Optional: Column to associate this rule with. |
grossWeight <= netWeight |
Aggregate SQL expression | Aggregate | These rules are executed once per table. Provide a SQL expression that
evaluates to boolean true (pass) or false (fail). The SQL expression can include a reference to another table using expression subqueries. |
All columns | Required SQL condition to use Dimension Optional Column to associate this rule with |
Simple aggregate example avg(price) > 100 Using an expression subquery to compare values across a different table (SELECT COUNT(*) FROM `example_project.example_dataset.different-table`) < COUNT(*) |
For example rules, see auto data quality sample rules.
Dimensions
Dimensions let you aggregate the results of multiple data quality rules for monitoring and alerting. You must associate every data quality rule with a dimension. Dataplex supports the following dimensions:
- Freshness
- Volume
- Completeness
- Validity
- Consistency
- Accuracy
- Uniqueness
Typed input in rules
All value parameters are passed as string values to the API. Dataplex requires inputs to follow the BigQuery specified format.
Binary-typed parameters can be passed as a base64
-encoded string.
Type | Supported formats | Examples |
---|---|---|
Binary | Base64 encoded value | YXBwbGU= |
Timestamp | YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]] [time_zone] OR YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.F]][time_zone_offset] |
2014-09-27 12:30:00.45-08 |
Date | YYYY-M[M]-D[D] | 2014-09-27 |
Time | [H]H:[M]M:[S]S[.DDDDDD] | 12:30:00.45 |
DateTime | YYYY-[M]M-[D]D [[H]H:[M]M:[S]S[.DDDDDD]] | 2014-09-27 12:30:00.45 |
Rule execution
You can schedule data quality scans to run at a specific interval, or you can run a scan on demand. To manage data quality scans, you can use the API or the Google Cloud console.
When you run a data quality scan, Dataplex creates a job. As part of the specification of a data quality scan, you can specify the scope of a job to be one of the following:
- Full Table
- Each job validates the entire table.
- Incremental
- Each job validates incremental data. To
determine increments, provide a
Date
/Timestamp
column in the table that can be used as a marker. Typically, this is the column on which the table is partitioned.
Filter data
Dataplex can filter data to be scanned for data quality by using a row filter. Creating a row filter lets you focus on data within a specific time period or specific segment, such as a certain region. Using filters can reduce execution time and cost, for example, filtering out data with a timestamp before a certain date.
Sample data
Dataplex lets you specify a percentage of records from your data to sample for running a data quality scan. Creating data quality scans on a smaller sample of data can reduce execution time and the cost of querying the entire dataset.
Monitoring and alerting
You can export the scan results to a BigQuery table for further analysis. To customize reporting, you can connect the BigQuery table data to a Looker dashboard. You can build an aggregated report by using the same results table across multiple scans.
You can monitor the data quality jobs using the data_scan
and
data_quality_scan_rule_result
logs in the Logs Explorer.
For each data quality job, the data_scan
log with the data_scan_type
field
set to DATA_QUALITY
contains the following information:
- Data source used for the data scan.
- Job execution details, such as creation time, start time, end time, and job state.
- Result of the data quality job: pass or fail.
- Dimension level pass or fail.
Every succeeded job contains a data_quality_scan_rule_result
log with the following detailed information about each rule in that job:
- Configuration information, such as rule name, rule type, evaluation type, and dimension.
- Result information, such as pass or failure, total row count, passing row count, null row count, and evaluated row count.
The information in the logs is available through the API and Google Cloud console. You can use this information to set up alerts. For more information, see Set alerts in Cloud Logging.
Troubleshoot failures
When a rule fails, Dataplex produces a query that returns all of the columns of the table (not just the failed column).
Limitations
- Data quality scan results aren't published to Data Catalog as tags.
- Rule recommendations are only supported in the Google Cloud console.
- The BigQuery tables to be scanned must have 300 columns or less.
- The choice of dimensions is fixed to one of the predefined seven dimensions.
- The number of rules per data quality scan is limited to 1000.
Pricing
Dataplex uses the premium processing SKU to charge for auto data quality. For more information, see Dataplex pricing.
Publishing auto data quality results to Catalog is not yet available. When it becomes available, it'll be charged at the same rate as Catalog metadata storage pricing. See Pricing for more details.
Dataplex premium processing for auto data quality is billed per second with a one-minute minimum.
There are no charges for failed data quality scans.
The charge depends on the number of rows, the number of columns, the amount of data that you've scanned, the data quality rule configuration, the partitioning and clustering settings on the table, and the frequency of the gscan.
There are several options to reduce the cost of auto data quality scans:
Separate data quality charges from other charges in Dataplex premium processing SKU, use the label
goog-dataplex-workload-type
with valueDATA_QUALITY
.Filter aggregate charges. Use the following labels:
goog-dataplex-datascan-data-source-dataplex-entity
goog-dataplex-datascan-data-source-dataplex-lake
goog-dataplex-datascan-data-source-dataplex-zone
goog-dataplex-datascan-data-source-project
goog-dataplex-datascan-data-source-region
goog-dataplex-datascan-id
goog-dataplex-datascan-job-id
What's next?
- Learn how to use auto data quality.
- Learn about data profiling.
- Learn how to use data profiling.