Dataplex data quality tasks enable you to define and execute data quality checks across tables in BigQuery and Cloud Storage. Dataplex data quality tasks allow you to apply regular data controls in BigQuery environments.
When to create Dataplex data quality tasks
- You want to validate data as part of the data production pipeline.
- You want to routinely monitor quality of datasets against your expectations.
- You want to build data quality reports for regulatory requirements.
Dataplex data quality tasks provide several benefits.
- Highly flexible specification
- Provides a highly flexible YAML syntax for declaring your data quality rules.
- Entirely serverless execution
- Dataplex handles execution. You don't need to do any infrastructure setup.
- Zero-copy and native push-down
- YAML checks are converted to SQL and pushed down to BigQuery, resulting in no data copy.
- Schedule data quality checks
- You can schedule data quality checks through Dataplex's serverless scheduler or use the Dataplex API through external schedulers like Cloud Composer for pipeline integration.
- Managed experience
- Dataplex uses an open source data quality engine, CloudDQ, to run data quality checks. However, Dataplex provides a one-click, managed experience for operationalizing your data quality checks.
How it works
The following diagram depicts how Dataplex data quality tasks work.
- Input from users
- YAML Specification: A set of one or more YAML files, stored in a Cloud Storage bucket in your project, that define data quality rules per the specification syntax. Users can run multiple rules at a time and those rules can be applied to different BigQuery tables, including tables across different datasets or Google Cloud projects. Syntax supports incremental executions for validating newer data only. To create YAML specification, see Create a specification file.
- BigQuery result table: A user-specified table where the data quality validation results are to be stored. The Google Cloud project in which this table resides can be a different project than the one in which the Dataplex data quality task is used.
- Tables to validate
- Within the YAML specification you need to specify which table(s) you want to validate for which rules (also known as a "rule binding"). The tables can be BigQuery native tables or BigQuery external tables in Cloud Storage. YAML specification is flexible enough to allow for specifying tables in a Dataplex zone or outside of a zone.
- BigQuery and Cloud Storage tables validated in a single execution can belong to different projects.
- Dataplex data quality task
- A Dataplex data quality task is configured with a prebuilt,
CloudDQpyspark binary and takes the YAML specification and BigQuery result table as the input. Just as any other Dataplex task, the Dataplex data quality task executes on a serverless Spark environment and converts YAML specification to BigQuery queries and executes those on the tables specified in the specification file.
When executing Dataplex data quality tasks, you are charged for BigQuery and Dataproc Serverless (Batches) usage.
Dataplex data quality task converts the specification file to BigQuery queries and executes them in the user project. See the BigQuery Pricing.
There is no charge for using Dataplex to organize data and using Dataplex's serverless scheduler to schedule data quality checks.