Use Dataplex data quality tasks with BigQuery
This document provides a conceptual overview of how to use BigQuery and Dataplex to run data quality tasks.
About data quality tasks with BigQuery
BigQuery uses Dataplex to let you define, schedule, and run data quality checks on BigQuery tables. These tables can be internal BigQuery tables, external tables, or BigLake tables on other clouds.
For directions on using Dataplex with BigQuery, see Create data quality tasks with Dataplex.
When to create Dataplex data quality tasks with BigQuery
Dataplex data quality tasks can help you with the following scenarios:
- Build data quality tools. Validate data as part of a data production pipeline.
- Maintain data quality management. Routinely monitor the quality of datasets against your expectations.
- Track data quality metrics. Build data quality reports for regulatory requirements.
- Customizable specifications. You can use the highly flexible YAML syntax to declare your data quality rules.
- Serverless implementation. Dataplex does not need any infrastructure setup.
- Zero-copy and automatic pushdown. YAML checks are converted to SQL and pushed down to BigQuery, resulting in no data copy.
- Schedulable data quality checks. You can schedule data quality checks through the serverless scheduler in Dataplex, or use the Dataplex API through external schedulers like Cloud Composer for pipeline integration.
- Managed experience. Dataplex uses an open source data quality engine, CloudDQ, to run data quality checks. However, Dataplex provides a seamless managed experience for performing your data quality checks.
How it works
The following diagram shows how Dataplex data quality tasks work:
- Input from users
- YAML specification: A set of one or more YAML files that define data quality rules based on the specification syntax. You store the YAML files in a Cloud Storage bucket in your project. Users can run multiple rules simultaneously, and those rules can be applied to different BigQuery tables, including tables across different datasets or Google Cloud projects. The specification supports incremental runs for only validating new data. To create a YAML specification, see Create a specification file.
- BigQuery result table: A user-specified table where the data quality validation results are stored. The Google Cloud project in which this table resides can be a different project than the one in which the Dataplex data quality task is used.
- Tables to validate
- Within the YAML specification, you need to specify which tables you want to validate for which rules, also known as a rule binding. The tables can be BigQuery native tables or BigQuery external tables in Cloud Storage. The YAML specification lets you specify tables inside or outside a Dataplex zone.
- BigQuery and Cloud Storage tables that are validated in a single run can belong to different projects.
- Dataplex data quality task A Dataplex data quality task is configured with a prebuilt, maintained CloudDQ PySpark binary and takes the YAML specification and BigQuery result table as the input. Similar to other Dataplex tasks, the Dataplex data quality task runs on a serverless Spark environment, converts the YAML specification to BigQuery queries, and then runs those queries on the tables that are defined in the specification file.
When you run Dataplex data quality tasks, you are charged for BigQuery and Dataproc Serverless (Batches) usage.
The Dataplex data quality task converts the specification file to BigQuery queries and runs them in the user project. See BigQuery pricing.
There are no charges for using Dataplex to organize data or using the serverless scheduler in Dataplex to schedule data quality checks. See Dataplex pricing.
- Learn how to create data quality tasks with Dataplex.