Data quality tasks overview

Dataplex data quality tasks enable you to define and execute data quality checks across tables in BigQuery and Cloud Storage. Dataplex data quality tasks allow you to apply regular data controls in BigQuery environments.

When to create Dataplex data quality tasks

  • You want to validate data as part of the data production pipeline.
  • You want to routinely monitor quality of datasets against your expectations.
  • You want to build data quality reports for regulatory requirements.

Benefits

Dataplex data quality tasks provide several benefits.

Highly flexible specification
Provides a highly flexible YAML syntax for declaring your data quality rules.
Entirely serverless execution
Dataplex handles execution. You don't need to do any infrastructure setup.
Zero-copy and native push-down
YAML checks are converted to SQL and pushed down to BigQuery, resulting in no data copy.
Schedule data quality checks
You can schedule data quality checks through Dataplex's serverless scheduler or use the Dataplex API through external schedulers like Cloud Composer for pipeline integration.
Managed experience
Dataplex uses an open source data quality engine, CloudDQ, to run data quality checks. However, Dataplex provides a one-click, managed experience for operationalizing your data quality checks.

How it works

The following diagram depicts how Dataplex data quality tasks work.

image

Input from users
YAML Specification: A set of one or more YAML files, stored in a Cloud Storage bucket in your project, that define data quality rules per the specification syntax. Users can run multiple rules at a time and those rules can be applied to different BigQuery tables, including tables across different datasets or Google Cloud projects. Syntax supports incremental executions for validating newer data only. To create YAML specification, see Create a specification file.
BigQuery result table: A user-specified table where the data quality validation results are to be stored. The Google Cloud project in which this table resides can be a different project than the one in which the Dataplex data quality task is used.
Tables to validate
Within the YAML specification you need to specify which table(s) you want to validate for which rules (also known as a "rule binding"). The tables can be BigQuery native tables or BigQuery external tables in Cloud Storage. YAML specification is flexible enough to allow for specifying tables in a Dataplex zone or outside of a zone.
BigQuery and Cloud Storage tables validated in a single execution can belong to different projects.
Dataplex data quality task
A Dataplex data quality task is configured with a prebuilt, maintained CloudDQ pyspark binary and takes the YAML specification and BigQuery result table as the input. Just as any other Dataplex task, the Dataplex data quality task executes on a serverless Spark environment and converts YAML specification to BigQuery queries and executes those on the tables specified in the specification file.

Costs

When executing Dataplex data quality tasks, you are charged for BigQuery and Dataproc Serverless (Batches) usage.

  • Dataplex data quality task converts the specification file to BigQuery queries and executes them in the user project. See the BigQuery Pricing.

  • Dataplex uses Spark to run the prebuilt, Google-maintained open source CloudDQ driver program, to convert user specification to BigQuery queries. See Dataproc Serverless pricing.

There is no charge for using Dataplex to organize data and using Dataplex's serverless scheduler to schedule data quality checks.

What's next