Building trust in the data with Dataplex
Sandeep Karmarkar
Product Manager, Google Cloud
Kanhu Badtia
Sr. Data Engineer, American Eagle Outfitters
Analytics data is growing exponentially and so is the dependence on the data in making critical business and product decisions. In fact, the best decisions are said to be the ones which are backed by data. In data, we trust!
But do we trust the data ?
As the data volumes have grown - one of the key challenges organizations are facing is how to maintain the data quality in a scalable and consistent way across the organization. While data quality is not a newly found need, the needs used to be contained when the data footprint was small and data consumers were few. In such a world, data consumers knew who the producers were and producers knew what the consumers needed. But today, data ownership is getting distributed and data consumption is finding new users and use cases. So the existing data quality approaches find themselves limited and are isolated to certain pockets of the organization. This often exposes data consumers to inconsistent and inaccurate data which ultimately impacts the decisions made from that data. As a result, organizations today are losing 10s of millions of dollars due to the low quality of data.
These organizations are looking for solutions that empower their data producers to consistently create high quality data cloud scale.
Building Trust with Dataplex data quality
Earlier this year, at Google Cloud, we launched Dataplex, an intelligent data fabric that enables governance and data management across distributed data at scale. One of the key things Dataplex enables out-of-box is for data producers to build trust in the data with a built-in data quality.
Dataplex data quality task delivers a declarative, data-ops centric experience for validating data across BigQuery and Google Cloud Storage. Producers can now easily build and publish quality reports or can easily include data validations as part of their data production pipeline. Reports can be aggregated across various data quality dimensions and the execution is entirely serverless.
Dataplex data quality task provides -
A declarative approach for defining “what good looks like'' that can be managed as part of a CI/CD workflow.
A serverless and managed execution with no infrastructure to provision.
Ability to validate across data quality dimensions like freshness, completeness, accuracy and validity.
Flexibility in execution - either by using Dataplex serverless scheduler (at no extra cost) or executing the data validations as part of a pipeline (e.g. Apache Airflow).
Incremental execution - so you save time and money by validating new data only.
Secure and performant execution with zero data-copy from BigQuery environments and projects.
Programmatic consumption of quality metrics for Dataops workflows.
Users can also execute these checks on data that is stored in BigQuery and Google Cloud Storage but is not yet organized with Dataplex. For Google Cloud Storage data that is managed by Dataplex, Dataplex auto-detects and auto-creates tables for structured and semi-structured data. These tables can be referenced with the Dataplex data quality task as well.
Behind the scenes - Dataplex makes use of an open source data quality engine - Cloud Data Quality Engine - to run these checks. Providing an open platform is one of our key goals and we have made contributions to this engine to integrate seamlessly with Dataplex’s metadata and serverless environment.
You can learn more about this in our product documentation.
Building enterprise trust at American Eagle Outfitters
One of our enterprise customers - American Eagle Outfitters (AEO) - is continuing to build trust in their critical data using Dataplex Data Quality Task. Kanhu Badtia, lead data engineer from AEO, shares their rationale and experience with Dataplex data quality task:
“AEO is a leading global specialty retailer offering high-quality & on-trend clothing under its American Eagle® and Aerie® brands. Our company operates stores in the United States, Canada, Mexico, and Hong Kong, and ships to 81 countries worldwide through its websites.
We are a data-driven organization that utilizes data from physical and digital store fronts, from social media channels, from logistics/delivery partners and many other sources through established compliant processes. We have a team of data scientists and analysts who create models, reports and dashboards that inform responsible business decision-making on such matters as inventory, promotions, new product launches and other internal business reviews. As the data engineering team at AEO, our goal is to provide highly trusted data for our internal data consumers.
Before Dataplex - AEO had methods for maintaining data quality that were effective for their purpose. However, those methods were not scalable with the continual expansion of data volume and demand for quality results from our data consumers. Internal data consumers identified and reported quality issues where ‘bad data’ was impacting business critical dashboards/reports . As a result, our teams were often in “fire-fighting” mode - finding & fixing bad data. We were looking for a solution that would standardize and scale data quality across the production data pipelines.
The majority of AEO’s business data is in Google’s BigQuery or in Google Cloud Storage (GCS). When Dataplex launched the data quality capabilities, we immediately started a proof-of-concept. After a careful evaluation, we decided to use it as the central data quality framework for production pipelines. We liked that -
- It provides an easy declarative (YAML) & flexible way of defining data quality. We were able to parameterize it to use across multiple tables.
- It allows validating data in any BigQuery table with a completely serverless and native execution using existing slot reservations.
- It allows executing these checks as part of the ETL pipelines using DataPlex Airflow Operators. This is a huge win as pipelines can now pause further processing if critical rules do not pass.
- Data quality checks are executed in parallel which gives us the required execution efficiency in pipelines.
- Data quality results are stored centrally in BigQuery & can be queried to identify which rules failed/succeeded and how many rows failed. This enables defining custom thresholds for success.
- Organizing data in Dataplex Lakes is optional when using Dataplex data quality.
Our team truly believes that data quality is an integral part of any data-driven organization and Dataplex DQ capabilities align perfectly with that fundamental principle.
For example, here is a sample Google Cloud Composer / Airflow DAG that loads & validates the “item_master” table and stops downstream processing if the validation fails.
It includes simple rules for uniqueness, completeness and more complex rules for referential integrity or business rules such as checking daily price variance. We publish all data quality results centrally to a BigQuery table, such as this:
We query this output table for data quality issues & fail the pipeline in case of critical rule failure. This stops low quality data from flowing downstream.
We now have a repeatable process for data validation that can be used across the key data production pipelines. It standardizes the data production process and effectively ensures that bad data doesn’t break downstream reports and analytics.”
Learn more
Here at Google - we are excited to enable our customer’s journey to high quality, trusted data. To learn more about our current data quality capabilities please refer to -