Jump to Content
Data Analytics

Understand and trust data with Dataplex data lineage

March 13, 2023
George Verghese

Product Manager, Google Cloud

Today, we are excited to announce the general availability of Dataplex data lineage — a fully managed Dataplex capability that helps you understand how data is sourced and transformed within the organization. Dataplex data lineage automatically tracks data movement across BigQuery, BigLake, Cloud Data Fusion (Preview), and Cloud Composer (Preview), eliminating operational hassles around manual curation of lineage metadata. 

With rising data volume spread across data silos, it can be challenging for organizations to ensure users have a self-service mechanism to discover, understand and trust the data. Organizations constantly struggle with questions such as:

Is the data extracted from an authoritative source?

What is the impact if I drop this table?

The data in this table seems corrupted - where did this data come from, and when was it last refreshed?

How is sensitive information being moved or copied? Is it in adherence to data governance practices?

To answer the above questions, organizations need to track how data is sourced and transformed, which can be complex and requires significant effort.

Dataplex data lineage describes each lineage relationship by detailing what happened and when it happened in an interactable lineage graph, providing data observability.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/Dataplex_mFWT6XE.gif

Data analysts who want to know if a table originates from an authoritative source can now answer this in a self-service manner with a simple look-up of lineage for the concerned table — available in Dataplex and in BigQuery for in-context analysis. 

Data engineers can reduce time to identify and resolve data issues through root cause analysis using the operational metadata trace asserting a lineage relationship. Data lineage also aids deterministic change management by providing the ability to evaluate the impact of a change and collaborate with the corresponding stakeholders to minimize any adverse impact. 

Finally, data lineage provides a map of data movement which can become the foundation for data governance practice. It enables data stewards and owners to evaluate and enforce adherence to governance requirements, especially when tracking the movement of sensitive information. 

Dataplex data lineage provides APIs for extensibility so that organizations can report lineage from various systems and have a single map of how data entries are related.

What our customers are saying

L’Oréal, the world’s largest cosmetics company, is on a mission to ‘create the beauty that moves the world.’ “Dataplex data lineage helps us understand how data moves across our organization,” said Sébastien Morand, Head of Data Engineering team, L’Oréal. “As a fully managed solution, it becomes the main entry point to diagnose data issues and evaluate the impact of a change or incident — providing insight on what happened and when it happened, including reference to the execution metadata. Directly integrated into our beauty tech data platform, data lineage helps us reduce data issues and also enables us to mitigate issues faster when it does happen.” 

“At Wayfair, we treat data-as-a-product and are building a robust data platform that provides self-service access and compliance constructs,” said Vinit Rajopadhye, Associate Director on Data Infrastructure & Data Enablement at Wayfair. “We are excited about Dataplex data lineage as it helps our data consumers trust data based on where it originates and the transformations applied.”

Hurb is an online travel agency in Brazil with a mission to optimize travel through technology. "Hurb has a rapidly growing data platform, with new data assets created and registered daily to support business decision-making and Machine Learning models,” said Vinícius dos Santos Mello, Senior Data Engineer. "Thanks to Dataplex data lineage features, we have end-to-end data observability across data in BigQuery. We can proactively address schema changes, data quality issues, and asset depreciation that could otherwise negatively affect the business.”

“As a company with many business domains and services, we handle a large volume of  data and use it to power our decision making, so it is crucial to ensure data quality. Dataplex data lineage provides a visual understanding of the flow of data across our organization, improving efficiency of impact investigations when problems occur and increasing the reliability of the data.” said Mitsunori Fukase, Data Platform Department Group Manager, DeNA.

Get started with Dataplex data lineage

You can get started with Dataplex data lineage by enabling the Data Lineage API on your project. You can learn more here.

Additional Resources:

Dataplex data lineage labs

Quickstart - track lineage for BigQuery table copy

Posted in