Data lineage helps you track how data moves through your systems. You can see the origin, destinations, and transformations applied to a data asset.
You can view lineage information in the Google Cloud console for Dataplex Universal Catalog, BigQuery, and Vertex AI assets, or you can retrieve it by using the Data Lineage API.
Why you need data lineage
Large datasets often require transforming data into various formats for specific projects, such as text files, tables, reports, dashboards, and models.
For example, an online store might have a data pipeline with the following flow:
A Dataflow job reads raw purchase events from a Pub/Sub topic, product details from Cloud Storage files, and customer information from a BigQuery table. The job joins this information and creates a
purchases
table in BigQuery.Subsequent BigQuery jobs transform the
purchases
table to create smaller, aggregated tables, such asregion
orbrand
, and calculate new columns, such astotal_profit
.Analysts use these tables to generate reports and dashboards in Looker.
This common scenario can present several challenges:
Data consumers lack a self-service method to verify if data originates from an authoritative source.
Data engineers struggle to find the root cause of issues because they can't reliably track all data transformations. For example, if an analyst finds an error in a
total_profit
column, tracing the error back to its origin is difficult.Data engineers and analysts can't fully assess the potential impact of modifying or deleting tables. For example, before deprecating a
product_id
column, they must identify all dependent downstream columns to avoid breaking reports.Data governors lack visibility into how sensitive data is used across the organization, making it difficult to ensure compliance with regulatory requirements.
Data lineage solves these problems by providing a clear, visual map of your data's journey. With data lineage, you can do the following:
Understand how data is sourced and transformed by using lineage graphs.
Trace errors in data entries and operations back to their root causes.
Enable better change management through impact analysis to avoid downtime or unexpected errors, understand dependencies, and collaborate with stakeholders.
Data lineage workflow
The data lineage workflow includes the following steps:
Data sources and ingestion: lineage information from your data sources initiates the entire process. For more information, see Lineage sources.
Google Cloud services: when the Data Lineage API is enabled, supported services like BigQuery and Dataflow automatically report lineage events whenever data is moved or transformed.
Custom sources: for any systems not automatically supported by Google Cloud integrations, you can use the Data Lineage API to manually record lineage information. We recommend importing events formatted according to the OpenLineage standard.
Lineage platform: this central platform ingests, models, and stores all lineage data. For more information, see Lineage information model and granularity.
Data Lineage API: this API acts as the single entry point for all incoming lineage information. It uses a hierarchical data model consisting of three core concepts: process, run, and event.
Processing and storage: the platform processes incoming data and stores it in reliable, query-optimized databases.
User experience: you can interact with the stored lineage information in two primary ways:
Visual exploration: in the Google Cloud console, a frontend service fetches and renders the lineage data as an interactive graph or list. This is supported for Dataplex Universal Catalog, BigQuery, and Vertex AI (for models, datasets, feature store views, and feature groups). This is ideal for visually exploring your data's journey. For more information, see Lineage views in the Google Cloud console.
Programmatic access: using an API client, you can directly communicate with the Data Lineage API to automate lineage management. This lets you write lineage information from custom sources. It also lets you read and query the stored lineage data for use in other applications or for building custom reports.
Lineage sources
You can populate lineage information in Dataplex Universal Catalog in the following ways:
- Automatically from integrated Google Cloud services
- Manually, by using the Data Lineage API for custom sources
- By importing events from OpenLineage
Automated data lineage tracking
When you enable the Data Lineage API, Google Cloud systems that support data lineage start reporting their data movement. Each integrated system can submit lineage information for a different range of data sources.
BigQuery
When you enable data lineage in your BigQuery project, Dataplex Universal Catalog automatically records lineage information for the following:
New tables created as a result of the following BigQuery jobs:
- Copy jobs
- Load jobs that use a Cloud Storage URI
- Query jobs that use the following data definition language (DDL) in GoogleSQL:
Existing tables when you use the following data manipulation language (DML) statements in GoogleSQL:
SELECT
in relation to any of the listed table types:INSERT SELECT
MERGE
UPDATE
DELETE
BigQuery copy, query, and load jobs are represented as processes.
To view the process details, on the lineage graph, click
.
Each process contains the BigQuery job_id in the attributes list for the most recent BigQuery job.
Other services
Data lineage supports integration with the following Google Cloud services:
Data lineage for custom data sources
You can use the Data Lineage API to manually record lineage information for any data source that isn't supported by the integrated systems.
Dataplex Universal Catalog can create lineage graphs for manually recorded
lineage if you use a
fullyQualifiedName
that matches the fully
qualified names of existing Dataplex Universal Catalog entries. If you want to record
lineage for a custom data source, you must first create a
custom entry.
Each process for a custom data source can contain a sql
key in the attributes
list. The value of this key is used to render a code highlight in the details
panel of the data lineage graph. The SQL statement is displayed as it was
provided. You are responsible for filtering out sensitive information. The
key name sql
is case-sensitive.
OpenLineage
If you already use OpenLineage to collect lineage information from other data sources, you can import OpenLineage events into Dataplex Universal Catalog and view these events in the Google Cloud console. For more information, see Integrate with OpenLineage.
Limitations
The following are limitations for data lineage:
All lineage information is retained in the system for only 30 days.
Lineage information persists after you delete the related data source. For example, if you delete a BigQuery table, you can still view its lineage through the API and the console for up to 30 days.
Column-level lineage limitations
Column-level lineage has the following additional limitations:
Column-level lineage isn't collected for BigQuery load jobs or for routines.
Upstream column-level lineage isn't collected for external tables.
Column-level lineage isn't collected if a job creates more than 1,500 column-level links. In these cases, only table-level lineage is collected.
There is no API to create, read, update, delete, or search for column-level lineage.
Support for partitioned tables is limited, because partitioning columns like
_PARTITIONDATE
and_PARTITIONTIME
aren't recognized in the lineage graph.Console limitations:
The lineage graph traversal is limited to a depth of 20 levels and 10,000 links in each direction.
Column-level lineage is only fetched from the region where the root table is located. There is no support for cross-region lineage in the graph view.
Pricing
Dataplex Universal Catalog uses the premium processing SKU to charge for data lineage. For more information, see Pricing.
To separate data lineage charges from other charges in the Dataplex Universal Catalog premium processing SKU, on the Cloud Billing report, use the label
goog-dataplex-workload-type
with the valueLINEAGE
.If you call the Data Lineage API
Origin
sourceType
with a value other thanCUSTOM
, it causes additional costs.
What's next
Learn how to track data lineage for a BigQuery table copy and query jobs.
Learn how to use data lineage with Google Cloud systems.
Learn about lineage views in the Google Cloud console.
Explore the Data Lineage API.
For administrative information, see Lineage considerations and data lineage audit logging.