Data lineage is a Dataplex feature that lets you track how data moves through your systems: where it comes from, where it is passed to, and what transformations are applied to it.
Why do you need data lineage?
Dealing with big datasets often involves transforming data into entities tailored to the needs of a specific project: text files, tables, reports, dashboards, models.
For example, imagine you have an online store where you record every purchase in a single SQL table. To make it easier for your analysts to work with the data, you start running jobs that extract information from this single table and produce smaller tables by region, brand, or sale price. Your analysts then start doing the same: they perform further transformations, merging these smaller tables with other data sources to produce even more tables.
This can become a big challenge for your stakeholders:
- Data consumers can't use a self-service tool to understand whether data comes from an authoritative source.
- Data engineers can't root cause issues due to a lack of reliable way to track all the data transformations.
- Data engineers and analysts can't fully assess possible impact before modifying or deleting tables.
- Data governors can't easily understand how sensitive data is used throughout the organization and ensure adherence to regulatory requirements.
Data lineage is a solution that provides a practical way to:
- Understand how data is sourced and transformed with the help of lineage graph visualizations.
- Trace errors related to entries and data operations back to their root causes.
- Enable better change management through impact analysis: avoid downtime or unexpected errors, understand dependent entries and collaborate with relevant stakeholders.
Lineage visualization graph
Lineage graphs represent information gathered by Data Lineage API for a particular Data Catalog entry:

Dataplex works with the Data Lineage API to identify entries whose fully qualified name matches entities recognized by data lineage. For matched Dataplex entries, you can access the Lineage tab on their details page and view the graph.
Lineage graphs display two types of elements:
- Wide, rectangular buttons that represent entities involved in constructing lineage information as sources or targets of a lineage event.
- Smaller, square buttons representing processes responsible for creating or updating
the source or target entities. The process buttons use icons specific to the source system that
reported them to Data Lineage API. For example, BigQuery jobs
use the
icon.
Data lineage information model
In its basic form, lineage is a record of data being transformed from sources to targets. Data Lineage API collects that information and organizes it into a hierarchical data model using the concepts of processes, runs, and events.
Process
A process is the definition of a data transformation operation supported for
a specific system. In the context of BigQuery lineage,
a process
is one of the supported job types.
Run
A run is an execution of a process. Processes can have multiple runs.
Runs contain details such as start and end times, state, or additional attributes.
For more information, see the
run
resource reference.
Event
An event represents a point in time when a data transformation operation took place and resulted in data moving between a source and a target entity.
Events contain a list of links that define which entry was the source and which was the target in a particular event. While events are used to compute lineage visualization graphs, they are not directly exposed on the Cloud Console. You can create, read, and delete (but not update) them via Data Lineage API.
Example
Consider the following example where data is copied between BigQuery tables:

How data moves between the tables is described by the lineage process
(represented on the graph by the
icon): it could be a SQL
CREATE TABLE AS SELECT
query or an INSERT
statement.
Each execution of that SQL statement would constitute an individual run.
Runs contain events - these record which tables were used as the sources and
which as the targets. In this example, the tables
customer_year
and customers
are both the source
for the target top_customer
table.
Automated data lineage tracking
When you enable Data Lineage API, Google Cloud systems that support data lineage start reporting their data movement. Each integrated system can submit lineage information for a different range of data sources. See the following sections for more details on every supported product.
BigQuery
Enabling data lineage in your BigQuery project causes Dataplex to automatically record lineage information for:
- New tables as a result of the following BigQuery jobs:
- Copy jobs
- Load jobs that use the Cloud Storage URI to load data in any allowed format from Cloud Storage*
- Query jobs that use the following data definition language (DDL) in Google Standard SQL:
- Existing tables as a result of using following data manipulation
language (DML) statements in Google Standard SQL:
- SELECT in relation to any of the listed table types:
- INSERT SELECT
- MERGE
- UPDATE
- DELETE
BigQuery copy, query, and load jobs are represented
as processes (click the looking-glass icon
on the lineage visualization graph to see process
details). Each process contains the BigQuery job_id in the
attributes
list for the most recent BigQuery job.
Other services
Data lineage supports integration with the following Google Cloud services:
Data lineage for custom data sources
You can use the Data Lineage API in Dataplex to record lineage information manually for any data source that isn't supported by the integrated systems.
Dataplex can create visualization graphs for manually recorded
lineage if you use a
fullyQualifiedNames
that match the fully
qualified names of existing Data Catalog entries. If you want to record
lineage for a custom data source, first create a
custom Data Catalog entry.
Each process for custom data source may contain sql
key in the attributes list. The value of such key will be used to render code highlight in details panel of the data lineage graph. SQL statement will be displayed as it was provided. The user is responsible for filtering out sensitive information. The key name sql
is case-sensitive.
OpenLineage
If you're already using OpenLineage to collect lineage information from other data sources, you can import OpenLineage events into Dataplex and display these events in the Google Cloud console. For details, see Integrate with OpenLineage.
Current feature limitations
Retention rate
- All lineage information is retained in the system for 30 days only.
- Lineage information persists after you remove its related data source. That is, if you remove a BigQuery table and its Data Catalog entry, you can still read the lineage for that table via API for up to 30 days.
Access data lineage
You can access data lineage functionalities using:
- Entry details pages in Dataplex UI in the Google Cloud console. See View lineage graphs.
- Table details page in BigQuery UI in the Google Cloud console. See View lineage graphs.
- Data Lineage API
What's next
To get started with data lineage, see Track data lineage for a BigQuery table copy and query jobs.
For administrative information, see updated IAM sections, lineage considerations, and data lineage audit logging.