Data lineage is a Dataplex feature that lets you track how data moves through your systems: where it comes from, where it is passed to, and what transformations are applied to it.
Why do you need data lineage?
Dealing with big datasets often involves transforming data into entities tailored to the needs of a specific project: text files, tables, reports, dashboards, models.
For example, imagine you have an online store where you record every purchase in a single SQL table. To make it easier for your analysts to work with the data, you start running jobs that extract information from this single table and produce smaller tables by region, brand, or sale price. Your analysts then start doing the same: they perform further transformations, merging these smaller tables with other data sources to produce even more tables.
This can become a big challenge for your stakeholders:
- Data consumers can't use a self-service tool to understand whether data comes from an authoritative source.
- Data engineers can't root cause issues due to a lack of reliable way to track all the data transformations.
- Data engineers and analysts can't fully assess possible impact before modifying or deleting tables.
- Data governors can't easily understand how sensitive data is used throughout the organization and ensure adherence to regulatory requirements.
Data lineage is a solution that provides a practical way to:
- Understand how data is sourced and transformed with the help of lineage graph visualizations.
- Trace errors related to entries and data operations back to their root causes.
- Enable better change management through impact analysis: avoid downtime or unexpected errors, understand dependent entries and collaborate with relevant stakeholders.
Lineage visualization graph
Lineage graphs represent information gathered by Data Lineage API for a particular Data Catalog entry:
Dataplex works with the Data Lineage API to identify entries whose fully qualified name matches entities recognized by data lineage. For matched Dataplex entries, you can access the Lineage tab on their details page and view the graph.
Lineage graphs display two types of elements:
- Wide, rectangular buttons that represent entities involved in constructing lineage information as sources or targets of a lineage event.
- Smaller, square buttons representing processes responsible for creating or updating the source or target entities. The process buttons use icons specific to the source system that reported them to Data Lineage API. For example, BigQuery jobs use the icon.
Data lineage information model
In its basic form, lineage is a record of data being transformed from sources to targets. Data Lineage API collects that information and organizes it into a hierarchical data model using the concepts of processes, runs, and events.
A process is the definition of a data transformation operation supported for
a specific system. In the context of BigQuery lineage,
process is one of the supported job types.
A run is an execution of a process. Processes can have multiple runs.
Runs contain details such as start and end times, state, or additional attributes.
For more information, see the
run resource reference.
An event represents a point in time when a data transformation operation took place and resulted in data moving between a source and a target entity.
Events contain a list of links that define which entry was the source and which was the target in a particular event. While events are used to compute lineage visualization graphs, they are not directly exposed on the Cloud Console. You can create, read, and delete (but not update) them via Data Lineage API.
Consider the following example where data is copied between BigQuery tables:
How data moves between the tables is described by the lineage process
(represented on the graph by the
icon): it could be a SQL
CREATE TABLE AS SELECT query or an
Each execution of that SQL statement would constitute an individual run.
Runs contain events - these record which tables were used as the sources and
which as the targets. In this example, the tables
customers are both the source
for the target
Automated data lineage tracking
When you enable Data Lineage API, Google Cloud systems that support data lineage start reporting their data movement. Each integrated system can submit lineage information for a different range of data sources. See the following sections for more details on every supported product.
Enabling data lineage in your BigQuery project causes Dataplex to automatically record lineage information for:
- New tables as a result of the following BigQuery jobs:
- Copy jobs
- Load jobs that use the Cloud Storage URI to load data in any allowed format from Cloud Storage*
- Query jobs that use the following data definition language (DDL) in Google Standard SQL:
- Existing tables as a result of using following data manipulation language (DML) statements in Google Standard SQL:
BigQuery copy, query, and load jobs are represented as processes (click the looking-glass icon on the lineage visualization graph to see process details). Each process contains the BigQuery job_id in the attributes list for the most recent BigQuery job.
Data lineage supports integration with the following Google Cloud services:
Data lineage for custom data sources
You can use the Data Lineage API in Dataplex to record lineage information manually for any data source that isn't supported by the integrated systems.
Dataplex can create visualization graphs for manually recorded
lineage if you use a
fullyQualifiedNames that match the fully
qualified names of existing Data Catalog entries. If you want to record
lineage for a custom data source, first create a
custom Data Catalog entry.
Each process for custom data source may contain
sql key in the attributes list. The value of such key will be used to render code highlight in details panel of the data lineage graph. SQL statement will be displayed as it was provided. The user is responsible for filtering out sensitive information. The key name
sql is case-sensitive.
If you're already using OpenLineage to collect lineage information from other data sources, you can import OpenLineage events into Dataplex and display these events in the Google Cloud console. For details, see Integrate with OpenLineage.
Current feature limitations
- All lineage information is retained in the system for 30 days only.
- Lineage information persists after you remove its related data source. That is, if you remove a BigQuery table and its Data Catalog entry, you can still read the lineage for that table via API for up to 30 days.
Access data lineage
You can access data lineage functionalities using:
- Entry details pages in Dataplex UI in the Google Cloud console. See View lineage graphs.
- Table details page in BigQuery UI in the Google Cloud console. See View lineage graphs.
- Data Lineage API
To get started with data lineage, see Track data lineage for a BigQuery table copy and query jobs.