About data lineage

Data lineage is a Dataplex feature that lets you track how data moves through your systems: where it comes from, where it is passed to, and what transformations are applied to it.

Why do you need data lineage?

Dealing with big datasets often involves transforming data into entities tailored to the needs of a specific project: text files, tables, reports, dashboards, models.

For example, imagine you have an online store where you record every purchase in a single SQL table. To make it easier for your analysts to work with the data, you start running jobs that extract information from this single table and produce smaller tables by region, brand, or sale price. Your analysts then start doing the same: they perform further transformations, merging these smaller tables with other data sources to produce even more tables.

This can become a big challenge for your stakeholders:

  • Data consumers can't use a self-service tool to understand whether data comes from an authoritative source.
  • Data engineers can't root cause issues due to a lack of reliable way to track all the data transformations.
  • Data engineers and analysts can't fully assess possible impact before modifying or deleting tables.
  • Data governors can't understand how sensitive data is used throughout the organization and ensure adherence to regulatory requirements.

Data lineage is a solution that provides a practical way to:

  • Understand how data is sourced and transformed with the help of lineage graph visualizations.
  • Trace errors related to entries and data operations back to their root causes.
  • Enable better change management through impact analysis: avoid downtime or unexpected errors, understand dependent entries and collaborate with relevant stakeholders.

Lineage visualization graph

Lineage graphs represent information gathered by Data Lineage API for a particular Data Catalog entry:

The sample graph shows data from two tables being transformed then merged
  into a new table, with a details panel showing SQL code docked at the bottom.
Figure 1. Example of a lineage visualization graph in Dataplex UI.

Dataplex works with the Data Lineage API to identify entries whose fully qualified name matches entities recognized by data lineage. For matched Dataplex entries, you can access the Lineage tab on their details page and view the graph.

Lineage graphs display two types of elements:

  • Wide, rectangular buttons that represent entities involved in constructing lineage information as sources or targets of a lineage event.
  • Smaller, square buttons representing processes responsible for creating or updating the source or target entities. The process buttons use icons specific to the source system that reported them to Data Lineage API. For example, BigQuery jobs use the BigQuery lineage process icon. icon.

Data lineage information model

In its basic form, lineage is a record of data being transformed from sources to targets. Data Lineage API collects that information and organizes it into a hierarchical data model using the concepts of processes, runs, and events.

Process

A process is the definition of a data transformation operation supported for a specific system. In the context of BigQuery lineage, a process is one of the supported job types.

Run

A run is an execution of a process. Processes can have multiple runs. Runs contain details such as start and end times, state, or additional attributes. For more information, see the run resource reference.

Event

An event represents a point in time when a data transformation operation took place and resulted in data moving between a source and a target entity.

Events contain a list of links that define which entry was the source and which was the target in a particular event. While events are used to compute lineage visualization graphs, they are not directly exposed on the Google Cloud console. You can create, read, and delete (but not update) them using Data Lineage API.

Example

Consider the following example where data is copied between BigQuery tables:

Example extracts data from tables called customer_year and customers to derive a table called top_customer.
Figure 2. Example of a graph showing the sources of table data.

How data moves between the tables is described by the lineage process (represented on the graph by the BigQuery lineage process icon. icon): it could be a SQL CREATE TABLE AS SELECT query or an INSERT statement.

Each execution of that SQL statement would constitute an individual run. Runs contain events - these record which tables were used as the sources and which as the targets. In this example, the tables customer_year and customers are both the source for the target top_customer table.

Automated data lineage tracking

When you enable Data Lineage API, Google Cloud systems that support data lineage start reporting their data movement. Each integrated system can submit lineage information for a different range of data sources. See the following sections for more details on every supported product.

BigQuery

Enabling data lineage in your BigQuery project causes Dataplex to automatically record lineage information for:

BigQuery copy, query, and load jobs are represented as processes (click the looking-glass icon on the lineage visualization graph to see process details). Each process contains the BigQuery job_id in the attributes list for the most recent BigQuery job.

Other services

Data lineage supports integration with the following Google Cloud services:

Data lineage for custom data sources

You can use the Data Lineage API in Dataplex to record lineage information manually for any data source that isn't supported by the integrated systems.

Dataplex can create visualization graphs for manually recorded lineage if you use a fullyQualifiedNames that match the fully qualified names of existing Data Catalog entries. If you want to record lineage for a custom data source, first create a custom Data Catalog entry.

Each process for custom data source may contain sql key in the attributes list. The value of such key will be used to render code highlight in details panel of the data lineage graph. SQL statement will be displayed as it was provided. The user is responsible for filtering out sensitive information. The key name sql is case-sensitive.

OpenLineage

If you're already using OpenLineage to collect lineage information from other data sources, you can import OpenLineage events into Dataplex and display these events in the Google Cloud console. For details, see Integrate with OpenLineage.

Current feature limitations

  • All lineage information is retained in the system for 30 days only.
  • Lineage information persists after you remove its related data source. That is, if you remove a BigQuery table and its Data Catalog entry, you can still read the lineage for that table using the API for up to 30 days.

Access data lineage

You can access data lineage functionalities using:

What's next