Data lineage systems for a data warehouse

Stay organized with collections Save and categorize content based on your preferences.

This document introduces technologies, related terminology, and use cases for data lineage. This document also describes an architecture for building a data lineage system for entities in a data warehouse. The data lineage system provides storage for and access to data lineage records and enables the data warehouse to automatically extract data lineage.

This document is intended for people who work in data governance and who manage data warehouses and big data systems.

This document is part of a series which helps you to understand data lineage concepts using BigQuery:

Data lineage overview

In a big data environment, operational systems feed fresh data into a data warehouse. Data warehouses act as a central repository for integrated data from one or more disparate sources, enabling analysts to build reports.

Traceability is defined as the ability to trace where the data in the data warehouse originated and how it flows through a business. Traceability helps enterprises to meet data governance objectives such as the following examples:

  • Putting safeguards on sensitive data to meet compliance and auditability standards such as GDPR legislation.
  • Enforcing access policies.
  • Performing root cause analysis on low-quality data.

Data lineage tells you where data originated. Data lineage also shows any transformations that have been applied to the data—for example, enriching source data by combining it with other data. Transformations can also be manipulations of data—for example, combining or changing data through mathematical operators.

Data lineage also establishes traceability, which defines the connection between different source and target entities in a data warehouse. A data source is the point where data is first read or retrieved for processing, such as tables in a database, files on a storage system, or a message bus or queue. Source entities in a data lineage record tell you where data came from. Target (or data destination) entities tell you where data is written after processing. Both source and target entities can both be defined as systems that store data temporarily or permanently.

Data lineage is also important for the management of data access controls. An effective way to gather information about data lineage for data governance is by harnessing the operational metadata generated by operations in the data warehouse.

Data lineage use cases

Data lineage can be useful in different contexts. The following list shows examples of how different kinds of users apply data lineage to their work:

  • Business users can use data lineage for the following activities:

    • Validating report fields by confirming data sources and transformations.
    • Performing data discovery, combining data lineage with a data glossary or catalog.
  • Data engineers can use data lineage for the following activities:

    • Identifying stale reports that could trigger upstream job failures.
    • Troubleshooting data quality issues in datasets by identifying sources to identify and fix any upstream issues.
  • Data governance users can use data lineage for the following activities:

    • Enforce data access control policies on sensitive data and data derivatives by identifying all downstream uses.
    • Identify data conflicts—for example, more than one process updating a dataset.
    • Remove unused or duplicated datasets for compliance and reduce storage.
    • Identify datasets and columns to improve storage performance or access scrutiny.
    • Monitor data quality over time.
    • Flag data exfiltration when data is exported outside of the data warehouse.

Data warehouse concepts

Conceptually, a data warehouse, as shown in the following diagram, is a system that provides large-scale storage and easy querying of data.

Flow of information to and from a data engine.

The functionality of a data warehouse as shown in the preceding diagram is supported by the following components:

  • A data engine. The data engine provides the core functionality of the data warehouse by integrating all of the components and interfacing with the storage system for durably persisting data.
  • A schema provider component. In a data warehouse, the schema is a description of data's structure—for example, the schema of a table in a relational database defines the names and types of fields in the database. The schema provider component supplies the data engine with information about existing entities in a database or data warehouse. Entities have the following properties:

    • Entities can be classed as tables, views, stored procedures, custom functions, and queries.
    • Entity information is the schema definition and contains column definitions such as name and type. Entity information also shows other metadata-like relationships between other entities. Most database and data warehouse providers can make this information available for use—for example, Information Schema for MySQL, Data Dictionary for Oracle or BigQuery API and Data Catalog for BigQuery.
    • Entities have user-enriched metadata—for example, column descriptions in BigQuery.
  • A grammar provider component. This component supplies the query language semantics. The language semantics encapsulate the database syntax rules that can then be used along with the schema for parsing a query. One such SQL2011 compliant parser is ZetaSQL.

  • An operation log. The operation log captures operations executed by the data system, such as queries, loads, and deletions in an audit trail. An audit trail can be a relevant chronological record, or a set of records that provide documented evidence of a sequence of activities and their details. In BigQuery, an operation log is also referred to as an audit log, or as a general query log in MySQL.

Data lineage system concepts

A data lineage system is either active or passive.

In an active system, developers need to program their data pipelines to explicitly provide source and transform information to the lineage system. The data lineage system acts as a repository for data lineage records and provides access through a visual interface and, optionally, an API.

An active data lineage system is useful for non-query language-based data operations, such as running an Apache Beam pipeline, or an Apache Spark job on Hadoop. One example of an active data lineage system is Apache Atlas, which has a rich set of APIs to ingest and access data.

Passive data lineage systems are more suitable for SQL-like operations, because these kinds of systems allow lineage information to be identified by parsing the SQL statements captured from an operation log. This functionality cuts down on time-consuming instrumentation work for analysts and developers, such as adding data lineage update calls to pipelines.

Passive data lineage systems are also more suitable for typical data warehouses, because they support SQL-based operations. Collibra Lineage is an example of SQL-based passive data lineage systems. When you use SaaS-based lineage solutions, metadata information might leave your computing environment perimeter for processing.

Data lineage granularity

Data lineage can be described at different levels of granularity. Some commonly used levels of granularity are explained in the following table:

Granularity level Description Implementation complexity Use cases
Entity Table of a relational database management system (RDBMS),
entity of a NoSQL database, or a topic in Pub/Sub
Low
  • Data governance
  • Dependency identification
Field or column Column of a table in an RDBMS, or the
field of a NoSQL entity
Medium
  • Data governance
  • Dependency identification
Record Actual data value held in the field of an RDBMS table or message in a Pub/Sub topic Very high
  • Transformation debugging
  • Data forensics

Data lineage granularity is determined by the use case for the data, for example, enterprises generally set the granularity level for data governance at field or column level.

You might be tempted to apply record-level granularity for maximum fidelity. However, the cost for storing lineage at this level of granularity can far exceed the benefits. In transactional databases, to recover data that is lost when an error occurs, transaction log replay is a better alternative. In data warehouses, snapshots are a better option.

Lineage data model

In a data lineage system, the lineage data model is represented as a tuple {I, T, O}. This tuple has the following properties:

  • I is the source or a set of inputs.
  • T is the transformation or operation.
  • O is the sink or output.

The source (I) is the data entity that is the data source and the sink (O) is the target of the data. The source is described with the following attributes:

  • The type (also referred to as kind) attribute describes the source type so that a user can identify the right set of data readers to use—for example, a BigQuery table, a Cloud Storage file or a Pub/Sub topic.
  • The identifier attribute describes the data location to identify the source and destination in the data system described by type or kind.

There are two ways to describe an entity in Google Cloud:

linkedResource describes the entity in the form of a fully qualified REST resource. A BigQuery table, for example, is described as follows: //bigquery.googleapis.com/projects/{project-id}/datasets/{dataset-id}/tables/{table-id}.

sqlResource describes the entity in the form of a fully qualified SQL table reference for use in Dataflow SQL to represent table and non-table entities, such as Pub/Sub. sqlResource is written as follows: {systemType}.{systemEntityType}.{entityIdentifier}.

For a BigQuery table, the sqlResource is written as follows: bigquery.tables.{project-id}.{dataset-id}.{table-id}

For a Pub/Sub topic, the sqlResource is written as follows: pubsub.topic.{project-id}.{topic-id}

The following table is an example of an entity in Google Cloud at column-level granularity. The example shows a BigQuery table titled UserDetails, with a column titled phone_number. Using the linkedresource format, the source or sink entity information is represented as follows:

Attribute Example value
Type (kind) BIGQUERY_TABLE
Identifier: entity //bigquery.googleapis.com/projects/my-project/dataset/admin_dataset/tables/UserDetails
Identifier: field or column phone_number

The transform instruction (T) records the processing steps that were used to manipulate the data source. The actual transform instruction varies by lineage granularity—for example, at the entity level, the transform instruction is the type of job that generated the output—for example, copying from a source table or querying a set of source tables. At the field level, the transform details are the SQL operations used to create the output column—for example, SUM, CONCAT, or HASH.

Minimum lineage data model

The minimum lineage data model to describe the data lineage tuple {I,T,O} for a data warehouse system is a combination of the following information components:

  • Reconcile time
  • Transform job information
  • Table-level lineage
  • Column-level lineage information

The following table describes the information components in detail.

Information type Information purpose Column Column definition
Lineage system operation The time at which the general query log was processed to extract lineage information reconcileTime The timestamp which was applied when the lineage system processed or reconciled a data warehouse operation that manipulated data. The information is expressed as an unsigned 64-bit integer representing the milliseconds since midnight (00:00) of January 1, 1970, in UTC.
Job information Processing system-related information to uniquely identify a data processing job jobId A unique ID from the processing system to identify a transformation or processing job.
Job information Processing system-related information to uniquely identify a data processing job jobTime The time in UTC that the system processed the transformation job.
Job information Processing system-related information to uniquely identify a data processing job actuator The user or account that requested the job execution.
Job information Processing system-related information to uniquely identify a data processing job transform instruction The transformation-related information captured by the processing system.
Lineage Information Definition of the data source, sink, and transforms at varying levels target The sink entity which is the destination of the transformation.
Lineage Information Definition of the data source, sink, and transforms at varying levels sources One or more entities used as the data-source for transformation.
Lineage Information Definition of the data source, sink, and transforms at varying levels transforms/ operation A set of operations performed during processing of the source data to create data in the target entity/field.

Reference architecture

This section describes a conceptual architecture for building a passive data lineage system for a SQL-like data warehouse. You can implement the architecture in different ways.

The conceptual Lineage system needs three logical components which are as follows:

  • An ingestion system to monitor the operation logs of the data warehouse for operations. The ingestion system monitors operation logs in a data warehouse or database system and retrieves the logs. The system also decouples the log monitoring function from the log consumer. Examples of this kind of systems are Debezium, a BinLog monitoring system, and Cloud Logging in Google Cloud.

  • A data lineage extraction engine to parse the log entry to extract lineage information such as the data source or any applied transformations. The lineage extraction engine identifies the source data system and parses the operation log using corresponding providers. Databases or data warehouses have their own query syntax with in-built functions and custom functions. Using a data system-specific grammar and schema provider enables correct parsing of an operation or executed query.

  • A lineage store to persist lineage data and make it available for analysis, governance, reporting, or other enterprise needs. A database can be used for this purpose with appropriate sharding or partitioning to make it easy to retrieve data.

Optionally, to enable downstream systems to use data lineage information in real-time operations, you can add a message bus to the architecture.

Data lineage architecture.

The lineage extraction process

The lineage extraction process is shown in the following diagram.

Flow of events in lineage extraction process.

The following events occur in the lineage extraction process as depicted in the preceding diagram:

  1. The lineage extraction process starts with the identification of source entities that are consumed to generate the output or target entity through the query. The identification is done by parsing the transform instructions—for example, parsing the SQL statement in a data warehouse system using the grammar provider.
  2. Parsing a query requires the schema information of the source entities of the query. The source entity schema is retrieved using the schema provider.
  3. The grammar provider is invoked to identify the relationship between output columns to the source columns and the list of functions and transformations applied to each output column.

What's next