About data lineage

Data lineage helps you track how data moves through your systems. You can see the origin, destinations, and transformations applied to a data asset.

You can view lineage information in the Google Cloud console for Dataplex Universal Catalog, BigQuery, and Vertex AI assets, or you can retrieve it by using the Data Lineage API.

Why you need data lineage

Large datasets often require transforming data into various formats for specific projects, such as text files, tables, reports, dashboards, and models.

For example, an online store might have a data pipeline with the following flow:

  1. A Dataflow job reads raw purchase events from a Pub/Sub topic, product details from Cloud Storage files, and customer information from a BigQuery table. The job joins this information and creates a purchases table in BigQuery.

  2. Subsequent BigQuery jobs transform the purchases table to create smaller, aggregated tables, such as region or brand, and calculate new columns, such as total_profit.

  3. Analysts use these tables to generate reports and dashboards in Looker.

This common scenario can present several challenges:

  • Data consumers lack a self-service method to verify if data originates from an authoritative source.

  • Data engineers struggle to find the root cause of issues because they can't reliably track all data transformations. For example, if an analyst finds an error in a total_profit column, tracing the error back to its origin is difficult.

  • Data engineers and analysts can't fully assess the potential impact of modifying or deleting tables. For example, before deprecating a product_id column, they must identify all dependent downstream columns to avoid breaking reports.

  • Data governors lack visibility into how sensitive data is used across the organization, making it difficult to ensure compliance with regulatory requirements.

Data lineage solves these problems by providing a clear, visual map of your data's journey. With data lineage, you can do the following:

  • Understand how data is sourced and transformed by using lineage graphs.

  • Trace errors in data entries and operations back to their root causes.

  • Enable better change management through impact analysis to avoid downtime or unexpected errors, understand dependencies, and collaborate with stakeholders.

Data lineage workflow

The data lineage workflow includes the following steps:

  1. Data sources and ingestion: lineage information from your data sources initiates the entire process. For more information, see Lineage sources.

    • Google Cloud services: when the Data Lineage API is enabled, supported services like BigQuery and Dataflow automatically report lineage events whenever data is moved or transformed.

    • Custom sources: for any systems not automatically supported by Google Cloud integrations, you can use the Data Lineage API to manually record lineage information. We recommend importing events formatted according to the OpenLineage standard.

  2. Lineage platform: this central platform ingests, models, and stores all lineage data. For more information, see Lineage information model and granularity.

    • Data Lineage API: this API acts as the single entry point for all incoming lineage information. It uses a hierarchical data model consisting of three core concepts: process, run, and event.

    • Processing and storage: the platform processes incoming data and stores it in reliable, query-optimized databases.

  3. User experience: you can interact with the stored lineage information in two primary ways:

    • Visual exploration: in the Google Cloud console, a frontend service fetches and renders the lineage data as an interactive graph or list. This is supported for Dataplex Universal Catalog, BigQuery, and Vertex AI (for models, datasets, feature store views, and feature groups). This is ideal for visually exploring your data's journey. For more information, see Lineage views in the Google Cloud console.

    • Programmatic access: using an API client, you can directly communicate with the Data Lineage API to automate lineage management. This lets you write lineage information from custom sources. It also lets you read and query the stored lineage data for use in other applications or for building custom reports.

Lineage sources

You can populate lineage information in Dataplex Universal Catalog in the following ways:

  • Automatically from integrated Google Cloud services
  • Manually, by using the Data Lineage API for custom sources
  • By importing events from OpenLineage

Automated data lineage tracking

When you enable the Data Lineage API, Google Cloud systems that support data lineage start reporting their data movement. Each integrated system can submit lineage information for a different range of data sources.

BigQuery

When you enable data lineage in your BigQuery project, Dataplex Universal Catalog automatically records lineage information for the following:

BigQuery copy, query, and load jobs are represented as processes.

To view the process details, on the lineage graph, click .

Each process contains the BigQuery job_id in the attributes list for the most recent BigQuery job.

Other services

Data lineage supports integration with the following Google Cloud services:

Data lineage for custom data sources

You can use the Data Lineage API to manually record lineage information for any data source that isn't supported by the integrated systems.

Dataplex Universal Catalog can create lineage graphs for manually recorded lineage if you use a fullyQualifiedName that matches the fully qualified names of existing Dataplex Universal Catalog entries. If you want to record lineage for a custom data source, you must first create a custom entry.

Each process for a custom data source can contain a sql key in the attributes list. The value of this key is used to render a code highlight in the details panel of the data lineage graph. The SQL statement is displayed as it was provided. You are responsible for filtering out sensitive information. The key name sql is case-sensitive.

OpenLineage

If you already use OpenLineage to collect lineage information from other data sources, you can import OpenLineage events into Dataplex Universal Catalog and view these events in the Google Cloud console. For more information, see Integrate with OpenLineage.

Limitations

The following are limitations for data lineage:

  • All lineage information is retained in the system for only 30 days.

  • Lineage information persists after you delete the related data source. For example, if you delete a BigQuery table, you can still view its lineage through the API and the console for up to 30 days.

Column-level lineage limitations

Column-level lineage has the following additional limitations:

  • Column-level lineage isn't collected for BigQuery load jobs or for routines.

  • Upstream column-level lineage isn't collected for external tables.

  • Column-level lineage isn't collected if a job creates more than 1,500 column-level links. In these cases, only table-level lineage is collected.

  • There is no API to create, read, update, delete, or search for column-level lineage.

  • Support for partitioned tables is limited, because partitioning columns like _PARTITIONDATE and _PARTITIONTIME aren't recognized in the lineage graph.

  • Console limitations:

    • The lineage graph traversal is limited to a depth of 20 levels and 10,000 links in each direction.

    • Column-level lineage is only fetched from the region where the root table is located. There is no support for cross-region lineage in the graph view.

Pricing

  • Dataplex Universal Catalog uses the premium processing SKU to charge for data lineage. For more information, see Pricing.

  • To separate data lineage charges from other charges in the Dataplex Universal Catalog premium processing SKU, on the Cloud Billing report, use the label goog-dataplex-workload-type with the value LINEAGE.

  • If you call the Data Lineage API Origin sourceType with a value other than CUSTOM, it causes additional costs.

What's next