Use data lineage in Dataflow

Data lineage is a Dataflow feature that lets you track how data moves through your systems: where it comes from, where it is passed to, and what transformations are applied to it.

Each pipeline that you run by using Dataflow has several associated data assets. The lineage of a data asset includes its origin, what happens to it, and where it moves over time. With data lineage, you can track the end-to-end movement of your data assets, from origin to eventual destination.

When you enable data lineage for your Dataflow jobs, Dataflow captures lineage events and publishes them to the Dataplex Data Lineage API.

To access lineage information through Dataplex, see Use data lineage with Google Cloud systems.

Before you begin

Set up your project:

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. Make sure that billing is enabled for your Google Cloud project.

  3. Enable the Dataplex, BigQuery, and Data lineage APIs.

    Enable the APIs

  4. Make sure that billing is enabled for your Google Cloud project.

  5. Enable the Dataplex, BigQuery, and Data lineage APIs.

    Enable the APIs

In Dataflow, you also need to enable lineage at the job level. See Enable data lineage in Dataflow in this document.

Required roles

To get the permissions that you need to view lineage visualization graphs, ask your administrator to grant you the following IAM roles:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

For more information about data lineage roles, see Predefined roles for data lineage.

Support and limitations

Data lineage in Dataflow has the following limitations:

  • Data lineage is supported in the Apache Beam SDK versions 2.63.0 and later.
  • You must enable data lineage on a per-job basis.
  • Data capture isn't instantaneous. It can take a few minutes for Dataflow job lineage data to appear in Dataplex.
  • The following sources and sinks are supported:

    • Apache Kafka
    • BigQuery
    • Bigtable
    • Cloud Storage
    • JDBC (Java Database Connectivity)
    • Pub/Sub
    • Spanner

    Dataflow templates that use these sources and sinks also automatically capture and publish lineage events.

Enable data lineage in Dataflow

You need to enable lineage at the job level. To enable data lineage, use the enable_lineage Dataflow service option as follows:

Java

--dataflowServiceOptions=enable_lineage=true

Python

--dataflow_service_options=enable_lineage=true

Go

--dataflow_service_options=enable_lineage=true

gcloud

Use the gcloud dataflow jobs run command with the additional-experiments option. If you're using Flex Templates, use the gcloud dataflow flex-template run command.

--additional-experiments=enable_lineage=true

Optionally, you can specify one or both of the following parameters with the service option:

  • process_id: A unique identifier that Dataplex uses to group job runs. If not specified, the job name is used.
  • process_name: A human-readable name for the data lineage process. If not specified, the job name prefixed with "Dataflow " is used.

Specify these options as follows:

Java

--dataflowServiceOptions=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME

Python

--dataflow_service_options=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME

Go

--dataflow_service_options=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME

gcloud

--additional-experiments=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME

View lineage in Dataplex

Data lineage provides information about the relations between your project resources and the processes that created them. You can view data lineage information in the Google Cloud console in the form of a graph or a single table. You can also retrieve data lineage information from the Data Lineage API in the form of JSON data.

For more information, see Use data lineage with Google Cloud systems.

Disable data lineage in Dataflow

If data lineage is enabled for a specific job and you want to disable it, cancel the existing job and run a new version of the job without the enable_lineage service option.

Billing

Using data lineage in Dataflow doesn't impact your Dataflow bill, but it might incur additional charges on your Dataplex bill. For more information, see Data lineage considerations and Dataplex pricing.

What's next