Spark data lineage

\

Overview

Data lineage is a Dataplex feature that lets you track how data moves through your systems: where it comes from, where it is passed to, and what transformations are applied to it.

After you enable data lineage, Dataproc Serverless for Spark workloads capture lineage events and publish them to the Dataplex Data Lineage API. Dataproc Serverless for Spark integrates with the Data Lineage API through OpenLineage, using the OpenLineage Spark plugin.

You can access lineage information through Dataplex, using Lineage visualization graphs and the Data Lineage API. For more information, see View lineage graphs in Dataplex.

Availability, capabilities, and limitations

Data lineage is available for Dataproc Serverless for Spark 1.x runtimes, starting with runtime version 1.1.50. It includes lineage BigQuery and Cloud Storage data sources.

Lineage support is not provided for the following:

  • BigQuery connector version 2 (Spark data source API version 2)
  • Spark streaming workloads

Before you begin

  1. On the project selector page in the Google Cloud console, select the project to use for your Dataproc Serverless for Spark workloads.

    Go to project selector

  2. Enable the Data Lineage API and Data Catalog APIs.

    Enable the APIs

Required roles

To get the permissions that you need to use data lineage in Dataproc Serverless for Spark, ask your administrator to grant you the following IAM roles on the Dataproc cluster VM service account:

For more information about granting roles, see Manage access.

You might also be able to get the required permissions through custom roles or other predefined roles.

Enable data lineage for a Spark batch workload

You can enable Spark data lineage for a batch workload by setting the spark.dataproc.lineage.enabled property to true when you submit the workload,

Google Cloud CLI example:

gcloud dataproc batches submit pyspark FILENAME.py
    --region=REGION \
    --version=1.1 \
    --properties=spark.dataproc.lineage.enabled=true \
    other args ...

View lineage graphs in Dataplex

A lineage visualization graph displays the relations between your project resources and the processes that created them. You can view data lineage information in a graph visualization in the Google Cloud console or retrieve the information from the Data Lineage API as JSON data.

For more information, see Use data lineage with Google Cloud systems .

Example:

The following Spark workload reads data from a BigQuery table, and then writes the output to a different BigQuery table.

#!/usr/bin/env python

from pyspark.sql import SparkSession
import sys

spark = SparkSession \
  .builder \
  .appName('LINEAGE_BQ_TO_BQ') \
  .getOrCreate()

bucket = lineage-ol-test
spark.conf.set('temporaryGcsBucket', bucket)

source = sample.source
words = spark.read.format('bigquery') \
  .option('table', source) \
  .load()
words.createOrReplaceTempView('words')

word_count = spark.sql('SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')

destination = sample.destination
word_count.write.format('bigquery') \
  .option('table', destination) \
  .save()

This Spark workload creates the following lineage graph in the Dataplex UI:

Sample lineage graph

What's next