This document describes how to enable data lineage for your Dataproc Serverless for Spark batch workloads either at the project or batch workload level.
Overview
Data lineage is a Dataplex feature that lets you track how data moves through your systems: where it comes from, where it is passed to, and what transformations are applied to it.
Dataproc Serverless for Spark workloads capture lineage events and publish them to the Dataplex Data Lineage API. Dataproc Serverless for Spark integrates with the Data Lineage API through OpenLineage, using the OpenLineage Spark plugin.
You can access lineage information through Dataplex, using Lineage visualization graphs and the Data Lineage API. For more information, see View lineage graphs in Dataplex.
Availability, capabilities, and limitations
Data lineage, which supports BigQuery and Cloud Storage
data sources, is available for workloads run with
Dataproc Serverless for Spark runtime versions
1.1.50+
, 1.2.29+
, and 2.2.29+
, with the following exceptions and limitations:
- Data lineage is not available for SparkR or Spark streaming workloads.
Before you begin
On the project selector page in the Google Cloud console, select the project to use for your Dataproc Serverless for Spark workloads.
Enable the Data Lineage API and Data Catalog APIs.
Required roles
To get the permissions that you need to use data lineage in Dataproc Serverless for Spark, ask your administrator to grant you the following IAM roles on the Dataproc cluster VM service account:
-
View lineage visualization in Data Catalog or to use the Data Lineage API:
Data Lineage Viewer (
roles/datalineage.viewer
) -
Produce lineage manually using the API:
Data Lineage Events Producer (
roles/datalineage.producer
) -
Edit lineage using the API:
Data Lineage Editor (
roles/datalineage.editor
) -
Perform all operations on lineage:
Data Lineage Administrator (
roles/datalineage.admin
)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Enable data lineage at the project level
You can enable data lineage at the project level. When enabled at the project level, all subsequent batch workloads that you run in the project will have Spark lineage enabled.
How to enable data lineage at the project level
To enable data lineage at the project level, set the following custom project metadata.
Key | Value |
---|---|
DATAPROC_LINEAGE_ENABLED |
true |
DATAPROC_CLUSTER_SCOPES |
https://www.googleapis.com/auth/cloud-platform |
You can disable data lineage at the project level by setting the
DATAPROC_LINEAGE_ENABLED
metadata to false
.
Enable data lineage for a Spark batch workload
You can enable data lineage on a batch workload
by setting the spark.dataproc.lineage.enabled
property to true
when you
submit the workload.
gcloud CLI example:
gcloud dataproc batches submit pyspark FILENAME.py --region=REGION \ --properties=spark.dataproc.lineage.enabled=true
View lineage graphs in Dataplex
A lineage visualization graph displays relationships between your project resources and the processes that created them. You can view data lineage information in a graph visualization in the Google Cloud console or retrieve the information from the Data Lineage API as JSON data.
For more information, see Use data lineage with Google Cloud systems .
Example:
The following Spark workload reads data from a BigQuery table, and then writes the output to a different BigQuery table.
#!/usr/bin/env python
from pyspark.sql import SparkSession
import sys
spark = SparkSession \
.builder \
.appName('LINEAGE_BQ_TO_BQ') \
.getOrCreate()
bucket = lineage-ol-test
spark.conf.set('temporaryGcsBucket', bucket)
source = sample.source
words = spark.read.format('bigquery') \
.option('table', source) \
.load()
words.createOrReplaceTempView('words')
word_count = spark.sql('SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
destination = sample.destination
word_count.write.format('bigquery') \
.option('table', destination) \
.save()
This Spark workload creates the following lineage graph in the Dataplex UI:
What's next
- Learn more about data lineage.