Data lineage is a Dataplex feature that lets you track how data moves through your systems: where it comes from, where it is passed to, and what transformations are applied to it.
Data lineage is available for all Dataproc Spark jobs except SparkR, with Dataproc Compute Engine 2.0.74+ and 2.1.22+ images. Lineage is available for BigQuery and Cloud Storage data sources.
Once you enable the feature in your Dataproc cluster, Dataproc Spark jobs capture lineage events and publish them to the Dataplex Data Lineage API. Dataproc integrates with the Data Lineage API through OpenLineage, using the OpenLineage Spark plugin.
You can access lineage information through Dataplex, using the following:
Limitations
Lineage is not supported for the following:
- BigQuery Connector version 2 (data source API version 2 of Spark)
- Spark streaming workload
Before you begin
In the Google Cloud console, on the project selector page, select the project that contains the Dataproc cluster for which you want to track lineage.
Enable Data Lineage API and Data Catalog API.
Required roles
To get the permissions that you need to use data lineage in Dataproc, ask your administrator to grant you the following IAM roles on the Dataproc cluster VM service account:
-
View lineage visualization in Data Catalog or to use the Data Lineage API:
Data Lineage Viewer (
roles/datalineage.viewer
) -
Produce lineage manually using the API:
Data Lineage Events Producer (
roles/datalineage.producer
) -
Edit lineage using the API:
Data Lineage Editor (
roles/datalineage.editor
) -
Perform all operations on lineage:
Data Lineage Administrator (
roles/datalineage.admin
)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Enable data lineage in Dataproc
Enable lineage at the cluster level, so that all submitted Spark jobs in the cluster report lineage information to the Data Lineage API.
Create a Dataproc cluster
Create a Dataproc cluster
with the property dataproc:dataproc.lineage.enabled
set to true
.
gcloud dataproc clusters create CLUSTER_NAME \
--region REGION \
--zone ZONE \
--project PROJECT_ID \
--properties 'dataproc:dataproc.lineage.enabled=true' \
--scopes https://www.googleapis.com/auth/cloud-platform
Submit a Spark job
When you submit a Spark job on a Dataproc cluster that was created with lineage enabled, Dataproc captures and reports the lineage information to the Data Lineage API.
gcloud dataproc jobs submit spark \
--project PROJECT_ID \
--cluster=CLUSTER_NAME \
--region REGION \
--class CLASS \
--jars=gs://APPLICATION_BUCKET/spark-application.jar \
--properties=spark.openlineage.namespace=CUSTOM_NAMESPACE,spark.openlineage.appName=CUSTOM_APPNAME
The properties spark.openlineage.namespace
and spark.openlineage.appName
are
optional, and are used to uniquely identify the job. If you don't pass these
properties, Dataproc uses the following default values:
- Default value for
spark.openlineage.namespace
: PROJECT_ID - Default value for
spark.openlineage.appName
:spark.app.name
View lineage graphs in Dataplex
A lineage visualization graph displays the relations between your project resources and the processes that created them. You can view data lineage information in the form of a graph visualization in the Google Cloud console, or retrieve it from the Data Lineage API in the form of JSON data.
For more information, see View lineage graphs in Dataplex UI.
Example
Consider the following Spark job that reads data from a BigQuery table and writes to another BigQuery table:
#!/usr/bin/env python
from pyspark.sql import SparkSession
import sys
spark = SparkSession \
.builder \
.appName('LINEAGE_BQ_TO_BQ') \
.getOrCreate()
bucket = lineage-ol-test
spark.conf.set('temporaryGcsBucket', bucket)
source = sample.source
words = spark.read.format('bigquery') \
.option('table', source) \
.load()
words.createOrReplaceTempView('words')
word_count = spark.sql('SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
destination = sample.destination
word_count.write.format('bigquery') \
.option('table', destination) \
.save()
This Spark job creates the following lineage graph in the Dataplex UI:
Disable data lineage in Dataproc
After you enable linage when you create a cluster, you
cannot disable lineage at the cluster level. To disable lineage in a
Dataproc cluster, recreate the cluster without the
dataproc:dataproc.lineage.enabled
property.
To disable lineage for a particular job on a cluster that was created with
lineage enabled, you must pass the spark.extraListeners
property with empty
value when submitting the job.
What's next
- Learn more about data lineage.