Enable Spark data lineage in Dataproc

This document describes how to enable data lineage for your Dataproc Spark jobs either at the project or cluster level.

Data lineage is a Dataplex Universal Catalog feature that lets you track how data moves through your systems: where it comes from, where it is passed to, and what transformations are applied to it.

Data lineage is available for all Dataproc Spark jobs except SparkR and Spark streaming jobs, and supports BigQuery and Cloud Storage data sources. It is included with Dataproc on Compute Engine 2.0.74+, 2.1.22+, 2.2.50, and later image versions.

Once you enable the feature in your Dataproc cluster, Dataproc Spark jobs capture data lineage events and publish them to the Dataplex Universal Catalog Data Lineage API. Dataproc integrates with the Data Lineage API through OpenLineage, using the OpenLineage Spark plugin.

You can access data lineage information through Dataplex Universal Catalog, using the following:

Before you begin

In the Google Cloud console, on the project selector page, select the project that contains the Dataproc cluster for which you want to track lineage.

Go to project selector
Enable the Data Lineage API.

Enable the APIs

Required roles

If you create a Dataproc cluster using the default VM service account, it has the Dataproc Worker role, which enables data lineage. No additional action is necessary.

However, if you create a Dataproc cluster that uses a custom service account, to enable data lineage on the cluster, you must grant a required role to the custom service account as explained in the following paragraph.

To get the permissions that you need to use data lineage with Dataproc , ask your administrator to grant you the following IAM roles on your cluster's custom service account:

Grant one of the following roles:
- Dataproc Worker (roles/dataproc.worker)
- data lineage Editor (roles/datalineage.editor)
- data lineage Producer (roles/datalineage.producer)
- data lineage Administrator (roles/datalineage.admin)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Enable Spark data lineage at the project level

You can enable Spark data lineage at the project level. Supported Spark jobs run on clusters created after data lineage is enabled on a project will have data lineage enabled. Note that jobs run on existing clusters—clusters that were created prior to enabling data lineage at the project level—won't have data lineage enabled.

Enable Spark data lineage at the project level

To enable Spark data lineage at the project level, set the following custom project metadata:

Key	Value
`DATAPROC_LINEAGE_ENABLED`	`true`
`DATAPROC_CLUSTER_SCOPES`	`https://www.googleapis.com/auth/cloud-platform`

You can disable Spark data lineage at the project level by setting the DATAPROC_LINEAGE_ENABLED metadata to false.

Enable Spark data lineage at the cluster level

You can enable Spark data lineage when you create a cluster so that all supported Spark jobs submitted to the cluster will have data lineage enabled.

Enable Spark data lineage at the cluster level

To enable Spark data lineage on a cluster, create a Dataproc cluster with the dataproc:dataproc.lineage.enabled cluster property set to true.

2.0 image version clusters: Dataproc cluster VM access cloud-platform scope is required for Spark data lineage. Dataproc image version clusters created with image version 2.1 and later have cloud-platform enabled. If you specify Dataproc image version 2.0 when you create a cluster, set the scope to cloud-platform.

gcloud CLI example:

gcloud dataproc clusters create CLUSTER_NAME \
    --project PROJECT_ID \
    --region REGION \
    --properties 'dataproc:dataproc.lineage.enabled=true'

Disable Spark data lineage on a job

If you enable Spark data lineage at the cluster level, you can disable Spark data lineage on a specific job by passing the spark.extraListeners property with an empty value ("") when you submit the job.

Once enabled, you cannot disable Spark data lineage on the cluster. To eliminate Spark data lineage on all cluster jobs, you can recreate the cluster without the dataproc:dataproc.lineage.enabled property.

Submit a Spark job

When you submit a Spark job on a Dataproc cluster that was created with Spark data lineage enabled, Dataproc captures and reports the data lineage information to the Data Lineage API.

gcloud dataproc jobs submit spark \
    --cluster=CLUSTER_NAME \
    --project PROJECT_ID \
    --region REGION \
    --class CLASS \
    --jars=gs://APPLICATION_BUCKET/spark-application.jar \
    --properties=spark.openlineage.namespace=CUSTOM_NAMESPACE,spark.openlineage.appName=CUSTOM_APPNAME

Notes:

Adding the spark.openlineage.namespace and spark.openlineage.appName properties, which are used to uniquely identify the job, is optional. If you don't add these properties, Dataproc uses the following default values:
- Default value for spark.openlineage.namespace: PROJECT_ID
- Default value for spark.openlineage.appName: spark.app.name

View lineage in Dataplex Universal Catalog

A lineage graph displays relationships between your project resources and the processes that created them. You can view data lineage information in the Google Cloud console, or retrieve it from the Data Lineage API in the form of JSON data.

PySpark example code:

The following PySpark job reads data from a public BigQuery table, and then writes the output a new table in an existing BigQuery dataset. It uses a Cloud Storage bucket for temporary storage.

#!/usr/bin/env python

from pyspark.sql import SparkSession
import sys

spark = SparkSession \
  .builder \
  .appName('LINEAGE_BQ_TO_BQ') \
  .getOrCreate()

bucket = 'gs://BUCKET`
spark.conf.set('temporaryCloudStorageBucket', bucket)

source = 'bigquery-public-data:samples.shakespeare'
words = spark.read.format('bigquery') \
  .option('table', source) \
  .load()
words.createOrReplaceTempView('words')

word_count = spark.sql('SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')

destination_table = 'PROJECT_ID:DATASET.TABLE'
word_count.write.format('bigquery') \
  .option('table', destination_table) \
  .save()

Make the following replacements:

BUCKET: The name of an existing Cloud Storage bucket.
PROJECT_ID, DATASET, and TABLE: Insert your project ID, the name of an existing BigQuery dataset, and the name of a new table to create in the dataset (the table must not exist).

You can view the lineage graph in the Dataplex Universal Catalog UI.

Sample lineage graph

What's next

Learn more about data lineage.

Enable Spark data lineage in Dataproc Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Required roles

Enable Spark data lineage at the project level

Enable Spark data lineage at the project level

Enable Spark data lineage at the cluster level

Enable Spark data lineage at the cluster level

Disable Spark data lineage on a job

Submit a Spark job

View lineage in Dataplex Universal Catalog

What's next

Enable Spark data lineage in Dataproc