Enable Hive data lineage in Dataproc

This document shows you how to enable and use data lineage for Dataproc Hive jobs.

You enable Data lineage for Dataproc Hive jobs using an initialization action when you create a cluster.

When you enable Hive Data lineage on a cluster, Hive jobs that you submit to the cluster capture data lineage events and publish them to Dataplex.

Visualize lineage information

A Data lineage graph displays relationships between your project resources and the processes that created them. You can access lineage graphs using Dataplex, BigQuery Studio, and Vertex AI in the Google Cloud console.

Pricing

Dataproc Hive data lineage is offered during Preview without additional charge. Standard Dataproc pricing applies.

Before you begin

In the Google Cloud console, on the project selector page, select the project that contains the Dataproc cluster for which you want to track lineage.

Go to project selector
Enable the Data Lineage API and Dataplex API.

Enable the APIs

Required roles

To get the permissions that you need to use data lineage in Dataproc, ask your administrator to grant you the following IAM roles on the Dataproc cluster VM service account:

View data lineage in Dataplex or use the Data Lineage API: Data Lineage Viewer (roles/datalineage.viewer)
Produce data lineage manually using the API: Data Lineage Events Producer (roles/datalineage.producer)
Edit data lineage using the API: Data Lineage Editor (roles/datalineage.editor)
Perform all operations on data lineage: Data Lineage Administrator (roles/datalineage.admin)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Enable Hive data lineage

To enable Hive data lineage on a cluster, specify the hive-lineage.sh initialization action when you create a Dataproc cluster. This initialization action is stored in regional buckets in Cloud Storage.

gcloud CLI cluster creation example:

gcloud dataproc clusters create CLUSTER_NAME \
    --project PROJECT_ID \
    --region REGION \
    --initialization-actions gs://goog-dataproc-initialization-actions-REGION/hive-lineage/hive-lineage.sh

Replace the following:

CLUSTER_NAME: The name of the cluster.
PROJECT_ID: Your Google Cloud project ID. Project IDs are listed in the Project info section on the Google Cloud console Dashboard.
REGION: The Compute Engine region in which to locate the cluster.
--initialization-actions: Specifies an installation action located in a Cloud Storage regional location, that enables Hive data lineage.
- Optionally add the Hive-BigQuery connector initialization action. If you want to integrate BigQuery tables with Hive workloads, you must install the Hive-BigQuery connector on the cluster. See the Hive data lineage with BigQuery example, which runs a connector initialization action to install the Hive-BigQuery connector on the cluster.

Submit a Hive job

When you submit a Hive job to a Dataproc cluster that was created with Hive data lineage enabled, Dataproc captures and reports the data lineage information to Dataplex.

gcloud CLI Hive job submission example:

gcloud dataproc jobs submit hive \
    --cluster=CLUSTER_NAME \
    --project PROJECT_ID \
    --region REGION \
    --properties=hive.openlineage.namespace=CUSTOM_NAMESPACE \
    --execute HIVE_QUERY

Replace the following:

CLUSTER_NAME: The name of the cluster.
PROJECT_ID: Your Google Cloud project ID. Project IDs are listed in the Project info section on the Google Cloud console Dashboard.
REGION: The Compute Engine region where your cluster is located.
CUSTOM_NAMESPACE: An optional custom Hive namespace you can specify to identify the Hive job.
HIVE_QUERY: The Hive query to submit to the cluster. Instead of specifying a query, you can replace the --execute HIVE_QUERY flag with a --file SQL_FILE flag to specify the location of a file that contains the query.

View lineage in Dataplex

A lineage graph displays relationships between your project resources and the processes that created them. You can view data lineage information in the Google Cloud console, or retrieve it from the Data Lineage API in the form of JSON data.

Hive data lineage with BigQuery example

The example in this section consists of the following steps:

Create a Dataproc cluster that has Hive data lineage enabled and the Hive-BigQuery connector installed on the cluster.
Run a Hive query on the cluster to copy data between Hive tables.
View the generated data lineage graph in BigQuery Studio.

Create a Dataproc cluster

Run the following command in a local terminal window or in Cloud Shell to create a Dataproc cluster.

gcloud dataproc clusters create CLUSTER_NAME \
    --project PROJECT_ID \
    --region REGION \
    --initialization-actions gs://goog-dataproc-initialization-actions-REGION/connectors/connectors.sh, gs://goog-dataproc-initialization-actions-REGION/hive-lineage/hive-lineage.sh \
    --metadata hive-bigquery-connector-version=VERSION

Notes:

CLUSTER_NAME: The name of the cluster.
PROJECT_ID: Your Google Cloud project ID. Project IDs are listed in the Project info section on the Google Cloud console Dashboard.
REGION: The Compute Engine region in which to locate the cluster.
--initialization-actions: These installation actions, located in Cloud Storage, install the Hive-BigQuery connector and enable Hive data lineage.
VERSION: Specifies the Hive-BigQuery connector version. The --metadata flag passes the version to the connectors.sh initialization action to install the Hive-BigQuery connector on the cluster.

Run a Hive query

Run a Hive query to perform the following actions:

Create a us_states external table with sample data input from gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout.
Create a us_states_copy managed table in the specified BigQuery dataset.
Copy the entire data from us_states to us_states_copy.

To run the query:

In a local terminal window or in Cloud Shell, use a text editor, such as the vi or nano, to copy the following Hive query statement into an hive-example.sql file, then save the file in the current directory.
Submit the hive-example.sql file to the Dataproc cluster created earlier replacing the --execute HIVE_QUERY flag with a --file SQL_FILE flag to specify the location of the saved hive-example.sql file. Note that the PROJECT and BQ_DATASET variables must be populated.

Hive BigQueryStorageHandler

CREATE EXTERNAL TABLE us_states (
    name STRING,
    post_abbr STRING
)
STORED AS PARQUET
LOCATION 'gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout';

CREATE TABLE us_states_copy (
    name STRING,
    post_abbr STRING
)
STORED BY 'com.google.cloud.hive.bigquery.connector.BigQueryStorageHandler'
TBLPROPERTIES (
  'bq.table'='PROJECT.BQ_DATASET.us_states_copy'
);

INSERT INTO us_states_copy SELECT * FROM us_states;

View the data lineage graph

After the Hive job finishes successfully, view the data lineage in BigQuery Studio in the Google Cloud console:

Hive lineage graph

For information about displaying graphs in BigQuery Studio, see View lineage in BigQuery. For information about understanding graphs, see Data lineage information model.

What's next

Learn more about data lineage.