Exploring data lineage

This tutorial shows how to use Cloud Data Fusion to explore data lineage: the data's origins and its movement over time.

Cloud Data Fusion Data Lineage

Cloud Data Fusion data lineage helps you:

  • detect the root cause of bad data events
  • perform an impact analysis prior to making data changes

Cloud Data Fusion provides lineage at the dataset level and field level, and is time-bound to show lineage over time.

  • Dataset level lineage shows the relationship between datasets and pipelines in a selected time interval.

  • Field level lineage shows the operations that were performed on a set of fields in the source dataset to produce a different set of fields in the target dataset.

Tutorial Scenario

In this tutorial, you work with two pipelines:

  • The Shipment Data Cleansing pipeline reads raw shipment data from a small sample dataset and applies transformations to clean the data.

  • The Delayed Shipments USA pipeline then reads the cleansed shipment data, analyzes it, and finds shipments within the USA that were delayed by more than a threshold.

These tutorial pipelines demonstrate a typical scenario in which raw data is cleaned then sent for downstream processing. This data trail from raw data to the cleaned shipment data to analytic output can be explored using the Cloud Data Fusion lineage feature.

Objectives

  • Run sample pipelines to produce lineage
  • Explore dataset and field level lineage
  • Learn how to pass handshaking information from the upstream pipeline to the downstream pipeline

Costs

This tutorial uses billable components of Google Cloud, including:

  • Cloud Data Fusion
  • Cloud Storage
  • BigQuery

Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. Enable the Cloud Data Fusion, Cloud Storage, Dataproc, and BigQuery APIs.

    Enable the APIs

  5. Create a Cloud Data Fusion Enterprise Edition instance.
  6. Click on the following links to download these small sample datasets to your local machine:

Open the Cloud Data Fusion UI

When using Cloud Data Fusion, you use both the Cloud Console and the separate Cloud Data Fusion UI. In the Cloud Console, you can create a Cloud Console project, and create and delete Cloud Data Fusion instances. In the Cloud Data Fusion UI, you can use the various pages, such as Lineage, to access Cloud Data Fusion features.

  1. In the Cloud Console, open the Instances page.

    Open the Instances page

  2. In the Actions column for the instance, click the View Instance link. The Cloud Data Fusion UI opens in a new browser tab.

  3. CLick Studio from the Integrate panel or the left navigation panel to open the Cloud Data Fusion Studio page.

Deploy and run pipelines

  1. Import the raw Shipping Data. Click Import in the top-right of the Studio page (or click +→Pipeline→Import), then select and import the Shipment Data Cleansing pipeline that you downloaded in Before You Begin.

  2. Deploy the pipeline. Click Deploy in the top-right of the Studio page. After deployment, the Pipeline page opens.

  3. Run the pipeline. Click Run in the top-center of the Pipeline page.

  4. Import, deploy, and run the Delayed Shipments data and pipeline. After the status of the Shipping Data Cleansing shows "Succeeded", apply the above steps to the Delayed Shipments USA data that you downloaded in Before You Begin. Return to the Studio page to import the data, then deploy and run this second pipeline from the Pipeline page. After this second pipeline successfully completes, you can continue to perform the remaining steps, below.

Discover datasets

You must discover a dataset before exploring its lineage. Select Metadata from the Cloud Data Fusion UI left navigation panel to open the metadata Search page. Since the Shipment Data Cleansing dataset specified "Cleaned-Shipments" as the reference dataset, insert "shipment" in the Search box. The search results include this dataset.

Using tags to discover datasets

A Metadata search discovers datasets that have been consumed, processed, or generated by Cloud Data Fusion pipelines. Pipelines execute on a structured framework that generates and collects technical and operational metadata. The technical metadata includes dataset name, type, schema, fields, creation time, and processing information. This technical information is used by the Cloud Data Fusion metadata search and lineage features.

Cloud Data Fusion also supports the annotation of datasets with business metadata, such as tags and key-value properties, which can be used as search criteria. For example, to add and search for a business tag annotation on the Raw Shipping Data dataset:

  1. Click the Properties button of the Raw Shipping Data node on the Shipment Data Cleansing Pipeline page to open the GCS Properties page.

  2. Click View Metadata to open the Search page.

  3. Under Business Tags, click + then insert a tag name (alphanumeric and underscore characters are allowed) and press Enter.

Explore lineage

Dataset level lineage

Click on the Cleaned-Shipments dataset name listed on the Search page (from Discover datasets), then click the Lineage tab. The lineage graph shows that this dataset was generated by the Shipments-Data-Cleansing pipeline, which had consumed the Raw_Shipping_Data dataset.

The left and right arrows allow you to navigate back and forward through any previous or subsequent dataset lineage. In this example, the graph displays the full lineage for the Cleaned-Shipments dataset.

Field level lineage

Cloud Data Fusion field level lineage shows the relationship between the fields of a dataset and the transformations that were performed on a set of fields to produce a different set of fields. Like dataset level lineage, field level lineage is time-bound, and its results change with time.

Continuing from the Dataset level lineage step, click the Field Level Lineage button in the top right of the Cleaned Shipments dataset-level lineage graph to display its field level lineage graph.

The field level lineage graph shows connections between fields. You can select a field to view its lineage. Select View→Pin field to view that field's lineage only.

Select View→View impact to perform an impact analysis.

The cause and impact links show the transformations performed on both sides of a field in a human-readable ledger format. This information can be essential for reporting and governance.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

After you've finished the tutorial, clean up the resources you created on Google Cloud so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Delete the tutorial dataset

This tutorial creates a logistics_demo dataset with several tables in your project.

You can delete the dataset from the BigQuery Web UI in the Cloud Console.

Delete the Cloud Data Fusion instance

Follow the instructions to delete your Cloud Data Fusion instance.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

What's next