Track data lineage for a BigQuery table

Data lineage lets you track how data moves through your systems: where it comes from, where it is passed to, and what transformations are applied to it.

Learn how to get started with tracking data lineage for BigQuery copy and query jobs:

  1. Copy two tables from a publicly available new_york_taxi_trips dataset.

  2. Combine the total number of taxi rides from both tables into a new table.

  3. View a lineage visualization graph for all three operations.

Before you begin

Set up your project:

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Data Catalog, BigQuery, and data lineage APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Data Catalog, BigQuery, and data lineage APIs.

    Enable the APIs

Required roles

To get the permissions that you need to view lineage visualization graphs, ask your administrator to grant you the following IAM roles:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Add a public dataset to your project

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, click Add.

  3. In the Add pane, search for Public datasets, and select the Public datasets result.

  4. In the Marketplace pane, search for NYC TLC Trips and click the NYC TLC Trips result.

  5. Click View Dataset.

This adds the public dataset's project as a reference that you can view in the Explorer pane. The details pane shows Dataset info, including information such as Dataset ID, Data location, and Last modified date.

Create a dataset in your project

  1. In the Explorer pane, select the project where you want to create the dataset.

  2. Click Actions and click Create dataset.

  3. On the Create dataset page, in the Dataset ID field, enter: data_lineage_demo. Leave the other fields with their default values.

  4. Click Create dataset.

  5. In the Explorer pane, click the newly added data_lineage_demo.

The details pane shows its Dataset info.

Copy two publicly accessible tables to your dataset

  1. Open a query editor: In the details pane, next to the tab called data_lineage_demo, click (Compose new query). This step creates a tab called Untitled.

  2. In the query editor, copy the first table by entering the following query. Replace PROJECT_ID with your project's identifier.

    CREATE TABLE `PROJECT_ID.data_lineage_demo.nyc_green_trips_2021`
    COPY `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2021`
    
  3. Click Run. This step creates the first table, called nyc_green_trips_2021.

  4. In the Query results pane, click Go to table. This step displays the contents of the first table.

  5. In the query editor, copy the second table by replacing the previous query with the following query. Replace PROJECT_ID with your project's identifier.

    CREATE TABLE `PROJECT_ID.data_lineage_demo.nyc_green_trips_2022`
    COPY `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022`
    
  6. Click Run. This step creates the second table, called nyc_green_trips_2022.

  7. In the Query results pane, click Go to table. This step displays the contents of the second table.

Aggregate data into a new table

  1. In the query editor, enter the following query. Replace PROJECT_ID with your project's identifier.

    CREATE TABLE `PROJECT_ID.data_lineage_demo.total_green_trips_22_21`
    AS SELECT vendor_id, COUNT(*) AS number_of_trips
    FROM (
         SELECT vendor_id FROM `PROJECT_ID.data_lineage_demo.nyc_green_trips_2022`
         UNION ALL
         SELECT vendor_id FROM `PROJECT_ID.data_lineage_demo.nyc_green_trips_2021`
    )
    GROUP BY vendor_id
    
  2. Click Run. This step creates a combined table, called total_green_trips_22_21.

  3. In the Query results pane, click Go to table. This step displays the combined table.

View the lineage graph in Dataplex

  1. In the Google Cloud console, go to the Dataplex Search page.

    Go to Dataplex Search

  2. For Choose search platform, select Data Catalog as the search mode.

  3. In the Search box, enter total_green_trips_22_21 and click Search.

  4. From the results list, click total_green_trips_22_21. This step displays the BigQuery table Details tab.

  5. Click the Lineage tab.

The total_green_trips_22_21 table with details panel docked to the bottom.
Figure 1. Data lineage with node details

In the lineage graph, each rectangular node represents a table, either an original, copied, or combined table. You can do the following:

  • To show or hide the origin of a table, click + (Expand) or - (Collapse).

  • To show table information, click a node. This step displays a node Details pane.

  • To show process information, click view lineage process details. This step displays a process Details pane showing the job that transformed a source table to a target table.

The intermediary nyc_green_trips_2021 table with details panel docked to the bottom.
Figure 2. Data lineage with process details

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the dataset

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, search for the data_lineage_demo dataset you created.

  3. Right-click the dataset and select Delete.

  4. Confirm your delete action.

What's next