Track data lineage for a BigQuery table
Data lineage lets you track how data moves through your systems: where it comes from, where it is passed to, and what transformations are applied to it.
Learn how to get started with tracking data lineage for BigQuery copy and query jobs:
Copy two tables from a publicly available
new_york_taxi_trips
dataset.Combine the total number of taxi rides from both tables into a new table.
View a lineage visualization graph for all three operations.
Before you begin
Set up your project:
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Data Catalog, BigQuery, and data lineage APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Data Catalog, BigQuery, and data lineage APIs.
Required roles
To get the permissions that you need to view lineage visualization graphs, ask your administrator to grant you the following IAM roles:
-
Data Catalog viewer (
roles/datacatalog.viewer
) on the Data Catalog resource project -
Data lineage viewer (
roles/datalineage.viewer
) on the project where you use BigQuery -
BigQuery data viewer (
roles/bigquery.dataViewer
) on the project where you use BigQuery
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Add a public dataset to your project
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click Add.
In the Add pane, search for
Public datasets
, and select the Public datasets result.In the Marketplace pane, search for
NYC TLC Trips
and click the NYC TLC Trips result.Click View Dataset.
This step adds the dataset new_york_taxi_trips to your project. The details pane shows Dataset info, including information such as Dataset ID, Data location, and Last modified date.
Create a dataset in your project
In the Explorer pane, select the project where you want to create the dataset.
Click the
Actions icon and click Create dataset.In the Create dataset page, in the Dataset ID field, enter:
data_lineage_demo
. Leave the other fields with their default values.Click Create dataset.
In the Explorer pane, click the newly added
data_lineage_demo
.
The details pane shows its Dataset info.
Copy two publicly accessible tables to your dataset
Open a query editor: In the details pane, next to the tab called
data_lineage_demo
, click (Compose new query). This step creates a tab calledUntitled
.In the query editor, copy the first table by entering the following query. Replace
PROJECT_ID
with your project's identifier.CREATE TABLE `PROJECT_ID.data_lineage_demo.nyc_green_trips_2021` COPY `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2021`
Click
Run. This step creates the first table, callednyc_green_trips_2021
.In the Query results pane, click Go to table. This step displays the contents of the first table.
In the query editor, copy the second table by replacing the previous query with the following query. Replace
PROJECT_ID
with your project's identifier.CREATE TABLE `PROJECT_ID.data_lineage_demo.nyc_green_trips_2022` COPY `bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2022`
Click
Run. This step creates the second table, callednyc_green_trips_2022
.In the Query results pane, click Go to table. This step displays the contents of the second table.
Aggregate data into a new table
In the query editor, enter the following query. Replace
PROJECT_ID
with your project's identifier.CREATE TABLE `PROJECT_ID.data_lineage_demo.total_green_trips_22_21` AS SELECT vendor_id, COUNT(*) AS number_of_trips FROM ( SELECT vendor_id FROM `PROJECT_ID.data_lineage_demo.nyc_green_trips_2022` UNION ALL SELECT vendor_id FROM `PROJECT_ID.data_lineage_demo.nyc_green_trips_2021` ) GROUP BY vendor_id
Click
Run. This step creates a combined table, calledtotal_green_trips_22_21
.In the Query results pane, click Go to table. This step displays the combined table.
View the lineage graph in Dataplex
Open the Dataplex Search page.
For Choose search platform, select Data Catalog as the search mode.
In the Search box, enter
total_green_trips_22_21
and click Search.From the results list, click
total_green_trips_22_21
. This step displays the BigQuery table Details tab.Click the Lineage tab.
In the lineage graph, each rectangular node represents a table, either an original, copied, or combined table. You can do the following:
Show or hide the origin of a table, by clicking + (Expand) or - (Collapse).
Show table information, by clicking a node. This step displays a node Details pane.
Show process information, by clicking a process icon. This step displays a process Details pane showing the job that transformed a source table to a target table.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.
Delete the project
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the dataset
Go to the BigQuery page.
In the Explorer pane, search for the
data_lineage_demo
dataset you created.Right-click the dataset and select Delete.
Confirm your delete action.
What's next
- Learn more about Dataplex and data lineage.
- Learn how to run BigQuery queries.
- Learn how to use data lineage and view data lineage graphs.
- Learn about Dataplex pricing and billing.