View lineage in Dataplex

This page describes how to view the data lineage generated by your Cloud Data Fusion pipelines with other data movement on Google Cloud, for discovery and governance purposes. You can view the lineage graphs for supported data sources on the Dataplex page in the console, or use the Data Lineage API to retrieve complete data lineage records.

Plugins that support Dataplex data lineage

Cloud Data Fusion and Dataplex support asset-level lineage for the following plugins:

  • Amazon S3
  • BigQuery
  • BigQuery Multi Table sink (version 6.9.1 and later)
  • Spanner
  • Cloud Storage
  • Cloud SQL for MySQL
  • Cloud SQL for PostgreSQL
  • Dataplex
  • FTP
  • Generic Database
  • HTTP
  • MSSQL/SQL Server
  • Multiple Database Tables source (version 6.9.1 and later)
  • MySQL
  • Oracle
  • PostgreSQL
  • SAP OData
  • SAP ODP
  • SAP Table

For more information, see Cloud Data Fusion plugins.

Before you begin

To enable viewing Cloud Data Fusion lineage graphs on the Dataplex page in the console, do the following:

  1. Create a data pipeline that uses only the supported plugins.

  2. Enable the Data Lineage API in the project that contains your Cloud Data Fusion instance.

  3. Grant the Data Lineage Events Producer role (roles/datalineage.producer) to the Cloud Data Fusion-managed service account, the Cloud Data Fusion API Service Agent. The process varies if your instance runs in an earlier version of Cloud Data Fusion and RBAC is enabled.

    6.10+ or no RBAC

    If your Cloud Data Fusion instance uses version 6.10.0 or later, or your instance uses an earlier version and RBAC isn't enabled, follow these steps:

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM

    2. Select the Include Google-provided role grants checkbox.

    3. Select the Cloud Data Fusion API Service Agent service account and click Edit.

    4. Click Add another role and select the Data Lineage Events Producer role.

    5. Click Save.

    <6.10 with RBAC

    If your Cloud Data Fusion instance uses a version earlier than 6.10.0 and RBAC is enabled, the service account doesn't appear in the list of principals on the IAM page. You must enter the service account name manually.

    To grant the required role, follow these steps:

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM

    2. Click Grant access.

    3. In the New principals field, enter the Cloud Data Fusion API Service Agent service account. Use the following format: datafusion-system@TENANT_PROJECT_ID.iam.gserviceaccount.com.

      Replace TENANT_PROJECT_ID with the tenant ID for your instance. To view the tenant project ID, go to the Instances page and click the instance name for instance details.

      Go to Instances

    4. Select the Data Lineage Events Producer role.

    5. Click Save.

Enable Dataplex data lineage in Cloud Data Fusion

For new instances in Cloud Data Fusion, Dataplex data lineage is turned off by default. If you created the instance before January 27, 2024 with version 6.8.0 or later, it's turned on by default after completing the steps in Before you begin.

REST API

Enable Dataplex data lineage when you create an instance

To enable Dataplex data lineage when you create an instance, set the optional dataplex_data_lineage_integration_enabled property to true:

echo '{ "description": "CDAPinstance", "dataplex_data_lineage_integration_enabled": "true"}' | curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  --data @- \
  "https://datafusion.googleapis.com/v1/projects/PROJECT/locations/LOCATION/instances?instanceId=INSTANCE_NAME"

To turn it off, either set the property to false or omit the property, as lineage is turned off by default when you create a new instance.

Enable Dataplex data lineage in an existing instance

To enable Dataplex data lineage in an existing instance in Cloud Data Fusion, set the dataplex_data_lineage_integration_enabled property to true and include the updateMask parameter value:

echo '{ "description": "CDAPinstance", "dataplex_data_lineage_integration_enabled": "true"}' | curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  --data @- \
  "https://datafusion.googleapis.com/v1/projects/PROJECT/locations/LOCATION/instances?instanceId=INSTANCE_NAME?updateMask=dataplex_data_lineage_integration_enabled"

Turn off Dataplex data lineage in an existing instance

To turn off Dataplex data lineage in an existing instance in Cloud Data Fusion, set the dataplex_data_lineage_integration_enabled property to false and include the updateMask parameter value:

echo '{ "description": "CDAPinstance", "dataplex_data_lineage_integration_enabled": "false"}' | curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  --data @- \
  "https://datafusion.googleapis.com/v1/projects/PROJECT/locations/LOCATION/instances?instanceId=INSTANCE_NAME?updateMask=dataplex_data_lineage_integration_enabled"

View data lineage graphs

To view lineage graphs for entities across all Google Cloud services, do the following:

  1. Go to your instance in Cloud Data Fusion and run a data pipeline that uses supported plugins.

  2. View the lineage graphs on the Dataplex page in the console and find the asset for which you want to view lineage information.

Limitations

Viewing lineage in Dataplex has the following limitations:

What's next