Integrate with OpenLineage

OpenLineage is an open platform for collecting and analyzing data lineage information. Using an open standard for lineage data, OpenLineage captures lineage events from data pipeline components which use an OpenLineage API to report on runs, jobs, and datasets.

Through the Data Lineage API, you can import OpenLineage events to display in the Dataplex web interface alongside lineage information from Google Cloud services, such as BigQuery, Cloud Composer, Cloud Data Fusion, and Dataproc.

To import OpenLineage events that use the OpenLineage specification, use the ProcessOpenLineageRunEvent REST API method, and map OpenLineage facets to Data Lineage API attributes.

Limitations

The Data Lineage API supports OpenLineage major versions 1 and 2.
The Data Lineage API doesn't support the following:
- Any subsequent OpenLineage release with message format changes
- DatasetEvent
- JobEvent
Maximum size of a single message is 5 MB.
Length of each Fully Qualified Name in inputs and outputs is limited to 4000 characters.
Links are grouped by events with 100 links. The maximum aggregate number of links is 1000.
Dataplex displays a lineage graph for each job run, showing the inputs and outputs of lineage events. It doesn't support lower-level processes like Spark stages.

OpenLineage mapping

The REST API method ProcessOpenLineageRunEvent maps OpenLineage attributes to Data Lineage API attributes as follows:

Data Lineage API attributes	OpenLineage attributes
Process.name	projects/`PROJECT_NUMBER`/locations/`LOCATION`/processes/`HASH_OF_NAMESPACE_AND_NAME`
Process.displayName	Job.namespace + ":" + Job.name
Process.attributes	Job.facets (see Stored data)
Run.name	projects/`PROJECT_NUMBER`/locations/`LOCATION`/processes/`HASH_OF_NAMESPACE_AND_NAME`/runs/`HASH_OF_RUNID`
Run.displayName	Run.runId
Run.attributes	Run.facets (see Stored data)
Run.startTime	eventTime
Run.endTime	eventTime
Run.state	eventType
LineageEvent.name	projects/`PROJECT_NUMBER`/locations/`LOCATION`/processes/`HASH_OF_NAMESPACE_AND_NAME`/runs/`HASH_OF_RUNID`/lineageEvents/`HASH_OF_JOB_RUN_INPUT_OUTPUTS_OF_EVENT` (for example, projects/11111111/locations/us/processes/1234/runs/4321/lineageEvents/111-222-333)
LineageEvent.EventLinks.source	inputs (fqn is namespace and name concatenation)
LineageEvent.EventLinks.target	outputs (fqn is namespace and name concatenation)
LineageEvent.startTime	eventTime
LineageEvent.endTime	eventTime
requestId	Defined by the method user

Import an OpenLineage event

If you haven't yet set up OpenLineage, see Getting started.

To import an OpenLineage event into Dataplex, call the REST API method ProcessOpenLineageRunEvent:

POST https://datalineage.googleapis.com/v1/projects/{project}/locations/{location}:processOpenLineageRunEvent \
--data '{"eventTime":"2023-04-04T13:21:16.098Z","eventType":"COMPLETE","inputs":[{"name":"somename","namespace":"somenamespace"}],"job":{"name":"somename","namespace":"somenamespace"},"outputs":[{"name":"somename","namespace":"somenamespace"}],"producer":"someproducer","run":{"runId":"somerunid"},"schemaURL":"https://openlineage.io/spec/1-0-5/OpenLineage.json#/$defs/RunEvent"}'

Analyze information from OpenLineage

To analyze the imported OpenLineage events, see View lineage graphs in Dataplex UI.

Stored data

The Data Lineage API doesn't store all facets data from the OpenLineage messages. The Data Lineage API stores the following facet fields:

spark_version
- openlineage-spark-version
- spark-version
all spark.logicalPlan.*
environment-properties (custom Google Cloud lineage facet)
- origin.sourcetype and origin.name
- spark.app.id
- spark.app.name
- spark.batch.id
- spark.batch.uuid
- spark.cluster.name
- spark.cluster.region
- spark.job.id
- spark.job.uuid
- spark.project.id
- spark.query.node.name
- spark.session.id
- spark.session.uuid

The Data Lineage API stores the following information:

eventTime
run.runId
job.namespace
job.name