Tracking provenance and lineage metadata for healthcare data

This document describes how to track provenance and lineage metadata for healthcare data in Google Cloud for researchers, data scientists, and IT teams.

Provenance and lineage metadata can help healthcare organizations track where their clinical and operational data comes from, what happens to the data, and where it is stored. This tracking can help your organization to achieve the following goals when working with healthcare data:

  • Adhere to organizational policies and external requirements.
  • Produce repeatable, reproducible, and justifiable data processing workloads.

Provenance and lineage metadata has many data levels, depending on the use case. This document covers three data levels—dataset level, field (column) level, and patient record level—and shows how built-in functionalities in Google Cloud let you access and track provenance and lineage metadata at these levels.

Data provenance

Data provenance is the origin of your data. It's important to keep track of which source is producing what data, especially when you're harmonizing multiple data sources to a common schema. For more information, see Transforming and harmonizing data for BigQuery.

Provenance information is also useful when you're running data quality checks or performing data profiling. For example, if you know the data's origin, you can decide whether the data meets your quality standards or if it needs to be cleansed.

There are multiple ways to track provenance in Google Cloud. For example, you can track the provenance of arbitrary datasets, such as those in Cloud Storage, by using a filename convention or a folder structure. If the data source is defined in the filename convention, you can use Cloud Data Fusion to parse the filename and add the source system as a structured data element into the dataset. This lets downstream users filter by source system and run validation checks based on data provenance. For example, the following filename structure is parsed into multiple sections:

gs://bucket-name/data-source/data-type/data-name-and-time

In the preceding filename example, the data source is stored in a bucket, with the particular data type in a folder subsection. The filename is labeled by the name of the data and its timestamp. This filename convention is parsed during processing so that the bucket, folder, and name can each be added as separate data elements in the final output.

FHIR provenance resource

The Fast Healthcare Interoperability Resources (FHIR) Specification, an established standard for exchanging healthcare information electronically, includes a resource for maintaining provenance information. If you use the Google Cloud tools for structural transformations, you can use the FHIR provenance resource to track structural transformations and mappings. Each element that you map outputs one provenance resource, regardless of how many FHIR resources it produces. This resource lets you track lineage at the level of patient records.

Data lineage

Data lineage is what happens to the data at every step along the pipeline. It's important to track what transformations happen to what data in case you need to reproduce the result or provide information to a third party. Cloud Data Fusion automatically tracks data lineage for all integrated datasets at the dataset level and the field level. This data capturing functionality is a powerful tool for reducing the workload for managing lineage data, as well as for helping users to understand data pipelines.

As a fully managed data integration service, Cloud Data Fusion provides a graphical user interface (GUI) that lets you visually track pipelines and data fields, and an API that lets you extract the lineage data stored in Cloud Data Fusion. These two interfaces let you work with other sources or on-premises lineage data to manage data transformations across the ecosystem. Currently, Cloud Data Fusion supports lineage at the dataset level and the field level.

Best practices

Some best practices for tracking provenance and lineage data in Google Cloud are as follows:

  • Enable Cloud Logging when you create a Cloud Data Fusion instance. Also enable Cloud Logging with the Cloud Healthcare API and with any additional cloud-based tool or product that you use.
  • Use Cloud Data Fusion for as much of your pipeline as possible because it can track lineage only for processes that run inside the instance. If there are transformations that happen outside the instance—for example, in a different cloud or on-premises—make sure that you have best practices in place to track the data. Alternatively, you can use open source Cask Data Application Platform (CDAP) to capture information.
  • Synchronize the data tags and the metadata tags across your organization so that the tags are searchable across business units.

What's next