Transforming and harmonizing healthcare data for BigQuery

This document describes the processes and considerations involved in harmonizing healthcare data on Google Cloud for researchers, data scientists, and IT teams who want to create an analytics data lake in BigQuery.

By harmonizing data, you bring together data of varying file formats and standards and then transform the data into one cohesive, standardized dataset ready for in-depth analysis. To harmonize data, you need to validate data sources and then build a series of pipelines to modify the data through structural transformation, conversion, identity resolution, and de-identification.

Validating source data

Clinical data can be inaccurate and incomplete in its raw form, so you need to assess the quality of incoming data prior to the data transformation. A validation check includes the following processes:

  • Ensures that the necessary data elements are present.
  • Confirms the data counts so that the source system and the ingestion data lake have the same number of data counts.
  • Confirms that the value ranges are reasonable prior to investing compute resources.

Google Cloud provides the following ways to validate source data:

  • Cloud Data Fusion prepares data so that you can view and explore a small subset of data (1000 rows or less).
  • BigQuery lets you perform many validations with SQL. If there are discrepancies in the data, you can create a view to grant access to identified users for validation or export selected rows to Google Sheets for annotation and collaboration with a team.
  • For automatic data quality checks, you can create user-defined directives (UDDs) by using Cloud Data Fusion. UDDs let you specify custom processing, data checks, and functions within Cloud Data Fusion. Creating UDDs also enables the automatic flagging of outliers.

Transforming data and building a pipeline

Structural transformation is an important step in building a data transformation pipeline and includes field mapping, parsing, and formatting data.

If you use the Cloud Healthcare API as your ingestion method, there's no need to parse or index data. The Cloud Healthcare API offers flexibility with incoming data formats, accepting resources and bundles in both multiline JSON formats and newline delimited JSON formats. For example, if you're working with the Fast Healthcare Interoperability Resources (FHIR) data type, you don't need to develop a custom JSON parser. Instead, you can use the built-in functionalities of the Cloud Healthcare API to ingest the data.

You can also perform structural transformation for conversions, such as converting an HL7v2 data type to a FHIR data type, or converting FHIR data type to SQL on FHIR schema.

Cloud Data Fusion has a wide range of prebuilt plugins for parsing, formatting, compressing, and converting data. Cloud Data Fusion also includes Wrangler, a visualization tool that interactively filters, cleans, formats, and projects the data, based on a small sample (1000 rows) of the dataset. After you define the transformation, Wrangler automatically generates a Dataproc pipeline that applies the changes to the full dataset.

For more advanced transformations, you can use a configuration-driven mapping language for definition, management, and portability of structural mappings.

Resolving patient identity

An important step in data transformation is matching records to their corresponding unique patient identifiers (UPIs). By matching records, you can create a longitudinal record that can help to identify inconsistencies and duplications in the data, and can also help to ensure consistent data transformations across individual patient records. You can match patient records by calling a master patient index (MPI) with custom pipelines or models. If you're consolidating the data through FHIR, there's also a patient identifier search parameter that you can use to join records.

Converting terminology

Clinical terminology environments change often, due to coding systems, governing bodies, and organizational needs. When you combine historical records with newly created resources, inconsistencies can emerge between former and current terminology. Converting and harmonizing clinical terms helps to ensure consistency and continuity between old and new terminology standards.

Converting units of measure

Units of measure can vary widely, based on the ingestion sources, the organizations doing the measurement, and the geographical region. In the US, for example, a baby's weight is often measured in grams, whereas an adult's weight is typically measured in pounds. Converting and harmonizing units of measurement helps to ensure that analytics captures all units of measure and includes all patient populations.

De-identifying data

De-identification is the last step in the transformation process because it can interfere with the ability to harmonize data, especially in identity resolution and clinical entity extraction. For example, if you de-identify too early in the pipeline, you might not be able to the perform the accurate identity resolution that's needed for harmonization. Google Cloud offers many options, including customizations, to help you de-identify and pseudonymize healthcare data.

For unstructured text data or traditional structured data such as CSVs, you can use Sensitive Data Protection to classify and redact sensitive data elements. Sensitive Data Protection also lets you customize de-identification for your security needs and use cases by using techniques such as masking, secure hashing, tokenization, bucketing, and format-preserving encryption.

The Cloud Healthcare API also has de-identification functionalities built into the API, which are available for Digital Imaging and Communications in Medicine (DICOM) and FHIR datasets. This option is useful when you want to preserve the initial data model.

If you're handling time-series data in the FHIR format, you can preserve the sequence of resources based on the UPI. This means that when you use the date-shift functionality, you can be sure that the order of the resources is maintained.

Using intermediate storage

As a best practice, store the results of each transformation in intermediate storage—an ephemeral middle storage—so that you can troubleshoot issues or return to a previous state of data transformation in a disaster recovery case. You can rewrite intermediate storage every time the pipeline is run, in the form of a raw export to Cloud Storage in BigQuery.

Understanding the pipeline architecture

The following diagram shows the pipelines for data transformation and harmonization.

Pipelines for data transformation and harmonization.

The preceding diagram shows the full process of data moving through the pipelines, including all the steps involved in transformation and harmonization. The data is first ingested as raw data, and then proceeds through the structural transformation pipeline and the identity resolution pipeline. The data then undergoes additional transformations—for example, terminology conversion and de-identification—and exits through the output pipeline to BigQuery, where it is stored as harmonized data ready for analytics. In addition, metadata in intermediate storage is automatically written at runtime and stored in BigQuery as provenance and lineage data.

The diagram also shows two intermediate storage steps where, as a best practice, you can store each dataset in between pipeline stages.

What's next