Ingesting clinical and operational data with Cloud Data Fusion

This document explains to researchers, data scientists, and IT teams how Cloud Data Fusion can unlock data by ingesting, transforming, and storing the data in BigQuery, an aggregated data warehouse on Google Cloud.

Healthcare organizations rely on data to drive their healthcare analytics use cases, but most of the data is locked up in siloed systems. This document shows how you can access this data with Cloud Data Fusion.

Using Cloud Data Fusion as a data integration service

Cloud Data Fusion is a fully managed, cloud-native data integration service with a broad library of open source transformations and more than 100 available plugins that provide a wide array of systems and data formats.

Cloud Data Fusion lets you ingest and integrate raw data from various sources and transform that data. For example, you can use Cloud Data Fusion to blend or join data sources before writing to BigQuery to analyze the data.

Raw data is drawn from data sources, which can be in the form of relational databases, file systems, mainframes and other legacy systems, public cloud systems, and Google Cloud. Cloud Data Fusion destinations, also known as sinks, are the locations where the data is written—for example, Cloud Storage and BigQuery.

Using Cloud Storage as a data lake

You can use Cloud Storage as the collection point for the data that you plan to move into the cloud, and you can also use it as a data lake. With its many connectors, Cloud Data Fusion populates the data lake from on-premises systems.

Ingesting clinical data types by using the Cloud Healthcare API

The Cloud Healthcare API provides a managed solution for ingesting, storing, and accessing healthcare data on Google Cloud by creating a critical bridge between care systems and applications hosted on the cloud. In the Cloud Healthcare API, each modality-specific data store and its associated API conform to current standards. The Cloud Healthcare API supports Fast Healthcare Interoperability Resources (FHIR), HL7v2, and Digital Imaging and Communications in Medicine (DICOM) data types. For more information, see Getting to know the Cloud Healthcare API.

Recently, healthcare organizations have been using the FHIR data type for electronic health records (EHRs) and healthcare systems to expand their ability to query clinical data across organizations. If your organization has access to FHIR, you can use the Cloud Healthcare API to ingest FHIR data for bulk uploads of clinical data.

The Cloud Healthcare API supports multiple versions of FHIR. For more information about supported versions and functionality, see the FHIR conformance statement.

Ingesting other structured data

For expanded data integration capacity, the Google Cloud products discussed in this document can handle common structured data formats such as CSV, JSON, Avro, ORC, and Parquet. In addition, Cloud Storage can ingest any data format as blob storage. For more information, see how to load data from Cloud Storage to BigQuery.

The open source raw data importer for BigQuery can import raw data into BigQuery and has the following features:

  • Automatic decompression of input files, with support for a variety of formats, including gzip, LZ4, tar, and zip file formats
  • Full dataset schema detection
  • Proper parallelization built on top of Dataflow

The data importer tool is not limited to healthcare data. You can use the tool to import any kind of dataset in a supported format to BigQuery for further analysis. Currently, the tool supports CSV data types.

Loading data

There are two forms of data loading—full and incremental. The initial full load consists of batch-loading data that resides in on-premises data warehouses into the cloud data warehouse, BigQuery. This full load is performed only once.

An incremental loading process often follows the initial full ingestion, with the objective of keeping the data in the cloud in sync with the primary data storage. Incremental loads can take the form of periodic database dumps or real-time streaming. For periodic updates, you can load a batch of database updates to Cloud Storage and then incorporate the updates into the cloud data warehouse. For real-time updates, you can either set up real-time database replication by using online transaction processing (OLTP) databases or messaging protocols, such as HL7v2 streaming. For more information, see options for cloud data transfer.

Transferring big datasets

To transfer big datasets to Google Cloud, you need to consider transfer duration, cost, and complexity. For more information, see strategies for transferring large datasets.

Data lifecycle

Data ingestion is just the first step in the data lifecycle. Google Cloud provides technologies across the data lifecycle, including ingestion, storage, analysis, and visualization.

What's next