Analyzing FHIR data in BigQuery

This document explains to researchers, data scientists, and business analysts the processes and considerations for analyzing Fast Healthcare Interoperability Resources (FHIR) data in BigQuery.

Specifically, this document focuses on patient resource data that is exported from the FHIR store in the Cloud Healthcare API. This document also steps through a series of queries that demonstrate how FHIR schema data works in a relational format, and shows you how to access these queries for reuse through views.

Using BigQuery for analyzing FHIR data

The FHIR-specific API of the Cloud Healthcare API is designed for real-time transactional interaction with FHIR data at the level of a single FHIR resource or a collection of FHIR resources. However, the FHIR API is not designed for analytics use cases. For these use cases, we recommend exporting your data from the FHIR API to BigQuery. BigQuery is a serverless, scalable data warehouse that lets you analyze large quantities of data retrospectively or prospectively.

Additionally, BigQuery conforms to ANSI:2011 SQL, which makes data accessible to data scientists and business analysts through tools that they typically use, such as Tableau, Looker, or Vertex AI Workbench.

In some applications, such as Vertex AI Workbench, you get access through built-in clients, such as the Python Client library for BigQuery. In these cases, the data returned to the application is available through built-in language data structures.

Accessing BigQuery

You can access BigQuery through the BigQuery web UI in the Google Cloud console, and also with the following tools:

By using these tools, you can integrate BigQuery into almost any application.

Working with the FHIR data structure

The built-in FHIR standard data structure is complex, with nested and embedded FHIR data types throughout any FHIR resource. These embeddable FHIR data types are referred to as complex data types. Arrays and structures are also referred to as complex data types in relational databases. The built-in FHIR standard data structure works well serialized as XML or JSON files in a document-oriented system, but the structure can be challenging to work with when translated into relational databases.

The following screenshot shows a partial view of a FHIR patient resource data type that illustrates the complex nature of the built-in FHIR standard data structure.

FHIR `patient resource` data type.

The preceding screenshot shows the primary components of a FHIR patient resource data type. For example, the cardinality column (indicated in the table as Card.) shows several items that can have zero, one, or more than one entries. The Type column shows the Identifier, HumanName, and Address data types, which are examples of complex data types that comprise the patient resource data type. Each of these rows can be recorded multiple times, as an array of structures.

Using arrays and structures

BigQuery supports arrays and STRUCT data types—nested, repeated data structures—as they are represented in FHIR resources, which makes data conversion from FHIR to BigQuery possible.

In BigQuery, an array is an ordered list consisting of zero or more values of the same data type. You can construct arrays of simple data types, such as the INT64 data type, and complex data types, such as the STRUCT data type. The exception is the ARRAY data type, because arrays of arrays are not currently supported. In BigQuery, an array of structures appears as a repeatable record.

You can specify nested data, or nested and repeated data, in the BigQuery UI or in a JSON schema file. To specify nested columns, or nested and repeated columns, use the RECORD (STRUCT) data type.

The Cloud Healthcare API supports the SQL on FHIR schema in BigQuery. This analytics schema is the default schema on the ExportResources() method and is supported by the FHIR community.

BigQuery supports denormalized data. This means that when you store your data, instead of creating a relational schema such as a star or snowflake schema, you can denormalize your data and use nested and repeated columns. Nested and repeated columns maintain relationships between data elements without the performance impact of preserving a relational (normalized) schema.

Accessing your data through the UNNEST operator

Every FHIR resource in the FHIR API is exported into BigQuery as one row of data. You can think of an array or structure inside any row as an embedded table. You can access data in that "table" in either the SELECT clause or the WHERE clause of your query by flattening the array or structure by using the UNNEST operator. The UNNEST operator takes an array and returns a table with a single row for each element in the array. For more information, see working with arrays in standard SQL.

The UNNEST operation doesn't preserve the order of the array elements, but you can reorder the table by using the optional WITH OFFSET clause. This returns an additional column with the OFFSET clause for each array element. You can then use the ORDER BY clause to order the rows by their offset.

When joining unnested data, BigQuery uses a correlated CROSS JOIN operation that references the column of arrays from each item in the array with the source table, which is the table that directly precedes the call to UNNEST in the FROM clause. For each row in the source table, the UNNEST operation flattens the array from that row into a set of rows containing the array elements. The correlated CROSS JOIN operation joins this new set of rows with the single row from the source table.

Investigating the schema with queries

To query FHIR data in BigQuery, it's important to understand the schema that's created through the export process. BigQuery lets you inspect the column structure of every table in the dataset through the INFORMATION_SCHEMA feature, a series of views that display metadata. The remainder of this document refers to the SQL on FHIR schema, which is designed to be accessible for retrieving data.

The following sample query explores the column details for the patient table in the SQL on FHIR schema. The query references the Synthea Generated Synthetic Data in FHIR public dataset, which hosts over 1 million synthetic patient records generated in the Synthea and FHIR formats.

When you query the INFORMATION_SCHEMA.COLUMNS view, the query results contain one row for each column (field) in a table. The following query returns all columns in the patient table:

SELECT *
FROM `bigquery-public-data.fhir_synthea.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name='patient'

The following screenshot of the query result shows the identifier data type, and the array within the data type that contains the STRUCT data types.

`Identifier` data type and the array within the data type that contains the `STRUCT` data type.

Using the FHIR patient resource in BigQuery

The patient medical record number (MRN), a critical piece of information stored in your FHIR data, is used throughout an organization's clinical and operational data systems for all patients. Any method of accessing data for an individual patient or a set of patients must filter for or return the MRN, or do both.

The following sample query returns the internal FHIR server identifier to the patient resource itself, including the MRN and the date of birth for all patients. The filter to query on a specific MRN is also included, but is commented out in this example.

In this query, you unnest the identifier complex data type twice. You also use correlated CROSS JOIN operations to join unnested data with its source table. The bigquery-public-data.fhir_synthea.patient table in the query was created by using the SQL on FHIR schema version of the FHIR to BigQuery export.

SELECT id, i.value as MRN, birthDate
FROM `bigquery-public-data.fhir_synthea.patient`
#This is a correlated cross join
,UNNEST(identifier) i
,UNNEST(i.type.coding) it
WHERE
# identifier.type.coding.code
it.code = "MR"
#uncomment to get data for one patient, this MRN exists
#AND i.value = "a55c8c2f-474b-4dbd-9c84-effe5c0aed5b"

The output is similar to the following:

Output showing ID, medical record number and date of birth for identifier type set to MR

In the preceding query, the identifier.type.coding.code value set is the FHIR identifier value set that enumerates available identity data types, such as the MRN (MR identity data type), driver's license (DL identity data type), and passport number (PPN identity data type). Because the identifier.type.coding value set is an array, there can be any number of identifiers listed for a patient. But in this case, you want the MRN (MR identity data type).

Joining the patient table with other tables

Building on the patient table query, you can join the patient table with other tables in this dataset, such as the conditions table. The conditions table is where patient diagnoses are recorded.

The following sample query retrieves all entries for the medical condition of hypertension.

SELECT abatement.dateTime as abatement_dateTime, assertedDate, category, clinicalStatus, code, onset.dateTime as onset_dateTime, subject.patientid
FROM
`bigquery-public-data.fhir_synthea.condition`
,UNNEST(code.coding) as code
WHERE
code.system = 'http://snomed.info/sct'
#snomed code for Hypertension
AND code.code = '38341003'

The output is similar to the following:

Output showing records for medical condition of hypertension

In the preceding query, you reuse the UNNEST method to flatten the code.coding field. The abatement.dateTime and onset.dateTime code elements in the SELECT statement are aliased because they both end in dateTime, which would result in ambiguous column names in the output of a SELECT statement. When you select the Hypertension code, you also need to declare the terminology system that the code comes from—in this case, the SNOMED CT clinical terminology system.

As the final step, you use the subject.patientid key to join the condition table with the patient table. This key points to the identifier of the patient resource itself within the FHIR server.

Bringing the queries together

In the following sample query, you use the queries from the two preceding sections and join them by using the WITH clause, while performing some simple calculations.

WITH patient AS (
SELECT id as patientid, i.value as MRN, birthDate
FROM
`bigquery-public-data.fhir_synthea.patient`
#This is a correlated cross join
,UNNEST(identifier) i
,UNNEST(i.type.coding) it
WHERE
# identifier.type.coding.code
it.code = "MR"
#uncomment to get data for one patient, this MRN exists
#AND i.value = "a55c8c2f-474b-4dbd-9c84-effe5c0aed5b"
),
condition AS (
SELECT abatement.dateTime as abatement_dateTime, assertedDate, category, clinicalStatus, code, onset.dateTime as onset_dateTime, subject.patientid
FROM
`bigquery-public-data.fhir_synthea.condition`
,UNNEST(code.coding) as code
WHERE
code.system = 'http://snomed.info/sct'
#snomed code for Hypertension
AND code.code = '38341003'
)
SELECT patient.patientid, patient.MRN, patient.birthDate as birthDate_string,
#current patient age. now - birthdate
CAST(DATE_DIFF(CURRENT_DATE(),CAST(patient.birthDate AS DATE),MONTH)/12 AS INT) as patient_current_age_years,
CAST(DATE_DIFF(CURRENT_DATE(),CAST(patient.birthDate AS DATE),MONTH) AS INT) as patient_current_age_months,
CAST(DATE_DIFF(CURRENT_DATE(),CAST(patient.birthDate AS DATE),DAY) AS INT) as patient_current_age_days,
#age at onset. onset date - birthdate
DATE_DIFF(CAST(SUBSTR(condition.onset_dateTime,1,10) AS DATE),CAST(patient.birthDate AS DATE),YEAR)as patient_age_at_onset,
condition.onset_dateTime, condition.code.code, condition.code.display, condition.code.system
FROM patient JOIN condition
ON patient.patientid = condition.patientid

The output is similar to the following:

Output showing ID, MRN and date of birth for medical condition of hypertension.

In the preceding sample query, the WITH clause lets you isolate subqueries into their own defined segments. This approach can help with legibility, which becomes more important as your query grows larger. In this query, you isolate the subquery for patients and conditions into their own WITH segments, and then join them in the main SELECT segment.

You can also apply calculations to raw data. The following sample code, a SELECT statement, shows how to calculate patient's age at disease onset.

DATE_DIFF(CAST(SUBSTR(condition.onset_dateTime,1,10) AS DATE),CAST(patient.birthDate AS DATE),YEAR)as patient_age_at_onset

As indicated in the preceding code sample, you can perform a number of operations on the supplied dateTime string, condition.onset_dateTime. First, you select the date component of the string with the SUBSTR value. Then you convert the string into a DATE data type by using the CAST syntax. You also convert the patient.birthDate field to the DATEfield. Finally, you calculate the difference between the two dates by using the DATE_DIFF function.

What's next