Importing FHIR clinical data into the cloud using the Cloud Healthcare API

This article introduces the Fast Healthcare Interoperability Resources (FHIR) implementation of the Cloud Healthcare API and explains four methods of loading FHIR data into Google Cloud. After the FHIR data is loaded into Google Cloud, you can use the data for many use cases, from app access to analytics and machine learning.

This document is intended for app developers, system admins, software engineers, and other personnel who need to copy FHIR data into the cloud. The techniques shown here are suitable for both production and non-production use, and can be automated. This document assumes that you have a basic understanding of the FHIR specification and its use by electronic health records systems (EHRs).

Use cases

A number of healthcare use cases require access to clinical data that are structured in conformance with the FHIR standard. A Cloud Healthcare API FHIR store is a way to provide access while facilitating security, access control, and standards-based access. For example:

  • A Cloud Healthcare API FHIR store can serve as a managed repository to expose data to patient- and provider-facing apps.
  • Ingesting data into a FHIR store means you can export data into BigQuery, where you can use the data for analysis, machine learning (ML) model training, and many other apps. A FHIR store also serves as a repository to store inferences derived from analytics or ML.
  • Using the Cloud Healthcare API can accelerate the process of de-identifying clinical data for research purposes.
  • As a format for app processing of data, FHIR is significantly easier to work with than other formats such as HL7v2 because of its REST API and JSON-based format. If your apps are currently processing HL7 data, converting it to FHIR, and then storing it in Google Cloud, you can accelerate your app development work by enabling the use of many software development and data management tools.

When combined with other data, such as Digital Imaging and Communications in Medicine (DICOM) metadata from radiological studies, FHIR-structured data gives a holistic view of a patient and creates new capabilities that improve patient care and clinician effectiveness.

An overview of FHIR data ingestion into Google Cloud

There are two approaches to ingest FHIR data into Google Cloud: near real-time (interactive) access or batch access. Both approaches start with the creation of a Cloud Healthcare API dataset and FHIR store to contain the data. Additional attributes are assigned to the store to control functionality, such as new resource notifications, referential integrity checks, and resource versioning.

For near real-time ingestion of data, you use the create interaction as defined in the FHIR API. This interaction enables resources, such as Patient or Observation, to be directly inserted into the FHIR store, one at a time. For near real-time insertion of larger batches of data, or transactional insertion of related resources, the FHIR modality of the Cloud Healthcare API offers an fhir.executeBundle method, which conforms to the FHIR standard's batch/transaction interactions.

For larger-scale batch ingestion not in near real-time, the FHIR modality offers an import method that can accept one or more files from Cloud Storage and insert them into the FHIR store in a single request. Alternatively, you can use the gcloud beta healthcare fhir-stores import gcs command to import from Cloud Storage, which uses the Cloud Healthcare API import method internally, but provides a streamlined user experience because you can use the command from command-line scripts.

Creating a service account and assigning permissions

The design and implementation of any process that stores sensitive data in a cloud platform begins with determining the best combination of accounts, security roles, and permissions needed for your particular use case and ingestion method.

Access to the Cloud Healthcare API and gcloud command-line tool requires that your Identity and Access Management (IAM) account has the appropriate roles for the function that you want to perform.

The Cloud Healthcare API has several roles to support data ingestion, as illustrated in the following table.

Role Purpose
Healthcare Dataset Administrator Create or delete datasets.
Healthcare FHIR Administrator Create or delete FHIR stores.
Import data in bulk using the import API request or the gcloud tool.
Healthcare FHIR Resource Editor Create, read, update, and delete resources in an existing FHIR store.
Add data using the batch or transaction API request.

These roles only apply to the ingestion of FHIR resources. Similar roles are available for the ingestion of DICOM and HL7v2 resources.

The account you use can be either your Google Cloud account or an IAM service account. This document uses a service account, but the general information presented here applies regardless of which account type you use. For more information, see Understanding service accounts.

In the Google Cloud Console, you can create an IAM service account and assign it the appropriate permissions. When you create a service account, you can define its roles, and you can also download a JSON-formatted key for app use. For more information, see Creating and managing service accounts.

Creating a Cloud Healthcare API dataset and FHIR store

In the Cloud Healthcare API, a dataset is a grouping of related data stores, each of which contains data for a particular modality, such as FHIR, HL7v2, and DICOM. You can create more than one store in the same dataset, and these stores can each be of different modalities. For more information, see Projects, datasets, and data stores.

The following diagram illustrates one possible arrangement of the Cloud Healthcare API datasets and stores of different modalities.

Architecture of datasets with HL7v2 and DICOM stores.

In this diagram are two datasets. The first has three stores—one for HL7v2 and two for DICOM. The second has four stores—two for HL7v2, one for DICOM, and one for FHIR. You can create any number of stores in any number of datasets to satisfy your app's requirements.

A dataset–and all of the stores the dataset contains–exists in a single Google Cloud region. This architecture gives you control over the placement of your data, which can be critical to comply with national data privacy or sovereignty laws.

There are two ways to create the Cloud Healthcare API datasets and stores: using the gcloud tool or using the REST API. To learn more about the REST API method, see Create datasets and Create a FHIR store.

This article uses the gcloud tool.

  1. Create a Cloud Healthcare API dataset:

    gcloud beta healthcare datasets create [DATASET_NAME] \
        --location [GCP_REGION]
    

    Where:

    • [DATASET_NAME] is the name of the Cloud Healthcare API dataset you are creating
    • [GCP_REGION] is the name of the region in which you are creating the dataset.

    The output is similar to the following:

    Create request issued for: [[DATASET_NAME]]
    Waiting for operation [OPERATION_NUMBER] to complete...done.
    Created dataset [[DATASET_NAME]].
    

    Where [OPERATION_NUMBER] is the dataset creation operation identifier used for tracking the request.

  2. Create a FHIR store:

    gcloud beta healthcare fhir-stores create [STORE_NAME] \
        --dataset [DATASET_NAME]
    

    Where [STORE_NAME] is the name of the FHIR store.

    There are several additional options for this command:

    • --disable-referential-integrity lets you turn off referential integrity in the FHIR store. By default, referential integrity enforces relationships among FHIR entities as defined by FHIR entity references. With referential integrity enabled, you must either insert all independent entities before inserting entities that depend on them, or insert all interdependent entities as part of a single transaction bundle. Disabling referential integrity lets you insert FHIR resources into the store without regard to order or relationships.

    • --disable-resource-versioning disables resource versioning for this FHIR store. By default, resource versioning is enabled and creates a historical record of resource changes is created. You can retrieve these changes by using the history features of the FHIR API. Disabling resource versioning prevents the creation of this historical record.

    • --enable-update-create lets you create resources by using update requests (HTTP PUT) if the resources don't already exist in the store. Enabling this setting also lets you use PUT requests to create resources with IDs that you can preassign. This setting can be useful when ID synchronization is important, such as when synchronizing data with an EHR, or when ingesting data with many resource cross-references. Using the import function requires the use of system-generated resource IDs. If this option isn't set, using update requests on resources that don't exist in the FHIR store results in an error.

    • --pubsub-topic [TOPIC_NAME] notifies you when data is inserted into this FHIR store. [TOPIC_NAME] represents the name of a Pub/Sub topic. Notifications are only generated when data is inserted into a single record at a time or by using the batch or transaction feature. Data ingested by using import or gcloud beta healthcare fhir-stores import doesn't generate notifications. By default, no Pub/Sub topic is created.

At a minimum, you must specify the FHIR store name and the dataset name in the gcloud beta healthcare fhir-stores create command.

Obtaining an OAuth2 access token

When you use the Cloud Healthcare API to ingest data, an OAuth 2.0 access token is required. In this article, all of the example Cloud Healthcare API requests are shown using the cURL command-line utility. These examples use gcloud auth print-access-token to obtain an OAuth 2.0 bearer token and to include the token in the request's authorization header. For more information about this command, see gcloud auth print-access-token.

Near real-time data ingestion

Many systems have requirements to store single FHIR resources (such as Patient or Observation) or to insert relatively small numbers of related resources (such as a set of MedicationRequest entities for a particular Patient record). The following sections discuss ways to create resources that satisfy these use cases.

Use the FHIR create function

You can use the create interaction to insert a single resource into a FHIR store. The following is a sample curl request:

curl -X POST \
'https://healthcare.googleapis.com/v1beta1/projects/[PROJECT_ID]/locations/[LOCATION_NAME]/datasets/[DATASET_NAME]/fhirStores/[STORE_NAME]/fhir/[RESOURCE_TYPE]' \
    -H 'Authorization: Bearer '$(gcloud auth print-access-token) \
    -H 'Content-Type: application/fhir+json;charset=utf-8' \
    -d '{ ... JSON FHIR resource ... }'

Where:

  • [PROJECT_ID] is the name of your Google Cloud project.
  • [LOCATION_NAME] is the location in which your dataset resides.
  • [DATASET_NAME] is the name of your dataset.
  • [STORE_NAME] is the name of your FHIR store.
  • [RESOURCE_TYPE] is the type of resource being created.

The Content-Type header value is required and must be set to application/fhir+json as defined in the FHIR specification. The payload for the POST request must be a valid JSON-encoded FHIR resource.

If this request is successful, the Cloud Healthcare API service returns a HTTP/2 201 status code. If there are any errors, the response includes a payload that describes the error. In most cases, this payload conforms to the OperationsOutcome resource definition in the FHIR standard.

Use the execute bundle request

You can use the batch or transaction requests to insert one or more resources into a FHIR store. The following is an example of input payload structure specifying a PUT request for a single resource:

{
    "entry": [
{
    "request": {
     "method": "PUT",
      "url": "<resource-url>"
   },
    "resource": {
     ... FHIR resource data ...
    }
  }
],
  "resourceType": "Bundle",
  "type": "transaction"
}

The request object in the payload indicates the operation to be performed for this resource. This request type is different from the create interaction in that it can accept a bundle containing multiple resources of various types as input. For each resource, you can use a different operation, such as POST for creation, PUT for update, and DELETE for delete. The FHIR specification defines the content of this object.

There is a difference in behavior between an input bundle that specifies a type of batch instead of a type of transaction. When you use a type value of batch in the input bundle, each request in the bundle is treated as an individual request, and there cannot be any relationship dependencies between the resources in the bundle. A bundle with a type value of transaction, however, does allow relationships between resources, and the content of the bundle is treated as a single, "all-or-nothing" unit of work. For more information, see the FHIR specification's sections on batch and transaction API requests.

The following is a sample curl request that runs a batch or transaction function:

curl -X POST \
'https://healthcare.googleapis.com/v1beta1/projects/[PROJECT_ID]/locations/[LOCATION_NAME]/datasets/[DATASET_NAME]/fhirStores/[STORE_NAME]/fhir' \
    -H 'Authorization: Bearer '$(gcloud auth print-access-token) \
    -H 'Content-Type: application/fhir+json;charset=utf-8' \
    -d '{ ... JSON FHIR bundle ... }'

The value of the Content-Type header value is required and must be set to application/fhir+json as defined in the FHIR specification. The payload of this POST request must be a valid FHIR bundle resource specifying either batch or transaction as the bundle type. The bundle must contain an array of entries containing FHIR resources and the corresponding request parameters.

If this request is successful, the Cloud Healthcare API returns an OperationOutcome resource containing an array of entries representing the status of the operation for each resource in the bundle. A successful result of the operation has a resource entry in the output bundle with a location URI for the resource and a status of 201 Created. If an operation incurs any errors, details about the error are in the array entry for the affected resource. Errors are returned with a payload describing the error. In most cases, this payload conforms to the OperationsOutcome resource definition in the FHIR standard.

Batch data ingestion

Batch ingestion of resources is a common use case, especially in situations where a large number of resources are involved. The best approach is to store your resources in a Cloud Storage bucket, and then load them into your FHIR store. This approach significantly reduces network latency associated with transferring data and making multiple API requests.

If you're using the import command, the disable-referential-integrity setting of the FHIR store is ignored. When you create a FHIR store, you can use the disable-referential-integrity setting to indicate whether or not the store should enforce referential integrity on most requests. The exception to this is import, where the setting is ignored due to the nature of the request type. If referential integrity checking is required, you should use the execute bundle or create functions.

Prepare your data for import

When you import data using the Cloud Healthcare API or the gcloud beta healthcare fhir-stores import command, the data in the Cloud Storage bucket should be in one of the following formats:

  • A series of files containing newline-delimited JSON. These files are also known as ndjson files. ndjson files can have multiple resources in one file, but each resource must be on its own line, which is terminated by a single newline character. You cannot have newline characters in the body of the resource itself. When importing FHIR resources stored in ndjson format, you can put as many ndjson files in the bucket as you want.
  • A series of files, each of which contains a single bundle. Unlike the bundles used in the execute bundle function, a bundle using the import function must specify a type parameter of collection only, and cannot contain REST request parameters such as method and url. You can have as many files in your Cloud Storage bucket as you want, but each file can contain only one bundle.

Batch ingestion using the Cloud Healthcare API import function

The Cloud Healthcare API has an import request type that you can use to import a large number of resources from Cloud Storage.

The following is a sample curl request:

curl -X POST \
'https://healthcare.googleapis.com/v1beta1/projects/[PROJECT_ID]/locations/[LOCATION_NAME]/datasets/[DATASET_NAME]/fhirStores/[STORE_NAME]:import' \
    -H 'Authorization: Bearer '$(gcloud auth print-access-token) \
    -H 'Content-Type: application/json;charset=utf-8' \
    -d '{
    "gcsSource": {
        "uri": "gs://[BUCKET_OR_RESOURCE_NAME]/*"
    }
}'

In the payload of this request, you must specify the name of either a fully-qualified path to a Cloud Storage resource (a file containing FHIR data) or the name of a Cloud Storage bucket containing the data (the [BUCKET_OR_RESOURCE_NAME]value). The /* on the end of the Cloud Storage bucket name parameter indicates that this is a bucket name rather than a file name, and that all data in the bucket is to be imported. If you want to do recursive traversal of a folder hierarchy in a bucket, you can use /** at the end of the path.

If this request succeeds, the Cloud Healthcare API service returns a 200 OK status code, plus a long-running operation identifier. You can use this long-running operation identifier to track progress of the request and to obtain the results of the operation after it is complete. The following is an example of a long-running identifier:

projects/[PROJECT_ID]/locations/[LOCATION_NAME]/datasets/[DATASET_NAME]/operations/[OPERATION_NUMBER]

To monitor the status of the import operation, you can use the following long-running operation check API request:

curl -X GET \
https://healthcare.googleapis.com/v1beta1/projects/[PROJECT_ID]/locations/[LOCATION_NAME]/datasets/[DATASET_NAME]/operations/[OPERATION_NUMBER] \
    -H Authorization: Bearer $(gcloud auth application-default print-access-token) \
    -H 'Content-Type: application/json; charset=utf-8'

When your import request is complete, the result of the long-running operation check request contains a payload with a done property set to true. Other information about the results of the request are also included in this payload.

Batch ingestion using the gcloud tool

The gcloud beta healthcare fhir-stores import command provides an easier way to import resources from Cloud Storage into a FHIR store than other previously described methods. Behind the scenes, the command uses the import functionality as previously described, but using the gcloud tool eliminates the need to poll for completion of the request.

The following is an example of using the gcloud tool to import resources:

gcloud beta healthcare fhir-stores import gcs [STORE_NAME] \
    --dataset [DATASET_NAME] \
    --location [LOCATION_NAME] \
    --gcs-uri gs://[BUCKET_OR_RESOURCE_NAME]/*

You can use the additional flags: - --async to display information about the operation in progress, without waiting for the operation to complete.

  • --content-structure to define content structure in the source location. The default is BUNDLE.

When the import operation succeeds, the output displays the following:

Request issued for: [[STORE_NAME]]
Waiting for operation [[OPERATION_NUMBER]] to complete...done.
name: projects/[PROJECT_ID]/locations/[LOCATION_NAME]/datasets/[DATASET_NAME]

Additional security and access control considerations

When you access a Cloud Healthcare API dataset and FHIR store using the REST API or the gcloud tool, access to resources is granted to any account that has access to the containing project and is authorized by one of the Cloud Healthcare API roles and permissions. This type of security makes project-level controls—such as creating only necessary accounts and setting up proper auditing—a critical part of helping to ensure the security of healthcare data.

There are two additional levels of security for the Cloud Healthcare API datasets and stores: IAM policies that govern who can have access to datasets, and resource-level restrictions imposed by security labels. These levels are described in more detail in the following sections.

IAM policy restrictions

The Cloud Healthcare API lets you create access control lists (ACLs) that can further restrict access to datasets and stores beyond what's possible with project-level controls. ACLs let you work with IAM roles and permissions at a dataset and store level by using three specialized API requests:

  • get IAM policy lets you retrieve the current policies at a dataset or store level.
  • set IAM policy lets you set IAM policies at a dataset or store level.
  • test IAM permissions allows apps to check that an account associated with the current OAuth 2 access token has a set of permissions for a particular dataset or store.

You can also use gcloud commands.

For more information about how to set ACLs on datasets and stores, as well as details about restricting access at the ACL level, see controlling access.

Loading data using other Google Cloud services

Many use cases involving analytics or machine learning require that data pass through other services for normalization, cleansing, and format transformation. While a complete discussion of these processes is outside the scope of this article, there are different ways to integrate with the FHIR modality of the Cloud Healthcare API.

Cloud Functions

As a fully managed, serverless feature, Cloud Functions can often be used as a key component of a complete data ingestion pipeline. Cloud Functions can invoke the REST interface of the Cloud Healthcare API to ingest data from most sources.

By default, Cloud Functions use the default service account of App Engine, which has the project editor role. To restrict the ability of a Cloud Functions to access sensitive healthcare data, you should strongly consider using a non-default service account that has only the roles that are appropriate to the function being performed and to the datasets and stores being accessed. For more information, read about assigning a non-default service account to a Cloud Function.

Cloud Storage

Many organizations use Cloud Storage to ingest large amounts of data into Google Cloud, and then use one of the Google Cloud processing services to cleanse, transform, or normalize that data. In addition to using the ingest process described in this document, it's also possible to use object change notification hooks to process the data when the upload finishes. You can use a Cloud Function that is invoked when a new file is uploaded to Cloud Storage. For example, a Cloud Function can invoke the import, batch, or transaction methods to import the data. You can also send the data through a processing pipeline built on Dataflow, Dataproc, or other third-party systems to convert the data into FHIR prior to ingestion.

Pub/Sub

Pub/Sub is Google's scalable event-processing system, and is frequently used with the Cloud Healthcare API to create and manage topics and subscriptions for notifications regarding events on the Cloud Healthcare API store types. Apps that process Pub/Sub messages require the appropriate permissions on their runtime service accounts in order to access a Pub/Sub subscription. You can also set explicit access control constraints on individual topics and subscriptions to limit the service accounts that can access it. The Cloud Healthcare API service agent account requires the Pub/Sub publisher role to publish messages to the topic you specified when creating your FHIR store.

When you use Pub/Sub to process data for loading into a Cloud Healthcare API FHIR store, we highly recommend that you review the following:

  • Roles and permissions granted to the processing app's service account.
  • Access control rules on the topics and subscriptions.

Dataflow

Dataflow is Google's high-performance, fully managed service for stream and batch data processing. Based on Apache Beam, Dataflow is often used in extract, transform, and load (ETL) pipelines to cleanse, normalize, and transform data. Dataflow can help transform HL7v2 messages to FHIR format.

Use the REST interface to have Dataflow pipelines invoke the FHIR modality of the Cloud Healthcare API.

What's next