Genomic data processing reference architecture

This document describes reference architectures for using the Cloud Life Sciences API with other Google Cloud products to perform genomic data processing by using different methods and workflow engines. Specifically, this document focuses on the alignment and variant calling steps of secondary analysis, and is intended for bioinformaticians, researchers, research IT teams, and other technical specialists in a life sciences organization.

Google Cloud offers a flexible set of APIs and services and tools for running a cost-effective secondary analysis solution at scale. Secondary analysis includes, but is not limited to, filtering raw reads, aligning and assembling sequence reads, and QA and variant calling on aligned reads.

Architecture

The following diagram illustrates the steps in processing genomic data at scale, and shows which steps are performed in Google Cloud.

Process genomic data at scale with Google Cloud.

As the preceding diagram shows, a genomic data sample first undergoes primary analysis, and then is ingested as raw data to Google Cloud for secondary analysis. The processed data then undergoes tertiary analysis and produces reports such as PDFs, which can be downloaded from the cloud by bioinformaticians and other technical specialists.

Genomic data processing reference architecture components

This document includes details about using the Cloud Life Sciences API, as well as two reference architecture overviews that outline different ways of using the API to perform genomic data processing.

Using the Cloud Life Sciences API

The Cloud Life Sciences API provides a fully managed computing service that offers optimal computing resources, based on the resource requirements for a batch job. The API provides a way for you to create, run, and monitor command-line tools on Compute Engine instances running in a Docker container. You can organize multiple command-line tools to run in a specific order with dependencies between the steps. The Cloud Life Sciences API includes the WorkflowsServiceV2Beta service, which you can use for running workflows, such as pipelines consisting of Docker containers.

A Pipeline object consists of one or more Action descriptions, and a Resources message that describes which cloud resources are required to run the pipeline. Each action describes a single Docker container execution, which can include one or more command objects. Action objects run sequentially, with each object waiting to run until the previous object exits, unless the preceding object is set to run in the background. There are flags that affect how each Action object runs and how the exit status affects actions in the pipeline—for example, if an Action object should run in the background, or if the exit status should be ignored. As a pipeline author, you create a Resources message that describes the cloud resources that are required to run the pipeline.

Specifications in the Resources message can include the following:

Virtual machine (VM) sizes, including custom machine shapes and preemptibility
Zones and regions for VM allocation
Networking options (important for large processing projects, due to network size limits)
Attached accelerators (GPUs)
Attached disks
Service accounts and scope

The Cloud Life Sciences API performs the following tasks when it receives a Pipeline object through an API call:

Creates a Compute Engine VM instance, based on the Resources message content.
Downloads all Docker images specified in an Action description.
Runs each Action object as a new Docker container with the specified image and command.
Deletes the Compute Engine VM instance.

A typical set of tasks described in an Action description can include the following:

Download input files.
Run commands.
Copy logs to Cloud Storage.
Upload output files.

The Cloud Life Sciences API uses Google Cloud's scalable and high-performance storage and processing power, along with preemptible VM instances, GPUs, and tensor processing units (TPUs), to help to provide a secure and scalable environment for data analysis. This includes running industry-standard secondary analysis frameworks, such as the Genome Analysis Toolkit (GATK) and DeepVariant, on a single sample or on joint datasets, quickly, easily, and cost-effectively.

The Cloud Life Sciences API is integrated with open source workflow engines, such as Cromwell, Nextflow, and Galaxy. In addition, Nextflow and Galaxy support computing on Google Cloud through direct integrations with Compute Engine and Cloud Storage. The following diagram shows a reference architecture that includes components of the Cloud Life Sciences API, as well as other services that are required to run a genomic secondary analysis pipeline in Google Cloud.

Reference architecture showing Cloud Life Sciences API running on Compute Engine compute nodes using Persistent Disk, Container-Optimized OS, and GPUs.

Genomic data processing: using Cromwell to run GATK Best Practices workflows

You can use the Cloud Life Sciences API with a workflow engine to orchestrate tasks. In the following example, Cromwell is used to perform secondary analysis, while applying the GATK Best Practices workflows.

Cromwell is an open source workflow management system that can be run in a variety of environments to schedule, run, and manage workflows. Cromwell reads a workflow defined in the Workflow Description Language (WDL) and runs individual tasks from the workflow by using the Cloud Life Sciences API. In this case, a Cromwell server runs on a Compute Engine VM instance, with a Cloud SQL server for storing operational data. The Cromwell VM then runs each of the tasks defined by making API calls to the Cloud Life Sciences API to start VM instances for each of the tasks, as configured in the WDL.

The GATK includes tools for variant discovery and genotyping. The GATK Best Practices workflows include a number of pipelines for specific use cases. This example focuses on the production workflow, PairedEndSingleSampleWf . The workflow includes data preprocessing, initial variant calling for germline single nucleotide polymorphisms (SNPs), and indel discovery in a single sample, human whole-genome sequencing data.

The workflow takes human whole-genome paired-end sequencing data in the unmapped BAM (uBAM) format with one or more read groups and uses the Hg38 reference genome with ALT contigs. The outputs include a cram file, a cram index, cram md5, GVCF and a corresponding index, a BQSR report, and several summary metrics. The sample used is a downsampled version of NA12878 that contains just 3 read groups of the full sample. The workflow is packaged as a series of WDL files that define the individual tasks, along with an input specification and generic options. The WDL files specify the cloud resources for each task, such as the number of CPU cores and memory for each VM instance, which container to install on each VM instance, and the command to run, as well as input and output file locations. There are other specifications, as well—for example, if a step should run on preemptible machines, and how many times to retry an Action object.

The following diagram shows a typical architecture for performing genomic secondary analysis by using Cromwell to run the GATK Best Practices workflows with the Cloud Life Sciences API, including the steps required to run the analysis.

Using Cromwell to run GATK Best Practices with the Cloud Life Sciences API.

The diagram shows the following steps for running a secondary analysis:

After sequencing, you save the raw base calls to local storage or network-attached storage (NAS), where they're converted to uBAM files. You then transfer these uBAM files to a Cloud Storage bucket.
You authenticate to act as a service account by using Identity and Access Management (IAM).
You submit the job to the Cromwell server that's running on Compute Engine.

The request goes through Identity-Aware Proxy, which is configured to allow access to Virtual Private Cloud by using Google Cloud firewall rules.
The Cromwell server pulls the workflow for the GATK from public repositories.
The Cromwell server makes calls to the Cloud Life Sciences API to start jobs.
The Cloud Life Sciences API pulls the containers for each task from the GATK public repositories.
The Cloud Life Sciences API starts VMs on Compute Engine for each task and starts the container on each machine.
The VM instance retrieves input files from Cloud Storage buckets.
The VM instance performs computational tasks and saves intermediate, log, and output files to a Cloud Storage bucket.

Genomic data processing: running DeepVariant

DeepVariant is an open source analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data. This analysis pipeline was developed by a research team within Google, and uses TensorFlow for SNP and indel variant calling on exomes or genomes. DeepVariant is used to transform aligned sequencing reads into variant calls.

To use DeepVariant, at the highest level, you need to provide three inputs:

A reference genome in FASTA format and its corresponding .fai index file, generated by using the Samtools faidx command.
An aligned reads file in BAM format and its corresponding index file (.bai). The reads must be aligned to the reference genome provided.
The DeepVariant ML model to use for variant calling.

The output of DeepVariant is a list of all variant calls in VCF format. DeepVariant is integrated with the TensorFlow machine learning framework.

You can run DeepVariant in a Docker container or in the direct binaries, and you can run it by using on-premises hardware, or in the cloud. DeepVariant includes support for using hardware accelerators like GPUs and TPUs. You can use Docker to run DeepVariant from a Docker container using only one command, as described in the DeepVariant Quick Start guide. For production use cases on Google Cloud, we recommend that you integrate the DeepVariant process into your workflow and make the call from your workflow engine.

Alternatively, you can use DeepVariant Runner, which uses the Cloud Life Sciences API in a way that's similar to the methods explained in the Running DeepVariant tutorial. This enables you to run DeepVariant at scale, by using a Docker-based pipeline that's optimized for cost and speed.

The following diagram shows a typical architecture for running a DeepVariant pipeline on Google Cloud.

Architecture for running a DeepVariant pipeline on Google Cloud.

Following are the steps to run DeepVariant, as represented in the preceding diagram:

After creating mapped DNA reads from samples, you transfer these BAM files to a Cloud Storage bucket.
You authenticate to act as a service account by using IAM.
You run DeepVariant, which makes calls to the Cloud Life Sciences API to start jobs.
The Cloud Life Sciences API pulls the DeepVariant container for each task from the public repositories.
The Cloud Life Sciences API starts VMs on Compute Engine for each task and starts the container on each machine.
The VM instance retrieves input files from Cloud Storage buckets.
The VM instance performs computational tasks and saves intermediate, log, and output files to a Cloud Storage bucket.

Data governance

The architecture described in this document focuses on the components that are specific to genomic data analysis. For a production integration that includes human genomic data, you might need to configure additional components in Google Cloud, depending on the requirements and best practices in your jurisdiction.

Google Cloud provides an end-to-end architecture that encapsulates Google Cloud best practices to help meet healthcare security, privacy, and compliance needs. The Cloud Healthcare Data Protection Toolkit includes many of the security and privacy best practice controls recommended for healthcare data, such as configuring appropriate access, maintaining audit logs, and monitoring for suspicious activities. To learn more about these capabilities and how Google Cloud helps to protect customer data, see the Trusting your data with Google Cloud whitepaper.

Data residency

For organizations that have data residency requirements, see the "General Services" section of the Service Specific Terms to review Google Cloud's data residency commitment document, which highlights tools and controls that are available to help configure users' environments to support such requirements.

Organization policies

To limit the physical location of a resource across a project, you can set an organization policy that includes a resource location constraint. For a list of supported services, see resource locations supported services.

Google Cloud resource identifiers

For purposes of this document, data residency commitments do not apply to resource identifiers, attributes, or other data labels. You are responsible for ensuring that no sensitive data is exposed in these identifiers, attributes, or other data labels, such as in a filename.

Cloud Life Sciences API regions and zones

When making Cloud Life Sciences API calls, you must specify the region in which the request will be sent. The following example shows an endpoint URI for running a pipeline in the Google Cloud project foo and the region us-central1:

v2beta/projects/foo/locations/us-central1/workflows:runPipeline

Any metadata that is saved for the operation—including container image names, input and output filenames, and other information sent in the request to the Cloud Life Sciences API—is saved in this region.

When the Cloud Life Sciences API starts a Compute Engine VM instance, the API call must include a region or zone in which to start the VM instance. You can configure one or more regions or zones to restrict the location of the VM. The VM instance, data at rest in Compute Engine, and persistent disks stay in the specified region. Available features for Compute Engine instances, such as CPU platforms, machine types, SSDs, and GPUs, differ by region and zone. Therefore, make sure that the required resources are available in a region or zone before you restrict usage to that region or zone.

For more information, see Compute Engine data residency details and terms.

Google Cloud Storage

You can store input and output files, temporary work files, and persistent disk snapshots in Cloud Storage. Data at rest in Cloud Storage buckets stays within the region, dual-region, or multi-region that you select when configuring the bucket. For more information, see the current list of Cloud Storage bucket locations.

To optimize availability, performance, and efficiency, you can use a Cloud Storage dual-region. When you select this option, the data in the bucket is asynchronously copied to two specific regions, making the data redundant across regions. For more information, see Cloud Storage dual-regions.

For performance and for data residency, use the same region for Cloud Storage, Compute Engine, and the Life Sciences API, if possible.

For more information, see Cloud Storage data residency details and terms.

What's next

Review the Cloud Life Sciences documentation.
Try out Cloud Life Sciences tutorials.
Read the Cloud Storage introduction.
Learn about IAM.
Continue your genomic data processing by using Variant Transforms tools.
Use BigQuery to perform analysis of your genomic data and variants.
Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.