This document describes reference architectures for using the Cloud Life Sciences API with other Google Cloud products to perform genomic data processing by using different methods and workflow engines. Specifically, this document focuses on the alignment and variant calling steps of secondary analysis, and is intended for bioinformaticians, researchers, research IT teams, and other technical specialists in a life sciences organization.
Google Cloud offers a flexible set of APIs and services and tools for running a cost-effective secondary analysis solution at scale. Secondary analysis includes, but is not limited to, filtering raw reads, aligning and assembling sequence reads, and QA and variant calling on aligned reads. The following diagram illustrates the steps in processing genomic data at scale, and shows which steps are performed in Google Cloud.
As the preceding diagram shows, a genomic data sample first undergoes primary analysis, and then is ingested as raw data to Google Cloud for secondary analysis. The processed data then undergoes tertiary analysis and produces reports such as PDFs, which can be downloaded from the cloud by bioinformaticians and other technical specialists.
Genomic data processing reference architecture components
This document includes details about using the Cloud Life Sciences API, as well as two reference architecture overviews that outline different ways of using the API to perform genomic data processing.
Using the Cloud Life Sciences API
The Cloud Life Sciences API provides a fully managed computing service that
offers optimal computing resources, based on the resource requirements for a
batch job. The API provides a way for you to create, run, and monitor
command-line tools on Compute Engine instances running in a
Docker container.
You can organize multiple command-line tools to run in a specific order with
dependencies between the steps. The Cloud Life Sciences API includes the
WorkflowsServiceV2Beta
service, which you can use
for running workflows, such as pipelines consisting of Docker containers.
A Pipeline
object consists of one or more
Action
descriptions, and a
Resources
message that describes which cloud resources are required to run the pipeline.
Each action describes a single Docker container execution, which can include
one or more
command
objects.
Action
objects run sequentially, with each object waiting to run until the
previous object exits, unless the preceding object is set to run in the
background. There are flags that affect how each Action
object runs and how
the exit status affects actions in the pipeline—for example, if an Action
object should run in the background, or if the exit status should be ignored.
As a pipeline author, you create a Resources
message that describes the
cloud resources that are required to run the pipeline.
Specifications in the Resources
message can include the following:
- Virtual machine (VM) sizes, including custom machine shapes and preemptibility
- Zones and regions for VM allocation
- Networking options (important for large processing projects, due to network size limits)
- Attached accelerators (GPUs)
- Attached disks
- Service accounts and scope
The Cloud Life Sciences API performs the following tasks when it receives a
Pipeline
object through an API call:
- Creates a Compute Engine VM instance, based on the
Resources
message content. - Downloads all Docker images specified in an
Action
description. - Runs each
Action
object as a new Docker container with the specified image and command. - Deletes the Compute Engine VM instance.
A typical set of tasks described in an Action
description can include the
following:
- Download input files.
- Run commands.
- Copy logs to Cloud Storage.
- Upload output files.
The Cloud Life Sciences API uses Google Cloud's scalable and high-performance storage and processing power, along with preemptible VM instances, GPUs, and tensor processing units (TPUs), to help to provide a secure and scalable environment for data analysis. This includes running industry-standard secondary analysis frameworks, such as the Genome Analysis Toolkit (GATK) and DeepVariant, on a single sample or on joint datasets, quickly, easily, and cost-effectively.
The Cloud Life Sciences API is integrated with open source workflow engines, such as Cromwell, Nextflow, and Galaxy. In addition, Nextflow and Galaxy support computing on Google Cloud through direct integrations with Compute Engine and Cloud Storage. The following diagram shows a reference architecture that includes components of the Cloud Life Sciences API, as well as other services that are required to run a genomic secondary analysis pipeline in Google Cloud.
Genomic data processing: using Cromwell to run GATK Best Practices workflows
You can use the Cloud Life Sciences API with a workflow engine to orchestrate tasks. In the following example, Cromwell is used to perform secondary analysis, while applying the GATK Best Practices workflows.
Cromwell is an open source workflow management system that can be run in a variety of environments to schedule, run, and manage workflows. Cromwell reads a workflow defined in the Workflow Description Language (WDL) and runs individual tasks from the workflow by using the Cloud Life Sciences API. In this case, a Cromwell server runs on a Compute Engine VM instance, with a Cloud SQL server for storing operational data. The Cromwell VM then runs each of the tasks defined by making API calls to the Cloud Life Sciences API to start VM instances for each of the tasks, as configured in the WDL.
The GATK includes tools for variant discovery and genotyping. The GATK Best Practices workflows include a number of pipelines for specific use cases. This example focuses on the production workflow, PairedEndSingleSampleWf . The workflow includes data preprocessing, initial variant calling for germline single nucleotide polymorphisms (SNPs), and indel discovery in a single sample, human whole-genome sequencing data.
The workflow takes human whole-genome paired-end sequencing data in the
unmapped
BAM
(uBAM) format with one or more read groups and uses the Hg38 reference genome
with ALT contigs. The outputs include a cram file, a cram index, cram md5, GVCF
and a corresponding index, a BQSR report, and several summary metrics. The
sample used is a
downsampled version of NA12878
that contains just 3 read groups of the full sample.
The workflow is packaged as a series of WDL files that define the individual
tasks, along with an input specification and generic options. The WDL files
specify the cloud resources for each task, such as the number of CPU cores and
memory for each VM instance, which container to install on each VM instance, and
the command to run, as well as input and output file locations. There are other
specifications, as well—for example, if a step should run on preemptible
machines, and how many times to retry an Action
object.
The following diagram shows a typical architecture for performing genomic secondary analysis by using Cromwell to run the GATK Best Practices workflows with the Cloud Life Sciences API, including the steps required to run the analysis.
The diagram shows the following steps for running a secondary analysis:
- After sequencing, you save the raw base calls to local storage or network-attached storage (NAS), where they're converted to uBAM files. You then transfer these uBAM files to a Cloud Storage bucket.
- You authenticate to act as a service account by using Identity and Access Management (IAM).
You submit the job to the Cromwell server that's running on Compute Engine.
The request goes through Identity-Aware Proxy, which is configured to allow access to Virtual Private Cloud by using Google Cloud firewall rules.
The Cromwell server pulls the workflow for the GATK from public repositories.
The Cromwell server makes calls to the Cloud Life Sciences API to start jobs.
The Cloud Life Sciences API pulls the containers for each task from the GATK public repositories.
The Cloud Life Sciences API starts VMs on Compute Engine for each task and starts the container on each machine.
The VM instance retrieves input files from Cloud Storage buckets.
The VM instance performs computational tasks and saves intermediate, log, and output files to a Cloud Storage bucket.
Genomic data processing: running DeepVariant
DeepVariant is an open source analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data. This analysis pipeline was developed by a research team within Google, and uses TensorFlow for SNP and indel variant calling on exomes or genomes. DeepVariant is used to transform aligned sequencing reads into variant calls.
To use DeepVariant, at the highest level, you need to provide three inputs:
- A reference genome in
FASTA
format and its corresponding
.fai
index file, generated by using the Samtoolsfaidx
command. - An aligned reads file in BAM format and its corresponding index file
(
.bai
). The reads must be aligned to the reference genome provided. - The DeepVariant ML model to use for variant calling.
The output of DeepVariant is a list of all variant calls in VCF format. DeepVariant is integrated with the TensorFlow machine learning framework.
You can run DeepVariant in a Docker container or in the direct binaries, and you can run it by using on-premises hardware, or in the cloud. DeepVariant includes support for using hardware accelerators like GPUs and TPUs. You can use Docker to run DeepVariant from a Docker container using only one command, as described in the DeepVariant Quick Start guide. For production use cases on Google Cloud, we recommend that you integrate the DeepVariant process into your workflow and make the call from your workflow engine.
Alternatively, you can use DeepVariant Runner, which uses the Cloud Life Sciences API in a way that's similar to the methods explained in the Running DeepVariant tutorial. This enables you to run DeepVariant at scale, by using a Docker-based pipeline that's optimized for cost and speed.
The following diagram shows a typical architecture for running a DeepVariant pipeline on Google Cloud.
Following are the steps to run DeepVariant, as represented in the preceding diagram:
- After creating mapped DNA reads from samples, you transfer these BAM files to a Cloud Storage bucket.
- You authenticate to act as a service account by using IAM.
- You run DeepVariant, which makes calls to the Cloud Life Sciences API to start jobs.
- The Cloud Life Sciences API pulls the DeepVariant container for each task from the public repositories.
- The Cloud Life Sciences API starts VMs on Compute Engine for each task and starts the container on each machine.
- The VM instance retrieves input files from Cloud Storage buckets.
- The VM instance performs computational tasks and saves intermediate, log, and output files to a Cloud Storage bucket.
Data governance
The architecture described in this document focuses on the components that are specific to genomic data analysis. For a production integration that includes human genomic data, you might need to configure additional components in Google Cloud, depending on the requirements and best practices in your jurisdiction.
Google Cloud provides an end-to-end architecture that encapsulates Google Cloud best practices to help meet healthcare security, privacy, and compliance needs. The Cloud Healthcare Data Protection Toolkit includes many of the security and privacy best practice controls recommended for healthcare data, such as configuring appropriate access, maintaining audit logs, and monitoring for suspicious activities. To learn more about these capabilities and how Google Cloud helps to protect customer data, see the Trusting your data with Google Cloud whitepaper.
Data residency
For organizations that have data residency requirements, see the "General Services" section of the Service Specific Terms to review Google Cloud's data residency commitment document, which highlights tools and controls that are available to help configure users' environments to support such requirements.
Organization policies
To limit the physical location of a resource across a project, you can set an organization policy that includes a resource location constraint. For a list of supported services, see resource locations supported services.
Google Cloud resource identifiers
For purposes of this document, data residency commitments do not apply to resource identifiers, attributes, or other data labels. You are responsible for ensuring that no sensitive data is exposed in these identifiers, attributes, or other data labels, such as in a filename.
Cloud Life Sciences API regions and zones
When making Cloud Life Sciences API calls, you must specify the region in
which the request will be sent. The following example shows an endpoint URI for
running a pipeline in the Google Cloud project foo
and the region us-central1
:
v2beta/projects/foo/locations/us-central1/workflows:runPipeline
Any metadata that is saved for the operation—including container image names, input and output filenames, and other information sent in the request to the Cloud Life Sciences API—is saved in this region.
When the Cloud Life Sciences API starts a Compute Engine VM instance, the API call must include a region or zone in which to start the VM instance. You can configure one or more regions or zones to restrict the location of the VM. The VM instance, data at rest in Compute Engine, and persistent disks stay in the specified region. Available features for Compute Engine instances, such as CPU platforms, machine types, SSDs, and GPUs, differ by region and zone. Therefore, make sure that the required resources are available in a region or zone before you restrict usage to that region or zone.
For more information, see Compute Engine data residency details and terms.
Google Cloud Storage
You can store input and output files, temporary work files, and persistent disk snapshots in Cloud Storage. Data at rest in Cloud Storage buckets stays within the region, dual-region, or multi-region that you select when configuring the bucket. For more information, see the current list of Cloud Storage bucket locations.
To optimize availability, performance, and efficiency, you can use a Cloud Storage dual-region. When you select this option, the data in the bucket is asynchronously copied to two specific regions, making the data redundant across regions. For more information, see Cloud Storage dual-regions.
For performance and for data residency, use the same region for Cloud Storage, Compute Engine, and the Life Sciences API, if possible.
For more information, see Cloud Storage data residency details and terms.
What's next
- Review the Cloud Life Sciences documentation.
- Try out Cloud Life Sciences tutorials.
- Read the Cloud Storage introduction.
- Learn about IAM.
- Continue your genomic data processing by using Variant Transforms tools.
- Use BigQuery to perform analysis of your genomic data and variants.
- Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.