Overview
Cloud Life Sciences is a suite of services and tools for managing, processing, and transforming life sciences data. It also enables advanced insights and operational workflows using highly scalable and compliant infrastructure. Cloud Life Sciences includes features such as the Cloud Life Sciences API, extract-transform-load (ETL) tools, and more.
This page provides an overview of the services and tools that Cloud Life Sciences (and Google Cloud more generally) offers and how you can leverage their features with your life sciences data.
Cloud Life Sciences API overview
The Cloud Life Sciences API provides a simple way to execute a series of Compute Engine containers on Google Cloud. The Cloud Life Sciences API is comprised of a single main operation:
projects.locations.pipelines.run
: Runs a pipeline.
And three generic operations:
projects.locations.operations.get
: Gets the latest state of a pipeline.projects.locations.operations.list
: Lists all pipelines running in a Google Cloud region in your Google Cloud project.projects.locations.operations.cancel
: Cancels a pipeline.
The Cloud Life Sciences API is aimed at developers who want to build on or create job management tools, such as dsub, or workflow engines, such as Cromwell. The Cloud Life Sciences API provides a backend for these tools and systems, providing job scheduling for Docker-based tasks that perform secondary genomic analysis on Compute Engine containers. You can submit batch operations from anywhere and run them on Google Cloud. The Docker images can be packaged manually, or you can use existing Docker images.
The most common use case when using the Cloud Life Sciences API is to run an existing tool or custom script that reads and writes files, typically to and from Cloud Storage. The Cloud Life Sciences API can run independently over hundreds or thousands of these files.
You can access the Cloud Life Sciences API using the REST API, RPC API, or the Google Cloud CLI.
Running the Cloud Life Sciences API
If you are creating a workflow engine, then a typical series of steps that the engine would take are:
- Parsing the input workflow language and constructing a series of
JSON-formatted
Pipeline
objects that the Cloud Life Sciences API accepts. The engine sends a series of requests defined in thePipeline
object to the Cloud Life Sciences API. - Monitoring the requests and merging the outputs from the requests before continuing to the next step.
The following provides a deeper explanation of the first step:
The pipeline runs by calling the pipelines.run
method.
This method takes a
Pipeline
object and an optional set of labels to start running a pipeline. The Pipeline
object consists of one or more Action
descriptions and a Resources
object that
describes what Google Cloud resources are required to run the
pipeline.
The following sample shows how to configure a
simple Pipeline
that runs a single Action
(printing "Hello, world" to the
terminal) on a small, standard (n1-standard-1
)
VM:
"pipeline": {
"actions": [
{
"imageUri": "bash",
"commands": [ "-c", "echo Hello, world" ]
},
],
"resources": {
"regions": ["us-central11"],
"virtualMachine": {
"machineType": "n1-standard-1",
}
}
}
The following sample shows how to configure an Action
object that executes
multiple commands. The Action
copies a file from Cloud Storage
to the VM, calculates and verifies the file's SHA-1 hash, and then
writes the file back to the original Cloud Storage bucket.
"actions": [
{
"imageUri": "google/cloud-sdk",
"commands": [ "gsutil", "cp", "gs://my-bucket/input.in", "/tmp" ]
},
{
"imageUri": "bash",
"commands": [ "-c", "sha1sum /tmp/in > /tmp/test.sha1" ]
},
{
"imageUri": "google/cloud-sdk",
"commands": [ "gsutil", "cp", "/tmp/output.sha1", "gs://my-bucket/output.sha1" ]
},
],
Calling pipelines.run
returns a long-running operation
that you can query to get the status of or cancel the pipeline.
Lifecycle of a Cloud Life Sciences API request
The typical lifecycle of a pipeline running on the Cloud Life Sciences API is as follows:
- The Cloud Life Sciences API allocates the Google Cloud resources required to run the pipeline. At a minimum, this typically involves allocating a Compute Engine virtual machine (VM) with disk space.
- After a VM becomes available, the Cloud Life Sciences API runs each action defined in the pipeline. These actions perform operations such as copying input files, processing data, or copying output files.
- The pipeline releases any allocated resources, including deleting any created VMs.
BigQuery ETL using the Variant Transforms tool
To load your life sciences data into BigQuery for further analysis, use the Variant Transforms tool.
Variant Transforms is an open source tool based on Apache Beam and uses Dataflow. Variant Transforms is the recommended way to transform and load genomics data into Google Cloud for further analysis.
Using other Google Cloud technologies with life sciences data
There are several Google Cloud technologies that interact with Cloud Life Sciences or can be used to analyze and process life sciences data. These include:
- BigQuery: Use
BigQuery for ad-hoc queries of massive structured datasets,
such as genomic variants. Use cases include analyzing variants
and running complex
JOIN
queries to analyze data described by genomic regional intervals, or overlaps. The Variant Transforms tool provides a way to transform and load VCF files directly into BigQuery. - Cloud Storage: Use Cloud Storage as an object store for raw VCF, FASTQ, and BAM files that you can load into BigQuery using Variant Transforms for large-scale analysis.
- Dataflow: The Variant Transforms tool uses Dataflow to create highly scalable data processing pipelines that load data into BigQuery.