Overview of Cloud Life Sciences

Overview

Cloud Life Sciences is a suite of services and tools for managing, processing, and transforming life sciences data. It also enables advanced insights and operational workflows using highly scalable and compliant infrastructure. Cloud Life Sciences includes features such as the Cloud Life Sciences API, extract-transform-load (ETL) tools, and more.

This page provides an overview of the services and tools that Cloud Life Sciences (and Google Cloud more generally) offers and how you can leverage their features with your life sciences data.

Cloud Life Sciences API overview

The Cloud Life Sciences API provides a simple way to execute a series of Compute Engine containers on Google Cloud. The Cloud Life Sciences API is comprised of a single main operation:

And three generic operations:

The Cloud Life Sciences API is aimed at developers who want to build on or create job management tools, such as dsub, or workflow engines, such as Cromwell. The Cloud Life Sciences API provides a backend for these tools and systems, providing job scheduling for Docker-based tasks that perform secondary genomic analysis on Compute Engine containers. You can submit batch operations from anywhere and run them on Google Cloud. The Docker images can be packaged manually, or you can use existing Docker images.

The most common use case when using the Cloud Life Sciences API is to run an existing tool or custom script that reads and writes files, typically to and from Cloud Storage. The Cloud Life Sciences API can run independently over hundreds or thousands of these files.

You can access the Cloud Life Sciences API using the REST API, RPC API, or the Google Cloud CLI.

Running the Cloud Life Sciences API

If you are creating a workflow engine, then a typical series of steps that the engine would take are:

  1. Parsing the input workflow language and constructing a series of JSON-formatted Pipeline objects that the Cloud Life Sciences API accepts. The engine sends a series of requests defined in the Pipeline object to the Cloud Life Sciences API.
  2. Monitoring the requests and merging the outputs from the requests before continuing to the next step.

The following provides a deeper explanation of the first step:

The pipeline runs by calling the pipelines.run method. This method takes a Pipeline object and an optional set of labels to start running a pipeline. The Pipeline object consists of one or more Action descriptions and a Resources object that describes what Google Cloud resources are required to run the pipeline.

The following sample shows how to configure a simple Pipeline that runs a single Action (printing "Hello, world" to the terminal) on a small, standard (n1-standard-1) VM:

"pipeline": {
  "actions": [
    {
      "imageUri": "bash",
      "commands": [ "-c", "echo Hello, world" ]
    },
  ],
  "resources": {
    "regions": ["us-central11"],
    "virtualMachine": {
      "machineType": "n1-standard-1",
    }
  }
}

The following sample shows how to configure an Action object that executes multiple commands. The Action copies a file from Cloud Storage to the VM, calculates and verifies the file's SHA-1 hash, and then writes the file back to the original Cloud Storage bucket.

"actions": [
  {
    "imageUri": "google/cloud-sdk",
    "commands": [ "gsutil", "cp", "gs://my-bucket/input.in", "/tmp" ]
  },
  {
    "imageUri": "bash",
    "commands": [ "-c", "sha1sum /tmp/in > /tmp/test.sha1" ]
  },
  {
    "imageUri": "google/cloud-sdk",
    "commands": [ "gsutil", "cp", "/tmp/output.sha1", "gs://my-bucket/output.sha1" ]
  },
],

Calling pipelines.run returns a long-running operation that you can query to get the status of or cancel the pipeline.

Lifecycle of a Cloud Life Sciences API request

The typical lifecycle of a pipeline running on the Cloud Life Sciences API is as follows:

  1. The Cloud Life Sciences API allocates the Google Cloud resources required to run the pipeline. At a minimum, this typically involves allocating a Compute Engine virtual machine (VM) with disk space.
  2. After a VM becomes available, the Cloud Life Sciences API runs each action defined in the pipeline. These actions perform operations such as copying input files, processing data, or copying output files.
  3. The pipeline releases any allocated resources, including deleting any created VMs.

BigQuery ETL using the Variant Transforms tool

To load your life sciences data into BigQuery for further analysis, use the Variant Transforms tool.

Variant Transforms is an open source tool based on Apache Beam and uses Dataflow. Variant Transforms is the recommended way to transform and load genomics data into Google Cloud for further analysis.

Using other Google Cloud technologies with life sciences data

There are several Google Cloud technologies that interact with Cloud Life Sciences or can be used to analyze and process life sciences data. These include: