Running DeepVariant on Google Cloud Platform

DeepVariant is a genomics variant caller that uses deep neural networks to call genetic variants in germline genomes. Originally developed by Google Brain and Verily Life Sciences, DeepVariant won the 2016 PrecisionFDA Truth Challenge award for Highest SNP Performance. Since then, the Google Brain team has reduced the error rate by over 50%. On Dec 4th, 2017, Google Brain released DeepVariant as open source.

Through a collaboration between the genomics teams at Google Cloud & Google Brain, Google has created two highly scalable versions of DeepVariant for those that want to run it on large data sets. Both versions can be used on Google Cloud Platform for no extra charge, though you may incur compute, storage and other cloud services costs based on your usage.

For an organization that wants to call thousands to millions of genomes, cost is a significant driver in resource planning. The cost-optimized version of DeepVariant calls an aligned 30x genome in 3-4 hours for approximately $8-9 in cloud costs, and an aligned exome in a little over an hour for approximately $0.70.

For an organization where turnaround time is a priority, we have also developed a speed-optimized version of DeepVariant that calls an aligned 30x genome in approximately 90-120 minutes for about $35 in cloud costs, and an aligned exome in a little over a half hour for a few dollars.

If you have questions about DeepVariant on Google Cloud, or if you simply want to connect with other organizations using Google Cloud for their genomic data, join the google-genomics-discuss group. If you’d like to contact the genomics groups at Google Cloud and Google Brain about DeepVariant, please send email to google-genomics-contact@google.com.

Instructions

Get ready

If you are just getting started, please follow the instructions here to get familiar with Google Genomics Pipelines API.

Pipeline runner script

The script below has the common settings for running the pipeline, which can be used to run DeepVariant with different configurations (see below). Please specify PROJECT_ID (this is the alphanumeric project ID and not the numeric project number), OUTPUT_BUCKET to a Cloud Storage bucket where you want the output and intermediate files to be stored, and a STAGING_FOLDER_NAME to a unique name for each of your runs (intermediate results are stored in the staging folder, so that they can be reused). Once the pipeline finishes, the output file will be available at gs://[OUTPUT_BUCKET]/output.vcf. You may use the storage browser to browse through the Cloud Storage files.

#!/bin/bash
set -euo pipefail
# Set common settings.
PROJECT_ID=[your alphanumeric project ID]
OUTPUT_BUCKET=gs://OUTPUT_BUCKET
STAGING_FOLDER_NAME=[a unique alphanumeric name for each run]
OUTPUT_FILE_NAME=output.vcf
MODEL=gs://deepvariant/models/DeepVariant/0.4.0/DeepVariant-inception_v3-0.4.0+cl-174375304.data-wgs_standard
IMAGE_VERSION=0.4.1
DOCKER_IMAGE=gcr.io/deepvariant-docker/deepvariant:"${IMAGE_VERSION}"
DOCKER_IMAGE_GPU=gcr.io/deepvariant-docker/deepvariant_gpu:"${IMAGE_VERSION}"

# Run the pipeline.
gcloud alpha genomics pipelines run \
  --project "${PROJECT_ID}" \
  --pipeline-file deepvariant_pipeline.yaml \
  --logging "${OUTPUT_BUCKET}"/runner_logs \
  --zones us-west1-b \
  --inputs `echo \
      PROJECT_ID="${PROJECT_ID}", \
      OUTPUT_BUCKET="${OUTPUT_BUCKET}", \
      MODEL="${MODEL}", \
      DOCKER_IMAGE="${DOCKER_IMAGE}", \
      DOCKER_IMAGE_GPU="${DOCKER_IMAGE_GPU}", \
      STAGING_FOLDER_NAME="${STAGING_FOLDER_NAME}", \
      OUTPUT_FILE_NAME="${OUTPUT_FILE_NAME}" \
      | tr -d '[:space:]'`

Please note the operation ID returned by the above script. You can track the status of your operation by running:

gcloud alpha genomics operations describe <operation-id>

The returned data will have done: true when the operation is done. Please see the API documentation for detailed descriptions of all fields. More detailed output from the operation can be found in the path provided by the --logging argument (i.e. "${OUTPUT_BUCKET}"/runner_logs).

Pipeline configurations

DeepVariant can be run with different configurations (e.g. with and without GPUs, using preemptible VMs, different number of shards). We have provided some of the most common configurations below. Please save each configuration as deepvariant_pipeline.yaml and use the script provided above to run the pipeline.

You do not need to make any changes to these configurations and can copy them as is. However, you may choose to modify the number of shards, workers, zones, etc depending on your needs once you are more faimilar with the pipeline. See additional options for more settings.

Quickstart test data configuration

The configuration below runs DeepVariant with the quickstart test data.

name: deepvariant_pipeline
inputParameters:
- name: PROJECT_ID
- name: OUTPUT_BUCKET
- name: MODEL
- name: DOCKER_IMAGE
- name: DOCKER_IMAGE_GPU
- name: STAGING_FOLDER_NAME
- name: OUTPUT_FILE_NAME
docker:
  imageName: gcr.io/deepvariant-docker/deepvariant_runner
  cmd: |
    ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \
      --project "${PROJECT_ID}" \
      --zones 'us-*' \
      --docker_image "${DOCKER_IMAGE}" \
      --outfile "${OUTPUT_BUCKET}"/"${OUTPUT_FILE_NAME}" \
      --staging "${OUTPUT_BUCKET}"/"${STAGING_FOLDER_NAME}" \
      --model "${MODEL}" \
      --bam gs://deepvariant/quickstart-testdata/NA12878_S1.chr20.10_10p1mb.bam \
      --ref gs://deepvariant/quickstart-testdata/ucsc.hg19.chr20.unittest.fasta.gz \
      --regions "chr20:10,000,000-10,010,000"

Speed-optimized configuration

The configuration below is optimized to run DeepVariant in a fast manner. It uses 16x64-core VMs for the make_examples step, 16x8-core VMs with GPU for the call_variants step, and an 8-core VM for the postprocess_variants step. It processes a 30x whole genome sample in ~2 hours and costs ~$35.

Please ensure you have the necessary quota in the specified zones (1025 CPU cores, 16 GPU cores, 3.2TB disk, 17 IPs) prior to running the pipeline (see the Compute Engine Resource Quotas page for details).

name: deepvariant_pipeline
inputParameters:
- name: PROJECT_ID
- name: OUTPUT_BUCKET
- name: MODEL
- name: DOCKER_IMAGE
- name: DOCKER_IMAGE_GPU
- name: STAGING_FOLDER_NAME
- name: OUTPUT_FILE_NAME
docker:
  imageName: gcr.io/deepvariant-docker/deepvariant_runner
  cmd: |
    ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \
      --project "${PROJECT_ID}" \
      --zones us-west1-b us-east1-d \
      --docker_image "${DOCKER_IMAGE}" \
      --docker_image_gpu "${DOCKER_IMAGE_GPU}" \
      --gpu \
      --outfile "${OUTPUT_BUCKET}"/"${OUTPUT_FILE_NAME}" \
      --staging "${OUTPUT_BUCKET}"/"${STAGING_FOLDER_NAME}" \
      --model "${MODEL}" \
      --bam gs://deepvariant/performance-testdata/HG002_NIST_150bp_downsampled_30x.bam \
      --ref gs://deepvariant/performance-testdata/hs37d5.fa.gz \
      --shards 1024 \
      --make_examples_workers 16 \
      --make_examples_cores_per_worker 64 \
      --make_examples_ram_per_worker_gb 240 \
      --make_examples_disk_per_worker_gb 200 \
      --call_variants_workers 16 \
      --call_variants_cores_per_worker 8 \
      --call_variants_ram_per_worker_gb 30 \
      --call_variants_disk_per_worker_gb 50

The default configuration uses NVIDIA® Tesla® K80 GPUs, but you may also use NVIDIA® Tesla® P100 GPUs by adding --accelerator_type nvidia-tesla-p100 to the above configuration. The pipeline would run ~15mins faster and cost ~$4 more. Note that P100 GPUs are available in limited zones. Please refer to GPUs on Compute Engine for more details.

Cost-optimized configuration

The configuration below is optimized to run DeepVariant at low cost. It uses preemptible VMs, which are up to 80% cheaper than regular VMs, but may get preempted at any time. There is built-in logic in the DeepVariant runner to automatically retry preempted jobs. Note that preemptible VMs are not covered by any Service Level Agreement, so if you require guarantees on turnaround time, please use the speed-optimized configuration. You may also specify the number of times you are willing to retry with preemptible VMs using the --max_preemptible_tries flag.

The runtime and cost varies depending on the number of instances that get preempted, but we have observed that for a 30x whole genome sample, the pipeline generally completes in ~3-4 hours and costs ~$8-9. Please refer to the best practices guidelines for how to effectively use preemptible VMs.

Please ensure you have enough quota in the specified zones (1025 CPU cores, 6.4TB disk, 33 IPs) prior to running the pipeline (see the Compute Engine Resource Quotas page for details). Note that preemptible GPUs are not supported.

name: deepvariant_pipeline
inputParameters:
- name: PROJECT_ID
- name: OUTPUT_BUCKET
- name: MODEL
- name: DOCKER_IMAGE
- name: DOCKER_IMAGE_GPU
- name: STAGING_FOLDER_NAME
- name: OUTPUT_FILE_NAME
docker:
  imageName: gcr.io/deepvariant-docker/deepvariant_runner
  cmd: |
    ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \
      --project "${PROJECT_ID}" \
      --zones 'us-*' \
      --docker_image "${DOCKER_IMAGE}" \
      --outfile "${OUTPUT_BUCKET}"/"${OUTPUT_FILE_NAME}" \
      --staging "${OUTPUT_BUCKET}"/"${STAGING_FOLDER_NAME}" \
      --model "${MODEL}" \
      --bam gs://deepvariant/performance-testdata/HG002_NIST_150bp_downsampled_30x.bam \
      --ref gs://deepvariant/performance-testdata/hs37d5.fa.gz \
      --shards 512 \
      --make_examples_workers 32 \
      --make_examples_cores_per_worker 16 \
      --make_examples_ram_per_worker_gb 60 \
      --make_examples_disk_per_worker_gb 200 \
      --call_variants_workers 32 \
      --call_variants_cores_per_worker 32 \
      --call_variants_ram_per_worker_gb 60 \
      --call_variants_disk_per_worker_gb 50 \
      --preemptible \
      --max_preemptible_tries 5

Exome configuration

The configuration below only calls exome regions. It uses preemptible VMs and costs ~$0.70 and runs in ~70mins without preemptions. If you require faster and/or guaranteed turnaround time, remove --preemptible and increase the number of shards and workers in the make_examples step (e.g. 256 shards with 8x32-core workers).

name: deepvariant_pipeline
inputParameters:
- name: PROJECT_ID
- name: OUTPUT_BUCKET
- name: MODEL
- name: DOCKER_IMAGE
- name: DOCKER_IMAGE_GPU
- name: STAGING_FOLDER_NAME
- name: OUTPUT_FILE_NAME
docker:
  imageName: gcr.io/deepvariant-docker/deepvariant_runner
  cmd: |
    ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \
      --project "${PROJECT_ID}" \
      --zones 'us-*' \
      --docker_image "${DOCKER_IMAGE}" \
      --outfile "${OUTPUT_BUCKET}"/"${OUTPUT_FILE_NAME}" \
      --staging "${OUTPUT_BUCKET}"/"${STAGING_FOLDER_NAME}" \
      --model "${MODEL}" \
      --bam gs://deepvariant/exome-case-study-testdata/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam \
      --bai gs://deepvariant/exome-case-study-testdata/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bai \
      --ref gs://deepvariant/exome-case-study-testdata/hs37d5.fa.gz \
      --regions gs://deepvariant/exome-case-study-testdata/refseq.coding_exons.b37.extended50.bed \
      --shards 64 \
      --make_examples_workers 4 \
      --make_examples_cores_per_worker 16 \
      --make_examples_ram_per_worker_gb 60 \
      --make_examples_disk_per_worker_gb 100 \
      --call_variants_workers 1 \
      --call_variants_cores_per_worker 32 \
      --call_variants_ram_per_worker_gb 60 \
      --call_variants_disk_per_worker_gb 50 \
      --preemptible \
      --max_preemptible_tries 5

Additional options

In addition to the basic configurations provided above, you may specify additional settings depending on your use case. Please add these to the arguments passed to ./opt/deepvariant_runner/bin/gcp_deepvariant_runner in deepvariant_pipeline.yaml.

  • job_name_prefix: You may specify a prefix to be added to the job names, which is useful to distinguish particular pipeline runs from others (e.g. for billing purposes). For instance, by specifying --job_name_prefix gpu_ the billing reports would show gpu_make_examples, gpu_call_variants, and gpu_postprocess_variants as the three stages of the pipeline.
  • jobs_to_run: You may use this setting to only run part of the pipeline, which would be useful if part of your pipeline failed for some reason (e.g. due to incorrect input paths). For instance, if your pipeline failed in the call_variants step, you can rerun the pipeline by specifying --jobs_to_run call_variants postprocess_variants, which would skip the make_examples step and reuse existing results. Please note that all settings must be the same as the original run in order to reuse existing intermediate results (particularly, the staging location and the number of shards and workers).
  • bai, ref_fai, ref_gzi: These settings are useful if your index files are not in the same location as the raw files and/or they don't have the common extensions. For BAM files, we require a .bam.bai file to be present at the same location as the BAM file. Similarly, .fai and .gzi files for reference index and gzip index, respectively. Note that a .gzi file is only required if your reference is compressed.