Running a DeepVariant Pipeline

This page explains how to run a pipeline on Google Cloud Platform using DeepVariant. DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

Objectives

After completing this tutorial, you'll know how to:

  • Run a pipeline on Google Cloud Platform using DeepVariant
  • Write configuration files for different DeepVariant use cases
  • Estimate costs and turnaround times for different DeepVariant pipeline configurations

Costs

This tutorial uses billable components of GCP, including:

  • Compute Engine

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. Enable the Cloud Genomics and Compute Engine APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.
  6. Update and install gcloud components:
    gcloud components update &&
    gcloud components install alpha

Run the pipeline

To run the pipeline, you need to create a configuration file and run a script that invokes the Cloud Genomics Pipelines API. The script has common settings so that you can run it with other pipeline configurations.

  1. To create a configuration file, copy the following text and save it to a file named deepvariant_pipeline.yaml.

    name: deepvariant_pipeline
    inputParameters:
    - name: PROJECT_ID
    - name: OUTPUT_BUCKET
    - name: MODEL
    - name: DOCKER_IMAGE
    - name: DOCKER_IMAGE_GPU
    - name: STAGING_FOLDER_NAME
    - name: OUTPUT_FILE_NAME
    docker:
      imageName: gcr.io/deepvariant-docker/deepvariant_runner
      cmd: |
        ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \
        --project "${PROJECT_ID}" \
        --zones 'us-*' \
        --docker_image "${DOCKER_IMAGE}" \
        --outfile "${OUTPUT_BUCKET}"/"${OUTPUT_FILE_NAME}" \
        --staging "${OUTPUT_BUCKET}"/"${STAGING_FOLDER_NAME}" \
        --model "${MODEL}" \
        --bam gs://deepvariant/quickstart-testdata/NA12878_S1.chr20.10_10p1mb.bam \
        --ref gs://deepvariant/quickstart-testdata/ucsc.hg19.chr20.unittest.fasta.gz \
        --regions "chr20:10,000,000-10,010,000"
    

  2. Copy the following text and save it to a file named script.sh, substituting the variables with the relevant resources from your GCP project. The OUTPUT_BUCKET variable designates a Cloud Storage bucket that holds the pipeline's output and intermediate files. The STAGING_FOLDER_NAME variable is a unique name for a directory in the bucket. You can set the variable for each run of the pipeline.

    #!/bin/bash
    set -euo pipefail
    # Set common settings.
    PROJECT_ID=PROJECT_ID
    OUTPUT_BUCKET=gs://OUTPUT_BUCKET
    STAGING_FOLDER_NAME=STAGING_FOLDER_NAME
    OUTPUT_FILE_NAME=output.vcf
    # Model for calling whole genome sequencing data.
    MODEL=gs://deepvariant/models/DeepVariant/0.6.0/DeepVariant-inception_v3-0.6.0+cl-191676894.data-wgs_standard
    # Model for calling exome sequencing data.
    # MODEL=gs://deepvariant/models/DeepVariant/0.6.0/DeepVariant-inception_v3-0.6.0+cl-191676894.data-wes_standard
    IMAGE_VERSION=0.6.1
    DOCKER_IMAGE=gcr.io/deepvariant-docker/deepvariant:"${IMAGE_VERSION}"
    DOCKER_IMAGE_GPU=gcr.io/deepvariant-docker/deepvariant_gpu:"${IMAGE_VERSION}"
    # Run the pipeline.
    gcloud alpha genomics pipelines run \
        --project "${PROJECT_ID}" \
        --pipeline-file deepvariant_pipeline.yaml \
        --logging "${OUTPUT_BUCKET}"/runner_logs \
        --zones us-west1-b \
        --inputs `echo \
            PROJECT_ID="${PROJECT_ID}", \
            OUTPUT_BUCKET="${OUTPUT_BUCKET}", \
            MODEL="${MODEL}", \
            DOCKER_IMAGE="${DOCKER_IMAGE}", \
            DOCKER_IMAGE_GPU="${DOCKER_IMAGE_GPU}", \
            STAGING_FOLDER_NAME="${STAGING_FOLDER_NAME}", \
            OUTPUT_FILE_NAME="${OUTPUT_FILE_NAME}" \
            | tr -d '[:space:]'`
    

  3. Run the following command to make script.sh executable:

    chmod +x script.sh
    

  4. Run script.sh:

    ./script.sh
    

  5. The command returns an operation ID in the format Running [operations/OPERATION_ID]. You can use the operation ID to track the status of the pipeline by running the following command:

    gcloud alpha genomics operations describe OPERATION_ID
    

  6. The operations describe command returns done: true when the pipeline finishes.

    You can run the following simple bash loop to check every 30 seconds whether the job is running, has finished, or returned an error:

    while [[ $(gcloud --format='value(done)' alpha genomics operations describe OPERATION_ID) != True ]]; do
       echo "Job not done, sleeping for 30 seconds..."
       sleep 30
    done
    

    You can find more details about the operation in the path provided by the --logging flag. In this tutorial, the path is "${OUTPUT_BUCKET}"/runner_logs.

  7. After the pipeline finishes, it outputs a VCF file, output.vcf, to your Cloud Storage bucket. Run the following command to list the files in your bucket, and check that the output.vcf file is in the list:

    gsutil ls gs://BUCKET/
    

Pipeline configurations

You can run DeepVariant on Google Cloud Platform with different configurations settings, such as:

  • With and without GPUs
  • Using preemptible VMs
  • Using different numbers of shards

Some of the most common configurations are provided below. To try out a configuration, copy and paste it into a file called deepvariant_pipeline.yaml, then run the script.sh script from Run the pipeline.

After trying out the configurations, see additional options for information on how to set advanced settings.

Speed-optimized configuration

The following configuration is optimized to run DeepVariant quickly. Running the configuration lets you process a 30x whole genome sample in roughly two hours, and costs about $35.00.

The configuration has the following properties:

  • Uses 16x64 core VMs for the make_examples step
  • Uses 16x8 core VMs with GPU for the call_variants step
  • Uses an 8 core VM for the postprocess_variants step

Before you can run the pipeline using this configuration, you must request the following Compute Engine quota increases:

  • CPUs: 1025
  • GPUs: 16
  • Persistent Disk: 3.2 TB
  • In-use IP addresses: 17
name: deepvariant_pipeline
inputParameters:
- name: PROJECT_ID
- name: OUTPUT_BUCKET
- name: MODEL
- name: DOCKER_IMAGE
- name: DOCKER_IMAGE_GPU
- name: STAGING_FOLDER_NAME
- name: OUTPUT_FILE_NAME
docker:
  imageName: gcr.io/deepvariant-docker/deepvariant_runner
  cmd: |
    ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \
      --project "${PROJECT_ID}" \
      --zones us-west1-b us-east1-d \
      --docker_image "${DOCKER_IMAGE}" \
      --docker_image_gpu "${DOCKER_IMAGE_GPU}" \
      --gpu \
      --outfile "${OUTPUT_BUCKET}"/"${OUTPUT_FILE_NAME}" \
      --staging "${OUTPUT_BUCKET}"/"${STAGING_FOLDER_NAME}" \
      --model "${MODEL}" \
      --bam gs://deepvariant/performance-testdata/HG002_NIST_150bp_downsampled_30x.bam \
      --ref gs://deepvariant/performance-testdata/hs37d5.fa.gz \
      --shards 1024 \
      --make_examples_workers 16 \
      --make_examples_cores_per_worker 64 \
      --make_examples_ram_per_worker_gb 240 \
      --make_examples_disk_per_worker_gb 200 \
      --call_variants_workers 16 \
      --call_variants_cores_per_worker 8 \
      --call_variants_ram_per_worker_gb 30 \
      --call_variants_disk_per_worker_gb 50

The configuration uses NVIDIA® Tesla® K80 GPUs, but you can use NVIDIA® Tesla® P100 GPUs by adding the --accelerator_type nvidia-tesla-p100 flag under the ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \ option.

If you use the NVIDIA® Tesla® P100 GPUs, the pipeline takes roughly one hour and 45 minutes and costs about $39.00. See GPUs on Compute Engine for more details.

Cost-optimized configuration using CPUs and GPUs

The following configuration is optimized to run DeepVariant at low cost. It uses GPUs and CPUs attached to preemptible VMs, which are up to 80% cheaper than regular VMs. However, Compute Engine might terminate (preempt) these instances if it requires access to those resources for other tasks. Additionally, preemptible VMs are not covered by any Service Level Agreement (SLA), so if you require guarantees on turnaround time, use the speed-optimized configuration.

The DeepVariant runner contains built-in logic to automatically retry preempted jobs. You can also specify the maximum number of retries using the --max_preemptible_tries flag.

The total for the runtime and cost varies depending on the number of instances that get preempted, but you can expect that for a 30x whole genome sample, the pipeline takes roughly 3 hours and costs between $6.00 and $7.00. See the Compute Engine best practices guidelines for how to effectively use preemptible VMs.

The configuration has the following properties:

  • Uses 16x32 core VMs for the make_examples step
  • Uses 16x8 core VMs with GPU for the call_variants step

Before you can run the pipeline using this configuration, you must request the following Compute Engine quota increases:

  • CPUs: 513
  • GPUS: 16
  • Persistent Disk: 3.2 TB
  • In-use IP addresses: 17

Finally, make sure that NVIDIA® Tesla® K80 GPUs are available in the zone(s) hosting your resources (see GPUs on Compute Engine for details).

name: deepvariant_pipeline
inputParameters:
- name: PROJECT_ID
- name: OUTPUT_BUCKET
- name: MODEL
- name: DOCKER_IMAGE
- name: DOCKER_IMAGE_GPU
- name: STAGING_FOLDER_NAME
- name: OUTPUT_FILE_NAME
docker:
  imageName: gcr.io/deepvariant-docker/deepvariant_runner
  cmd: |
    ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \
      --project "${PROJECT_ID}" \
      --zones us-west1-b us-east1-d \
      --docker_image "${DOCKER_IMAGE}" \
      --docker_image_gpu "${DOCKER_IMAGE_GPU}" \
      --gpu \
      --outfile "${OUTPUT_BUCKET}"/"${OUTPUT_FILE_NAME}" \
      --staging "${OUTPUT_BUCKET}"/"${STAGING_FOLDER_NAME}" \
      --model "${MODEL}" \
      --bam gs://deepvariant/performance-testdata/HG002_NIST_150bp_downsampled_30x.bam \
      --ref gs://deepvariant/performance-testdata/hs37d5.fa.gz \
      --shards 512 \
      --make_examples_workers 16 \
      --make_examples_cores_per_worker 32 \
      --make_examples_ram_per_worker_gb 60 \
      --make_examples_disk_per_worker_gb 200 \
      --call_variants_workers 16 \
      --call_variants_cores_per_worker 8 \
      --call_variants_ram_per_worker_gb 30 \
      --call_variants_disk_per_worker_gb 50 \
      --preemptible \
      --max_preemptible_tries 5

Cost-optimized configuration using only CPUs

The following configuration is optimized to run DeepVariant at low cost. You can also use this configuration if GPUs are unavailable in the zone(s) hosting your resources.

The configuration uses CPUs attached to preemptible VMs, which are up to 80% cheaper than regular VMs. However, Compute Engine might terminate (preempt) these instances if it requires access to those resources for other tasks. Additionally, preemptible VMs are not covered by any Service Level Agreement (SLA), so if you require guarantees on turnaround time, use the speed-optimized configuration.

The DeepVariant runner contains built-in logic to automatically retry preempted jobs. You can also specify the maximum number of retries using the --max_preemptible_tries flag.

The total for the runtime and cost varies depending on the number of instances that get preempted, but you can expect that for a 30x whole genome sample, the pipeline will complete in 3 to 4 hours and cost between $8.00 and $9.00. See the Compute Engine best practices guidelines for how to effectively use preemptible VMs.

The configuration has the following properties:

  • Uses 32x16 core VMs for the make_examples step
  • Uses 32x32 core VMs for the call_variants step

Before you can run the pipeline using this configuration, you must request the following Compute Engine quota increases:

  • CPUs: 1025
  • Persistent Disk: 6.4 TB
  • In-use IP addresses: 33
name: deepvariant_pipeline
inputParameters:
- name: PROJECT_ID
- name: OUTPUT_BUCKET
- name: MODEL
- name: DOCKER_IMAGE
- name: DOCKER_IMAGE_GPU
- name: STAGING_FOLDER_NAME
- name: OUTPUT_FILE_NAME
docker:
  imageName: gcr.io/deepvariant-docker/deepvariant_runner
  cmd: |
    ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \
      --project "${PROJECT_ID}" \
      --zones 'us-*' \
      --docker_image "${DOCKER_IMAGE}" \
      --outfile "${OUTPUT_BUCKET}"/"${OUTPUT_FILE_NAME}" \
      --staging "${OUTPUT_BUCKET}"/"${STAGING_FOLDER_NAME}" \
      --model "${MODEL}" \
      --bam gs://deepvariant/performance-testdata/HG002_NIST_150bp_downsampled_30x.bam \
      --ref gs://deepvariant/performance-testdata/hs37d5.fa.gz \
      --shards 512 \
      --make_examples_workers 32 \
      --make_examples_cores_per_worker 16 \
      --make_examples_ram_per_worker_gb 60 \
      --make_examples_disk_per_worker_gb 200 \
      --call_variants_workers 32 \
      --call_variants_cores_per_worker 32 \
      --call_variants_ram_per_worker_gb 60 \
      --call_variants_disk_per_worker_gb 50 \
      --preemptible \
      --max_preemptible_tries 5

Calling exome regions configuration

The following configuration calls only exome regions. It uses CPUs attached to preemptible VMs, which are up to 80% cheaper than regular VMs. However, Compute Engine might terminate (preempt) these instances if it requires access to those resources for other tasks. Additionally, preemptible VMs are not covered by any Service Level Agreement (SLA), so if you require guarantees on turnaround time, use the speed-optimized configuration.

The DeepVariant runner contains built-in logic to automatically retry preempted jobs. You can also specify the maximum number of retries using the --max_preemptible_tries flag.

The total for the runtime and cost varies depending on the number of instances that get preempted, but you can expect that the pipeline will take roughly 70 minutes and cost about $0.70. See the Compute Engine best practices guidelines for how to effectively use preemptible VMs.

When running the pipeline runner script, use the exome MODEL (highlighted below) instead of the model for calling whole genome sequencing data:

#!/bin/bash
set -euo pipefail
# Set common settings.
PROJECT_ID=PROJECT_ID
OUTPUT_BUCKET=gs://OUTPUT_BUCKET
STAGING_FOLDER_NAME=STAGING_FOLDER_NAME
OUTPUT_FILE_NAME=output.vcf
# Model for calling whole genome sequencing data.
# MODEL=gs://deepvariant/models/DeepVariant/0.6.0/DeepVariant-inception_v3-0.6.0+cl-191676894.data-wgs_standard
# Model for calling exome sequencing data.
MODEL=gs://deepvariant/models/DeepVariant/0.6.0/DeepVariant-inception_v3-0.6.0+cl-191676894.data-wes_standard
IMAGE_VERSION=0.6.1
DOCKER_IMAGE=gcr.io/deepvariant-docker/deepvariant:"${IMAGE_VERSION}"
DOCKER_IMAGE_GPU=gcr.io/deepvariant-docker/deepvariant_gpu:"${IMAGE_VERSION}"
# Run the pipeline.
gcloud alpha genomics pipelines run \
    --project "${PROJECT_ID}" \
    --pipeline-file deepvariant_pipeline.yaml \
    --logging "${OUTPUT_BUCKET}"/runner_logs \
    --zones us-west1-b \
    --inputs \`echo \
    PROJECT_ID="${PROJECT_ID}", \
    OUTPUT_BUCKET="${OUTPUT_BUCKET}", \
    MODEL="${MODEL}", \
    DOCKER_IMAGE="${DOCKER_IMAGE}", \
    DOCKER_IMAGE_GPU="${DOCKER_IMAGE_GPU}", \
    STAGING_FOLDER_NAME="${STAGING_FOLDER_NAME}", \
    OUTPUT_FILE_NAME="${OUTPUT_FILE_NAME}" \
    | tr -d '[:space:]'\`

The configuration has the following properties:

  • Uses 4x16 core VMs for the make_examples step
  • Uses 1x32 core VMs for the call_variants step
name: deepvariant_pipeline
inputParameters:
- name: PROJECT_ID
- name: OUTPUT_BUCKET
- name: MODEL
- name: DOCKER_IMAGE
- name: DOCKER_IMAGE_GPU
- name: STAGING_FOLDER_NAME
- name: OUTPUT_FILE_NAME
docker:
  imageName: gcr.io/deepvariant-docker/deepvariant_runner
  cmd: |
    ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \
      --project "${PROJECT_ID}" \
      --zones 'us-*' \
      --docker_image "${DOCKER_IMAGE}" \
      --outfile "${OUTPUT_BUCKET}"/"${OUTPUT_FILE_NAME}" \
      --staging "${OUTPUT_BUCKET}"/"${STAGING_FOLDER_NAME}" \
      --model "${MODEL}" \
      --bam gs://deepvariant/exome-case-study-testdata/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam \
      --bai gs://deepvariant/exome-case-study-testdata/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bai \
      --ref gs://deepvariant/exome-case-study-testdata/hs37d5.fa.gz \
      --regions gs://deepvariant/exome-case-study-testdata/refseq.coding_exons.b37.extended50.bed \
      --shards 64 \
      --make_examples_workers 4 \
      --make_examples_cores_per_worker 16 \
      --make_examples_ram_per_worker_gb 60 \
      --make_examples_disk_per_worker_gb 100 \
      --call_variants_workers 1 \
      --call_variants_cores_per_worker 32 \
      --call_variants_ram_per_worker_gb 60 \
      --call_variants_disk_per_worker_gb 50 \
      --preemptible \
      --max_preemptible_tries 5

If you need the pipeline to run faster or you need to guarantee its turnaround time, you can make the following changes to the configuration:

  • Remove the --preemptible flag
  • Double the number of shards to 256 by doubling the workers in the make_examples step, like so:

    ...
    --make_examples_workers 8
    --make_examples_cores_per_worker 32
    ...
    

Genomic VCF (gVCF) configuration

DeepVariant supports gVCF as an output format. To generate gVCF output, use any of the above configurations and replace the --outfile flag with a flag named --gvcf_outfile. The --gvcf_outfile flag specifies the destination path for the gVCF file.

If you run any of the above configurations and output gVCF data, the runtime of the pipeline increases by roughly 2 hours. Additionally, a larger disk size is required for the postprocess_variants step. For best results when generating gVCF output, add the --postprocess_variants_disk_gb 200 flag under the ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \ option.

You can also use the --gvcf_gq_binsize flag to control the bin size used to quantize gVCF genotype qualities. Larger bin sizes reduce the number of gVCF records, but result in a loss of granularity quality.

Additional configuration options

In addition to the basic configurations provided above, you can specify the following additional settings to fit your use case.

When specifying any of the following flags, add them under the ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \ option.

Configuration option Description Example use case
job_name_prefix A prefix added to job names. This can be useful when distinguishing pipeline runs, such as for billing purposes. By specifying --job_name_prefix gpu_, a billing report will show gpu_make_examples, gpu_call_variants, and gpu_postprocess_variants as the three stages of the pipeline.
jobs_to_run Used to run only one part of the pipeline. This can be useful if part of a pipeline failed for some reason, such as if you specified incorrect input paths. When using this flag, all settings (such as the staging location and the numbers of shards and workers) must be identical to the original run to reuse existing intermediate results. If the pipeline failed in the call_variants step, you could rerun the pipeline by specifying --jobs_to_run call_variants postprocess_variants, which would skip the make_examples step and reuse the existing results.
  • bai
  • ref_fai
  • ref_gzi
BAM files must have a .bam.bai present at the same location as the BAM file. If the BAM file has a reference index and gzip index, .fai and .gzi files, respectively, must also be present. A .gzi file is required only if your reference is compressed. Can be used if index files are not in the same location as the raw files and/or the index files don't have the common extensions.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

After you've finished the Running Pipelines with DeepVariant tutorial, you can clean up the resources you created on Google Cloud Platform so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project you used for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Genomics