Running DeepVariant

This page explains how to run DeepVariant on Google Cloud Platform. DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

Objectives

After completing this tutorial, you'll know how to:

  • Run DeepVariant on GCP
  • Write configuration files for different DeepVariant use cases
  • Estimate costs and turnaround times for different DeepVariant pipeline configurations

Costs

This tutorial uses billable components of GCP, including:

  • Compute Engine

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. Enable the Cloud Genomics and Compute Engine APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.
  6. Update and install gcloud components:
    gcloud components update &&
    gcloud components install alpha

Run DeepVariant

To run DeepVariant on GCP, you need to run a script that invokes the Cloud Genomics Pipelines API.

  1. Copy the following text and save it to a file named script.sh, substituting the variables with the relevant resources from your GCP project. The OUTPUT_BUCKET variable designates a Cloud Storage bucket that holds the pipeline's output and intermediate files. The STAGING_FOLDER_NAME variable is a unique name for a directory in the bucket. You can set the variable for each run of the pipeline.

    #!/bin/bash
    set -euo pipefail
    # Set common settings.
    PROJECT_ID=PROJECT_ID
    OUTPUT_BUCKET=gs://OUTPUT_BUCKET
    STAGING_FOLDER_NAME=STAGING_FOLDER_NAME
    OUTPUT_FILE_NAME=output.vcf
    # Model for calling whole genome sequencing data.
    MODEL=gs://deepvariant/models/DeepVariant/0.7.2/DeepVariant-inception_v3-0.7.2+data-wgs_standard
    IMAGE_VERSION=0.7.2
    DOCKER_IMAGE=gcr.io/deepvariant-docker/deepvariant:"${IMAGE_VERSION}"
    COMMAND="/opt/deepvariant_runner/bin/gcp_deepvariant_runner \
      --project ${PROJECT_ID} \
      --zones us-west1-* \
      --docker_image ${DOCKER_IMAGE} \
      --outfile ${OUTPUT_BUCKET}/${OUTPUT_FILE_NAME} \
      --staging ${OUTPUT_BUCKET}/${STAGING_FOLDER_NAME} \
      --model ${MODEL} \
      --bam gs://deepvariant/quickstart-testdata/NA12878_S1.chr20.10_10p1mb.bam \
      --ref gs://deepvariant/quickstart-testdata/ucsc.hg19.chr20.unittest.fasta.gz \
      --regions chr20:10,000,000-10,010,000 \
      --gcsfuse"
    # Run the pipeline.
    gcloud alpha genomics pipelines run \
        --project "${PROJECT_ID}" \
        --service-account-scopes="https://www.googleapis.com/auth/cloud-platform" \
        --logging "${OUTPUT_BUCKET}/${STAGING_FOLDER_NAME}/runner_logs_$(date +%Y%m%d_%H%M%S).log" \
        --regions us-west1 \
        --docker-image gcr.io/deepvariant-docker/deepvariant_runner:"${IMAGE_VERSION}" \
        --command-line "${COMMAND}"
    
  2. Run the following command to make script.sh executable:

    chmod +x script.sh
    
  3. Run script.sh:

    ./script.sh
    
  4. The command returns an operation ID in the format Running [operations/OPERATION_ID]. You can use the operation ID to track the status of the pipeline by running the following command:

    gcloud alpha genomics operations describe OPERATION_ID
    
  5. The operations describe command returns done: true when the pipeline finishes.

    You can run the following simple bash loop to check every 30 seconds whether the job is running, has finished, or returned an error:

    while [[ $(gcloud --format='value(done)' alpha genomics operations describe OPERATION_ID) != True ]]; do
       echo "Job not done, sleeping for 30 seconds..."
       sleep 30
    done
    

    You can find more details about the operation in the path provided by the --logging flag. In this tutorial, the path is "${OUTPUT_BUCKET}"/runner_logs.

  6. After the pipeline finishes, it outputs a VCF file, output.vcf, to your Cloud Storage bucket. Run the following command to list the files in your bucket, and check that the output.vcf file is in the list:

    gsutil ls gs://BUCKET/
    

Pipeline configurations

You can run DeepVariant on GCP with different configurations settings, such as:

  • With and without GPUs
  • Using preemptible VMs
  • Using different numbers of shards

Some of the most common configurations are provided below. After trying out the configurations, see additional options for information on how to set advanced settings.

Cost-optimized configuration

The following configuration is optimized to run DeepVariant at low cost. The total for the runtime and cost varies depending on the number of instances that get preempted, but you can expect that for a 30x whole genome sample, the pipeline will complete in 1 to 2 hours and cost between $3.00 and $4.00.

The configuration uses CPUs attached to preemptible VMs, which are up to 80% cheaper than regular VMs. However, Compute Engine might terminate (preempt) these instances if it requires access to those resources for other tasks. Additionally, preemptible VMs are not covered by any Service Level Agreement (SLA), so if you require guarantees on turnaround time, do not use the --preemptible flag.

See the Compute Engine best practices for how to effectively use preemptible VMs.

The DeepVariant runner contains built-in logic to automatically retry preempted jobs. You can also specify the maximum number of retries using the --max_preemptible_tries flag.

The configuration has the following properties:

  • Uses 32x16 core VMs for the make_examples step
  • Uses 32x32 core VMs for the call_variants step

Before you can run the pipeline using this configuration, you must request the following Compute Engine quota increases:

  • CPUs: 1025
  • Persistent Disk: 6.4 TB
  • In-use IP addresses: 33
#!/bin/bash
set -euo pipefail
# Set common settings.
PROJECT_ID=PROJECT_ID
OUTPUT_BUCKET=gs://OUTPUT_BUCKET
STAGING_FOLDER_NAME=STAGING_FOLDER_NAME
OUTPUT_FILE_NAME=output.vcf
# Model for calling whole genome sequencing data.
MODEL=gs://deepvariant/models/DeepVariant/0.7.2/DeepVariant-inception_v3-0.7.2+data-wgs_standard
IMAGE_VERSION=0.7.2
DOCKER_IMAGE=gcr.io/deepvariant-docker/deepvariant:"${IMAGE_VERSION}"
COMMAND="/opt/deepvariant_runner/bin/gcp_deepvariant_runner \
  --project ${PROJECT_ID} \
  --zones us-west1-* \
  --docker_image ${DOCKER_IMAGE} \
  --outfile ${OUTPUT_BUCKET}/${OUTPUT_FILE_NAME} \
  --staging ${OUTPUT_BUCKET}/${STAGING_FOLDER_NAME} \
  --model ${MODEL} \
  --bam gs://deepvariant/performance-testdata/HG002_NIST_150bp_downsampled_30x.bam \
  --ref gs://deepvariant/performance-testdata/hs37d5.fa.gz \
  --shards 512 \
  --make_examples_workers 32 \
  --make_examples_cores_per_worker 16 \
  --make_examples_ram_per_worker_gb 60 \
  --make_examples_disk_per_worker_gb 200 \
  --call_variants_workers 32 \
  --call_variants_cores_per_worker 32 \
  --call_variants_ram_per_worker_gb 60 \
  --call_variants_disk_per_worker_gb 50 \
  --preemptible \
  --max_preemptible_tries 5 \
  --gcsfuse"
# Run the pipeline.
gcloud alpha genomics pipelines run \
    --project "${PROJECT_ID}" \
    --service-account-scopes="https://www.googleapis.com/auth/cloud-platform" \
    --logging "${OUTPUT_BUCKET}/${STAGING_FOLDER_NAME}/runner_logs_$(date +%Y%m%d_%H%M%S).log" \
    --regions us-west1 \
    --docker-image gcr.io/deepvariant-docker/deepvariant_runner:"${IMAGE_VERSION}" \
    --command-line "${COMMAND}"

Calling exome regions configuration

The following configuration calls only exome regions. It uses CPUs attached to preemptible VMs, which are up to 80% cheaper than regular VMs. However, Compute Engine might terminate (preempt) these instances if it requires access to those resources for other tasks. Additionally, preemptible VMs are not covered by any Service Level Agreement (SLA), so if you require guarantees on turnaround time, do not use the --preemptible flag.

The DeepVariant runner contains built-in logic to automatically retry preempted jobs. You can also specify the maximum number of retries using the --max_preemptible_tries flag.

The total for the runtime and cost varies depending on the number of instances that get preempted, but you can expect that the pipeline will take roughly 25 minutes and cost about $0.20. See the Compute Engine best practices for how to effectively use preemptible VMs.

The configuration has the following properties:

  • Uses 4x16 core VMs for the make_examples step
  • Uses 1x32 core VMs for the call_variants step

    #!/bin/bash
    set -euo pipefail
    # Set common settings.
    PROJECT_ID=PROJECT_ID
    OUTPUT_BUCKET=gs://OUTPUT_BUCKET
    STAGING_FOLDER_NAME=STAGING_FOLDER_NAME
    OUTPUT_FILE_NAME=output.vcf
    # Model for calling exome sequencing data.
    MODEL=gs://deepvariant/models/DeepVariant/0.7.2/DeepVariant-inception_v3-0.7.2+data-wes_standard
    IMAGE_VERSION=0.7.2
    DOCKER_IMAGE=gcr.io/deepvariant-docker/deepvariant:"${IMAGE_VERSION}"
    COMMAND="/opt/deepvariant_runner/bin/gcp_deepvariant_runner \
      --project ${PROJECT_ID} \
      --zones us-west1-* \
      --docker_image ${DOCKER_IMAGE} \
      --outfile ${OUTPUT_BUCKET}/${OUTPUT_FILE_NAME} \
      --staging ${OUTPUT_BUCKET}/${STAGING_FOLDER_NAME} \
      --model ${MODEL} \
      --bam gs://deepvariant/exome-case-study-testdata/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam \
      --bai gs://deepvariant/exome-case-study-testdata/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bai \
      --ref gs://deepvariant/exome-case-study-testdata/hs37d5.fa.gz \
      --regions gs://deepvariant/exome-case-study-testdata/refseq.coding_exons.b37.extended50.bed \
      --shards 64 \
      --make_examples_workers 4 \
      --make_examples_cores_per_worker 16 \
      --make_examples_ram_per_worker_gb 60 \
      --make_examples_disk_per_worker_gb 100 \
      --call_variants_workers 1 \
      --call_variants_cores_per_worker 32 \
      --call_variants_ram_per_worker_gb 60 \
      --call_variants_disk_per_worker_gb 50 \
      --preemptible \
      --max_preemptible_tries 5 \
      --gcsfuse"
    # Run the pipeline.
    gcloud alpha genomics pipelines run \
        --project "${PROJECT_ID}" \
        --service-account-scopes="https://www.googleapis.com/auth/cloud-platform" \
        --logging "${OUTPUT_BUCKET}/${STAGING_FOLDER_NAME}/runner_logs_$(date +%Y%m%d_%H%M%S).log" \
        --regions us-west1 \
        --docker-image gcr.io/deepvariant-docker/deepvariant_runner:"${IMAGE_VERSION}" \
        --command-line "${COMMAND}"
    

If you need the pipeline to run faster or you need to guarantee its turnaround time, you can make the following changes to the configuration:

  • Remove the --preemptible flag
  • Double the number of shards to 256 by doubling the workers in the make_examples step, like so:

    ...
    --make_examples_workers 8
    --make_examples_cores_per_worker 32
    ...
    

Genomic VCF (gVCF) configuration

DeepVariant supports gVCF as an output format. To generate gVCF output, use any of the above configurations and add the--gvcf_outfile flag. The --gvcf_outfile flag specifies the destination path for the gVCF file.

If you run any of the above configurations and output gVCF data, the runtime of the pipeline increases by roughly 2 hours. Additionally, a larger disk size is required for the postprocess_variants step. For best results when generating gVCF output, add the --postprocess_variants_disk_gb 200 flag under the ./opt/deepvariant_runner/bin/gcp_deepvariant_runner option.

You can also use the --gvcf_gq_binsize flag to control the bin size used to quantize gVCF genotype qualities. Larger bin sizes reduce the number of gVCF records, but result in a loss of granularity quality.

Additional configuration options

In addition to the basic configurations provided above, you can specify the following additional settings to fit your use case.

When specifying any of the following flags, add them under the ./opt/deepvariant_runner/bin/gcp_deepvariant_runner option.

Configuration option Description Example use case
job_name_prefix A prefix added to job names. This can be useful when distinguishing pipeline runs, such as for billing purposes. By specifying --job_name_prefix gpu_, a billing report will show gpu_make_examples, gpu_call_variants, and gpu_postprocess_variants as the three stages of the pipeline.
jobs_to_run Used to run only one part of the pipeline. This can be useful if part of a pipeline failed for some reason, such as if you specified incorrect input paths. When using this flag, all settings (such as the staging location and the numbers of shards and workers) must be identical to the original run to reuse existing intermediate results. If the pipeline failed in the call_variants step, you could rerun the pipeline by specifying --jobs_to_run call_variants postprocess_variants, which would skip the make_examples step and reuse the existing results.
  • bai
  • ref_fai
  • ref_gzi
BAM files must have a .bam.bai present at the same location as the BAM file. If the BAM file has a reference index and gzip index, .fai and .gzi files, respectively, must also be present. A .gzi file is required only if your reference is compressed. Can be used if index files are not in the same location as the raw files and/or the index files don't have the common extensions.

Cleaning up

To avoid incurring charges to your GCP account for the resources used in this tutorial:

After you've finished the Running DeepVariant tutorial, you can clean up the resources you created on GCP so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project you used for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Genomics