Running DeepVariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

This page explains how to run DeepVariant on Google Cloud using a single Compute Engine instance.

There are more complex configurations available in the DeepVariant GitHub repository. For example, you can run DeepVariant using multiple instances. These variations provide improvements in processing speed and reduced costs.

Running DeepVariant consists of three stages:

  1. Making examples: DeepVariant pre-processes the input data and saves examples from the data using an internal TensorFlow format. You can run this stage in parallel where all of the input shards are processed independently.

  2. Calling variants: DeepVariant runs a deep neural network that makes inferences from the examples and saves them into shared files using an internal TensorFlow format.

  3. Post-processing variants: DeepVariant converts variants from the internal TensorFlow format to VCF or gVCF files. This stage runs on a single thread.

In this tutorial, you run all of these stages using a single instance. The first and second stages (making examples and calling variants, respectively) can benefit from parallelization on multiple cores, but the third stage (post-processing variants) does not have the same benefits because it runs on a single thread.

Objectives

After completing this tutorial, you'll know how to:

  • Run DeepVariant on Google Cloud

Costs

This tutorial uses billable components of Google Cloud, including:

  • Compute Engine

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. Enable the Compute Engine API.

    Enable the API

  5. Install and initialize the Cloud SDK.
  6. Tip: Need a command prompt? You can use the Google Cloud Shell. The Google Cloud Shell is a command line environment that already includes the Google Cloud SDK, so you don't need to install it.

Creating a Compute Engine instance

Before running DeepVariant, you need to create a Compute Engine instance on which DeepVariant will run. You can create a Linux virtual machine instance in Compute Engine using either the Google Cloud Console or the gcloud command-line tool.

Console

  1. In the Cloud Console, go to the VM Instances page.

    Go to the VM Instances page

  2. Click Create instance.
  3. Choose a Name for the instance in the format PROJECT_ID-deepvariant-run where PROJECT_ID is the ID for your Google Cloud project.
  4. Choose a Region and Zone for the instance. Unless you have a specific reason to run the instance in a certain location, select us-central1 (Iowa) for the Region and us-central1-a for the Zone.
  5. In the Machine type menu, select n1-standard-64 (64 vCPU, 240 GB memory).
  6. In the CPU platform menu, select Intel Skylake or later.
  7. In the Boot disk section, click Change to begin configuring your boot disk.
  8. On the OS images tab, choose Google Drawfork Ubuntu 16.04 LTS. In the Boot disk type menu, select Standard persistent disk. In the Size (GB) field, enter 300. Click Select.
  9. Click Create to create the instance.

gcloud

gcloud compute instances create \
    PROJECT_ID-deepvariant-run \
    --project PROJECT_ID \
    --zone ZONE \
    --scopes "cloud-platform" \
    --image-project ubuntu-os-cloud \
    --image-family ubuntu-1604-lts \
    --machine-type n1-standard-64 \
    --min-cpu-platform "Intel Skylake" \
    --boot-disk-size=300GB

where:

  • PROJECT_ID is the ID for your Google Cloud project.
  • ZONE is the zone in which your instance is deployed. A zone is an approximate regional location in which your instance and its resources live. For example, us-west1-a is a zone in the us-west region. If you've set a default zone previously using gcloud config set compute/zone, the value of this flag will override that default.

Allow a short time for the instance to start up. Once ready, it will be listed on the VM Instances page with a green status icon.

Connecting to the instance

You can connect to the instance using either the Cloud Console or the gcloud tool:

Console

  1. In the Cloud Console, go to the VM Instances page.

    Go to the VM Instances page

  2. In the list of virtual machine instances, click SSH in the row of the instance that you just created.

gcloud

gcloud compute ssh PROJECT_ID-deepvariant-run --zone ZONE

Running DeepVariant

Configure your environment and run DeepVariant by completing the following steps on the Compute Engine instance you created:

  1. Install Docker Community Edition (CE) by running the following commands:

    sudo apt-get -qq -y install \
      apt-transport-https \
      ca-certificates \
      curl \
      gnupg-agent \
      software-properties-common
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    sudo add-apt-repository \
      "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
      $(lsb_release -cs) \
      stable"
    sudo apt-get -qq -y update
    sudo apt-get -qq -y install docker-ce
    
  2. Configure the DeepVariant environment variables by copying and pasting the following commands:

    BIN_VERSION="0.9.0"
    BASE="${HOME}/deepvariant-run"
    INPUT_DIR="${BASE}/input"
    REF="hs37d5.fa.gz"
    BAM="HG002_NIST_150bp_chr20_downsampled_30x.bam"
    OUTPUT_DIR="${BASE}/output"
    DATA_DIR="${INPUT_DIR}/data"
    OUTPUT_VCF="HG002.output.vcf.gz"
    OUTPUT_GVCF="HG002.output.g.vcf.gz"
    
  3. Create the local directory structure for the input data directory and the output directory:

    mkdir -p "${OUTPUT_DIR}"
    mkdir -p "${INPUT_DIR}"
    mkdir -p "${DATA_DIR}"
    
  4. This tutorial uses a publicly available HG002 genome at 30x coverage mapped to GRCh37 reference. However, to ensure a quicker runtime, you will add the --regions 20 flag when you run DeepVariant so that DeepVariant only runs on chromosome 20 (chr20).

    The sample data was created using Illumina sequencing, but DeepVariant also supports the following other types of input data:

    • Whole genome (Illumina) (WGS)
    • Exome (Illumina) (WES)
    • Whole genome (PacBio)

    Using the gsutil cp command, copy the input test data from the deepvariant Cloud Storage bucket to the directories on the instance you created in the previous step:

    # Input BAM and BAI files:
    gsutil cp gs://deepvariant/performance-testdata/"${BAM}" "${DATA_DIR}"
    gsutil cp gs://deepvariant/performance-testdata/"${BAM}".bai "${DATA_DIR}"
    
    # GRCh37 reference FASTA file:
    gsutil cp gs://deepvariant/case-study-testdata/"${REF}"* "${DATA_DIR}"
    
  5. DeepVariant is a containerized application staged in a pre-built Docker image in Container Registry. To pull the image, run the following command:

    sudo docker pull gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}"
    
  6. Run the following command to start DeepVariant. The total running time for the command is roughly eight minutes.

    sudo docker run \
        -v "${DATA_DIR}":"/input" \
        -v "${OUTPUT_DIR}:/output" \
        gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}" \
        /opt/deepvariant/bin/run_deepvariant \
        --model_type=WGS \
        --ref="/input/${REF}" \
        --reads="/input/${BAM}" \
        --output_vcf=/output/${OUTPUT_VCF} \
        --output_gvcf=/output/${OUTPUT_GVCF} \
        --regions 20 \
        --num_shards=$(nproc)
    

    The following table describes the flags passed in to the command:

    Flag Description
    model_type DeepVariant supports several different types of input data. This tutorial uses Whole Genome Sequencing (WSG).
    ref The location of the reference FASTA file.
    reads The location of the input BAM file.
    output_vcf The location of the output VCF files.
    output_gvcf The location of the output gVCF files.
    regions (Optional) A space-separated list of chromosome regions to process. Individual elements can be region literals, such as chr20:10-20 or paths to BED/BEDPE files.
    num_shards The number of shards to run in parallel. For best results, set the value of this flag to the number of cores on the machine where DeepVariant runs.

    If the command starts to run successfully, it outputs a message similar to the following:

    ***** Running the command:*****
    time seq 0 63 | parallel
    -k
    --line-buffer /opt/deepvariant/bin/make_examples
    --mode calling
    --ref "/input/hs37d5.fa.gz"
    --reads "/input/HG002_NIST_150bp_downsampled_30x.bam"
    --examples "/tmp/deepvariant_tmp_output/make_examples.tfrecord@64.gz"
    --regions "20"
    --gvcf "/tmp/deepvariant_tmp_output/gvcf.tfrecord@64.gz"
    --task {}
    
  7. After DeepVariant finishes, it outputs the following files to the deepvariant-run/output directory:

    • HG002.output.g.vcf.gz
    • HG002.output.g.vcf.gz.tbi
    • HG002.output.vcf.gz
    • HG002.output.vcf.gz.tbi

    Run the following command to list the files in the output directory, and check that all of the output files display:

    ls $OUTPUT_DIR
    

Cost and runtime estimates

The following table shows the approximate runtime and costs when running DeepVariant using a 30x whole genome sample in a BAM file. These estimates do not include the time required to set up the instance and download any sample data from Cloud Storage.

The table contains estimates for preemptible VMs and non-preemptible VMs. The runtime estimates are based on using non-preemptible VMs.

Preemptible VMs are up to 80% cheaper than regular VMs. However, Compute Engine might terminate (preempt) these instances if it requires access to those resources for other tasks. Additionally, preemptible VMs are not covered by any Service Level Agreement (SLA), so if you require guarantees on turnaround time, do not use the --preemptible flag.

See the Compute Engine best practices for how to effectively use preemptible VMs.

Machine type Runtime in hours Cost (non-preemptible) Cost (preemptible)
n1-standard-8 19.7 $8.06 $2.16
n1-standard-16 11.3 $8.91 $2.14
n1-standard-32 7.25 $11.20 $2.54
n1-standard-64 5.65 $17.30 $3.78
n1-standard-96 4.52 $20.80 $4.48

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:

After you've finished the Running DeepVariant tutorial, you can clean up the resources you created on Google Cloud so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project you used for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Life Sciences Documentation