Running DeepVariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

This page explains how to run DeepVariant on Google Cloud using a single Compute Engine instance.

There are more complex configurations available in the DeepVariant GitHub repository. For example, you can run DeepVariant using multiple instances. These variations provide improvements in processing speed and reduced costs.

Running DeepVariant consists of three stages:

  1. Making examples: DeepVariant pre-processes the input data and saves examples from the data using an internal TensorFlow format. You can run this stage in parallel where the input shards are processed independently.

  2. Calling variants: DeepVariant runs a deep neural network that makes inferences from the examples and saves them into shared files using an internal TensorFlow format.

  3. Post-processing variants: DeepVariant converts variants from the internal TensorFlow format to VCF or gVCF files. This stage runs on a single thread.

In this tutorial, you run these stages using a single instance. The first and second stages benefit from parallelization on multiple cores. The third stage does not have the same benefits because it runs on a single thread.


After completing this tutorial, you'll know how to:

  • Run DeepVariant on Google Cloud


This tutorial uses billable components of Google Cloud, including:

  • Compute Engine

Generate a cost estimate based on your projected usage by using the Pricing Calculator. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Compute Engine API.

    Enable the API

  5. Install and initialize the Cloud SDK.
  6. Tip: Need a command prompt? You can use the Cloud Shell. The Cloud Shell is a command-line environment that already includes the Cloud SDK, so you don't need to install it.

Creating a Compute Engine instance

You must create a Compute Engine instance to run DeepVariant. You can create the instance using the Google Cloud Console or the gcloud tool.


  1. In the Cloud Console, go to the VM Instances page.

    Go to the VM Instances page

  2. Click Create instance.
  3. Choose a Name for the instance in the format PROJECT_ID-deepvariant-run where PROJECT_ID is the ID for your Google Cloud project.
  4. Choose a Region and Zone for the instance. Unless you have a specific reason to run the instance in a certain location, select us-central1 (Iowa) for the Region and us-central1-a for the Zone.
  5. In the Machine type menu, select n1-standard-64 (64 vCPU, 240 GB memory).
  6. In the CPU platform menu, select Intel Skylake or later.
  7. In the Boot disk section, click Change to begin configuring your boot disk.
  8. On the OS images tab, choose Google Drawfork Ubuntu 16.04 LTS. In the Boot disk type menu, select Standard persistent disk. In the Size (GB) field, enter 300. Click Select.
  9. Click Create to create the instance.


gcloud compute instances create \
    PROJECT_ID-deepvariant-run \
    --project PROJECT_ID \
    --zone ZONE \
    --scopes "cloud-platform" \
    --image-project ubuntu-os-cloud \
    --image-family ubuntu-1604-lts \
    --machine-type n1-standard-64 \
    --min-cpu-platform "Intel Skylake" \


  • PROJECT_ID is the ID for your Google Cloud project.
  • ZONE is the zone in which your instance is deployed. A zone is an approximate regional location in which your instance and its resources live. For example, us-west1-a is a zone in the us-west region. If you've set a default zone previously using gcloud config set compute/zone, the value of this flag overrides that default.

Allow a short time for the instance to start up. After it's ready, it will be listed on the VM Instances page with a green status icon.

Connecting to the instance

You can connect to the instance using either the Cloud Console or the gcloud tool:


  1. In the Cloud Console, go to the VM Instances page.

    Go to the VM Instances page

  2. In the list of virtual machine instances, click SSH in the row of the instance that you created.


gcloud compute ssh PROJECT_ID-deepvariant-run --zone ZONE

Running DeepVariant

Configure your environment and run DeepVariant by completing the following steps on the Compute Engine instance you created:

  1. Install Docker Community Edition (CE) by running the following commands:

    sudo apt-get -qq -y install \
      apt-transport-https \
      ca-certificates \
      curl \
      gnupg-agent \
    curl -fsSL | sudo apt-key add -
    sudo add-apt-repository \
      "deb [arch=amd64] \
      $(lsb_release -cs) \
    sudo apt-get -qq -y update
    sudo apt-get -qq -y install docker-ce
  2. Configure the DeepVariant environment variables by copying and pasting the following commands:

  3. Create the local directory structure for the input data directory and the output directory:

    mkdir -p "${OUTPUT_DIR}"
    mkdir -p "${INPUT_DIR}"
    mkdir -p "${DATA_DIR}"
  4. This tutorial uses a publicly available HG003 genome at 30x coverage mapped to GRCh38 reference. However, to ensure a quicker runtime, you add the --regions chr20 flag when you run DeepVariant so that DeepVariant only runs on chromosome 20 (chr20).

    The sample data was created using Illumina sequencing, but DeepVariant also supports the following other types of input data:

    • Whole genome (Illumina) (WGS)
    • Exome (Illumina) (WES)
    • Whole genome (PacBio)
    • Whole genome PacBio and Illumina hybrid (HYBRID_PACBIO_ILLUMINA)

    Copy the input test data from the deepvariant Cloud Storage bucket to the directories on the instance you created. You can copy the data by running the gsutil cp command:

    # Input BAM and BAI files:
    gsutil cp gs://deepvariant/case-study-testdata/"${BAM}" "${DATA_DIR}"
    gsutil cp gs://deepvariant/case-study-testdata/"${BAM}".bai "${DATA_DIR}"
    # GRCh38 reference FASTA file:
    curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > "${DATA_DIR}/${REF}"
    curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > "${DATA_DIR}/${REF}".fai
  5. DeepVariant is a containerized application staged in a pre-built Docker image in Container Registry. To pull the image, run the following command:

    sudo docker pull"${BIN_VERSION}"
  6. To start DeepVariant, run the following command:

    sudo docker run \
        -v "${DATA_DIR}":"/input" \
        -v "${OUTPUT_DIR}:/output" \"${BIN_VERSION}"  \
        /opt/deepvariant/bin/run_deepvariant \
        --model_type=WGS \
        --ref="/input/${REF}" \
        --reads="/input/${BAM}" \
        --output_vcf=/output/${OUTPUT_VCF} \
        --output_gvcf=/output/${OUTPUT_GVCF} \
        --regions chr20 \
        --num_shards=$(nproc) \
        --intermediate_results_dir /output/intermediate_results_dir

    The following table describes the flags passed in to the command:

    Flag Description
    model_type DeepVariant supports several different types of input data. This tutorial uses Whole Genome Sequencing (WSG).
    ref The location of the reference FASTA file.
    reads The location of the input BAM file.
    output_vcf The location of the output VCF files.
    output_gvcf The location of the output gVCF files.
    regions (Optional) A space-separated list of chromosome regions to process. Individual elements can be region literals, such as chr20:10-20 or paths to BED/BEDPE files.
    num_shards The number of shards to run in parallel. For best results, set the value of this flag to the number of cores on the machine where DeepVariant runs.
    intermediate_results_dir Optional flag specifying the directory for the intermediate outputs of make_examples and call_variants stages. After the command completes, the files will be saved to your local directory in the following formats:

    If the command starts to run successfully, it outputs a message starting with the following:

    ***** Running the command:*****
    time seq 0 63 | parallel
    --line-buffer /opt/deepvariant/bin/make_examples
    --mode calling
    --ref "/input/GRCh38_no_alt_analysis_set.fasta"
    --reads "/input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam"
    --examples "/output/intermediate_results_dir/make_examples.tfrecord@64.gz"
    --regions "chr20"
    --gvcf "/output/intermediate_results_dir/gvcf.tfrecord@64.gz"
    --task {}
  7. After DeepVariant finishes, it outputs the following files to the deepvariant-run/output directory:

    • HG003.output.g.vcf.gz
    • HG003.output.g.vcf.gz.tbi
    • HG003.output.vcf.gz
    • HG003.output.vcf.gz.tbi
    • HG003.output.visual_report.html

    Run the following command to list the files in the output directory, and check that all of the output files display:

    ls $OUTPUT_DIR

Cost and runtime estimates

The following table shows the approximate runtime and costs when running DeepVariant using a 30x whole genome sample in a BAM file. These estimates do not include the time required to set up the instance and download any sample data from Cloud Storage.

The table contains estimates for preemptible VMs and non-preemptible VMs. The runtime estimates are based on using non-preemptible VMs.

Preemptible VMs are up to 80% cheaper than regular VMs. However, if Compute Engine requires access to those resources for other tasks, it might terminate (preempt) these instances. Preemptible VMs are not covered by any Service Level Agreement (SLA), so if you require guarantees on turnaround time, do not use the --preemptible flag.

See the Compute Engine best practices for how to effectively use preemptible VMs.

Machine type Runtime in hours Cost (non-preemptible) Cost (preemptible)
n1-standard-8 27.6 $11.3 $3.04
n1-standard-16 15.4 $12.1 $2.92
n1-standard-32 9.47 $14.7 $3.32
n1-standard-64 6.8 $20.9 $4.55
n1-standard-96 5.58 $25.6 $5.53

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:

After you've finished the Running DeepVariant tutorial, you can clean up the resources you created on Google Cloud so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project you used for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next