Run DeepVariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

This tutorial explains how to run DeepVariant on Google Cloud using sample data. You run DeepVariant on a single Compute Engine instance.

Objectives

After completing this tutorial, you'll know how to run DeepVariant on Google Cloud.

Costs

This tutorial uses the following billable components of Google Cloud:

  • Compute Engine

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Compute Engine API.

    Enable the API

  5. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  7. Enable the Compute Engine API.

    Enable the API

  8. Install and initialize the Cloud SDK.
  9. Tip: Need a command prompt? You can use the Cloud Shell. The Cloud Shell is a command-line environment that already includes the Cloud SDK, so you don't need to install it.

Create a Compute Engine instance

Create a Compute Engine instance using the Google Cloud Console or the gcloud tool to run DeepVariant.

Console

  1. In the Cloud Console, go to the VM Instances page.

    Go to the VM Instances page

  2. Click Create instance.
  3. Choose a Name for the instance in the format PROJECT_ID-deepvariant-run where PROJECT_ID is the ID for your Google Cloud project.
  4. Choose a Region and Zone for the instance. Unless you have a specific reason to run the instance in a certain location, select us-central1 (Iowa) for the Region and us-central1-a for the Zone.
  5. In the Machine type menu, select n1-standard-64 (64 vCPU, 240 GB memory).
  6. In the CPU platform menu, select Intel Skylake or later.
  7. In the Boot disk section, click Change to begin configuring your boot disk.
  8. On the Public images tab, choose Ubuntu 20.04 LTS. In the Boot disk type menu, select Standard persistent disk. In the Size (GB) field, enter 300. Click Select.
  9. Click Create to create the instance.

gcloud

gcloud compute instances create \
    PROJECT_ID-deepvariant-run \
    --project PROJECT_ID \
    --zone ZONE \
    --scopes "cloud-platform" \
    --image-project ubuntu-os-cloud \
    --image-family ubuntu-2004-lts \
    --machine-type n1-standard-64 \
    --min-cpu-platform "Intel Skylake" \
    --boot-disk-size=300GB

Replace the following:

  • PROJECT_ID: your Google Cloud project ID
  • ZONE: the zone in which your instance is deployed. A zone is an approximate regional location in which your instance and its resources live. For example, us-west1-a is a zone in the us-west region. If you've set a default zone previously using gcloud config set compute/zone, the value of this flag overrides that default.

Allow a short time for the instance to start up. After it's ready, it will appear on the VM Instances page with a green status icon.

Connect to the instance

You can connect to the instance using either the Cloud Console or the gcloud tool:

Console

  1. In the Cloud Console, go to the VM Instances page.

    Go to the VM Instances page

  2. In the list of virtual machine instances, click SSH in the row of the instance that you created.

gcloud

gcloud compute ssh PROJECT_ID-deepvariant-run --zone ZONE

Run DeepVariant

Configure your environment and run DeepVariant on the Compute Engine instance you created:

  1. Install Docker Community Edition (CE):

    sudo apt-get -qq -y install \
      apt-transport-https \
      ca-certificates \
      curl \
      gnupg-agent \
      software-properties-common
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    sudo add-apt-repository \
      "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
      $(lsb_release -cs) \
      stable"
    sudo apt-get -qq -y update
    sudo apt-get -qq -y install docker-ce
    
  2. Configure the DeepVariant environment variables by copying and pasting the following commands into your local environment:

    BIN_VERSION="1.2.0"
    BASE="${HOME}/deepvariant-run"
    INPUT_DIR="${BASE}/input"
    REF="GRCh38_no_alt_analysis_set.fasta"
    BAM="HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam"
    OUTPUT_DIR="${BASE}/output"
    DATA_DIR="${INPUT_DIR}/data"
    OUTPUT_VCF="HG003.output.vcf.gz"
    OUTPUT_GVCF="HG003.output.g.vcf.gz"
    
  3. Create the local directory structure for the input data directory and the output directory:

    mkdir -p "${OUTPUT_DIR}"
    mkdir -p "${INPUT_DIR}"
    mkdir -p "${DATA_DIR}"
    
  4. This tutorial uses a publicly available HG003 genome at 30x coverage mapped to GRCh38 reference. To ensure a quicker runtime, add the --regions chr20 flag when you run DeepVariant so that DeepVariant only runs on chromosome 20 (chr20).

    The sample data was created using Illumina sequencing, but DeepVariant also supports these other types of input data:

    • Whole genome (Illumina) (WGS)
    • Exome (Illumina) (WES)
    • Whole genome (PacBio)
    • Whole genome PacBio and Illumina hybrid (HYBRID_PACBIO_ILLUMINA)

    Run gsutil cp to copy the input test data from the deepvariant Cloud Storage bucket to the directories on the instance you created:

    # Input BAM and BAI files:
    gsutil cp gs://deepvariant/case-study-testdata/"${BAM}" "${DATA_DIR}"
    gsutil cp gs://deepvariant/case-study-testdata/"${BAM}".bai "${DATA_DIR}"
    
    # GRCh38 reference FASTA file:
    FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
    curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > "${DATA_DIR}/${REF}"
    curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > "${DATA_DIR}/${REF}".fai
    
  5. DeepVariant is a containerized application staged in a pre-built Docker image in Container Registry. To pull the image, run the following command:

    sudo docker pull gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}"
    
  6. To start DeepVariant, run the following command:

    sudo docker run \
        -v "${DATA_DIR}":"/input" \
        -v "${OUTPUT_DIR}:/output" \
        gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}"  \
        /opt/deepvariant/bin/run_deepvariant \
        --model_type=WGS \
        --ref="/input/${REF}" \
        --reads="/input/${BAM}" \
        --output_vcf=/output/${OUTPUT_VCF} \
        --output_gvcf=/output/${OUTPUT_GVCF} \
        --regions chr20 \
        --num_shards=$(nproc) \
        --intermediate_results_dir /output/intermediate_results_dir
    

    The following table describes the flags passed to the command:

    Flag Description
    model_type DeepVariant supports several different types of input data. This tutorial uses Whole Genome Sequencing (WSG).
    ref The location of the reference FASTA file.
    reads The location of the input BAM file.
    output_vcf The location of the output VCF files.
    output_gvcf The location of the output gVCF files.
    regions (Optional) A space-separated list of chromosome regions to process. Individual elements can be region literals, such as chr20:10-20 or paths to BED/BEDPE files.
    num_shards The number of shards to run in parallel. For best results, set the value of this flag to the number of cores on the machine where DeepVariant runs.
    intermediate_results_dir Optional flag specifying the directory for the intermediate outputs of make_examples and call_variants stages. After the command completes, the files will be saved to your local directory in the following formats:
    dry_run Optional flag. If set to true, the commands will be printed out instead of being executed.
    call_variants_output.tfrecord.gz
    gvcf.tfrecord-SHARD_NUMBER-of-NUM_OF_SHARDS.gz
    make_examples.tfrecord-SHARD_NUMBER-of-NUM_OF_SHARDS.gz
    
  7. After DeepVariant finishes, it outputs the following files to the deepvariant-run/output directory:

    • HG003.output.g.vcf.gz
    • HG003.output.g.vcf.gz.tbi
    • HG003.output.vcf.gz
    • HG003.output.vcf.gz.tbi
    • HG003.output.visual_report.html

    Run the following command to list the files in the output directory, and check that the output files display:

    ls $OUTPUT_DIR
    

Runtime estimates

The following table shows the approximate runtime when running DeepVariant using a 30x whole genome sample in a BAM file. These estimates do not include the time required to set up the instance and download any sample data from Cloud Storage.

You can refer to Compute Engine pricing for pricing per hour. Consider using Spot VMs, which are significantly cheaper than regular VMs.

Machine type Runtime in hours
n1-standard-8 24.63
n1-standard-16 13.30
n1-standard-32 7.77
n1-standard-64 5.64
n1-standard-96 4.38

Clean up

After you finish the tutorial, you can clean up the resources that you created so that they stop using quota and incurring charges. The following sections describe how to delete or turn off these resources.

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next