DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
This page explains how to run DeepVariant on Google Cloud using a single Compute Engine instance.
There are more complex configurations available in the DeepVariant GitHub repository. For example, you can run DeepVariant using multiple instances. These variations provide improvements in processing speed and reduced costs.
Running DeepVariant consists of three stages:
Making examples: DeepVariant pre-processes the input data and saves examples from the data using an internal TensorFlow format. You can run this stage in parallel where the input shards are processed independently.
Calling variants: DeepVariant runs a deep neural network that makes inferences from the examples and saves them into shared files using an internal TensorFlow format.
Post-processing variants: DeepVariant converts variants from the internal TensorFlow format to VCF or gVCF files. This stage runs on a single thread.
In this tutorial, you run these stages using a single instance. The first and second stages (making examples and calling variants, respectively) can benefit from parallelization on multiple cores. However, the third stage (post-processing variants) does not have the same benefits because it runs on a single thread.
Objectives
After completing this tutorial, you'll know how to:
- Run DeepVariant on Google Cloud
Costs
This tutorial uses billable components of Google Cloud, including:
- Compute Engine
Generate a cost estimate based on your projected usage by using the Pricing Calculator. New Cloud Platform users might be eligible for a free trial.
Before you begin
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Compute Engine API.
- Install and initialize the Cloud SDK.
Tip: Need a command prompt? You can use the Cloud Shell. The Cloud Shell is a command-line environment that already includes the Cloud SDK, so you don't need to install it.
Creating a Compute Engine instance
Before running DeepVariant, you need to create a Compute Engine
instance. You'll run DeepVariant on the Compute Engine instance.
You can create a Linux virtual
machine instance in Compute Engine using
either the Google Cloud Console or the gcloud
command-line tool.
Console
- In the Cloud Console, go to the VM Instances page.
- Click Create instance.
-
Choose a Name for the instance in the format
PROJECT_ID-deepvariant-run
where PROJECT_ID is the ID for your Google Cloud project. -
Choose a Region and Zone for the instance. Unless you have a specific reason to run the instance in a certain location, select
us-central1 (Iowa)
for the Region andus-central1-a
for the Zone. - In the Machine type menu, select n1-standard-64 (64 vCPU, 240 GB memory).
- In the CPU platform menu, select Intel Skylake or later.
- In the Boot disk section, click Change to begin configuring your boot disk.
- On the OS images tab, choose Google Drawfork Ubuntu 16.04 LTS. In the Boot disk type menu, select Standard persistent disk. In the Size (GB) field, enter 300. Click Select.
- Click Create to create the instance.
gcloud
gcloud compute instances create \ PROJECT_ID-deepvariant-run \ --project PROJECT_ID \ --zone ZONE \ --scopes "cloud-platform" \ --image-project ubuntu-os-cloud \ --image-family ubuntu-1604-lts \ --machine-type n1-standard-64 \ --min-cpu-platform "Intel Skylake" \ --boot-disk-size=300GB
where:
- PROJECT_ID is the ID for your Google Cloud project.
- ZONE is the zone in which your instance is deployed. A zone is an approximate regional location in which your instance and its resources live. For example,
us-west1-a
is a zone in theus-west
region. If you've set a default zone previously usinggcloud config set compute/zone
, the value of this flag overrides that default.
Allow a short time for the instance to start up. After it's ready, it will be listed on the VM Instances page with a green status icon.
Connecting to the instance
You can connect to the instance using either the Cloud Console
or the gcloud
tool:
Console
- In the Cloud Console, go to the VM Instances page.
- In the list of virtual machine instances, click SSH in the row of the instance that you created.
gcloud
gcloud compute ssh PROJECT_ID-deepvariant-run --zone ZONE
Running DeepVariant
Configure your environment and run DeepVariant by completing the following steps on the Compute Engine instance you created:
Install Docker Community Edition (CE) by running the following commands:
sudo apt-get -qq -y install \ apt-transport-https \ ca-certificates \ curl \ gnupg-agent \ software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable" sudo apt-get -qq -y update sudo apt-get -qq -y install docker-ce
Configure the DeepVariant environment variables by copying and pasting the following commands:
BIN_VERSION="1.1.0" BASE="${HOME}/deepvariant-run" INPUT_DIR="${BASE}/input" REF="GRCh38_no_alt_analysis_set.fasta" BAM="HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam" OUTPUT_DIR="${BASE}/output" DATA_DIR="${INPUT_DIR}/data" OUTPUT_VCF="HG003.output.vcf.gz" OUTPUT_GVCF="HG003.output.g.vcf.gz"
Create the local directory structure for the input data directory and the output directory:
mkdir -p "${OUTPUT_DIR}" mkdir -p "${INPUT_DIR}" mkdir -p "${DATA_DIR}"
This tutorial uses a publicly available HG003 genome at 30x coverage mapped to GRCh38 reference. However, to ensure a quicker runtime, you add the
--regions chr20
flag when you run DeepVariant so that DeepVariant only runs on chromosome 20 (chr20).The sample data was created using Illumina sequencing, but DeepVariant also supports the following other types of input data:
- Whole genome (Illumina) (WGS)
- Exome (Illumina) (WES)
- Whole genome (PacBio)
- Whole genome PacBio and Illumina hybrid (HYBRID_PACBIO_ILLUMINA)
Copy the input test data from the
deepvariant
Cloud Storage bucket to the directories on the instance you created. You can copy the data by running thegsutil cp
command:# Input BAM and BAI files: gsutil cp gs://deepvariant/case-study-testdata/"${BAM}" "${DATA_DIR}" gsutil cp gs://deepvariant/case-study-testdata/"${BAM}".bai "${DATA_DIR}" # GRCh38 reference FASTA file: FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > "${DATA_DIR}/${REF}" curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > "${DATA_DIR}/${REF}".fai
DeepVariant is a containerized application staged in a pre-built Docker image in Container Registry. To pull the image, run the following command:
sudo docker pull gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}"
To start DeepVariant, run the following command:
sudo docker run \ -v "${DATA_DIR}":"/input" \ -v "${OUTPUT_DIR}:/output" \ gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}" \ /opt/deepvariant/bin/run_deepvariant \ --model_type=WGS \ --ref="/input/${REF}" \ --reads="/input/${BAM}" \ --output_vcf=/output/${OUTPUT_VCF} \ --output_gvcf=/output/${OUTPUT_GVCF} \ --regions chr20 \ --num_shards=$(nproc) \ --intermediate_results_dir /output/intermediate_results_dir
The following table describes the flags passed in to the command:
Flag Description model_type
DeepVariant supports several different types of input data. This tutorial uses Whole Genome Sequencing (WSG). ref
The location of the reference FASTA file. reads
The location of the input BAM file. output_vcf
The location of the output VCF files. output_gvcf
The location of the output gVCF files. regions
(Optional) A space-separated list of chromosome regions to process. Individual elements can be region literals, such as chr20:10-20
or paths to BED/BEDPE files.num_shards
The number of shards to run in parallel. For best results, set the value of this flag to the number of cores on the machine where DeepVariant runs. intermediate_results_dir
Optional flag specifying the directory for the intermediate outputs of make_examples and call_variants stages. After the command completes, the files will be saved to your local directory in the following formats: call_variants_output.tfrecord.gz gvcf.tfrecord-SHARD_NUMBER-of-NUM_OF_SHARDS.gz make_examples.tfrecord-SHARD_NUMBER-of-NUM_OF_SHARDS.gz
If the command starts to run successfully, it outputs a message starting with the following:
***** Running the command:***** time seq 0 63 | parallel -k --line-buffer /opt/deepvariant/bin/make_examples --mode calling --ref "/input/GRCh38_no_alt_analysis_set.fasta" --reads "/input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam" --examples "/output/intermediate_results_dir/make_examples.tfrecord@64.gz" --regions "chr20" --gvcf "/output/intermediate_results_dir/gvcf.tfrecord@64.gz" --task {}
After DeepVariant finishes, it outputs the following files to the
deepvariant-run/output
directory:HG003.output.g.vcf.gz
HG003.output.g.vcf.gz.tbi
HG003.output.vcf.gz
HG003.output.vcf.gz.tbi
HG003.output.visual_report.html
Run the following command to list the files in the output directory, and check that all of the output files display:
ls $OUTPUT_DIR
Cost and runtime estimates
The following table shows the approximate runtime and costs when running DeepVariant using a 30x whole genome sample in a BAM file. These estimates do not include the time required to set up the instance and download any sample data from Cloud Storage.
The table contains estimates for preemptible VMs and non-preemptible VMs. The runtime estimates are based on using non-preemptible VMs.
Preemptible VMs are up to 80% cheaper than regular VMs.
However, if Compute Engine requires access to those resources for other tasks,
it might terminate (preempt) these instances. Preemptible VMs are not covered
by any Service Level Agreement (SLA),
so if you require guarantees on turnaround time, do not use the --preemptible
flag.
See the Compute Engine best practices for how to effectively use preemptible VMs.
Machine type | Runtime in hours | Cost (non-preemptible) | Cost (preemptible) |
---|---|---|---|
n1-standard-8 | 27.6 | $11.3 | $3.04 |
n1-standard-16 | 15.4 | $12.1 | $2.92 |
n1-standard-32 | 9.47 | $14.7 | $3.32 |
n1-standard-64 | 6.8 | $20.9 | $4.55 |
n1-standard-96 | 5.58 | $25.6 | $5.53 |
Cleaning up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:
After you've finished the Running DeepVariant tutorial, you can clean up the resources you created on Google Cloud so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.
Deleting the project
The easiest way to eliminate billing is to delete the project you used for the tutorial.
To delete the project:
- In the Cloud Console, go to the Projects page.
-
In the project list, select the project you
want to delete and click Delete project.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Read the DeepVariant documentation on GitHub.
- Read a Google AI blog post about DeepVariant's open source release.
- If you have question about DeepVariant, you can file a GitHub issue. If you have questions about Google Cloud, you can post to the gcp-life-sciences-discuss@googlegroups.com mailing list.