Verily Life Sciences and Google are working together so that you can run the DeepVariant germline variant calling algorithm on human whole genome sequencing data on Google Genomics, powered by Google Cloud Platform.
DeepVariant uses a deep convolutional neural network to call genetic variants in aligned next-generation sequencing read data. The data preparation pipeline aligns the input FASTQs using bwa (v0.7.12), marks duplicate reads using Picard MarkDuplicates (v2.1.0), and performs local haplotype alignment using Verily's local assembler. Candidate variant loci are identified using the reassembled reads, and a deep convolutional neural network is used both to call genotypes and perform variant filtering. DeepVariant won the Precision FDA Truth Challenge for Highest SNP Performance. Verily has a publication in preparation which describes the approach in more detail.
By running DeepVariant on Google Genomics, you will incur Google Cloud Platform charges.
To run the pipeline:
This is an alpha release by invitation only.
Request access to run DeepVariant.
Only a limited number of testers will be granted access to DeepVariant at this time. Eventually our intention is to release an open-source version that's open to all.
- Enable the Cloud Storage and Compute Engine APIs on a new or existing Google Cloud Project using the Cloud Console
- Follow the Google Genomics getting started instructions
to install the Google Cloud SDK, which includes the
gcloudcommand-line tool and
gsutil, on your laptop or workstation.
To install the bio commands, run:
gcloud components install alpha
Increase quotas for CPUs us-central1 to at least 50 cores (each BWA alignment requires 32 cores).
Create a bucket for your DeepVariant results
To run DeepVariant on Google Genomics, you'll need a Cloud Storage bucket where output results and logs will be stored.
gsutil mb gs://my-bucket
my-bucket to a unique name that follows the bucket-naming conventions.
By default the bucket will be in the US. You can change the location settings
-l command-line option.
The DeepVariant alpha service supports whole genome sequencing at high coverage (20-60x) from Illumina PCR-free sample preparation. Input is paired-end FASTQ files, from a single library and without barcodes. The DeepVariant service will align reads and call variants against build 38 of the human reference genome. The service typically processes each sample within about 48 hours.
Although the DeepVariant algorithm can support many other modes, including different genome assemblies and organisms, various sequencing technologies, and other sample preparation approaches, the alpha service only supports high coverage Illumina sequencing, analyzed with build 38 of the human reference.
For this example, we'll use public data in Google Cloud Storage from NA12878.
If you are a whitelisted alpha tester, you can run DeepVariant from your terminal:
gcloud alpha bio pipelines run deepvariant-alpha1 \ --logging gs://my-bucket/my-path/logging \ --sample-name NA12878 \ --input-pair gs://genomics-public-data/precision-fda/input/HG001-NA12878-pFDA_S1_L001_R1_001.fastq.gz,gs://genomics-public-data/precision-fda/input/HG001-NA12878-pFDA_S1_L001_R2_001.fastq.gz \ --output-path=gs://my-bucket/my-path/ Running: [operations/operation-id]
my-path to specify the desired output location for your
DeepVariant VCF and gVCF files and logs.
The output is an
operation-id that you can use to monitor the status or
cancel the job.
When the job completes, you can find your output in the
specified. A copy of the expected output VCF files can be found in
To run DeepVariant on your own Illumina human WGS data in Cloud Storage, update the input file paths to reference your own FASTQ files.
Monitor the operation
To monitor the status, run:
gcloud alpha bio operations describe operation-id \ --format='yaml(done, error, metadata.events)'
operation-id to the value returned above.
To cancel the operation, run:
gcloud bio operations cancel operation-id
Verify the results
When all of the steps complete, you can check out the results by looking in your output directory:
gsutil ls gs://my-bucket/my-path/outputs/
That’s it! You have now run Verily's DeepVariant caller on Google Genomics.
Run DeepVariant on your own data
Before you scale up your processing to run multiple workflows concurrently, confirm quotas for your cloud resources.
To run DeepVariant on your own FASTQ files, transfer your data to a Google
Cloud Storage bucket using
gsutil, which you installed as part of the Google
gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp my-file gs://my-bucket/my-path
my-file to the name of the file you want to copy, and
my-path to the bucket and path where you’d like
your files to go.
(If you have files in a different format, like aligned or unaligned BAM, you must convert them first. You can convert them locally before uploading, or if you'd like to convert them in the cloud, the Pipelines API can help.)
Re-run this command as needed if you encounter
any failures due to temporary network issues.(And in case you were wondering,
is a setting that helps with very large files, such as 100 GB BAM files.) For
best practices when uploading large files to Google Cloud Storage, see
using Google Cloud Storage with Big Data.
Note that gsutil always verifies checksums automatically, so when the transfer succeeds, your data should be ready for GATK.
To verify that your files are uploaded, list the files:
gsutil ls gs://my-bucket/my-path
Important: Keep in mind that all of the input file restrictions described in the section Run DeepVariant apply.
- Reading from and writing to a different cloud project. Your project's default Compute Engine service account must have read access to the input files, and write access to the output bucket path. This is automatically the case if you run DeepVariant in the same cloud project. If the output bucket is in a different cloud project, you'll need to grant the service account permission to access the bucket.
For general questions about setting up accounts, billing, authentication, transferring files, and calling APIs:
For questions about DeepVariant: