Running Verily DeepVariant

Verily Life Sciences and Google are working together so that you can run the DeepVariant germline variant calling algorithm on human whole genome sequencing data on Google Genomics, powered by Google Cloud Platform.

DeepVariant uses a deep convolutional neural network to call genetic variants in aligned next-generation sequencing read data. The data preparation pipeline aligns the input FASTQs using bwa (v0.7.12), marks duplicate reads using Picard MarkDuplicates (v2.1.0), and performs local haplotype alignment using Verily's local assembler. Candidate variant loci are identified using the reassembled reads, and a deep convolutional neural network is used both to call genotypes and perform variant filtering. DeepVariant won the Precision FDA Truth Challenge for Highest SNP Performance. Verily has a publication in preparation which describes the approach in more detail.

By running DeepVariant on Google Genomics, you will incur Google Cloud Platform charges.

To run the pipeline:

  1. Request access
  2. Get ready
  3. Run DeepVariant

Request access

This is an alpha release by invitation only.

Request access to run DeepVariant.

Only a limited number of testers will be granted access to DeepVariant at this time. Eventually our intention is to release an open-source version that's open to all.

Get ready

Prerequisites

  1. Enable the Cloud Storage and Compute Engine APIs on a new or existing Google Cloud Project using the Cloud Console
  2. Follow the Google Genomics getting started instructions to install the Google Cloud SDK, which includes the gcloud command-line tool and gsutil, on your laptop or workstation.
  3. To install the bio commands, run:

    gcloud components install alpha

  4. Increase quotas for CPUs us-central1 to at least 50 cores (each BWA alignment requires 32 cores).

Create a bucket for your DeepVariant results

To run DeepVariant on Google Genomics, you'll need a Cloud Storage bucket where output results and logs will be stored.

To create a bucket, use the storage browser or run the command-line utility gsutil, included in the Cloud SDK:

gsutil mb gs://my-bucket

Change my-bucket to a unique name that follows the bucket-naming conventions. By default the bucket will be in the US. You can change the location settings using the -l command-line option.

Run DeepVariant

The DeepVariant alpha service supports whole genome sequencing at high coverage (20-60x) from Illumina PCR-free sample preparation. Input is paired-end FASTQ files, from a single library and without barcodes. The DeepVariant service will align reads and call variants against build 38 of the human reference genome. The service typically processes each sample within about 48 hours.

Although the DeepVariant algorithm can support many other modes, including different genome assemblies and organisms, various sequencing technologies, and other sample preparation approaches, the alpha service only supports high coverage Illumina sequencing, analyzed with build 38 of the human reference.

For this example, we'll use public data in Google Cloud Storage from NA12878.

If you are a whitelisted alpha tester, you can run DeepVariant from your terminal:

gcloud alpha bio pipelines run deepvariant-alpha1 \
  --logging gs://my-bucket/my-path/logging \
  --sample-name NA12878 \
  --input-pair gs://genomics-public-data/precision-fda/input/HG001-NA12878-pFDA_S1_L001_R1_001.fastq.gz,gs://genomics-public-data/precision-fda/input/HG001-NA12878-pFDA_S1_L001_R2_001.fastq.gz \
  --output-path=gs://my-bucket/my-path/
Running: [operations/operation-id]

Set my-bucket and my-path to specify the desired output location for your DeepVariant VCF and gVCF files and logs.

The output is an operation-id that you can use to monitor the status or cancel the job.

When the job completes, you can find your output in the output-path you specified. A copy of the expected output VCF files can be found in gs://genomics-public-data/precision-fda-deepvariant/vcf.

To run DeepVariant on your own Illumina human WGS data in Cloud Storage, update the input file paths to reference your own FASTQ files.

Monitor the operation

To monitor the status, run:

gcloud alpha bio operations describe operation-id \
    --format='yaml(done, error, metadata.events)'

Set the operation-id to the value returned above.

To cancel the operation, run:

gcloud bio operations cancel operation-id

Verify the results

When all of the steps complete, you can check out the results by looking in your output directory:

gsutil ls gs://my-bucket/my-path/outputs/

That’s it! You have now run Verily's DeepVariant caller on Google Genomics.

Run DeepVariant on your own data

Before you scale up your processing to run multiple workflows concurrently, confirm quotas for your cloud resources.

To run DeepVariant on your own FASTQ files, transfer your data to a Google Cloud Storage bucket using gsutil, which you installed as part of the Google Cloud SDK:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp my-file gs://my-bucket/my-path

Change my-file to the name of the file you want to copy, and my-bucket and my-path to the bucket and path where you’d like your files to go.

(If you have files in a different format, like aligned or unaligned BAM, you must convert them first. You can convert them locally before uploading, or if you'd like to convert them in the cloud, the Pipelines API can help.)

Re-run this command as needed if you encounter any failures due to temporary network issues.(And in case you were wondering, parallel_composite_upload_threshold is a setting that helps with very large files, such as 100 GB BAM files.) For best practices when uploading large files to Google Cloud Storage, see using Google Cloud Storage with Big Data.

Note that gsutil always verifies checksums automatically, so when the transfer succeeds, your data should be ready for GATK.

To verify that your files are uploaded, list the files:

gsutil ls gs://my-bucket/my-path

Important: Keep in mind that all of the input file restrictions described in the section Run DeepVariant apply.

Troubleshooting

Common issues:

  • Reading from and writing to a different cloud project. Your project's default Compute Engine service account must have read access to the input files, and write access to the output bucket path. This is automatically the case if you run DeepVariant in the same cloud project. If the output bucket is in a different cloud project, you'll need to grant the service account permission to access the bucket.

Support

For general questions about setting up accounts, billing, authentication, transferring files, and calling APIs:

For questions about DeepVariant: