Running a Sentieon DNAseq Pipeline

This page explains how to run Sentieon DNAseq as a Google Cloud Platform (GCP) pipeline. The pipeline matches the results from the GATK Best Practices version 3.7: alignment, sorting, duplicate removal, BQSR, and variant calling. Input formats include fastq files or aligned and sorted BAM files.

Objectives

After completing this tutorial, you'll know how to:

  • Run a pipeline on GCP using Sentieon DNAseq
  • Write configuration files for different Sentieon DNAseq use cases

Costs

This tutorial uses billable components of GCP, including:

  • Compute Engine
  • Cloud Storage

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Install Python 2.7+. For more information on setting up your Python development environment, such as installing pip on your system, see the Python Development Environment Setup Guide.
  2. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  3. Select or create a Google Cloud Platform project.

    Go to the Manage resources page

  4. Make sure that billing is enabled for your Google Cloud Platform project.

    Learn how to enable billing

  5. Enable the Cloud Genomics, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

  6. Install and initialize the Cloud SDK.
  7. Update and install gcloud components:
    gcloud components update &&
    gcloud components install alpha
  8. Install git to download the required files.

    Download git

  9. By default, Compute Engine has resource quotas in place to prevent inadvertent usage. By increasing quotas, you can launch more virtual machines concurrently, increasing throughput and reducing turnaround time.

    For best results in this tutorial, you should request additional quota above your project's default. Recommendations for quota increases are provided in the following list, as well as the minimum quotas needed to run the tutorial. Make your quota requests in the us-central1 region:

    • CPUs: 64
    • Persistent Disk Standard (GB): 375

    You can leave other quota request fields empty to keep your current quotas.

Get an evaluation license

Email support@sentieon.com with your GCP project ID to request a GCP evaluation license. You can find your project ID and number in the Google Cloud Platform Console Dashboard.

Set up your local environment and install prerequisites

  1. If you don't have virtualenv, install it using pip:

    pip install virtualenv
    
  2. Create an isolated Python environment and install dependencies:

    virtualenv env
    source env/bin/activate
    pip install --upgrade \
        pyyaml \
        google-api-python-client \
        google-auth \
        google-cloud-storage \
        google-auth-httplib2
    

Download the pipeline script

Download the example files and set your current directory:

git clone https://github.com/sentieon/sentieon-google-genomics.git
cd sentieon-google-genomics

Understanding the input format

The pipeline uses parameters specified in a JSON file as its input.

In the repository you downloaded, there is an examples/example.json file with the following content:

{
  "FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
  "FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
  "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
  "OUTPUT_BUCKET": "gs://BUCKET",
  "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
  "PROJECT_ID": "PROJECT_ID"
}

The following table describes the JSON keys in the file:

JSON key Description
FQ1 The first pair of reads in the input fastq file.
FQ2 The second pair of reads in the input fastq file.
BAM The input BAM file, if applicable.
REF The reference genome. If set, the fastq/BAM index files are assumed to exist.
OUTPUT_BUCKET The bucket and directory used to store the data output from the pipeline.
ZONES A comma-separated list of GCP zones to use for the worker node.
PROJECT_ID Your GCP project ID.

Run the pipeline

  1. In the sentieon-google-genomics directory, edit the examples/example.json file, substituting the BUCKET and PROJECT_ID variables with the relevant resources from your GCP project:

    {
      "FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
      "FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
      "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
      "OUTPUT_BUCKET": "gs://BUCKET",
      "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
      "PROJECT_ID": "PROJECT_ID"
    }
    
  2. Run the following command to execute the DNAseq pipeline on a small test dataset identified by the inputs in the configuration file. By default, the script verifies that the input files exist in your Cloud Storage bucket before starting the pipeline.

    python runner/sentieon_runner.py examples/example.json
    

If you specified multiple preemptible tries, the pipeline will restart whenever its instances are preempted. After the pipeline finishes, it outputs a message to the console that states whether the pipeline succeeded or failed.

For most situations, you can optimize turnaround time and cost using the following configuration. The configuration will run a 30x human genome at a cost of roughly $1.25 and will take about 2 hours. A human whole exome will cost roughly $0.35 and take about 45 minutes. Both of these estimates are based on the pipeline's instances not being preempted.

{
  "FQ1": "gs://my-bucket/sample1_1.fastq.gz",
  "FQ2": "gs://my-bucket/sample1_2.fastq.gz",
  "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
  "OUTPUT_BUCKET": "gs://BUCKET",
  "BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/1000G_phase1.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
  "DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
  "PREEMPTIBLE_TRIES": "2",
  "NONPREEMPTIBLE_TRY": true,
  "STREAM_INPUT": "True",
  "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
  "PROJECT_ID": "PROJECT_ID"
}

Additional options

You can customize a pipeline using the following additional options.

Input file options

The pipeline supports multiple comma-separated fastq files as input:

"FQ1": "gs://my-bucket/s1_prep1_1.fastq.gz,gs://my-bucket/s1_prep2_1.fastq.gz",
"FQ2": "gs://my-bucket/s1_prep1_2.fastq.gz,gs://my-bucket/s1_prep2_2.fastq.gz",

It also accepts comma-separated BAM files as an input using the BAM JSON key. Reads in the BAM files will not be aligned to the reference genome. Instead, they will start at the data deduplication stage of the pipeline.

"BAM": "gs://my-bucket/s1_prep1.bam,gs://my-bucket/s1_prep2.bam"

Whole-exome data or large datasets configuration

The settings in the recommended configuration are optimized for human whole-genome samples sequenced to an average coverage of 30x. For files that are much smaller or larger than standard whole-genome datasets, you can increase or decrease the resources available to the instance. For best results with large datasets, use the following settings:

{
  "FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
  "FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
  "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
  "OUTPUT_BUCKET": "gs://BUCKET",
  "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
  "PROJECT_ID": "PROJECT_ID",
  "DISK_SIZE": 600,
  "MIN_CPU": 64,
  "MIN_RAM_GB": 100
}

The following table provides a description of the settings used:

JSON key Description
DISK_SIZE SSD space available to the worker node.
MIN_CPU Minimum number of CPU cores.
MIN_RAM_GB Minimum RAM (in GB).

Preemptible instances

You can use preeemptible instances in your pipeline by setting the PREEMPTIBLE_TRIES JSON key.

By default, the runner tries to run the pipeline with a standard instance if the preemptible tries are exhausted or if the NONPREEMPTIBLE_TRY JSON key is set to 0. You can turn off this behavior by setting the NONPREEMPTIBLE_TRY key to false.

"PREEMPTIBLE_TRIES": 2,
"NONPREEMPTIBLE_TRY": false

The following table provides a description of the settings used:

JSON key Description
PREEMPTIBLE_TRIES The number of times to attempt the pipeline when using preemptible instances.
NONPREEMPTIBLE_TRY Determines whether to try running the pipeline with a standard instance after preemptible tries are exhausted.

Read groups

Read groups are added when fastq files are aligned with a reference genome using Sentieon BWA. You can supply multiple comma-separated read groups. The number of read groups must match the number of input fastq files. The default read group is @RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA. To change the read group, set the READGROUP key in the JSON input file:

"READGROUP": "@RG\\tID:my-rgid-1\\tSM:my-sm\\tPL:ILLUMINA,@RG\\tID:my-rgid-2\\tSM:my-sm\\tPL:ILLUMINA"

The following table provides a description of the setting used:

JSON key Description
READGROUP A read group containing sample metadata.

For more information on read groups, see the Broad Institute documentation.

Streaming input from Cloud Storage

You can stream input fastq files from Cloud Storage which can reduce the total runtime of the pipeline. To stream input fastq files from Cloud Storage, set the STREAM_INPUT JSON key to True:

"STREAM_INPUT": "True"

The following table provides a description of the setting used:

JSON key Description
STREAM_INPUT Determines whether to stream input fastq files directly from Cloud Storage.

Duplicate marking

By default, the pipeline removes duplicate reads from BAM files. You can change this behavior by setting the DEDUP JSON key:

"DEDUP": "markdup"

The following table provides a description of the setting used:

JSON key Description
DEDUP Duplicate marking behavior.
Valid values:
  • The default configuration removes reads marked as duplicate.
  • markdup marks duplicates but does not remove them.
  • nodup skips duplicate marking.

Base quality score recalibration (BQSR) and known sites

BSQR requires known sites of genetic variation. The default behavior is to skip this stage of the pipeline. However, you can enable BSQR by supplying known sites with the BQSR_SITES JSON key. If supplied, a DBSNP file can be used to annotate the output variants during variant calling.

"BQSR_SITES": "gs://my-bucket/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://my-bucket/reference/1000G_phase1.indels.b37.vcf.gz,gs://my-bucket/reference/dbsnp_138.b37.vcf.gz",
"DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz"

The following table provides a description of the settings used:

JSON key Description
BSQR_SITES Turns on BSQR and uses a comma-separated list of supplied files as known sites.
DBSNP A dbSNP file used during variant calling.

Intervals

For some applications, such as targeted or whole-exome sequencing, you might be interested only in a portion of the genome. In those cases, supplying a file of target intervals can speed up processing and reduce low quality off-target variant calls. You can use intervals with the INTERVAL_FILE and INTERVAL JSON keys.

"INTERVAL_FILE": "gs://my-bucket/capture-targets.bed",
"INTERVAL": "9:80331190-80646365"

The following table provides a description of the settings used:

JSON key Description
INTERVAL_FILE A file containing genomic intervals to process.
INTERVAL A string containing a genomic interval to process.

Output options

By default, the pipeline produces a preprocessed BAM, quality control metrics, and variant calls. You can disable any of these outputs using the NO_BAM_OUTPUT, NO_METRICS, and NO_HAPLOTYPER JSON keys. If the NO_HAPLOTYPER argument is not supplied or NULL, you can use the GVCF_OUTPUT JSON key to produce variant calls in gVCF format rather than VCF format.

"NO_BAM_OUTPUT": "true",
"NO_METRICS": "true",
"NO_HAPLOTYPER": "true",
"GVCF_OUTPUT": "true",

The following table provides a description of the settings used:

JSON key Description
NO_BAM_OUTPUT Determines whether to output a preprocessed BAM file.
NO_METRICS Determines whether to output file metrics.
NO_HAPLOTYPER Determines whether to output variant calls.
GVCF_OUTPUT Determines whether to output variant calls in gVCF format.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

After you've finished the Running a Sentieon DNAseq Pipeline tutorial, you can clean up the resources you created on Google Cloud Platform so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project you used for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Genomics