Run Sentieon® DNASeq®

This page explains how to run Sentieon® DNASeq® as a Google Cloud pipeline for secondary genomic analysis. The pipeline matches the following results from the Genome Analysis Toolkit (GATK) Best Practices version 3.7:

  • Alignment
  • Sorting
  • Duplicate removal
  • Base quality score recalibration (BQSR)
  • Variant calling

Input formats include the following:

  • fastq files
  • Aligned and sorted BAM files

Objectives

After completing this tutorial, you'll know how to:

  • Run a pipeline on Google Cloud using Sentieon® DNASeq®
  • Write configuration files for different Sentieon® DNASeq® use cases

Costs

In this document, you use the following billable components of Google Cloud:

  • Compute Engine
  • Cloud Storage

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Install Python 2.7+. For more information on setting up your Python development environment, such as installing pip on your system, see the Python Development Environment Setup Guide.
  2. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  3. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  4. Make sure that billing is enabled for your Google Cloud project.

  5. Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

  6. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  7. Make sure that billing is enabled for your Google Cloud project.

  8. Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

  9. Install the Google Cloud CLI.
  10. To initialize the gcloud CLI, run the following command:

    gcloud init
  11. Update and install gcloud components:

    gcloud components update
    gcloud components install beta
  12. Install git to download the required files.

    Download git

  13. By default, Compute Engine has resource quotas in place to prevent inadvertent usage. By increasing quotas, you can launch more virtual machines concurrently, increasing throughput and reducing turnaround time.

    For best results in this tutorial, you should request additional quota above your project's default. Recommendations for quota increases are provided in the following list alongside the minimum quotas needed to run the tutorial. Make your quota requests in the us-central1 region:

    • CPUs: 64
    • Persistent Disk Standard (GB): 375

    You can leave other quota request fields empty to keep your current quotas.

Sentieon® evaluation license

When using this pipeline, Sentieon® automatically grants you a free two week evaluation license of its software for use with Google Cloud. To receive the license, enter your email address in the EMAIL field when configuring the pipeline. See Understanding the input format for information on setting this field.

To continue using Sentieon® after the evaluation license expires, contact support@sentieon.com.

Set up your local environment and install prerequisites

  1. If you don't have virtualenv, run the following command to install it using pip:

    pip install virtualenv
  2. Run the following command to create an isolated Python environment and install dependencies:

    virtualenv env
    source env/bin/activate
    pip install --upgrade \
        pyyaml \
        google-api-python-client \
        google-auth \
        google-cloud-storage \
        google-auth-httplib2

Download the pipeline script

Run the following command to download the example files and set your current directory:

git clone https://github.com/sentieon/sentieon-google-genomics.git
cd sentieon-google-genomics

Understand the input format

The pipeline uses parameters specified in a JSON file as its input.

In the repository you downloaded, there is an examples/example.json file with the following content:

{
  "FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
  "FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
  "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
  "OUTPUT_BUCKET": "gs://BUCKET",
  "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
  "PROJECT_ID": "PROJECT_ID"
  "REQUESTER_PROJECT": "PROJECT_ID",
  "EMAIL": "YOUR_EMAIL_HERE"
}

The following table describes the JSON keys in the file:

JSON key Description
FQ1 The first pair of reads in the input fastq file.
FQ2 The second pair of reads in the input fastq file.
BAM The input BAM file, if applicable.
REF The reference genome. If set, the fastq/BAM index files are assumed to exist.
OUTPUT_BUCKET The bucket and directory used to store the data output from the pipeline.
ZONES A comma-separated list of Google Cloud zones to use for the worker node.
PROJECT_ID Your Google Cloud project ID.
REQUESTER_PROJECT A project to bill when transferring data from Requester Pays buckets.
EMAIL Your email address.

Run the pipeline

  1. In the sentieon-google-genomics directory, edit the examples/example.json file, substituting the BUCKET, REQUESTER_PROJECT, EMAIL, and PROJECT_ID variables with the relevant resources from your Google Cloud project:

    {
      "FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
      "FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
      "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
      "OUTPUT_BUCKET": "gs://BUCKET",
      "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
      "PROJECT_ID": "PROJECT_ID",
      "REQUESTER_PROJECT": "PROJECT_ID",
      "EMAIL": "EMAIL_ADDRESS"
    }
  2. Set the PROJECT_ID variable in your environment:

    export PROJECT_ID=PROJECT_ID

  3. Run the following command to execute the DNASeq® pipeline on a small test dataset identified by the inputs in the configuration file. By default, the script verifies that the input files exist in your Cloud Storage bucket before starting the pipeline.

    python runner/sentieon_runner.py --requester_project $PROJECT_ID examples/example.json

If you specified multiple preemptible tries, the pipeline restarts whenever its instances are preempted. After the pipeline finishes, it outputs a message to the console that states whether the pipeline succeeded or failed.

For most situations, you can optimize turnaround time and cost using the following configuration. The configuration runs a 30x human genome at a cost of roughly $1.25 and takes about 2 hours. A human whole exome costs roughly $0.35 and take about 45 minutes. Both of these estimates are based on the pipeline's instances not being preempted.

{
  "FQ1": "gs://my-bucket/sample1_1.fastq.gz",
  "FQ2": "gs://my-bucket/sample1_2.fastq.gz",
  "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
  "OUTPUT_BUCKET": "gs://BUCKET",
  "BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/1000G_phase1.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
  "DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
  "PREEMPTIBLE_TRIES": "2",
  "NONPREEMPTIBLE_TRY": true,
  "STREAM_INPUT": "True",
  "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
  "PROJECT_ID": "PROJECT_ID",
  "EMAIL": "EMAIL_ADDRESS"
}

Additional options

You can customize a pipeline using the following additional options.

Input file options

The pipeline supports multiple comma-separated fastq files as input, as the following configuration shows:

"FQ1": "gs://my-bucket/s1_prep1_1.fastq.gz,gs://my-bucket/s1_prep2_1.fastq.gz",
"FQ2": "gs://my-bucket/s1_prep1_2.fastq.gz,gs://my-bucket/s1_prep2_2.fastq.gz",

The pipeline accepts comma-separated BAM files as an input using the BAM JSON key. Reads in the BAM files are not aligned to the reference genome. Instead, they start at the data deduplication stage of the pipeline. The following sample shows a configuration using two BAM files as an input:

"BAM": "gs://my-bucket/s1_prep1.bam,gs://my-bucket/s1_prep2.bam"

Whole-exome data or large datasets configuration

The settings in the recommended configuration are optimized for human whole-genome samples sequenced to an average coverage of 30x. For files that are much smaller or larger than standard whole-genome datasets, you can increase or decrease the resources available to the instance. For best results with large datasets, use the following settings:

{
  "FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
  "FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
  "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
  "OUTPUT_BUCKET": "gs://BUCKET",
  "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
  "PROJECT_ID": "PROJECT_ID",
  "EMAIL": "EMAIL_ADDRESS",
  "DISK_SIZE": 600,
  "MACHINE_TYPE": "n1-highcpu-64",
  "CPU_PLATFORM": "Intel Broadwell"
}

The following table provides a description of the settings used:

JSON key Description
DISK_SIZE SSD space available to the worker node.
MACHINE_TYPE The type of Compute Engine virtual machine to use. Defaults to n1-standard-1.
CPU_PLATFORM The CPU platform to request. Must be a valid Compute Engine CPU platform name (such as "Intel Skylake").

Preemptible instances

You can use preemptible instances in your pipeline by setting the PREEMPTIBLE_TRIES JSON key.

By default, the runner tries to run the pipeline with a standard instance if the preemptible tries are exhausted or if the NONPREEMPTIBLE_TRY JSON key is set to 0. You can turn off this behavior by setting the NONPREEMPTIBLE_TRY key to false, as shown in the following configuration:

"PREEMPTIBLE_TRIES": 2,
"NONPREEMPTIBLE_TRY": false

The following table provides a description of the settings used:

JSON key Description
PREEMPTIBLE_TRIES The number of times to attempt the pipeline when using preemptible instances.
NONPREEMPTIBLE_TRY Determines whether to try running the pipeline with a standard instance after preemptible tries are exhausted.

Read groups

Read groups are added when fastq files are aligned with a reference genome using Sentieon® BWA. You can supply multiple comma-separated read groups. The number of read groups must match the number of input fastq files. The default read group is @RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA. To change the read group, set the READGROUP key in the JSON input file, as shown in the following configuration:

"READGROUP": "@RG\\tID:my-rgid-1\\tSM:my-sm\\tPL:ILLUMINA,@RG\\tID:my-rgid-2\\tSM:my-sm\\tPL:ILLUMINA"

The following table provides a description of the setting used:

JSON key Description
READGROUP A read group containing sample metadata.

For more information on read groups, see Read groups.

Streaming input from Cloud Storage

You can stream input fastq files from Cloud Storage which can reduce the total runtime of the pipeline. To stream input fastq files from Cloud Storage, set the STREAM_INPUT JSON key to True:

"STREAM_INPUT": "True"

The following table provides a description of the setting used:

JSON key Description
STREAM_INPUT Determines whether to stream input fastq files directly from Cloud Storage.

Duplicate marking

By default, the pipeline removes duplicate reads from BAM files. You can change this behavior by setting the DEDUP JSON key, as shown in the following configuration:

"DEDUP": "markdup"

The following table provides a description of the setting used:

JSON key Description
DEDUP Duplicate marking behavior.
Valid values:
  • The default configuration removes reads marked as duplicate.
  • markdup marks duplicates but does not remove them.
  • nodup skips duplicate marking.

Base quality score recalibration (BQSR) and known sites

BSQR requires known sites of genetic variation. The default behavior is to skip this stage of the pipeline. However, you can enable BSQR by supplying known sites with the BQSR_SITES JSON key. If supplied, a DBSNP file can be used to annotate the output variants during variant calling.

"BQSR_SITES": "gs://my-bucket/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://my-bucket/reference/1000G_phase1.indels.b37.vcf.gz,gs://my-bucket/reference/dbsnp_138.b37.vcf.gz",
"DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz"

The following table provides a description of the settings used:

JSON key Description
BSQR_SITES Turns on BSQR and uses a comma-separated list of supplied files as known sites.
DBSNP A dbSNP file used during variant calling.

Intervals

For some applications, such as targeted or whole-exome sequencing, you might be interested only in a portion of the genome. In those cases, supplying a file of target intervals can speed up processing and reduce low quality off-target variant calls. You can use intervals with the INTERVAL_FILE and INTERVAL JSON keys.

"INTERVAL_FILE": "gs://my-bucket/capture-targets.bed",
"INTERVAL": "9:80331190-80646365"

The following table provides a description of the settings used:

JSON key Description
INTERVAL_FILE A file containing genomic intervals to process.
INTERVAL A string containing a genomic interval to process.

Output options

By default, the pipeline produces a preprocessed BAM, quality control metrics, and variant calls. You can disable any of these outputs using the NO_BAM_OUTPUT, NO_METRICS, and NO_HAPLOTYPER JSON keys. If the NO_HAPLOTYPER argument is not supplied or NULL, you can use the GVCF_OUTPUT JSON key to produce variant calls in gVCF format rather than VCF format.

"NO_BAM_OUTPUT": "true",
"NO_METRICS": "true",
"NO_HAPLOTYPER": "true",
"GVCF_OUTPUT": "true",

The following table provides a description of the settings used:

JSON key Description
NO_BAM_OUTPUT Determines whether to output a preprocessed BAM file.
NO_METRICS Determines whether to output file metrics.
NO_HAPLOTYPER Determines whether to output variant calls.
GVCF_OUTPUT Determines whether to output variant calls in gVCF format.

Sentieon® DNASeq® versions

You can use any recent version of the Sentieon® DNASeq® software package with the Cloud Life Sciences API by specifying the SENTIEON_VERSION JSON key, like so:

"SENTIEON_VERSION": "201808.08"

The following versions are valid:

  • 201711.01
  • 201711.02
  • 201711.03
  • 201711.04
  • 201711.05
  • 201808
  • 201808.01
  • 201808.03
  • 201808.05
  • 201808.06
  • 201808.07
  • 201808.08

Clean up

After finishing the tutorial, you can clean up the resources you created on Google Cloud so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Delete the project

The easiest way to eliminate billing is to delete the project you used for the tutorial.

To delete the project:

  1. In the Google Cloud console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next