Running Sentieon DNAseq

This page explains how to run Sentieon DNAseq as a Google Cloud (GCP) pipeline. The pipeline matches the results from the GATK Best Practices version 3.7: alignment, sorting, duplicate removal, BQSR, and variant calling. Input formats include fastq files or aligned and sorted BAM files.

Objectives

After completing this tutorial, you'll know how to:

  • Run a pipeline on Google Cloud using Sentieon DNAseq
  • Write configuration files for different Sentieon DNAseq use cases

Costs

This tutorial uses billable components of Google Cloud, including:

  • Compute Engine
  • Cloud Storage

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Install Python 2.7+. For more information on setting up your Python development environment, such as installing pip on your system, see the Python Development Environment Setup Guide.
  2. Google アカウントにログインします。

    Google アカウントをまだお持ちでない場合は、新しいアカウントを登録します。

  3. GCP プロジェクトを選択または作成します。

    プロジェクト セレクタのページに移動

  4. Google Cloud Platform プロジェクトに対して課金が有効になっていることを確認します。 詳しくは、課金を有効にする方法をご覧ください。

  5. Cloud Life Sciences, Compute Engine, and Cloud Storage API を有効にします。

    APIを有効にする

  6. Cloud SDK をインストールして初期化します。
  7. gcloud コンポーネントを更新し、インストールします。
    gcloud components update &&
    gcloud components install beta
  8. Install git to download the required files.

    Download git

  9. By default, Compute Engine has resource quotas in place to prevent inadvertent usage. By increasing quotas, you can launch more virtual machines concurrently, increasing throughput and reducing turnaround time.

    For best results in this tutorial, you should request additional quota above your project's default. Recommendations for quota increases are provided in the following list, as well as the minimum quotas needed to run the tutorial. Make your quota requests in the us-central1 region:

    • CPUs: 64
    • Persistent Disk Standard (GB): 375

    You can leave other quota request fields empty to keep your current quotas.

Sentieon evaluation license

When using this pipeline, Sentieon automatically grants you a free two week evaluation license of its software for use with Google Cloud. To receive the license, enter your email address in the EMAIL field when configuring the pipeline. See Understanding the input format for information on setting this field.

To continue using Sentieon after the evaluation license expires, contact support@sentieon.com.

Set up your local environment and install prerequisites

  1. If you don't have virtualenv, install it using pip:

    pip install virtualenv
    
  2. Create an isolated Python environment and install dependencies:

    virtualenv env
    source env/bin/activate
    pip install --upgrade \
        pyyaml \
        google-api-python-client \
        google-auth \
        google-cloud-storage \
        google-auth-httplib2
    

Download the pipeline script

Download the example files and set your current directory:

git clone -b v0.2.3 https://github.com/sentieon/sentieon-google-genomics.git
cd sentieon-google-genomics

Understanding the input format

The pipeline uses parameters specified in a JSON file as its input.

In the repository you downloaded, there is an examples/example.json file with the following content:

{
  "FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
  "FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
  "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
  "OUTPUT_BUCKET": "gs://BUCKET",
  "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
  "PROJECT_ID": "PROJECT_ID"
  "EMAIL": "YOUR_EMAIL_HERE"
}

The following table describes the JSON keys in the file:

JSON key Description
FQ1 The first pair of reads in the input fastq file.
FQ2 The second pair of reads in the input fastq file.
BAM The input BAM file, if applicable.
REF The reference genome. If set, the fastq/BAM index files are assumed to exist.
OUTPUT_BUCKET The bucket and directory used to store the data output from the pipeline.
ZONES A comma-separated list of Google Cloud zones to use for the worker node.
PROJECT_ID Your Google Cloud project ID.
EMAIL Your email address.

Run the pipeline

  1. In the sentieon-google-genomics directory, edit the examples/example.json file, substituting the BUCKET and PROJECT_ID variables with the relevant resources from your Google Cloud project:

    {
      "FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
      "FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
      "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
      "OUTPUT_BUCKET": "gs://BUCKET",
      "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
      "PROJECT_ID": "PROJECT_ID",
      "EMAIL": "EMAIL_ADDRESS"
    }
    
  2. Run the following command to execute the DNAseq pipeline on a small test dataset identified by the inputs in the configuration file. By default, the script verifies that the input files exist in your Cloud Storage bucket before starting the pipeline.

    python runner/sentieon_runner.py examples/example.json
    

If you specified multiple preemptible tries, the pipeline will restart whenever its instances are preempted. After the pipeline finishes, it outputs a message to the console that states whether the pipeline succeeded or failed.

For most situations, you can optimize turnaround time and cost using the following configuration. The configuration will run a 30x human genome at a cost of roughly $1.25 and will take about 2 hours. A human whole exome will cost roughly $0.35 and take about 45 minutes. Both of these estimates are based on the pipeline's instances not being preempted.

{
  "FQ1": "gs://my-bucket/sample1_1.fastq.gz",
  "FQ2": "gs://my-bucket/sample1_2.fastq.gz",
  "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
  "OUTPUT_BUCKET": "gs://BUCKET",
  "BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/1000G_phase1.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
  "DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
  "PREEMPTIBLE_TRIES": "2",
  "NONPREEMPTIBLE_TRY": true,
  "STREAM_INPUT": "True",
  "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
  "PROJECT_ID": "PROJECT_ID",
  "EMAIL": "EMAIL_ADDRESS"
}

Additional options

You can customize a pipeline using the following additional options.

Input file options

The pipeline supports multiple comma-separated fastq files as input:

"FQ1": "gs://my-bucket/s1_prep1_1.fastq.gz,gs://my-bucket/s1_prep2_1.fastq.gz",
"FQ2": "gs://my-bucket/s1_prep1_2.fastq.gz,gs://my-bucket/s1_prep2_2.fastq.gz",

It also accepts comma-separated BAM files as an input using the BAM JSON key. Reads in the BAM files will not be aligned to the reference genome. Instead, they will start at the data deduplication stage of the pipeline.

"BAM": "gs://my-bucket/s1_prep1.bam,gs://my-bucket/s1_prep2.bam"

Whole-exome data or large datasets configuration

The settings in the recommended configuration are optimized for human whole-genome samples sequenced to an average coverage of 30x. For files that are much smaller or larger than standard whole-genome datasets, you can increase or decrease the resources available to the instance. For best results with large datasets, use the following settings:

{
  "FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
  "FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
  "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
  "OUTPUT_BUCKET": "gs://BUCKET",
  "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
  "PROJECT_ID": "PROJECT_ID",
  "EMAIL": "EMAIL_ADDRESS",
  "DISK_SIZE": 600,
  "MACHINE_TYPE": "n1-highcpu-64",
  "CPU_PLATFORM": "Intel Broadwell"
}

The following table provides a description of the settings used:

JSON key Description
DISK_SIZE SSD space available to the worker node.
MACHINE_TYPE The type of Compute Engine virtual machine to use. Defaults to n1-standard-1.
CPU_PLATFORM The CPU platform to request. Must be a valid Compute Engine CPU platform name (such as "Intel Skylake").

Preemptible instances

You can use preeemptible instances in your pipeline by setting the PREEMPTIBLE_TRIES JSON key.

By default, the runner tries to run the pipeline with a standard instance if the preemptible tries are exhausted or if the NONPREEMPTIBLE_TRY JSON key is set to 0. You can turn off this behavior by setting the NONPREEMPTIBLE_TRY key to false.

"PREEMPTIBLE_TRIES": 2,
"NONPREEMPTIBLE_TRY": false

The following table provides a description of the settings used:

JSON key Description
PREEMPTIBLE_TRIES The number of times to attempt the pipeline when using preemptible instances.
NONPREEMPTIBLE_TRY Determines whether to try running the pipeline with a standard instance after preemptible tries are exhausted.

Read groups

Read groups are added when fastq files are aligned with a reference genome using Sentieon BWA. You can supply multiple comma-separated read groups. The number of read groups must match the number of input fastq files. The default read group is @RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA. To change the read group, set the READGROUP key in the JSON input file:

"READGROUP": "@RG\\tID:my-rgid-1\\tSM:my-sm\\tPL:ILLUMINA,@RG\\tID:my-rgid-2\\tSM:my-sm\\tPL:ILLUMINA"

The following table provides a description of the setting used:

JSON key Description
READGROUP A read group containing sample metadata.

For more information on read groups, see the Broad Institute documentation.

Streaming input from Cloud Storage

You can stream input fastq files from Cloud Storage which can reduce the total runtime of the pipeline. To stream input fastq files from Cloud Storage, set the STREAM_INPUT JSON key to True:

"STREAM_INPUT": "True"

The following table provides a description of the setting used:

JSON key Description
STREAM_INPUT Determines whether to stream input fastq files directly from Cloud Storage.

Duplicate marking

By default, the pipeline removes duplicate reads from BAM files. You can change this behavior by setting the DEDUP JSON key:

"DEDUP": "markdup"

The following table provides a description of the setting used:

JSON key Description
DEDUP Duplicate marking behavior.
Valid values:
  • The default configuration removes reads marked as duplicate.
  • markdup marks duplicates but does not remove them.
  • nodup skips duplicate marking.

Base quality score recalibration (BQSR) and known sites

BSQR requires known sites of genetic variation. The default behavior is to skip this stage of the pipeline. However, you can enable BSQR by supplying known sites with the BQSR_SITES JSON key. If supplied, a DBSNP file can be used to annotate the output variants during variant calling.

"BQSR_SITES": "gs://my-bucket/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://my-bucket/reference/1000G_phase1.indels.b37.vcf.gz,gs://my-bucket/reference/dbsnp_138.b37.vcf.gz",
"DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz"

The following table provides a description of the settings used:

JSON key Description
BSQR_SITES Turns on BSQR and uses a comma-separated list of supplied files as known sites.
DBSNP A dbSNP file used during variant calling.

Intervals

For some applications, such as targeted or whole-exome sequencing, you might be interested only in a portion of the genome. In those cases, supplying a file of target intervals can speed up processing and reduce low quality off-target variant calls. You can use intervals with the INTERVAL_FILE and INTERVAL JSON keys.

"INTERVAL_FILE": "gs://my-bucket/capture-targets.bed",
"INTERVAL": "9:80331190-80646365"

The following table provides a description of the settings used:

JSON key Description
INTERVAL_FILE A file containing genomic intervals to process.
INTERVAL A string containing a genomic interval to process.

Output options

By default, the pipeline produces a preprocessed BAM, quality control metrics, and variant calls. You can disable any of these outputs using the NO_BAM_OUTPUT, NO_METRICS, and NO_HAPLOTYPER JSON keys. If the NO_HAPLOTYPER argument is not supplied or NULL, you can use the GVCF_OUTPUT JSON key to produce variant calls in gVCF format rather than VCF format.

"NO_BAM_OUTPUT": "true",
"NO_METRICS": "true",
"NO_HAPLOTYPER": "true",
"GVCF_OUTPUT": "true",

The following table provides a description of the settings used:

JSON key Description
NO_BAM_OUTPUT Determines whether to output a preprocessed BAM file.
NO_METRICS Determines whether to output file metrics.
NO_HAPLOTYPER Determines whether to output variant calls.
GVCF_OUTPUT Determines whether to output variant calls in gVCF format.

Sentieon DNAseq versions

You can use any recent version of the Sentieon DNAseq software package with the Cloud Life Sciences API by specifying the SENTIEON_VERSION JSON key, like so:

"SENTIEON_VERSION": "201808.08"

The following versions are valid:

  • 201711.01
  • 201711.02
  • 201711.03
  • 201711.04
  • 201711.05
  • 201808
  • 201808.01
  • 201808.03
  • 201808.05
  • 201808.06
  • 201808.07
  • 201808.08

Future updates will continue to work with the Cloud Life Sciences API.

Cleaning up

After you've finished the tutorial, you can clean up the resources you created on Google Cloud so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project you used for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

このページは役立ちましたか?評価をお願いいたします。

フィードバックを送信...

Cloud Life Sciences Documentation