Running a dsub Pipeline

dsub is a command-line tool and open source alternative to the Pipelines API gcloud tool.

Objectives

After completing this tutorial, you'll know how to:

  • Run a dsub pipeline on Google Cloud Platform that creates an index (BAI file) from a large binary file of DNA sequences (BAM file)

Costs

This tutorial uses billable components of GCP, including:

  • Compute Engine
  • Cloud Storage

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Install Python 2.7+. For more information on setting up your Python development environment, such as installing pip on your system, see the Python Development Environment Setup Guide.
  2. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  3. Select or create a Google Cloud Platform project.

    Go to the Manage resources page

  4. Make sure that billing is enabled for your Google Cloud Platform project.

    Learn how to enable billing

  5. Enable the Cloud Genomics, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

Create a BAI file

Complete the following steps to create an index (BAI file) from a large binary file of DNA sequences (BAM file). The data comes from the 1,000 Genomes Project.

  1. Clone the GitHub repository googlegenomics/dsub, then change to the directory for the dsub tool. The repository contains a pre-built Docker image that uses samtools to do the indexing.

    git clone https://github.com/googlegenomics/dsub.git
    cd dsub
    
  2. Install dsub and its dependencies:

    python setup.py install
    
  3. Run the dsub tool to create the BAI file, replacing PROJECT_ID with your GCP project and BUCKET with a Cloud Storage bucket to which you have write access:

    dsub \
        --project PROJECT_ID \
        --zones "us-*" \
        --logging gs://BUCKET/logs \
        --input BAM=gs://genomics-public-data/1000-genomes/bam/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \
        --output BAI=gs://BUCKET/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \
        --image quay.io/cancercollaboratory/dockstore-tool-samtools-index \
        --command 'samtools index ${BAM} ${BAI}' \
        --wait
    

    The samtools command runs on the data file provided with the --input flag. The pipeline writes the output file and logs to your Cloud Storage bucket.

    If you have multiple inputs, you can specify them using multiple --input flags. The inputs can be specified in any order. The following sample shows how to specify two inputs:

    ...
    --input INPUT_FILE_1=gs://PATH/TO/INPUT_FILE_1 \
    --input INPUT_FILE_2=gs://PATH/TO/INPUT_FILE_2 \
    ...
    
  4. Verify that the BAI file was generated:

    gsutil ls BUCKET
    

    The command should return the following:

    gs://BUCKET/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
    

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

After you've finished the Running a dsub Pipeline tutorial, you can clean up the resources you created on Google Cloud Platform so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project you used for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Read the dsub documentation on GitHub for more details and examples of how to use dsub with genomics data.

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Genomics