Running dsub

dsub is a command-line tool that you can use to run batch computing tasks and workflows on Google Cloud.


After completing this tutorial, you'll know how to:

  • Run a dsub pipeline on Google Cloud that creates an index (BAI file) from a large binary file of DNA sequences (BAM file)


This tutorial uses billable components of Google Cloud, including:

  • Compute Engine
  • Cloud Storage

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Install Python 3.5+. For more information on setting up your Python development environment, such as installing pip on your system, see the Python Development Environment Setup Guide.
  2. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  3. In the Cloud Console, on the project selector page, select or create a Cloud project.

    Go to the project selector page

  4. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  5. Enable the Google Genomics, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

Create a BAI file

Complete the following steps to create an index (BAI file) from a large binary file of DNA sequences (BAM file). The data comes from the 1,000 Genomes Project.

  1. Clone the GitHub repository databiosphere/dsub, then change to the directory for the dsub tool. The repository contains a pre-built Docker image that uses samtools to do the indexing.

    git clone
    cd dsub
  2. Install dsub and its dependencies:

    python install
  3. Run the dsub tool to create the BAI file, replacing PROJECT_ID with your Google Cloud project and BUCKET with a Cloud Storage bucket to which you have write access:

    dsub \
        --provider google-cls-v2 \
        --project PROJECT_ID \
        --logging gs://BUCKET/logs \
        --input BAM=gs://genomics-public-data/1000-genomes/bam/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \
        --output BAI=gs://BUCKET/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \
        --image \
        --command 'samtools index ${BAM} ${BAI}' \

    The samtools command runs on the data file provided with the --input flag. The pipeline writes the output file and logs to your Cloud Storage bucket.

    If you have multiple inputs, you can specify them using multiple --input flags. The inputs can be specified in any order. The following sample shows how to specify two inputs:

    --input INPUT_FILE_1=gs://PATH/TO/INPUT_FILE_1 \
    --input INPUT_FILE_2=gs://PATH/TO/INPUT_FILE_2 \
  4. Verify that the BAI file was generated:

    gsutil ls BUCKET

    The command should return the following:


Cleaning up

After you've finished the tutorial, you can clean up the resources you created on Google Cloud so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project you used for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Read the dsub documentation on GitHub for more details and examples of how to use dsub with genomics data.