Run dsub

Use dsub to write scripts and then run batch computing tasks and workflows on Google Cloud.

Objectives

After completing this tutorial, you'll know how to run a dsub pipeline on Google Cloud that creates an index (BAI file) from a large binary file of DNA sequences (BAM file).

Costs

In this document, you use the following billable components of Google Cloud:

  • Compute Engine
  • Cloud Storage

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Install Python 3.6+. For more information on setting up your Python development environment, such as installing pip on your system, see the Python Development Environment Setup Guide.
  2. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  3. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  4. Make sure that billing is enabled for your Google Cloud project.

  5. Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

  6. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  7. Make sure that billing is enabled for your Google Cloud project.

  8. Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

Create a BAI file

Complete the following steps to create an index (BAI file) from a large binary file of DNA sequences (BAM file). The data comes from the 1,000 Genomes Project.

  1. Clone the databiosphere/dsub GitHub repository:

    git clone https://github.com/databiosphere/dsub.git
    
  2. Change to the directory for the dsub tool. The repository contains a pre-built Docker image that uses samtools to do the indexing.

    cd dsub
    
  3. Install dsub and its dependencies:

    sudo python3 setup.py install
    
  4. Run the dsub tool to create the BAI file, replacing PROJECT_ID with your Google Cloud project and BUCKET with a Cloud Storage bucket to which you have write access:

    dsub \
        --provider google-cls-v2 \
        --project PROJECT_ID \
        --logging gs://BUCKET/logs \
        --input BAM=gs://genomics-public-data/1000-genomes/bam/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \
        --output BAI=gs://BUCKET/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \
        --image quay.io/cancercollaboratory/dockstore-tool-samtools-index \
        --command 'samtools index ${BAM} ${BAI}' \
        --wait
    

    The samtools command runs on the data file provided with the --input flag. The pipeline writes the output file and logs to your Cloud Storage bucket.

  5. Verify that the BAI file was generated:

    gsutil ls gs://BUCKET
    

    The command returns the following response:

    gs://BUCKET/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
    

Clean up

After you finish the tutorial, you can clean up the resources that you created so that they stop using quota and incurring charges. The following sections describe how to delete or turn off these resources.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Read the dsub documentation on GitHub for more details and examples of how to develop with dsub locally or use dsub to scale up to many tasks on Google Cloud.