Cloud Life Sciences is deprecated and will no longer be available on Google Cloud after July 8, 2025. Use cases for Cloud Life Sciences are now supported by Batch. To learn how to migrate your workload, see Migrate to Batch.

Run dsub

Use dsub to write scripts and then run batch computing tasks and workflows on Google Cloud.

Objectives

After completing this tutorial, you'll know how to run a dsub pipeline on Google Cloud that creates an index (BAI file) from a large binary file of DNA sequences (BAM file).

Costs

In this document, you use the following billable components of Google Cloud:

Compute Engine
Cloud Storage

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

Install Python 3.6+. For more information on setting up your Python development environment, such as installing pip on your system, see the Python Development Environment Setup Guide.

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.

Enable the APIs

Create a BAI file

Complete the following steps to create an index (BAI file) from a large binary file of DNA sequences (BAM file). The data comes from the 1,000 Genomes Project.

Clone the databiosphere/dsub GitHub repository:

git clone https://github.com/databiosphere/dsub.git

Change to the directory for the dsub tool. The repository contains a pre-built Docker image that uses samtools to do the indexing.
```
cd dsub
```
Install dsub and its dependencies:
```
sudo python3 setup.py install
```

Run the dsub tool to create the BAI file, replacing PROJECT_ID with your Google Cloud project and BUCKET with a Cloud Storage bucket to which you have write access:

dsub \
    --provider google-cls-v2 \
    --project PROJECT_ID \
    --logging gs://BUCKET/logs \
    --input BAM=gs://genomics-public-data/1000-genomes/bam/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \
    --output BAI=gs://BUCKET/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \
    --image quay.io/cancercollaboratory/dockstore-tool-samtools-index \
    --command 'samtools index ${BAM} ${BAI}' \
    --wait

The samtools command runs on the data file provided with the --input flag. The pipeline writes the output file and logs to your Cloud Storage bucket.

Verify that the BAI file was generated:

gcloud storage ls gs://BUCKET

The command returns the following response:

gs://BUCKET/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai

Clean up

After you finish the tutorial, you can clean up the resources that you created so that they stop using quota and incurring charges. The following sections describe how to delete or turn off these resources.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Read the dsub documentation on GitHub for more details and examples of how to develop with dsub locally or use dsub to scale up to many tasks on Google Cloud.