Use dsub to write scripts and then run batch computing tasks and workflows on Google Cloud.
Objectives
After completing this tutorial, you'll know how to run a dsub pipeline on Google Cloud that creates an index (BAI file) from a large binary file of DNA sequences (BAM file).
Costs
In this document, you use the following billable components of Google Cloud:
- Compute Engine
- Cloud Storage
To generate a cost estimate based on your projected usage,
use the pricing calculator.
Before you begin
- Install Python 3.6+. For more information on setting up your Python development environment, such as installing pip on your system, see the Python Development Environment Setup Guide.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
Create a BAI file
Complete the following steps to create an index (BAI file) from a large binary file of DNA sequences (BAM file). The data comes from the 1,000 Genomes Project.
Clone the databiosphere/dsub GitHub repository:
git clone https://github.com/databiosphere/dsub.git
Change to the directory for the dsub tool. The repository contains a pre-built Docker image that uses samtools to do the indexing.
cd dsub
Install dsub and its dependencies:
sudo python3 setup.py install
Run the dsub tool to create the BAI file, replacing PROJECT_ID with your Google Cloud project and BUCKET with a Cloud Storage bucket to which you have write access:
dsub \ --provider google-cls-v2 \ --project PROJECT_ID \ --logging gs://BUCKET/logs \ --input BAM=gs://genomics-public-data/1000-genomes/bam/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \ --output BAI=gs://BUCKET/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \ --image quay.io/cancercollaboratory/dockstore-tool-samtools-index \ --command 'samtools index ${BAM} ${BAI}' \ --wait
The samtools command runs on the data file provided with the
--input
flag. The pipeline writes the output file and logs to your Cloud Storage bucket.Verify that the BAI file was generated:
gcloud storage ls gs://BUCKET
The command returns the following response:
gs://BUCKET/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
Clean up
After you finish the tutorial, you can clean up the resources that you created so that they stop using quota and incurring charges. The following sections describe how to delete or turn off these resources.
Delete the project
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
Read the dsub documentation on GitHub for more details and examples of how to develop with dsub locally or use dsub to scale up to many tasks on Google Cloud.