This tutorial explains how to run a
dsub
pipeline
on Batch.
Specifically, the example dsub
pipeline processes DNA sequencing data in a
Binary Alignment Map (BAM) file
to create a BAM index (BAI) file.
This tutorial is intended for Batch users who want to use
dsub
with Batch.
dsub
is an open source job scheduler
for orchestrating batch-processing workflows on Google Cloud.
To learn more about how to use
Batch with dsub
, see the
dsub
documentation for Batch.
Objectives
- Run a
dsub
pipeline on Batch that reads and writes files in Cloud Storage buckets. - View the output files in a Cloud Storage bucket.
Costs
In this document, you use the following billable components of Google Cloud:
- Batch
- Cloud Storage
To generate a cost estimate based on your projected usage,
use the pricing calculator.
The resources created in this tutorial typically cost less than a dollar, assuming you complete all the steps—including the cleanup—in a timely manner.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Create or select a Google Cloud project.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_ID
with a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with your Google Cloud project name.
-
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Batch, Cloud Storage, Compute Engine, and Logging APIs:
gcloud services enable batch.googleapis.com
compute.googleapis.com logging.googleapis.com storage.googleapis.com - Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Create or select a Google Cloud project.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_ID
with a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with your Google Cloud project name.
-
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Batch, Cloud Storage, Compute Engine, and Logging APIs:
gcloud services enable batch.googleapis.com
compute.googleapis.com logging.googleapis.com storage.googleapis.com -
Make sure that your project has at least one service account with the permissions required for this tutorial.
Each job requires a service account that allows the Batch service agent to create and access the resources required to run the job. For this tutorial, the job's service account is the Compute Engine default service account.
To ensure that the Compute Engine default service account has the necessary permissions to allow the Batch service agent to create and access resources for Batch jobs, ask your administrator to grant the Compute Engine default service account the following IAM roles:
-
Batch Agent Reporter (
roles/batch.agentReporter
) on the project -
Storage Admin (
roles/storage.admin
) on the project -
(Recommended) Let jobs generate logs in Cloud Logging:
Logs Writer (
roles/logging.logWriter
) on the project
For more information about granting roles, see Manage access to projects, folders, and organizations.
Your administrator might also be able to give the Compute Engine default service account the required permissions through custom roles or other predefined roles.
-
Batch Agent Reporter (
-
Make sure that you have the permissions required for this tutorial.
To get the permissions that you need to complete this tutorial, ask your administrator to grant you the following IAM roles:
-
Batch Job Editor (
roles/batch.jobsEditor
) on the project -
Service Account User (
roles/iam.serviceAccountUser
) on the job's service account, which for this tutorial is the Compute Engine default service account -
Storage Object Admin (
roles/storage.objectAdmin
) on the project
-
Batch Job Editor (
-
Install
dsub
and its dependencies. For more information, see thedsub
installation documentation.Make sure you have installed versions of Python and pip that are supported by the latest version of
dsub
. To view the currently installed versions, run the following command:pip --version
If you need to install or update
pip
or Python, follow the steps for installing Python.Recommended: To prevent dependency-conflict errors when installing
dsub
, create and activate a Python virtual environment:python -m venv dsub_libs && source dsub_libs/bin/activate
Clone the
dsub
GitHub repository usinggit
and open it:git clone https://github.com/databiosphere/dsub.git && cd dsub
Install
dsub
and its dependencies:python -m pip install .
The output is similar to the following:
... Successfully installed cachetools-5.3.1 certifi-2023.7.22 charset-normalizer-3.3.1 dsub-0.4.9 funcsigs-1.0.2 google-api-core-2.11.0 google-api-python-client-2.85.0 google-auth-2.17.3 google-auth-httplib2-0.1.0 google-cloud-batch-0.10.0 googleapis-common-protos-1.61.0 grpcio-1.59.0 grpcio-status-1.59.0 httplib2-0.22.0 idna-3.4 mock-4.0.3 parameterized-0.8.1 proto-plus-1.22.3 protobuf-4.24.4 pyasn1-0.4.8 pyasn1-modules-0.2.8 pyparsing-3.1.1 python-dateutil-2.8.2 pytz-2023.3 pyyaml-6.0 requests-2.31.0 rsa-4.9 six-1.16.0 tabulate-0.9.0 tenacity-8.2.2 uritemplate-4.1.1 urllib3-2.0.7
Create a Cloud Storage bucket
To create a Cloud Storage bucket for storing the output files from the
sample dsub
pipeline using the gcloud CLI, run the
gcloud storage buckets create
command:
gcloud storage buckets create gs://BUCKET_NAME \
--project PROJECT_ID
Replace the following:
BUCKET_NAME
: a globally unique name for your bucket.PROJECT_ID
: the project ID of your Google Cloud project.
The output is similar to the following:
Creating gs://BUCKET_NAME/...
Run the dsub
pipeline
The sample dsub
pipeline indexes a BAM file from the
1,000 Genomes Project
and outputs the results to a Cloud Storage bucket.
To run the sample dsub
pipeline, run the following dsub
command:
dsub \
--provider google-batch \
--project PROJECT_ID \
--logging gs://BUCKET_NAME/WORK_DIRECTORY/logs \
--input BAM=gs://genomics-public-data/1000-genomes/bam/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \
--output BAI=gs://BUCKET_NAME/WORK_DIRECTORY/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \
--image quay.io/cancercollaboratory/dockstore-tool-samtools-index \
--command 'samtools index ${BAM} ${BAI}' \
--wait
Replace the following:
PROJECT_ID
: the project ID of your Google Cloud project.BUCKET_NAME
: the name of the Cloud Storage bucket that you created.WORK_DIRECTORY
: the name for a new directory that the pipeline can use to store logs and outputs. For example, enterworkDir
.
The dsub
pipeline runs a
Batch job that writes the BAI file
and logs to specified directory in your Cloud Storage bucket.
Specifically, the dsub
repository contains a prebuilt Docker
image that uses samtools
to index the
BAM file that you specified in the --input
flag.
The command doesn't finish until the dsub
pipeline finishes running,
which might vary based on when the Batch job is scheduled.
Usually, this takes about 10 minutes: Batch usually starts
running the job within a few minutes, and the job's runtime is about 8 minutes.
At first, the command is still running and the output is similar to the following:
Job properties:
job-id: JOB_NAME
job-name: samtools
user-id: USERNAME
Provider internal-id (operation): projects/PROJECT_ID/locations/us-central1/jobs/JOB_NAME
Launched job-id: JOB_NAME
To check the status, run:
dstat --provider google-batch --project PROJECT_ID --location us-central1 --jobs 'JOB_NAME' --users 'USERNAME' --status '*'
To cancel the job, run:
ddel --provider google-batch --project PROJECT_ID --location us-central1 --jobs 'JOB_NAME' --users 'USERNAME'
Waiting for job to complete...
Waiting for: JOB_NAME.
Then, after the job has successfully finished, the command ends and the output is similar to the following:
JOB_NAME: SUCCESS
JOB_NAME
This output includes the following values:
JOB_NAME
: the name of the job.USERNAME
: your Google Cloud username.PROJECT_ID
: the project ID of your Google Cloud project.
View the output files
To view the output files created by the sample dsub
pipeline using the
gcloud CLI, run the
gcloud storage ls
command:
gcloud storage ls gs://BUCKET_NAME/WORK_DIRECTORY \
--project PROJECT_ID
Replace the following:
BUCKET_NAME
: the name of the Cloud Storage bucket that you created.WORK_DIRECTORY
: the directory that you specified in thedsub
command.PROJECT_ID
: the project ID of your Google Cloud project.
The output is similar to the following:
gs://BUCKET_NAME/WORK_DIRECTORY/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
gs://BUCKET_NAME/WORK_DIRECTORY/logs/
This output includes the BAI file and a directory containing the job's logs.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
The easiest way to eliminate billing is to delete the current project.
Delete a Google Cloud project:
gcloud projects delete PROJECT_ID
Delete individual resources
If you want to keep using the current project, then delete the individual resources used in this tutorial.
Delete the bucket
After the pipeline finishes running, it creates and stores output files in the
WORK_DIRECTORY
directory of your Cloud Storage
bucket.
To reduce Cloud Storage charges to the current current Google Cloud account, do one of the following:
If you no longer need the bucket you used in this tutorial, then use the
gcloud storage rm
command with the--recursive
flag to delete the bucket and all of its contents:gcloud storage rm gs://BUCKET_NAME \ --recursive \ --project PROJECT_ID
Replace the following:
BUCKET_NAME
: the name of the Cloud Storage bucket that you created.PROJECT_ID
: the project ID of your Google Cloud project.
Otherwise, if you still need the bucket, then use the
gcloud storage rm
command with the--recursive
flag to delete only theWORK_DIRECTORY
directory and all of its contents:gcloud storage rm gs://BUCKET_NAME/WORK_DIRECTORY \ --recursive \ --project PROJECT_ID
Replace the following:
BUCKET_NAME
: the name of the Cloud Storage bucket that you created.WORK_DIRECTORY
: the directory that you specified in thedsub
command.PROJECT_ID
: the project ID of your Google Cloud project.
Delete the job
To delete a job using the gcloud CLI, run the
gcloud batch jobs delete
command.
gcloud batch jobs delete JOB_NAME \
--location us-central1 \
--project PROJECT_ID
Replace the following:
JOB_NAME
: the name of the job.PROJECT_ID
: the project ID of your Google Cloud project.
What's next
- Learn more about
dsub
anddsub
for Batch. - Learn more about using storage volumes with Batch.